Japanese scene text poses challenges that multilingual benchmarks often fail to capture, including mixed scripts, frequent vertical writing, and a character inventory far larger than the Latin alphabet. Although Japanese is included in several multilingual benchmarks, these resources do not adequately capture the language-specific complexities. Meanwhile, existing Japanese visual text datasets have primarily focused on scanned documents, leaving in-the-wild scene text underexplored. To fill this gap, we introduce JaWildText, a diagnostic benchmark for evaluating vision-language models (VLMs) on Japanese scene text understanding. JaWildText contains 3,241 instances from 2,961 images newly captured in Japan, with 1.12 million annotated characters spanning 3,643 unique character types. It comprises three complementary tasks that vary in visual organization, output format, and writing style: (i) Dense Scene Text Visual Question Answering (STVQA), which requires reasoning over multiple pieces of visual text evidence; (ii) Receipt Key Information Extraction (KIE), which tests layout-aware structured extraction from mobile-captured receipts; and (iii) Handwriting OCR, which evaluates page-level transcription across various media and writing directions. We evaluate 14 open-weight VLMs and find that the best model achieves an average score of 0.64 across the three tasks. Error analyses show recognition remains the dominant bottleneck, especially for kanji. JaWildText enables fine-grained, script-aware diagnosis of Japanese scene text capabilities, and will be released with evaluation code.
Taihei Shiotani, Masahiro Kaneko, Ayana Niwa
et al.
Social biases reflected in language are inherently shaped by cultural norms, which vary significantly across regions and lead to diverse manifestations of stereotypes. Existing evaluations of social bias in large language models (LLMs) for non-English contexts, however, often rely on translations of English benchmarks. Such benchmarks fail to reflect local cultural norms, including those found in Japanese. For instance, Western benchmarks may overlook Japan-specific stereotypes related to hierarchical relationships, regional dialects, or traditional gender roles. To address this limitation, we introduce Japanese cUlture adversarial BiAs benchmarK Under handcrafted creation (JUBAKU), a benchmark tailored to Japanese cultural contexts. JUBAKU uses adversarial construction to expose latent biases across ten distinct cultural categories. Unlike existing benchmarks, JUBAKU features dialogue scenarios hand-crafted by native Japanese annotators, specifically designed to trigger and reveal latent social biases in Japanese LLMs. We evaluated nine Japanese LLMs on JUBAKU and three others adapted from English benchmarks. All models clearly exhibited biases on JUBAKU, performing below the random baseline of 50% with an average accuracy of 23% (ranging from 13% to 33%), despite higher accuracy on the other benchmarks. Human annotators achieved 91% accuracy in identifying unbiased responses, confirming JUBAKU's reliability and its adversarial nature to LLMs.
In this paper, we design a signalling game-based emergent communication environment to generate state-of-the-art emergent languages in terms of similarity to human language. This is done with hyperparameter optimization, using XferBench as the objective function. XferBench quantifies the statistical similarity of emergent language to human language by measuring its suitability for deep transfer learning to human language. Additionally, we demonstrate the predictive power of entropy on the transfer learning performance of emergent language as well as corroborate previous results on the entropy-minimization properties of emergent communication systems. Finally, we report generalizations regarding what hyperparameters produce more realistic emergent languages, that is, ones which transfer better to human language.
Matheus Belarmino, Rackel Coelho, Roberto Lotudo
et al.
Large Language Models (LLMs) have been increasingly used to optimize the analysis and synthesis of legal documents, enabling the automation of tasks such as summarization, classification, and retrieval of legal information. This study aims to conduct a systematic literature review to identify the state of the art in prompt engineering applied to LLMs in the legal context. The results indicate that models such as GPT-4, BERT, Llama 2, and Legal-Pegasus are widely employed in the legal field, and techniques such as Few-shot Learning, Zero-shot Learning, and Chain-of-Thought prompting have proven effective in improving the interpretation of legal texts. However, challenges such as biases in models and hallucinations still hinder their large-scale implementation. It is concluded that, despite the great potential of LLMs for the legal field, there is a need to improve prompt engineering strategies to ensure greater accuracy and reliability in the generated results.
Modern large-scale networks introduce significant complexity in understanding network behaviors, increasing the risk of misconfiguration. Prior work proposed to understand network behaviors by mining network configurations, typically relying on domain-specific languages interfaced with formal models. While effective, they suffer from a steep learning curve and limited flexibility. In contrast, natural language (NL) offers a more accessible and interpretable interface, motivating recent research on NL-guided network behavior understanding. Recent advances in large language models (LLMs) further enhance this direction, leveraging their extensive prior knowledge of network concepts and strong reasoning capabilities. However, three key challenges remain: 1) numerous router devices with lengthy configuration files challenge LLM's long-context understanding ability; 2) heterogeneity across devices and protocols impedes scalability; and 3) complex network topologies and protocols demand advanced reasoning abilities beyond the current capabilities of LLMs. To tackle the above challenges, we propose NetMind, a novel framework for querying networks using NL. Our approach introduces a tree-based configuration chunking strategy to preserve semantic coherence while enabling efficient partitioning. We then construct a unified fact graph as an intermediate representation to normalize vendor-specific configurations. Finally, we design a hybrid imperative-declarative language to reduce the reasoning burden on LLMs and enhance precision. We contribute a benchmark consisting of NL question-answer pairs paired with network configurations. Experiments demonstrate that NetMind achieves accurate and scalable network behavior understanding, outperforming existing baselines.
The Japanese word "好きだ" (suki da, like) is a な-adjective (adjectival noun) used to express preferences. It typically appears in the structure "~が好きだ" to indicate liking someone or something. However, in practical Japanese usage, the structure "~を好きだ" is also used in certain contexts, raising questions about its grammatical variability and acceptability in modern Japanese. With advances in large-scale corpora, language research has increasingly transitioned into the era of big data, allowing for more systematic and empirical analysis of linguistic patterns. This study explores the grammatical role, frequency, and acceptance of this alternative structure, offering insights into language theory and language education.
Masaru Tateno, Ryotaro Shimomura, Eri Shiraishi
et al.
The association between autism spectrum disorder (ASD) and gender dysphoria (GD) has attracted a great deal of interest among child and adolescent psychiatrists in Japan. In clinical settings, it is common to hear complaints or concerns related to GD from adolescents with ASD. In the past few years, several review articles have been published on ASD and GD. The Initial Clinical Guidelines for co-occurring ASD and GD in adolescents were published in 2018, suggesting the increasing need of intervention for these conditions worldwide. Although a large amount of evidence has been accumulated regarding the co-occurrence of ASD and GD, all review articles were based solely on case reports and articles published in English. In this article, we performed a bilingual literature review using English- and Japanese-language literature databases. We found 13 case reports in English and 11 case reports in Japanese. The Japanese literature included articles on gender-related symptoms in ASD, but not limited to ASD with comorbid GD. Wattel and her colleagues proposed 15 theories on the link between ASD and GD. We classified the reported cases into one or more of the fifteen theories proposed by Wattel. These theories seemed useful in understanding the co-occurrence of ASD and GD, especially in AMAB cases. Wattel’s 15 theories are categorized into biological, psychological, and social factors, respectively. With regard to the social factors, we discussed Japanese school culture and psychological burden among gender-dysphoric students. Further studies are awaited.
This research critically navigates the intricate landscape of AI deception, concentrating on deceptive behaviours of Large Language Models (LLMs). My objective is to elucidate this issue, examine the discourse surrounding it, and subsequently delve into its categorization and ramifications. The essay initiates with an evaluation of the AI Safety Summit 2023 (ASS) and introduction of LLMs, emphasising multidimensional biases that underlie their deceptive behaviours.The literature review covers four types of deception categorised: Strategic deception, Imitation, Sycophancy, and Unfaithful Reasoning, along with the social implications and risks they entail. Lastly, I take an evaluative stance on various aspects related to navigating the persistent challenges of the deceptive AI. This encompasses considerations of international collaborative governance, the reconfigured engagement of individuals with AI, proposal of practical adjustments, and specific elements of digital education.
No two authors write alike. Personal flourishes invoked in written narratives, from lexicon to rhetorical devices, imply a particular author--what literary theorists label the implied or virtual author; distinct from the real author or narrator of a text. Early large language models trained on unfiltered training sets drawn from a variety of discordant sources yielded incoherent personalities, problematic for conversational tasks but proving useful for sampling literature from multiple perspectives. Successes in alignment research in recent years have allowed researchers to impose subjectively consistent personae on language models via instruction tuning and reinforcement learning from human feedback (RLHF), but whether aligned models retain the ability to model an arbitrary virtual author has received little scrutiny. By studying 4,374 stories sampled from three OpenAI language models, we show successive versions of GPT-3 suffer from increasing degrees of "mode collapse" whereby overfitting the model during alignment constrains it from generalizing over authorship: models suffering from mode collapse become unable to assume a multiplicity of perspectives. Our method and results are significant for researchers seeking to employ language models in sociological simulations.
Language-molecule models have emerged as an exciting direction for molecular discovery and understanding. However, training these models is challenging due to the scarcity of molecule-language pair datasets. At this point, datasets have been released which are 1) small and scraped from existing databases, 2) large but noisy and constructed by performing entity linking on the scientific literature, and 3) built by converting property prediction datasets to natural language using templates. In this document, we detail the $\textit{L+M-24}$ dataset, which has been created for the Language + Molecules Workshop shared task at ACL 2024. In particular, $\textit{L+M-24}$ is designed to focus on three key benefits of natural language in molecule design: compositionality, functionality, and abstraction.
Jafar Isbarov, Kavsar Huseynova, Elvin Mammadov
et al.
The emergence of multilingual large language models has enabled the development of language understanding and generation systems in Azerbaijani. However, most of the production-grade systems rely on cloud solutions, such as GPT-4. While there have been several attempts to develop open foundation models for Azerbaijani, these works have not found their way into common use due to a lack of systemic benchmarking. This paper encompasses several lines of work that promote open-source foundation models for Azerbaijani. We introduce (1) a large text corpus for Azerbaijani, (2) a family of encoder-only language models trained on this dataset, (3) labeled datasets for evaluating these models, and (4) extensive evaluation that covers all major open-source models with Azerbaijani support.
As a result of the ageing of the population, shifts in the structure of households, changes in the labor market in the last two to three decades, in Japan, there is a gradual strengthening of the initial income disparity, as well as some increase in the relative poverty rate. However, through the mechanisms of redistribution of income embedded in the social security system and tax system, as well as through the provision of material support to the least protected segments of the population, the state has managed to restrain these processes. The measures taken to support families with children in recent years have been particularly important. They made it possible to reduce the rate of relative child poverty and keep income inequality among this group of families at a relatively low level.Obviously, current income, by which the level of relative poverty and income inequality are measured, cannot clearly indicate that a family or a particular person live in poverty. In addition to the current income, the standard of living also depends on the amount of financial savings, the availability of real estate, the possession of securities, etc. For example, older citizens, who are among the least well-off in terms of current income, have the largest share of the country’s accumulated financial assets. The results of opinion polls conducted annually by the Prime Minister’s Cabinet Office show that the changes taking place in the country have not led to the transformation of the Japanese “middle-class society” into a “divided society”.For more than half a century, the absolute majority of the Japanese, about 90 percent, when asked how they would rate their families’ standard of living, chose the answer “middle level”. At the same time, in the composition of the middle class, there was a shift towards the increasing share of the more well-off (middle and higher) strata. In general, Japanese society remains healthy and prosperous. As for poverty, unlike Russia, where it is a consequence of blatant social injustice and extreme social contrasts, in Japan, in our opinion, it is not systemic and arises as a result of some particularly unfavorable, exceptional circumstances in which a family or a person find themselves.
Language models (LMs) have been argued to overlap substantially with human beings in grammaticality judgment tasks. But when humans systematically make errors in language processing, should we expect LMs to behave like cognitive models of language and mimic human behavior? We answer this question by investigating LMs' more subtle judgments associated with "language illusions" -- sentences that are vague in meaning, implausible, or ungrammatical but receive unexpectedly high acceptability judgments by humans. We looked at three illusions: the comparative illusion (e.g. "More people have been to Russia than I have"), the depth-charge illusion (e.g. "No head injury is too trivial to be ignored"), and the negative polarity item (NPI) illusion (e.g. "The hunter who no villager believed to be trustworthy will ever shoot a bear"). We found that probabilities represented by LMs were more likely to align with human judgments of being "tricked" by the NPI illusion which examines a structural dependency, compared to the comparative and the depth-charge illusions which require sophisticated semantic understanding. No single LM or metric yielded results that are entirely consistent with human behavior. Ultimately, we show that LMs are limited both in their construal as cognitive models of human language processing and in their capacity to recognize nuanced but critical information in complicated language materials.
Contrastive learning is widely used for sentence representation learning. Despite this prevalence, most studies have focused exclusively on English and few concern domain adaptation for domain-specific downstream tasks, especially for low-resource languages like Japanese, which are characterized by insufficient target domain data and the lack of a proper training strategy. To overcome this, we propose a novel Japanese sentence representation framework, JCSE (derived from ``Contrastive learning of Sentence Embeddings for Japanese''), that creates training data by generating sentences and synthesizing them with sentences available in a target domain. Specifically, a pre-trained data generator is finetuned to a target domain using our collected corpus. It is then used to generate contradictory sentence pairs that are used in contrastive learning for adapting a Japanese language model to a specific task in the target domain. Another problem of Japanese sentence representation learning is the difficulty of evaluating existing embedding methods due to the lack of benchmark datasets. Thus, we establish a comprehensive Japanese Semantic Textual Similarity (STS) benchmark on which various embedding models are evaluated. Based on this benchmark result, multiple embedding methods are chosen and compared with JCSE on two domain-specific tasks, STS in a clinical domain and information retrieval in an educational domain. The results show that JCSE achieves significant performance improvement surpassing direct transfer and other training strategies. This empirically demonstrates JCSE's effectiveness and practicability for downstream tasks of a low-resource language.
Information Extraction (IE) stands as a cornerstone in natural language processing, traditionally segmented into distinct sub-tasks. The advent of Large Language Models (LLMs) heralds a paradigm shift, suggesting the feasibility of a singular model addressing multiple IE subtasks. In this vein, we introduce the General Information Extraction Large Language Model (GIELLM), which integrates text Classification, Sentiment Analysis, Named Entity Recognition, Relation Extraction, and Event Extraction using a uniform input-output schema. This innovation marks the first instance of a model simultaneously handling such a diverse array of IE subtasks. Notably, the GIELLM leverages the Mutual Reinforcement Effect (MRE), enhancing performance in integrated tasks compared to their isolated counterparts. Our experiments demonstrate State-of-the-Art (SOTA) results in five out of six Japanese mixed datasets, significantly surpassing GPT-3.5-Turbo. Further, an independent evaluation using the novel Text Classification Relation and Event Extraction(TCREE) dataset corroborates the synergistic advantages of MRE in text and word classification. This breakthrough paves the way for most IE subtasks to be subsumed under a singular LLM framework. Specialized fine-tune task-specific models are no longer needed.
This year’s conference was held in Denpasar, Bali, after three years of online conferences since COVID-19 emerged in 2020. It was a precious and wonderful experience for researchers from various regions to gather together. The joy of sharing various research results and of further developing awareness of issues through discussions was great, and new encounters took place as we deepened our friendships over Balinese cuisine at the banquet. The special theme of this year’s conference was “Glocal Culture in Modern Asia and Intersectionality: Representation, Conflict, and Symbiosis,” which was based on the 8th conference’s theme of “Glocal Culture in Modern Asia: Identity, History, and Transformation,” with the key concept of “intersectionality” newly assimilated into that theme. Listening to the stimulating presentations on a wide range of research subjects and from a broad range of regions, I was reminded that the concept of “intersectionality” is necessary in order to theorize individual experience as a complex nexus of power and resistance, a phenomenon that cannot be captured by one-sided views or simple oppositional schemes. It was a meaningful conference, and I left with many homework assignments and perspectives to assimilate in order to further my own research.
Proverbs in Japanese are known as Kotowaza, which are defined as wise sayings or ancient proverbs that have many practical benefits. This study aims to examine the understanding of Japanese language students in Bandung city regarding the meaning of proverbs and their equivalents in Indonesian. The research method used is descriptive quantitative with the theory used for analysis is Ogden and Richard's triangle of meaning theory. This paper is expected to be an insight in understanding the meaning of proverbs. Although each word in these proverbs has a different meaning, when formed into a proverb in both Indonesian and Japanese, it will give rise to a new meaning that contains sarcasm, advice or simply reinforces the meaning of each utterance. The word elements used to form proverbs in Japanese and Indonesian are the same, but the meaning will be different depending on the culture of the people. The results of the analysis showed that out of 41 students of Japanese language study programs in Bandung, 29 students understood the meaning of some of the Japanese proverbs given, although the proverbs were not learned in their formal education. As for the Indonesian equivalents, each of these Japanese proverbs has two Indonesian equivalents.
After the 1979 Islamic Revolution, Iran did not change its policy towards Japan, which was not only perceived separately from the West, but was also a major customer of its raw resources and a potential economic donor. In turn, Japan had to pursue the policy of unstable balancing towards Iran, by coordinating its Middle Eastern policy with its strategic ally, the United States, and simultaneously trying to preserve close ties with Iran, which were important from the point of view of maintaining energy security. The article shows how Japan tried to maneuver between these two countries: it took steps to realize large-scale economic projects in Iran’s energy sector, to limit sanctions, which were minimal on its side, etc., and simultaneously participated in the anti-Iran campaign pursued by Western countries under the pressure of the US, or under the influence of major crises in the region (for example, the Iran-Iraq War). Japan continues to face this “Middle Eastern dilemma” in present times as well. It could partly have been solved by the signing of the Joint Comprehensive Plan of Action on the Iranian nuclear program in 2015, were it not for the fact that, two years later, under the US administration of Donald Trump, the US-Iran relations worsened significantly. Nevertheless, in this period, Japan successfully tried its mediatory potential, making use of its good relations with both parties of the conflict: on the one hand, it refused to participate in the US operation in the Strait of Hormuz, on the other, it halted the unfreezing of Iran’s funds in its banks, which had been blocked after the introduction of sanctions by the US administration of Donald Trump. Recently, the activization of China in the region has become a factor of risk for Japan’s Middle Eastern policy, as China tries to use the weaking positions of the US and to occupy the vacuum that was left in its wake, which is illustrated by the signing of the Comprehensive Strategic Partnership agreement between Beijing and Tehran in March 2021. This forces Japan to deliberately strengthen its positions and expand its presence in Iran. It appears that Japan’s approaches to solving the “Middle Eastern dilemma” deserve special attention in the future as well.