Multimodal research has predominantly focused on single-image reasoning, with limited exploration of multi-image scenarios. Recent models have sought to enhance multi-image understanding through large-scale pretraining on interleaved image-text datasets. However, most Vision-Language Models (VLMs) are trained primarily on English datasets, leading to inadequate representation of Indian languages. To address this gap, we introduce the Chitrakshara dataset series, covering 11 Indian languages sourced from Common Crawl. It comprises (1) Chitrakshara-IL, a large-scale interleaved pretraining dataset with 193M images, 30B text tokens, and 50M multilingual documents, and (2) Chitrakshara-Cap, which includes 44M image-text pairs with 733M tokens. This paper details the data collection pipeline, including curation, filtering, and processing methodologies. Additionally, we present a comprehensive quality and diversity analysis to assess the dataset's representativeness across Indic languages and its potential for developing more culturally inclusive VLMs.
Elizabeth A. Newnham, Enrique L. P. Mergelsberg, Yanyu Chen
et al.
The patterns of long-term psychological response after disasters and pandemics remain unclear. We aimed to determine the trajectories for post-traumatic stress symptoms (PTSS), depression and anxiety prevalence following disasters and pandemic exposure; and identify associated risk and protective factors. A systematic review of the English, Chinese, and Japanese longitudinal mental health literature was conducted. We searched Cochrane, MEDLINE, ProQuest, PsycINFO, PubMed, Web of Science, and CINAHL (English), CNKI and SINOMED (Chinese) and CiNii (Japanese) for studies published between January 2000 and May 2022. Following a pre-specified protocol (PROSPERO: CRD42020206424), conditional linear growth curve models and ANOVA analyses were conducted. The search identified 77,891 papers, with a final sample of 234: 206 English, 24 Chinese, and 4 Japanese-language papers. PTSS rates improved for all ages (p = .018, eta2 = 0.035). In contrast, depression and anxiety prevalence remained elevated for years following exposure (p = .424, eta2 = 0.019 and p = .051, eta2 = 0.064, respectively), with significantly higher rates for children and adolescents (p 0.056). Earthquakes and pandemics were associated with higher prevalence of PTSS (p 0.019). Multi-level risk and protective factors were identified. The chronicity of mental health outcomes highlights a critical need for tailored, sustainable mental health services, particularly for children and adolescents, in disaster- and pandemic-affected settings.
Yuichiro Hirano, S. Hanaoka, Takahiro Nakao
et al.
To assess the performance of GPT-4 Turbo with Vision (GPT-4TV), OpenAI’s latest multimodal large language model, by comparing its ability to process both text and image inputs with that of the text-only GPT-4 Turbo (GPT-4 T) in the context of the Japan Diagnostic Radiology Board Examination (JDRBE). The dataset comprised questions from JDRBE 2021 and 2023. A total of six board-certified diagnostic radiologists discussed the questions and provided ground-truth answers by consulting relevant literature as necessary. The following questions were excluded: those lacking associated images, those with no unanimous agreement on answers, and those including images rejected by the OpenAI application programming interface. The inputs for GPT-4TV included both text and images, whereas those for GPT-4 T were entirely text. Both models were deployed on the dataset, and their performance was compared using McNemar’s exact test. The radiological credibility of the responses was assessed by two diagnostic radiologists through the assignment of legitimacy scores on a five-point Likert scale. These scores were subsequently used to compare model performance using Wilcoxon's signed-rank test. The dataset comprised 139 questions. GPT-4TV correctly answered 62 questions (45%), whereas GPT-4 T correctly answered 57 questions (41%). A statistical analysis found no significant performance difference between the two models (P = 0.44). The GPT-4TV responses received significantly lower legitimacy scores from both radiologists than the GPT-4 T responses. No significant enhancement in accuracy was observed when using GPT-4TV with image input compared with that of using text-only GPT-4 T for JDRBE questions.
Large language models exhibit strong multilingual capabilities despite limited exposure to non-English data. Prior studies show that English-centric large language models map multilingual content into English-aligned representations at intermediate layers and then project them back into target-language token spaces in the final layer. From this observation, we hypothesize that this cross-lingual transition is governed by a small and sparse set of dimensions, which occur at consistent indices across the intermediate to final layers. Building on this insight, we introduce a simple, training-free method to identify and manipulate these dimensions, requiring only as few as 50 sentences of either parallel or monolingual data. Experiments on a multilingual generation control task reveal the interpretability of these dimensions, demonstrating that the interventions in these dimensions can switch the output language while preserving semantic content, and that it surpasses the performance of prior neuron-based approaches at a substantially lower cost.
Automatic speech recognition systems have undoubtedly advanced with the integration of multilingual and multitask models such as Whisper, which have shown a promising ability to understand and process speech across a wide range of languages. Despite their robustness, these models often fall short in handling the linguistic distinctions of minority languages. This study addresses this gap by integrating traditional and novel language models with fine-tuned Whisper models to raise their performance in less commonly studied languages. Through rigorous fine-tuning and evaluation across multiple datasets, we demonstrate substantial improvements in word error rate, particularly in low-resource scenarios. Our approach not only does take advantage of the extensive data Whisper was pre-trained on, but also complements its linguistic adaptability by incorporating language models. We obtained improvements up to 51% for in-distribution datasets and up to 34% for out-of-distribution sentences using statistical language models, while large language models provided moderate but consistently robust improvement across diverse linguistic contexts. The findings reveal that, while the integration reliably benefits all model sizes, the extent of improvement varies, highlighting the importance of optimized language model parameters. Finally, we emphasize the importance of selecting appropriate evaluation parameters when reporting the results using transformer-based ASR models. In summary, this research clears the way for more inclusive ASR technologies that perform better across languages by enriching their linguistic knowledge. For further implementation details of this study, the technical documentation and source code are available at http://www.github.com/hitz-zentroa/whisper-lm.
Natural language-based assessment (NLA) is an approach to second language assessment that uses instructions - expressed in the form of can-do descriptors - originally intended for human examiners, aiming to determine whether large language models (LLMs) can interpret and apply them in ways comparable to human assessment. In this work, we explore the use of such descriptors with an open-source LLM, Qwen 2.5 72B, to assess responses from the publicly available S&I Corpus in a zero-shot setting. Our results show that this approach - relying solely on textual information - achieves competitive performance: while it does not outperform state-of-the-art speech LLMs fine-tuned for the task, it surpasses a BERT-based model trained specifically for this purpose. NLA proves particularly effective in mismatched task settings, is generalisable to other data types and languages, and offers greater interpretability, as it is grounded in clearly explainable, widely applicable language descriptors.
The evolution of large language models (LLMs) toward artificial superhuman intelligence (ASI) hinges on data reproduction, a cyclical process in which models generate, curate and retrain on novel data to refine capabilities. Current methods, however, risk getting stuck in a data reproduction trap: optimizing outputs within fixed human-generated distributions in a closed loop leads to stagnation, as models merely recombine existing knowledge rather than explore new frontiers. In this paper, we propose language games as a pathway to expanded data reproduction, breaking this cycle through three mechanisms: (1) \textit{role fluidity}, which enhances data diversity and coverage by enabling multi-agent systems to dynamically shift roles across tasks; (2) \textit{reward variety}, embedding multiple feedback criteria that can drive complex intelligent behaviors; and (3) \textit{rule plasticity}, iteratively evolving interaction constraints to foster learnability, thereby injecting continual novelty. By scaling language games into global sociotechnical ecosystems, human-AI co-evolution generates unbounded data streams that drive open-ended exploration. This framework redefines data reproduction not as a closed loop but as an engine for superhuman intelligence.
Hisami Suzuki, Satoru Katsumata, Takashi Kodama
et al.
In this paper we present AnswerCarefully, a dataset for promoting the safety and appropriateness of Japanese LLM outputs. The dataset consists of 1,800 pairs of questions and reference answers, where the questions require special attention in answering. It covers a wide range of risk categories established in prior English-language datasets, but the data samples are original in that they are manually created to reflect the socio-cultural context of LLM usage in Japan. We show that using this dataset for instruction to fine-tune a Japanese LLM led to improved output safety without compromising the utility of general responses. We also report the results of a safety evaluation of 12 Japanese LLMs using this dataset as a benchmark. Finally, we describe the latest update on the dataset which provides English translations and annotations of the questions, aimed at facilitating the derivation of similar datasets in different languages and regions.
We introduce a novel paradigm of emergent local memory. It is a continuous-learning completely-parallel content-addressable memory encoding global order. It demonstrates how local constraints on uncoordinated learning can produce topologically protected memories realizing emergent symbolic order. It is therefore a neuro-symbolic bridge. It further has the ability to produce human language without data, by exploiting its own self-organizing dynamics. It teaches us that words arise as a side-effect of emergent symbolic order, and that human language patterns at all structural levels reflect a universal mechanism of word formation (which is subregular). This work answers essential questions about the existence \& origin of all the human language data.
Multilingual large language models (LLMs) are known to more frequently generate non-faithful output in resource-constrained languages (Guerreiro et al., 2023 - arXiv:2303.16104), potentially because these typologically diverse languages are underrepresented in their training data. To mitigate unfaithfulness in such settings, we propose using computationally light auxiliary models to rescore the outputs of larger architectures. As proof of the feasibility of such an approach, we show that monolingual 4-layer BERT models pretrained from scratch on less than 700 MB of data without fine-tuning are able to identify faithful summaries with a mean accuracy of 88.33% in three genetically unrelated languages that differ in their morphological complexity - Vietnamese, Polish and Georgian. The same hyperparameter combination moreover generalises well to three other tasks, suggesting applications for rescoring beyond improving faithfulness. In order to inform typologically aware model selection, we also investigate how morphological complexity interacts with regularisation, model depth and training objectives, ultimately demonstrating that morphologically complex languages are more likely to benefit from dropout, while across languages downstream performance is enhanced most by shallow architectures as well as training using the standard BERT objectives.
The article is devoted to the problems of Russian-Japanese relations on the Chinese Eastern Railway at the initial stage of its history. Workers from Japan were not invited to build the road, but the Japanese were already among the first settlers in the city of Harbin, founded in 1898 by the Chinese Eastern Railway Society. After a short break caused by the Russo-Japanese War, bilateral relations were quickly restored. In 1906, as the Russian troops withdrew from Manchuria, the Japanese began to return to the Chinese Eastern Railway. Since 1907, Japanese officials, entrepreneurs, cultural and sports figures from Japan began to take an active part in the socio-political, economic, and cultural life of the city of Harbin. It was here that the first “Russian-Japanese Society” was created. The Japanese in Harbin were mainly engaged in such areas as trade, medicine, entertainment, and domestic service. An important component of bilateral interaction was cooperation in the railway sector. Due to tradition and the inaccessibility of sources, the history of the Japanese population of Harbin is poorly studied in Russian historiography. The aim of this work is to reconstruct the historical picture and identify the problems of the Japanese presence in Harbin and Russian-Japanese interaction on the Chinese Eastern Railway in the period between the Russo-Japanese and the First World Wars. The study is based on materials from the Harbin periodical press, with the involvement of materials found in the archives collected by the researcher of the Chinese Eastern Railway E. K. Nilus. The information and analytical materials published in the newspapers Harbinskii Vestnik (“Harbin Bulletin”), Harbin, Novaya Zhizn’ (“New Life”), and Man'chzhurskii Kur'er (“Manchurian Courier”), especially advertising, allow one to get an idea of the composition of the Japanese community, the occupations of the Japanese and some problems of Russian-Japanese relations in 1906–1914. The study of development of the Chinese Eastern Railway by the Japanese, their coexistence and cooperation with Russians in Harbin will expand knowledge of the history of the Chinese Eastern Railway and achieve a more complete and objective picture of the history of Russian-Japanese relations in the 20th century.
The article focuses on examining the evolution of the memory of the Tokyo Tribunal in historical narratives championed by conservative non-governmental organizations in Japan and the impact these organizations had on historical memory in the country. Among the organizations examined are the Association of Shinto Shrines, the Japan Association of War-Bereaved Families, the Association to Answer the Departed War Heroes, the Society for History Textbook Reform, and the Japan Conference. As a theoretical framework, the research applies the approach of O. Malinova, who considers the politics of memory as a variation of symbolic politics, a struggle between various actors for the dominance of their historical narratives in the public sphere.
The study traces the evolution of perceptions of the Tokyo Tribunal within three periods: 1945—1960, 1960—1985, and 1985—present. In the early stages, right-wing conservative organizations avoided direct criticism of the Tokyo Trials, while developing a policy of remembrance that implicitly contradicted the foundations of the order established by the tribunal and the occupation reforms. The efforts of these groups, primarily Shinto associations and veterans’ and war-bereaved families’ associations, were aimed at achieving certain “special interests”: restoring state funding and raising their social status.
In the second period, criticism of the tribunal became more overt. Due to the need to establish their ideological and historical views among the younger generation, and amidst growing national pride due to economic success, right-wing conservatives consolidated around propelling the cult of fallen war heroes (eirei), which was ritualized at the Yasukuni Shrine. During the same period, 14 individuals recognized by the Tokyo Tribunal as Class A war criminals were enshrined there.
The third period is characterized by the open expression of revisionist views, criticism of the “Tokyo Tribunal's view of history.” New influential right-wing organizations emerged, promoting the reform of school textbooks and educational standards along “patriotic lines.” In recent decades, a new trend has gained momentum: against the backdrop of the decline of the once powerful Association of War-Bereaved Families, owing to the ageing of its members, the efforts of right-wing conservative organizations are increasingly focused on the revision of anti-war articles in the Constitution, while the issues of the Tokyo Tribunal are taking a back seat.
Compass Japanese 1 Interactive Workbook and its accompanying Japanese Writing Practice Book for Novice Learners mark an important contribution to Japanese language instruction. Drawing on the Global Competence framework, the workbook integrates reflection, collaboration, and authentic communication into a learner-centered design. With its diverse representation, inclusive visuals, and wide range of interactive tasks, the series encourages students not only to acquire Japanese but also to explore cultural and social themes. While instructors may need to guide learners through some vocabulary and pitch accent nuances, the workbook’s emphasis on inclusivity, creativity, and intercultural awareness makes it a valuable and inspiring resource for novice-level classrooms.
This chapter elaborates on the general thought process behind Society 5.0 and lists the relevant nomenclature. As per the Japanese government literature, Society 5.0 should be one that, “through the high degree of merging between cyberspace and physical space, will be able to balance economic advancement with the resolution of social problems by providing goods and services that granularly address manifold latent needs regardless of locale, age, sex, or language.” The vision A. Deguchi (*) Department of Socio-Cultural Environmental Studies, Graduate School of Frontier Sciences, The University of Tokyo, Tokyo, Japan e-mail: deguchi@edu.k.u-tokyo.ac.jp C. Hirai Global Center for Social Innovation—Tokyo, Research & Development Group, Hitachi, Ltd., Tokyo, Japan e-mail: chiaki.hirai.xj@hitachi.com H. Matsuoka Center for Exploratory Research, Research & Development Group, Hitachi, Ltd., Tokyo, Japan e-mail: hideyuki.matsuoka.ws@hitachi.com T. Nakano Department of Housing and Urban Planning, Building Research Institute, Ibaraki, Japan e-mail: nakano@kenken.go.jp K. Oshima Department of Urban Engineering, Graduate School of Engineering, The University of Tokyo, Tokyo, Japan e-mail: oshimax@edu.k.u-tokyo.ac.jp M. Tai System Innovation Center, Research & Development Group, Hitachi, Ltd., Tokyo, Japan e-mail: mitsuharu.tai.wu@hitachi.com S. Tani Social Systems Engineering Research Department, System Innovation Center, Research & Development Group, Hitachi, Ltd., Tokyo, Japan e-mail: shigeyuki.tani.dn@hitachi.com The original version of this chapter was revised: This book was inadvertently published with the incorrect license type CC BY 4.0 and the Open Access License has been amended throughout the book to the correct license type CC-BY-NC-ND. The correction to this chapter is available at https://doi.org/10.1007/978-981-15-2989-4_9
Dialogue datasets are crucial for deep learning-based task-oriented dialogue system research. While numerous English language multi-domain task-oriented dialogue datasets have been developed and contributed to significant advancements in task-oriented dialogue systems, such a dataset does not exist in Japanese, and research in this area is limited compared to that in English. In this study, towards the advancement of research and development of task-oriented dialogue systems in Japanese, we constructed JMultiWOZ, the first Japanese language large-scale multi-domain task-oriented dialogue dataset. Using JMultiWOZ, we evaluated the dialogue state tracking and response generation capabilities of the state-of-the-art methods on the existing major English benchmark dataset MultiWOZ2.2 and the latest large language model (LLM)-based methods. Our evaluation results demonstrated that JMultiWOZ provides a benchmark that is on par with MultiWOZ2.2. In addition, through evaluation experiments of interactive dialogues with the models and human participants, we identified limitations in the task completion capabilities of LLMs in Japanese.
Yui Tatsumi, Shoko Tanaka, Shunsuke Akamatsu
et al.
Sign language is a visual language expressed through hand movements and non-manual markers. Non-manual markers include facial expressions and head movements. These expressions vary across different nations. Therefore, specialized analysis methods for each sign language are necessary. However, research on Japanese Sign Language (JSL) recognition is limited due to a lack of datasets. The development of recognition models that consider both manual and non-manual features of JSL is crucial for precise and smooth communication with deaf individuals. In JSL, sentence types such as affirmative statements and questions are distinguished by facial expressions. In this paper, we propose a JSL recognition method that focuses on facial expressions. Our proposed method utilizes a neural network to analyze facial features and classify sentence types. Through the experiments, we confirm our method's effectiveness by achieving a classification accuracy of 96.05%.
User-generated data sources have gained significance in uncovering Adverse Drug Reactions (ADRs), with an increasing number of discussions occurring in the digital world. However, the existing clinical corpora predominantly revolve around scientific articles in English. This work presents a multilingual corpus of texts concerning ADRs gathered from diverse sources, including patient fora, social media, and clinical reports in German, French, and Japanese. Our corpus contains annotations covering 12 entity types, four attribute types, and 13 relation types. It contributes to the development of real-world multilingual language models for healthcare. We provide statistics to highlight certain challenges associated with the corpus and conduct preliminary experiments resulting in strong baselines for extracting entities and relations between these entities, both within and across languages.
Researchers note that Japanese national identity has been shaped by the nihonjinron ideology, which defines the country as culturally unique and mono-ethnic, thus constituting a divide between Japan and the rest of the world. However, in the context of globalization, the Japanese government, following the global trend, has engaged in a national branding program called “Cool Japan,” which aims to attract foreigners and therefore differs in its message from nihonjinron. Previous research suggests that, in reality, Cool Japan is not so much in conflict with nihonjiron narratives, but rather inherits their rhetoric. However, little attention has been paid to how foreigners are represented within this program. Therefore, our research question is this: How is the interaction between the nation/national subjects and foreigners portrayed in the Cool Japan discourse, and what characteristics are attributed to both groups?
Through narrative analysis and critical discourse analysis, this study examines government-affiliated media: articles from the “Friends of Japan” section in a government magazine and episodes from the NHK documentary series “Japanology Plus.” The findings show that the analyzed materials reproduce the boundaries between Japanese people and foreigners, rearticulate Japan’s cultural uniqueness, and implicitly reinforce the narrative of mono-ethnicity. In “Friends of Japan,” foreigners are portrayed as enthusiasts of Japanese culture, expected to mediate between Japan and the world without being fully integrated. Similarly, in “Japanology Plus,” foreigners are depicted as “Japanophiles,” helping to bridge Japan and the world by sharing Japan’s uniqueness.
The position of foreigners in this discourse is ambivalent: they appear subordinate to Japanese nationals but are also seen as capable of revitalizing the country’s economy and are thus valued. This duality reflects Japan’s struggle to balance globalization with affirming national identity in the logic of nihonjinron, highlighting the challenges of integrating foreigners into Japanese society.
The purpose of this paper is to use natural language processing to explore how the terms “art,” “literature,” and “novel” were employed during the period from 1910 to 1945. In Japanese literary criticism, “art” served as the central analytical concept in the 1910s and 1920s. From the 1920s onwards, “literature” began to be used in a similar context as “art,” while the term “novel” gradually gained prominence. In the early 1930s, the meanings of “art” and “literature” became closer, but after 1935, the usage of “art” declined. Conversely, “novel” began to be used as an analytical concept more closely aligned with “literature.”The term “art” was primarily used by critics active during the late Meiji and Taisho periods. Proletarian writers considered “art” and “literature” to be nearly synonymous, while dismissing “novel” as a form of popular entertainment. However, with the novel debate as a catalyst, writers began to use “novel” as a term representing legitimate literary works, and this usage was subsequently adopted by a new generation of critics who emerged in the 1930s.