Japanese finance combines agglutinative, head-final linguistic structure, mixed writing systems, and high-context communication norms that rely on indirect expression and implicit commitment, posing a substantial challenge for LLMs. We introduce Ebisu, a benchmark for native Japanese financial language understanding, comprising two linguistically and culturally grounded, expert-annotated tasks: JF-ICR, which evaluates implicit commitment and refusal recognition in investor-facing Q&A, and JF-TE, which assesses hierarchical extraction and ranking of nested financial terminology from professional disclosures. We evaluate a diverse set of open-source and proprietary LLMs spanning general-purpose, Japanese-adapted, and financial models. Results show that even state-of-the-art systems struggle on both tasks. While increased model scale yields limited improvements, language- and domain-specific adaptation does not reliably improve performance, leaving substantial gaps unresolved. Ebisu provides a focused benchmark for advancing linguistically and culturally grounded financial NLP. All datasets and evaluation scripts are publicly released.
AbstractThe present study has been conducted in the framework of Systemic Functional Grammar and on the basis of systemic functional typology, specifically Matthiessen’s typological generalizations. Based on examples taken from various written documents in Azeri Turkic such as grammar books and a series of stories as well as constructed examples, this study aims to describe the typological behaviors of the MOOD TYPE system in the clause structure of Azeri Turkic in terms of Matthiessen’s typological generalizations regarding the MOOD system. Some results of the present study indicate that Azeri Turkic MOOD TYPE system (1) has all the three declarative, polar interrogative, and imperative moods, (2) uses negative polar interrogatives to indicate the speaker’s positive bias, (3) belongs to the ‘languages that have the Wh-interrogative category’ type, (4) queries just the participants and circumstantial adjuncts in the Wh-interrogatives, (5) belongs to the ‘Wh-in-situ languages’ type, and (6) differentiates the imperative mood from the other mood types.IntroductionThis study, conducted within the framework of Systemic Functional Grammar and based on Matthiessen’s (2004) typology of MOOD TYPE system, as a subsystem within the interpersonal metafunction, aims to investigate and describe the MOOD TYPE system in Azeri Turkic, which belongs to the Southwestern branch of Turkic languages, also known as the Western Oghuz group. Grounded in systemic functional typology, this research seeks empirical generalizations applicable across languages. Matthiessen (2004) has developed descriptive generalizations through comparative analysis of the experiential, logical, interpersonal, and textual systems of various languages, identifying typological universals and variations. Following Matthiessen’s claim that these generalizations can be applied to any language within a Systemic Functional Framework, this study explores the realization of the MOOD TYPE system in Azeri Turkic. Data was collected from diverse sources, including short story collections, academic articles, grammar books on Azeri Turkic, and original examples provided by the researcher. The paper is structured into five sections: introduction, review of related literature, theoretical framework, analysis of MOOD TYPE in Azeri Turkic, and concluding remarks presenting the findings.Literature ReviewThis section reviews several studies on the clause type system, including Mirahmadi’s (2004) Systemic Functional analysis of Persian mood types, Pahlavannajhad & Vazirnejad’s (2004) stylistic study of mood types in Zoya Pirzad’s novel, Najm’s (2008) cross-linguistic comparison of English and Arabic imperatives and exclamatives, Figuerdo’s (2010) description of Portuguese mood types, and traditional grammatical studies on Azerbaijani Turkish by Li (1996), Ahmadi Givi (2004), Dehqani (2000), and Zahedi & Bayan (2008).Unlike previous studies, this research contributes to Systemic Functional Typology by analyzing mood types in Azeri Turkic through a functional lens, aiming to determine whether Matthiessen’s descriptive generalizations can be effectively applied to this language.ResultsThe findings of the present study show that Azeri Turkic identifies four major mood types—declarative, polar interrogative, content interrogative (Wh-questions), and imperative. This confirms Matthiessen’s generalization that declarative, polar interrogative, and imperative clauses are universal, while the presence of content interrogatives places Azeri Turkic among languages that distinguish this category. Polar interrogatives in Azeri Turkic appear in both biased and unbiased forms, marked by particles such as ɒjɒ ‘whether’ for neutral questions and mæjær, mæjæ, bæjæ, or bæ ‘don't/doesn't, didn't’ for biased questions. These markers typically appear at the beginning of the clause, contradicting Matthiessen’s generalization that such particles occur at the end in SOV languages. Additionally, polar interrogatives may be unmarked but distinguished by falling intonation. Content interrogatives are used to inquire about specific elements and are marked by Wh-words such as cim ‘who’, hɒrɒ ‘where’, and nijæ ‘why’. These clauses are clearly differentiated from declaratives through question words and rising intonation, aligning Azeri Turkic with typologically similar languages like English and Japanese, where content and polar interrogatives form a distinct mood type separate from declaratives. In terms of word order, Azeri Turkic follows the canonical position of Wh-elements within the clause rather than fronting them, placing it in the typological category of "Wh-in-situ" languages alongside Persian, Chinese, and Japanese, as opposed to English, French, and German.In Azeri Turkic, as in Persian and English, the imperative mood is marked by the absence of an overt subject, which is usually implied. Unlike in languages such as Mandarin Chinese and Hebrew, where negation in imperatives differs morphologically from declaratives, Azeri Turkic —similar to Persian and English—uses the same negative form across both imperative and non-imperative clauses. This indicates a syntactic independence between the imperative mood and the system of polarity in Azeri Turkic. Another typological feature of imperative clauses is the realization of speech functions relative to the speaker-listener relationship. Azeri Turkic, like German, Persian, and English, offers more delicate choices within the imperative mood to express politeness. For instance, all three languages can use polar interrogatives to represent polite commands.ConclusionThis study was an effort to describe the MOOD TYPE system in Azeri Turkic within a Systemic Functional Typological Framework.Drawing on diverse sources—including short story collections, academic articles, grammar books on Azeri Turkic, and the researcher's linguistic intuition—the study demonstrates that the MOOD TYPE system in Azeri Turkic:includes the three universal mood types: declarative, polar interrogative, and imperative.uses negative polar interrogatives to express the speaker’s positive bias.allows polar interrogatives to be expressed in declarative structure, without any formal marking other than intonation.belongs to the typological category of languages that distinguish content interrogatives (Wh-questions).questions only about participants and peripheral adjuncts—not processes—as interrogative elements in Wh-questions.is classified typologically as a "Wh-in-situ" language, where question words remain in their canonical position rather than being fronted.clearly distinguishes the imperative mood from other mood types.typically omits the addressee (second person) as an unmarked feature in imperative clauses.treats the imperative mood and the system of polarity independently, with no morphological distinction between negative forms in imperatives and non-imperatives.can metaphorically express the speech function of command through polar interrogatives, depending on the social relationship between speaker and listener.Overall, Azeri Turkic exhibits a well-differentiated MOOD TYPE system that aligns with broader systemic functional typological generalizations regarding mood types and their typological variations across languages.
Language and Literature, Language. Linguistic theory. Comparative grammar
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs. These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks. This adaptation process presents two major limitations. First, ALLMs often suffer from catastrophic forgetting, where crucial textual capabilities like instruction-following are lost after training on audio data. In some cases, models may even hallucinate sounds that are not present in the input audio, raising concerns about reliability. Second, achieving cross-modal alignment between audio and language typically relies on large collections of task-specific question-answer pairs for instruction tuning, making it resource-intensive. To address these issues, previous works have leveraged the backbone LLMs to synthesize general-purpose, caption-style alignment data. In this paper, we propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds. We further extend our approach to multi-audio scenarios, enabling the model to either explain differences between audio inputs or produce unified captions that describe all inputs, thereby enhancing audio-language alignment. We refer to the entire ALLM training framework as bootstrapping audio-language alignment via synthetic data generation from backbone LLMs (BALSa). Experimental results indicate that our method effectively mitigates audio hallucinations while reliably maintaining strong performance on audio understanding and reasoning benchmarks, as well as instruction-following skills. Moreover, incorporating multi-audio training further enhances the model's comprehension and reasoning capabilities. Overall, BALSa offers an efficient and scalable approach to developing ALLMs.
Large Language Models (LLMs) possess extensive knowledge and commonsense reasoning capabilities, making them valuable for creating powerful agents. However, existing LLM agent frameworks have not fully utilized past experiences for improvement. This work introduces a new LLM-based agent framework called Retrospex, which addresses this challenge by analyzing past experiences in depth. Unlike previous approaches, Retrospex does not directly integrate experiences into the LLM's context. Instead, it combines the LLM's action likelihood with action values estimated by a Reinforcement Learning (RL) Critic, which is trained on past experiences through an offline ''retrospection'' process. Additionally, Retrospex employs a dynamic action rescoring mechanism that increases the importance of experience-based values for tasks that require more interaction with the environment. We evaluate Retrospex in ScienceWorld, ALFWorld and Webshop environments, demonstrating its advantages over strong, contemporary baselines.
Mutsumi Sasaki, Go Kamoda, Ryosuke Takahashi
et al.
Temporal reasoning and knowledge are essential capabilities for language models (LMs). While much prior work has analyzed and improved temporal reasoning in LMs, most studies have focused solely on the Gregorian calendar. However, many non-Gregorian systems, such as the Japanese, Hijri, and Hebrew calendars, are in active use and reflect culturally grounded conceptions of time. If and how well current LMs can accurately handle such non-Gregorian calendars has not been evaluated so far. Here, we present a systematic evaluation of how well language models handle one such non-Gregorian system: the Japanese wareki. We create datasets that require temporal knowledge and reasoning in using wareki dates. Evaluating open and closed LMs, we find that some models can perform calendar conversions, but GPT-4o, Deepseek V3, and even Japanese-centric models struggle with Japanese calendar arithmetic and knowledge involving wareki dates. Error analysis suggests corpus frequency of Japanese calendar expressions and a Gregorian bias in the model's knowledge as possible explanations. Our results show the importance of developing LMs that are better equipped for culture-specific tasks such as calendar understanding.
In this paper, we offer an overview of indigenous languages, identifying the causes of their devaluation and the need for legislation on language rights. We review the technologies used to revitalize these languages, finding that when they come from outside, they often have the opposite effect to what they seek; however, when developed from within communities, they become powerful instruments of expression. We propose that the inclusion of Indigenous knowledge in large language models (LLMs) will enrich the technological landscape, but must be done in a participatory environment that encourages the exchange of knowledge.
Carolina Minami Oguchi, Leo Wei, Koyo Kobayashi
et al.
Post-training methods have improved the performance and enhanced the reasoning capability for mainstream large language models (LLMs), but the same is challenging for Japanese LLMs to achieve due to the amount of resources required. Inspired by task vectors that extract the change of weights before and after training, specifically for a certain task, we obtain reasoning vectors from reasoning LLMs and apply them to Japanese LLMs to boost their performance. While the resources available present a challenge to improve Japanese LLMs, we present a simple and effective way to obtain high improvement and hope to inspire for other languages.
We present a Japanese domain-specific language model for the pharmaceutical field, developed through continual pretraining on 2 billion Japanese pharmaceutical tokens and 8 billion English biomedical tokens. To enable rigorous evaluation, we introduce three new benchmarks: YakugakuQA, based on national pharmacist licensing exams; NayoseQA, which tests cross-lingual synonym and terminology normalization; and SogoCheck, a novel task designed to assess consistency reasoning between paired statements. We evaluate our model against both open-source medical LLMs and commercial models, including GPT-4o. Results show that our domain-specific model outperforms existing open models and achieves competitive performance with commercial ones, particularly on terminology-heavy and knowledge-based tasks. Interestingly, even GPT-4o performs poorly on SogoCheck, suggesting that cross-sentence consistency reasoning remains an open challenge. Our benchmark suite offers a broader diagnostic lens for pharmaceutical NLP, covering factual recall, lexical variation, and logical consistency. This work demonstrates the feasibility of building practical, secure, and cost-effective language models for Japanese domain-specific applications, and provides reusable evaluation resources for future research in pharmaceutical and healthcare NLP. Our model, codes, and datasets are released at https://github.com/EQUES-Inc/pharma-LLM-eval.
As translated Japanese literature crosses over to become world literature, there still are Orientalist assumptions about modern Japanese literature across the global academy at large, as of limited scope, rejecting fictionalization, and exploring states of mind. The objective of this essay is to counter essentialist and dated assumptions about modern Japanese literature by highlighting the actual breadth and diversity of English translations that negate these stereotypes. I even question postulating any homogeneous genre under the rubric “Japanese literature.” I begin with a broad survey of English-language reviews by non-area-specialists of three contemporary texts of Japanese literature in translation: Mizumura Minae’s (b. 1951) A True Novel (Honkaku shōsetsu, 2002), Kirino Natsuo’s (b. 1951) OUT (AUTO, 1997), and Kaneshiro Kazuki’s (b. 1968) Go (Gō, 2000). I follow that survey by revisiting and putting into question the dominant literary discourse by Japan specialists regarding just what constitutes “modernity” in Japanese literary studies. I then return my focus to the three contemporary Japanese novels already introduced. In the following order I will take up the complexities of each of the three novels’ plots, narration strategies, focalization, issues of ethnicity and race, relation of the individual to social conflicts and issues, and degrees of fictionalization versus realism. Finally, I will show how these various aspects of each of the three can moreover be seen as complementary to some of the most highly regarded fictions of the belle lettres traditions of Japanese modern literature, and hence belong in an alternative genealogy of modern Japanese literature.
Compass Japanese 1 Interactive Workbook and its accompanying Japanese Writing Practice Book for Novice Learners mark an important contribution to Japanese language instruction. Drawing on the Global Competence framework, the workbook integrates reflection, collaboration, and authentic communication into a learner-centered design. With its diverse representation, inclusive visuals, and wide range of interactive tasks, the series encourages students not only to acquire Japanese but also to explore cultural and social themes. While instructors may need to guide learners through some vocabulary and pitch accent nuances, the workbook’s emphasis on inclusivity, creativity, and intercultural awareness makes it a valuable and inspiring resource for novice-level classrooms.
English language education in Japan has long been criticized for its traditional methods which emphasize grammar and reading at the expense of communicative competence. This article explores the potential of Artificial Intelligence in Education (AIEd) to address this issue. A review of literature explored key challenges faced by Japanese EFL learners including Japanese teachers’ low English proficiency and attitudes towards English teaching, heavy focus on entrance examinations in high school, overemphasis on grammar in EFL curricula and textbooks, lack of authentic communicative practice, and differences in cultural values. An analysis of technology integration in Japanese education revealed that while many institutions have begun incorporating technology, its widespread adoption has been gradual. Several case studies support the use of AI to address the psychological barrier to speaking by offering a safe and engaging learning environment, thus boosting confidence and fluency. Furthermore, in the Japanese language context, the use of AI can lower anxiety, promote creativity, and offer personalized learning. In addition to the individual benefits, AI empowers institutions to tailor learning needs, teachers to shift their role from instructors to facilitators, and students to become independent critical thinkers. Finally, challenges and limitations including ethical considerations surrounding data privacy, overreliance, authenticity, watermarking, and academic integrity are addressed. Despite potential drawbacks, the benefits of AIEd merit a deeper exploration of its adoption in EFL curricula. AI tools can be a practical solution to prepare Japanese EFL students to effectively and confidently communicate in English and thus participate in the global landscape.
Naoaki Okazaki, Kakeru Hattori, Hirai Shota
et al.
Open Japanese large language models (LLMs) have been trained on the Japanese portions of corpora such as CC-100, mC4, and OSCAR. However, these corpora were not created for the quality of Japanese texts. This study builds a large Japanese web corpus by extracting and refining text from the Common Crawl archive (21 snapshots of approximately 63.4 billion pages crawled between 2020 and 2023). This corpus consists of approximately 312.1 billion characters (approximately 173 million pages), which is the largest of all available training corpora for Japanese LLMs, surpassing CC-100 (approximately 25.8 billion characters), mC4 (approximately 239.7 billion characters) and OSCAR 23.10 (approximately 74 billion characters). To confirm the quality of the corpus, we performed continual pre-training on Llama 2 7B, 13B, 70B, Mistral 7B v0.1, and Mixtral 8x7B Instruct as base LLMs and gained consistent (6.6-8.1 points) improvements on Japanese benchmark datasets. We also demonstrate that the improvement on Llama 2 13B brought from the presented corpus was the largest among those from other existing corpora.
Large language models (LLMs) are now widely used in various fields, including finance. However, Japanese financial-specific LLMs have not been proposed yet. Hence, this study aims to construct a Japanese financial-specific LLM through continual pre-training. Before tuning, we constructed Japanese financial-focused datasets for continual pre-training. As a base model, we employed a Japanese LLM that achieved state-of-the-art performance on Japanese financial benchmarks among the 10-billion-class parameter models. After continual pre-training using the datasets and the base model, the tuned model performed better than the original model on the Japanese financial benchmarks. Moreover, the outputs comparison results reveal that the tuned model's outputs tend to be better than the original model's outputs in terms of the quality and length of the answers. These findings indicate that domain-specific continual pre-training is also effective for LLMs. The tuned model is publicly available on Hugging Face.
This report presents a small language model (SLM) for Japanese clinical and medicine, named NCVC-slm-1. This 1B parameters model was trained using Japanese text classified to be of high-quality. Moreover, NCVC-slm-1 was augmented with respect to clinical and medicine content that includes the variety of diseases, drugs, and examinations. Using a carefully designed pre-processing, a specialized morphological analyzer and tokenizer, this small and light-weight model performed not only to generate text but also indicated the feasibility of understanding clinical and medicine text. In comparison to other large language models, a fine-tuning NCVC-slm-1 demonstrated the highest scores on 6 tasks of total 8 on JMED-LLM. According to this result, SLM indicated the feasibility of performing several downstream tasks in the field of clinical and medicine. Hopefully, NCVC-slm-1 will be contributed to develop and accelerate the field of clinical and medicine for a bright future.
This article explore works by five authors who deal with the financial costs of trying to get pregnant, being pregnant and giving birth in Japan today. It explores new reproductive technologies and the problems when people other than married women try to access them. The underlying message is that if a woman wants to try something her own way then there are steep costs associated with it and perhaps, by alleviating these costs, this could contribute to increasing the birthrate.
The article addresses the experience of Fumio Kishida’s first year as Prime Minister of the Japanese government with respect to its economic policy vision and implementation. The latter is presented by comparing it to the policy pursued over the past ten years by PM Kishida’s predecessors in the office, on the one hand, and to the commitments he announced during the 2021 election campaign, on the other. The paper notes that, in its basic moments, the economic policy of the cabinet of the new leader of the ruling Liberal Democratic Party continues the course previously maintained by the team of then-PM Shinzō Abe, known as Abenomics. This manifests itself in the specifics of the use of basic tools of macroeconomic policy, monetary and tax instruments to be named first, as well as in choosing priorities for the policy stimulating consumption and economic activity. The continuity of the course can be particularly illustrated by loose monetary policy; positive view of the depreciation of Japanese national currency; moderately expansionary fiscal policy relying on domestic borrowing; preference given to the interests of the corporate sector. Legacy inherited from previous administrations also includes the government striving to revitalize private demand; promoting investment in R&D, venture and innovating enterprises with particular stress on regional economies. Kishida’s particular emphasis on invigorating redistributive mechanisms and increasing labor share in national income has not yet led to material decisions and actions by the government. Nor did Kishida make progress with his commitment to expand the ring of beneficiaries of the capitalist market system, as part of his idea of “new capitalism”.
In recent months, the attention of the government and its economic team has been largely focused on overcoming the consequences of the disruption of transnational chain lines in production, trade, and logistics, as a result of the coronavirus pandemic and rising geopolitical tensions. In addressing this issue, they prioritized subsidizing businesses and households to partially offset energy and food price hikes; securing diversification or localization of critical links of trade and production chains, as well as the exclusion of politically undesirable or unstable locations from them. At this stage, however, the actions of the government are limited to setting relevant goals and plans to provide finance for programs with effectiveness yet to be proved.
Due to the expiration of a number of previously classified materials related to the activities of the Soviet special services during the Second World War, and also due to the especially high public interest in the “case of Richard Sorge” in the last 5 years, a military historian and Doctor of Historical Sciences Mikhail A. Alekseyev introduced into scientific discourse a large number of previously unknown Russian-language documents on this case. These documents are of paramount interest to researchers. For example, it is the first publication of the materials that definitively answer the questions of whether Sorge was a double agent and of what the real reason for the failure of the network of Soviet military intelligence in Japan in 1941 was, as well as of many other pieces of important documentary evidence of the activities of the Soviet secret services in that country.Soon after that, the authors of this article for the first time made a complete translation of the memoirs of Ishii Hanako, Sorge’s Japanese wife, which were analyzed in detail and commented on by the authors of the book Another Sorge. The Story of Ishii Hanako. The memoirs of Ishii Hanako give a chance to take a fresh look at Richard Sorge’s personality, his goals in studying Japan and his approaches to this issue, to form a more personal and, at the same time, objective picture of his character. Together with the case of the “Special Folder” of the Central Committee of the CPSU on perpetuating the memory of Richard Sorge, declassified in 2020, for the first time in history, these materials allow us to fully evaluate Ishii Hanako’s effort to preserve the memory of Sorge in Japan during the period from 1945 to 1964. By comparing the memoirs, the documents of the Soviet side, and by carrying out the research and analytical work, the authors have reached a new level of understanding of the “case of Richard Sorge.”In 2017–2022, a number of new materials devoted to the same case were published in Japan. They reveal the level of awareness of not only Japan’s government and law enforcement agencies, but also of the emperor himself, as well as the division of powers of the special services in the liquidation of Sorge’s intelligence network.
Takanori Ashihara, Takafumi Moriya, Kohei Matsuura
et al.
Self-supervised learning (SSL) has been dramatically successful not only in monolingual but also in cross-lingual settings. However, since the two settings have been studied individually in general, there has been little research focusing on how effective a cross-lingual model is in comparison with a monolingual model. In this paper, we investigate this fundamental question empirically with Japanese automatic speech recognition (ASR) tasks. First, we begin by comparing the ASR performance of cross-lingual and monolingual models for two different language tasks while keeping the acoustic domain as identical as possible. Then, we examine how much unlabeled data collected in Japanese is needed to achieve performance comparable to a cross-lingual model pre-trained with tens of thousands of hours of English and/or multilingual data. Finally, we extensively investigate the effectiveness of SSL in Japanese and demonstrate state-of-the-art performance on multiple ASR tasks. Since there is no comprehensive SSL study for Japanese, we hope this study will guide Japanese SSL research.
We evaluate four state-of-the-art instruction-tuned large language models (LLMs) -- ChatGPT, Flan-T5 UL2, Tk-Instruct, and Alpaca -- on a set of 13 real-world clinical and biomedical natural language processing (NLP) tasks in English, such as named-entity recognition (NER), question-answering (QA), relation extraction (RE), etc. Our overall results demonstrate that the evaluated LLMs begin to approach performance of state-of-the-art models in zero- and few-shot scenarios for most tasks, and particularly well for the QA task, even though they have never seen examples from these tasks before. However, we observed that the classification and RE tasks perform below what can be achieved with a specifically trained model for the medical field, such as PubMedBERT. Finally, we noted that no LLM outperforms all the others on all the studied tasks, with some models being better suited for certain tasks than others.
The auditory system plays a substantial role in shaping the overall human perceptual experience. While prevailing large language models (LLMs) and visual language models (VLMs) have shown their promise in solving a wide variety of language and vision understanding tasks, only a few of them can be generalised to the audio domain without compromising their domain-specific capability. In this work, we introduce Acoustic Prompt Tuning (APT), a new adapter extending LLMs and VLMs to the audio domain by injecting audio embeddings to the input of LLMs, namely soft prompting. Specifically, APT applies an instruction-aware audio aligner to generate soft prompts, conditioned on both input text and sounds, as the inputs to the language model. To mitigate data scarcity in the audio domain, a curriculum learning strategy is proposed by formulating diverse audio tasks in a sequential manner. Moreover, we improve the audio language model by using interleaved audio-text embeddings as the input sequence. In this improved model, zero constraints are imposed on the input format, thus it is capable of tackling diverse modelling tasks, such as few-shot audio classification and audio comparison. To further evaluate the advanced ability of the audio networks, we introduce natural language audio reasoning (NLAR), a new task that analyses two audio clips by comparison and summarisation. Experiments show that APT-enhanced LLMs (namely APT-LLMs) achieve competitive results compared to the expert models (i.e., the networks trained on the target datasets) across various tasks. We finally demonstrate APT's ability in extending frozen VLMs to the audio domain without fine-tuning, achieving promising results in audio-visual question and answering. Our code and model weights will be released at https://github.com/JinhuaLiang/APT