A Japanese Benchmark for Evaluating Social Bias in Reasoning Based on Attribution Theory
Taihei Shiotani, Masahiro Kaneko, Naoaki Okazaki
In enhancing the fairness of Large Language Models (LLMs), evaluating social biases rooted in the cultural contexts of specific linguistic regions is essential. However, most existing Japanese benchmarks heavily rely on translating English data, which does not necessarily provide an evaluation suitable for Japanese culture. Furthermore, they only evaluate bias in the conclusion, failing to capture biases lurking in the reasoning. In this study, based on attribution theory in social psychology, we constructed a new dataset, ``JUBAKU-v2,'' which evaluates the bias in attributing behaviors to in-groups and out-groups within reasoning while fixing the conclusion. This dataset consists of 216 examples reflecting cultural biases specific to Japan. Experimental results verified that it can detect performance differences across models more sensitively than existing benchmarks.
Predicting Talent Breakout Rate using Twitter and TV data
Bilguun Batsaikhan, Hiroyuki Fukuda
Early detection of rising talents is of paramount importance in the field of advertising. In this paper, we define a concept of talent breakout and propose a method to detect Japanese talents before their rise to stardom. The main focus of the study is to determine the effectiveness of combining Twitter and TV data on predicting time-dependent changes in social data. Although traditional time-series models are known to be robust in many applications, the success of neural network models in various fields (e.g.\ Natural Language Processing, Computer Vision, Reinforcement Learning) continues to spark an interest in the time-series community to apply new techniques in practice. Therefore, in order to find the best modeling approach, we have experimented with traditional, neural network and ensemble learning methods. We observe that ensemble learning methods outperform traditional and neural network models based on standard regression metrics. However, by utilizing the concept of talent breakout, we are able to assess the true forecasting ability of the models, where neural networks outperform traditional and ensemble learning methods in terms of precision and recall.
Enhancing Small Language Models for Cross-Lingual Generalized Zero-Shot Classification with Soft Prompt Tuning
Fred Philippy, Siwen Guo, Cedric Lothritz
et al.
In NLP, Zero-Shot Classification (ZSC) has become essential for enabling models to classify text into categories unseen during training, particularly in low-resource languages and domains where labeled data is scarce. While pretrained language models (PLMs) have shown promise in ZSC, they often rely on large training datasets or external knowledge, limiting their applicability in multilingual and low-resource scenarios. Recent approaches leveraging natural language prompts reduce the dependence on large training datasets but struggle to effectively incorporate available labeled data from related classification tasks, especially when these datasets originate from different languages or distributions. Moreover, existing prompt-based methods typically rely on manually crafted prompts in a specific language, limiting their adaptability and effectiveness in cross-lingual settings. To address these challenges, we introduce RoSPrompt, a lightweight and data-efficient approach for training soft prompts that enhance cross-lingual ZSC while ensuring robust generalization across data distribution shifts. RoSPrompt is designed for small multilingual PLMs, enabling them to leverage high-resource languages to improve performance in low-resource settings without requiring extensive fine-tuning or high computational costs. We evaluate our approach on multiple multilingual PLMs across datasets covering 106 languages, demonstrating strong cross-lingual transfer performance and robust generalization capabilities over unseen classes.
Self-Cognition in Large Language Models: An Exploratory Study
Dongping Chen, Jiawen Shi, Yao Wan
et al.
While Large Language Models (LLMs) have achieved remarkable success across various applications, they also raise concerns regarding self-cognition. In this paper, we perform a pioneering study to explore self-cognition in LLMs. Specifically, we first construct a pool of self-cognition instruction prompts to evaluate where an LLM exhibits self-cognition and four well-designed principles to quantify LLMs' self-cognition. Our study reveals that 4 of the 48 models on Chatbot Arena--specifically Command R, Claude3-Opus, Llama-3-70b-Instruct, and Reka-core--demonstrate some level of detectable self-cognition. We observe a positive correlation between model size, training data quality, and self-cognition level. Additionally, we also explore the utility and trustworthiness of LLM in the self-cognition state, revealing that the self-cognition state enhances some specific tasks such as creative writing and exaggeration. We believe that our work can serve as an inspiration for further research to study the self-cognition in LLMs.
Emergent World Models and Latent Variable Estimation in Chess-Playing Language Models
Adam Karvonen
Language models have shown unprecedented capabilities, sparking debate over the source of their performance. Is it merely the outcome of learning syntactic patterns and surface level statistics, or do they extract semantics and a world model from the text? Prior work by Li et al. investigated this by training a GPT model on synthetic, randomly generated Othello games and found that the model learned an internal representation of the board state. We extend this work into the more complex domain of chess, training on real games and investigating our model's internal representations using linear probes and contrastive activations. The model is given no a priori knowledge of the game and is solely trained on next character prediction, yet we find evidence of internal representations of board state. We validate these internal representations by using them to make interventions on the model's activations and edit its internal board state. Unlike Li et al's prior synthetic dataset approach, our analysis finds that the model also learns to estimate latent variables like player skill to better predict the next character. We derive a player skill vector and add it to the model, improving the model's win rate by up to 2.6 times.
Scaling Behavior of Machine Translation with Large Language Models under Prompt Injection Attacks
Zhifan Sun, Antonio Valerio Miceli-Barone
Large Language Models (LLMs) are increasingly becoming the preferred foundation platforms for many Natural Language Processing tasks such as Machine Translation, owing to their quality often comparable to or better than task-specific models, and the simplicity of specifying the task through natural language instructions or in-context examples. Their generality, however, opens them up to subversion by end users who may embed into their requests instructions that cause the model to behave in unauthorized and possibly unsafe ways. In this work we study these Prompt Injection Attacks (PIAs) on multiple families of LLMs on a Machine Translation task, focusing on the effects of model size on the attack success rates. We introduce a new benchmark data set and we discover that on multiple language pairs and injected prompts written in English, larger models under certain conditions may become more susceptible to successful attacks, an instance of the Inverse Scaling phenomenon (McKenzie et al., 2023). To our knowledge, this is the first work to study non-trivial LLM scaling behaviour in a multi-lingual setting.
Native vs Non-Native Language Prompting: A Comparative Analysis
Mohamed Bayan Kmainasi, Rakif Khan, Ali Ezzat Shahroor
et al.
Large language models (LLMs) have shown remarkable abilities in different fields, including standard Natural Language Processing (NLP) tasks. To elicit knowledge from LLMs, prompts play a key role, consisting of natural language instructions. Most open and closed source LLMs are trained on available labeled and unlabeled resources--digital content such as text, images, audio, and videos. Hence, these models have better knowledge for high-resourced languages but struggle with low-resourced languages. Since prompts play a crucial role in understanding their capabilities, the language used for prompts remains an important research question. Although there has been significant research in this area, it is still limited, and less has been explored for medium to low-resourced languages. In this study, we investigate different prompting strategies (native vs. non-native) on 11 different NLP tasks associated with 12 different Arabic datasets (9.7K data points). In total, we conducted 197 experiments involving 3 LLMs, 12 datasets, and 3 prompting strategies. Our findings suggest that, on average, the non-native prompt performs the best, followed by mixed and native prompts.
Leveraging Large Language Models to Measure Gender Representation Bias in Gendered Language Corpora
Erik Derner, Sara Sansalvador de la Fuente, Yoan Gutiérrez
et al.
Large language models (LLMs) often inherit and amplify social biases embedded in their training data. A prominent social bias is gender bias. In this regard, prior work has mainly focused on gender stereotyping bias - the association of specific roles or traits with a particular gender - in English and on evaluating gender bias in model embeddings or generated outputs. In contrast, gender representation bias - the unequal frequency of references to individuals of different genders - in the training corpora has received less attention. Yet such imbalances in the training data constitute an upstream source of bias that can propagate and intensify throughout the entire model lifecycle. To fill this gap, we propose a novel LLM-based method to detect and quantify gender representation bias in LLM training data in gendered languages, where grammatical gender challenges the applicability of methods developed for English. By leveraging the LLMs' contextual understanding, our approach automatically identifies and classifies person-referencing words in gendered language corpora. Applied to four Spanish-English benchmarks and five Valencian corpora, our method reveals substantial male-dominant imbalances. We show that such biases in training data affect model outputs, but can surprisingly be mitigated leveraging small-scale training on datasets that are biased towards the opposite gender. Our findings highlight the need for corpus-level gender bias analysis in multilingual NLP. We make our code and data publicly available.
Cross-lingual Named Entity Corpus for Slavic Languages
Jakub Piskorski, Michał Marcińczuk, Roman Yangarber
This paper presents a corpus manually annotated with named entities for six Slavic languages - Bulgarian, Czech, Polish, Slovenian, Russian, and Ukrainian. This work is the result of a series of shared tasks, conducted in 2017-2023 as a part of the Workshops on Slavic Natural Language Processing. The corpus consists of 5 017 documents on seven topics. The documents are annotated with five classes of named entities. Each entity is described by a category, a lemma, and a unique cross-lingual identifier. We provide two train-tune dataset splits - single topic out and cross topics. For each split, we set benchmarks using a transformer-based neural network architecture with the pre-trained multilingual models - XLM-RoBERTa-large for named entity mention recognition and categorization, and mT5-large for named entity lemmatization and linking.
Mapping 'when'-clauses in Latin American and Caribbean languages: an experiment in subtoken-based typology
Nilo Pedrazzini
Languages can encode temporal subordination lexically, via subordinating conjunctions, and morphologically, by marking the relation on the predicate. Systematic cross-linguistic variation among the former can be studied using well-established token-based typological approaches to token-aligned parallel corpora. Variation among different morphological means is instead much harder to tackle and therefore more poorly understood, despite being predominant in several language groups. This paper explores variation in the expression of generic temporal subordination ('when'-clauses) among the languages of Latin America and the Caribbean, where morphological marking is particularly common. It presents probabilistic semantic maps computed on the basis of the languages of the region, thus avoiding bias towards the many world's languages that exclusively use lexified connectors, incorporating associations between character $n$-grams and English $when$. The approach allows capturing morphological clause-linkage devices in addition to lexified connectors, paving the way for larger-scale, strategy-agnostic analyses of typological variation in temporal subordination.
Yūjo in the Off Hours: Female Intimacy in Chikamatsu's Contemporary Life Plays
Jyana S. Browne
-
Language and Literature, Japanese language and literature
Antarlekhaka: A Comprehensive Tool for Multi-task Natural Language Annotation
Hrishikesh Terdalkar, Arnab Bhattacharya
One of the primary obstacles in the advancement of Natural Language Processing (NLP) technologies for low-resource languages is the lack of annotated datasets for training and testing machine learning models. In this paper, we present Antarlekhaka, a tool for manual annotation of a comprehensive set of tasks relevant to NLP. The tool is Unicode-compatible, language-agnostic, Web-deployable and supports distributed annotation by multiple simultaneous annotators. The system sports user-friendly interfaces for 8 categories of annotation tasks. These, in turn, enable the annotation of a considerably larger set of NLP tasks. The task categories include two linguistic tasks not handled by any other tool, namely, sentence boundary detection and deciding canonical word order, which are important tasks for text that is in the form of poetry. We propose the idea of sequential annotation based on small text units, where an annotator performs several tasks related to a single text unit before proceeding to the next unit. The research applications of the proposed mode of multi-task annotation are also discussed. Antarlekhaka outperforms other annotation tools in objective evaluation. It has been also used for two real-life annotation tasks on two different languages, namely, Sanskrit and Bengali. The tool is available at https://github.com/Antarlekhaka/code.
Beyond Classification: Financial Reasoning in State-of-the-Art Language Models
Guijin Son, Hanearl Jung, Moonjeong Hahm
et al.
Large Language Models (LLMs), consisting of 100 billion or more parameters, have demonstrated remarkable ability in complex multi-step reasoning tasks. However, the application of such generic advancements has been limited to a few fields, such as clinical or legal, with the field of financial reasoning remaining largely unexplored. To the best of our knowledge, the ability of LLMs to solve financial reasoning problems has never been dealt with, and whether it can be performed at any scale remains unknown. To address this knowledge gap, this research presents a comprehensive investigation into the potential application of LLMs in the financial domain. The investigation includes a detailed exploration of a range of subjects, including task formulation, synthetic data generation, prompting methods, and evaluation capability. Furthermore, the study benchmarks various GPT variants with parameter scales ranging from 2.8B to 13B, with and without instruction tuning, on diverse dataset sizes. By analyzing the results, we reveal that the ability to generate coherent financial reasoning first emerges at 6B parameters, and continues to improve with better instruction-tuning or larger datasets. Additionally, the study provides a publicly accessible dataset named sFIOG (Synthetic-Financial Investment Opinion Generation), consisting of 11,802 synthetic investment thesis samples, to support further research in the field of financial reasoning. Overall, this research seeks to contribute to the understanding of the efficacy of language models in the field of finance, with a particular emphasis on their ability to engage in sophisticated reasoning and analysis within the context of investment decision-making.
Simple is Better and Large is Not Enough: Towards Ensembling of Foundational Language Models
Nancy Tyagi, Aidin Shiri, Surjodeep Sarkar
et al.
Foundational Language Models (FLMs) have advanced natural language processing (NLP) research. Current researchers are developing larger FLMs (e.g., XLNet, T5) to enable contextualized language representation, classification, and generation. While developing larger FLMs has been of significant advantage, it is also a liability concerning hallucination and predictive uncertainty. Fundamentally, larger FLMs are built on the same foundations as smaller FLMs (e.g., BERT); hence, one must recognize the potential of smaller FLMs which can be realized through an ensemble. In the current research, we perform a reality check on FLMs and their ensemble on benchmark and real-world datasets. We hypothesize that the ensembling of FLMs can influence the individualistic attention of FLMs and unravel the strength of coordination and cooperation of different FLMs. We utilize BERT and define three other ensemble techniques: {Shallow, Semi, and Deep}, wherein the Deep-Ensemble introduces a knowledge-guided reinforcement learning approach. We discovered that the suggested Deep-Ensemble BERT outperforms its large variation i.e. BERTlarge, by a factor of many times using datasets that show the usefulness of NLP in sensitive fields, such as mental health.
ANALYSIS OF THE REGRET SPEECH ACT IN JAPANESE
Azila Dinda Amalia, Nuria Haristiani
The characters in Japanese animation (or anime) are the center of this study, which examines the ways in which they demonstrate regret. Many prior studies have used anime as research study material because it is one of the media that is extremely interesting to analyze from a variety of perspectives. For this analysis, researchers used information from 14 episodes of the 24-minute-long anime series Golden Time. The animation depicts the daily lives of Japanese college students, and the numerous struggles they endure are depicted through several expressions of regret. The data gathered is derived from transcripts of talks carried out by the anime characters and is utilized as research material. The collected data was 54 regret speech act utterances and then examined using a qualitative descriptive approach. The data were then classified into the varieties of regret indicated by Pink. After being analyzed, it was shown that noni form dominates the regret speech acts performed in anime, and the types of regret speech acts that are commonly employed are those related to regret of an action or opportunity that should have been taken by someone. It is also known in this study that Japanese people tend to express regret by providing information about facts that differ from what they expected, causing them to feel regret. When they express regret, they may also blame themselves for their actions.
Japanese language and literature
The Effects of Individual-Level and Group-Level Trust on Willingness to Communicate in the Group Language Learning
Takehiko Ito, Mariko Furuyabu, Jennifer Toews-Shimizu
This study examined the effects of individual-level and group-level trust on willingness to communicate (WTC) in a second language, targeting Japanese university students in a group-language (English) learning setting. Although the effects of group language learning on students’ learning attitudes and the effects of trust on WTC in a second language have been examined extensively, no study has examined group-level factors in a group-language learning setting. A questionnaire survey was conducted thrice per semester. Multilevel analysis found that individual-level trust in group members positively influenced individual-level WTC in English, and group-level trust in group members also positively influenced group-level WTC in English repeatedly through one semester. Moreover, the degree of group-level WTC in English changed after the mid-semester. This study contributes to the literature on group-language learning, and has implications for language education where educators must be mindful not only of each student’s characteristics but also of each group’s characteristics to enhance their performance.
Amidaist Practices in Zoku Honcho Ojoden (“Continuation of the Biographies of Japanese Reborn Into the Pure Land”)
A. A. Petrova
The article discusses practices for reaching rebirth in the Pure Land recounted in Zoku Honchō Ōjōden (“Continuation of the Biographies of Japanese Reborn Into the Pure Land”), composed in 1101- 1111 by Ōe-no Masafusa. These practices include those mentioned in the stories as being performed during one’s lifetime, intended to show one’s strong devotion to Pure Land, as well as death-bed practices: the description of the death hour is the crucial point of every biography. Some of these practices belong to the Pure Land tradition (the most important to be mentioned is nenbutsu, “recollection of Buddha [Amida]”), while others are more likely to be attributed to other traditions (the most important one being reading and reciting the Lotus Sutra): the author obviously does not feel any need to draw a line between them. Normally, these practices are only mentioned in the text and not discussed in detail. This aspect of Zoku Honchō Ōjōden is analyzed in comparison with other important Pure Land texts: Nihon Ōjō Gokuraku-ki (“Japanese Records of Rebirth in the Land of Supreme Joy”) by Yoshishige-no Yasutane and Ōjōyōshū (“The Essentials of Rebirth in the Pure Land”) by Genshin. As compared to Nihon Ōjō Gokuraku-ki, in Zoku Honchō Ōjōden, much stronger emphasis is placed on the death-bed practices than on the lifetime actions and evidence of rebirth. Often, the text focuses on the state of mind of the dying person, his or her determination in performing death-bed practices. In his work, Ōe-no Masafusa leans on the idea expressed in Ōjōyōshū that these are the last moments of life that are decisive and determine one’s rebirth, illustrating it with examples.
Japanese language and literature
Koherensi pada Video Blog Berbahasa Jepang
Reny Wiyatasari, Mahardita Hideko
The aim of this research is to describe the elements of coherence in the discourse of Japanese video blog. The data of this research is taken from a video blog entitled ‘[Chōkai | Nihongo Kaiwa] Daigaku No Tomodachi to Hanasu Toki <Nihongo Joukyuushamuke>’ uploaded to YouTube on January 22, 2020. The data retrieval method used the referral method, using the tapping technique as its advanced technique. Furthermore, in the data analysis phase, using the ‘Agih’ method with the direct element sharing techniques. Then in presenting the data, using informal presentation methods.The results showed that there were 5 coherence data that showed interconnected discourse, seen from the use of cohesion markers. That way, the five datas can represent discourse coherence.
Japanese language and literature
Personal and Animal Names in Japanese and Russian Texts Translated from a Fantasy Novel
T. Ninomiya
This paper analyzed a Russian translation text and the Japanese translation text of Harry Potter and the Philosopher’s Stone from the viewpoint of Descriptive Translation Studies proposed by Toury (1995). First, the present researchers situated the texts within the target culture systems, looking at their acceptability. The results revealed that although Japanese and Russian famous bookstore companies categorized the translated books as children’s literature, the Japanese version’s translator did not choose a certain readership and the Russian version’s translator and publishing company categorized it as роман, namely novel. In the Russian-speaking world, not children but adolescence people and adults usually read роман. Second, the present authors surveyed translation features of personal and animal names in the Japanese and Russian translation versions of Harry Potter and the Philosopher’s stone. Based on the seven translation procedures that Davies (2003) proposed (i.e., Preservation, Addition, Globalization, Omission, Localization, Transformation, and Creation), the present researchers classified the names and counted the number of each of the translation procedures. The results showed that Preservation was most frequently used in both the Japanese and Russian texts. In the Japanese translation version, the percentage of Preservation was about 80%. In the Russian translation version, the percentage of Preservation was about 50% and the percentage of Localization was about 30%. The Russian translator used Preservation and Localization frequently. According to Jaleniauskienė and Čičelytė (2009), Preservation was the procedure that emphasized the Source language and Localization was the procedure that emphasized the Target language. These results showed that the Russian translation version put more emphasis on the Target language than the Japanese translation version.
The zero-address form in the Japanese address system
Yayan Suyana, Suhandano Suhandano, Tatang Hariri
In the Japanese language, there were various forms of address. For example, the use of the second pronoun; anata (you), kimi (you), self-name; Nakamura, Yamaguchi, kinship names; okaasan (mother), otousan (father), name of the profession; and sensei (teacher/doctor). In addition to the various address types, the zero forms of address were also known, namely the implicit use of address words. For example, address words in the form of zero were address (aisatsu); ohayou gozaimasu (good morning), irasshaimase (welcome), and sumimasen (sorry). The form of address adopted in this study was the zero-address form. This study would find the various forms and variations of zero-address. In addition, it also examined the functions and factors that influenced the use of zero-address by the sociolinguistic and pragmatics approach. This study found that there are four variations of the zero-address form, namely (1) the form of greeting; (2) the form of an exclamation or interjection; (3) the form of an interrogative sentence; and (4) the form of declarative sentences. There are two kinds of greeting, namely formal and informal. The function of the zero-address is to show respect, closeness, attract attention, and notification/statement. Factors that influence the use of zero greetings are social status, social distance, situation, and identity of the speaking actor.
Japanese language and literature