Dive Into the Past: Chinese Intangible Cultural Heritage Represented in Chinese Mobile Games
Yixiang Que, Teresa de La Hera, Jeroen Jansz
This study delves into the representation of intangible Chinese cultural heritage in Chinese mobile games. Nowadays, even with huge economic potential, Chinese mobile games are still faced with negative comments, especially from parents. To develop their reputation, Chinese mobile games have portrayed considerate Chinese intangible cultural heritage (ICH). The representation of cultural heritage in video games has been a subject of discussion for a long time, while console and computer games and tangible cultural heritage have been at the centre of academic studies. Hence, we strived to investigate how Chinese ICH is depicted in Chinese mobile games. A thematic analysis was conducted on data collected through analytical gameplay of 30 Chinese mobile games selected from Apple’s App Store and the Chinese TapTap platform. Our analysis indicates that Chinese ICH is represented in Chinese mobile games to fulfil the persona of game characters, to create a more engaging and immersive game world experience, to provide more intense combat moments, to offer visually enjoyable virtual goods, and to deliver culturally driven updates during traditional festivals. Among the various forms of Chinese ICH, Chinese traditional craftsmanship is found to be used most frequently because of the huge effort it requires to be mastered, which could be educational to players. Theatrical traditions were found to be depicted in most game elements because they combine visual, audio, and motional elements. Lastly, folk and classical literature is portrayed in Chinese mobile games to provide a narrative resource that culturally resonates with players.
Communication. Mass media
Trans-EnV: A Framework for Evaluating the Linguistic Robustness of LLMs Against English Varieties
Jiyoung Lee, Seungho Kim, Jieun Han
et al.
Large Language Models (LLMs) are predominantly evaluated on Standard American English (SAE), often overlooking the diversity of global English varieties. This narrow focus may raise fairness concerns as degraded performance on non-standard varieties can lead to unequal benefits for users worldwide. Therefore, it is critical to extensively evaluate the linguistic robustness of LLMs on multiple non-standard English varieties. We introduce Trans-EnV, a framework that automatically transforms SAE datasets into multiple English varieties to evaluate the linguistic robustness. Our framework combines (1) linguistics expert knowledge to curate variety-specific features and transformation guidelines from linguistic literature and corpora, and (2) LLM-based transformations to ensure both linguistic validity and scalability. Using Trans-EnV, we transform six benchmark datasets into 38 English varieties and evaluate seven state-of-the-art LLMs. Our results reveal significant performance disparities, with accuracy decreasing by up to 46.3% on non-standard varieties. These findings highlight the importance of comprehensive linguistic robustness evaluation across diverse English varieties. Each construction of Trans-EnV was validated through rigorous statistical testing and consultation with a researcher in the field of second language acquisition, ensuring its linguistic validity. Our code and datasets are publicly available at https://github.com/jiyounglee-0523/TransEnV and https://huggingface.co/collections/jiyounglee0523/transenv-681eadb3c0c8cf363b363fb1.
The role of zealots in the spread of linguistic traits
Vivian Dornelas, Celia Anteneodo, Renan Nunes
et al.
We investigate the diffusion of linguistic innovations on a fully connected network in order to understand the emergence of linguistic diversity. We employ an agent-based dynamics based on the Axelrod model, where interactions between agents are driven by homophily and social influence, with the difference that we assume that all agents share a number of common features that ensure a finite probability of pairwise interaction. We start from a homogeneous population and introduce zealots that act like agents spreading linguistic innovations, without being influenced by other agents. We analyze how different factors, such as the degree of cohesion and number of zealots in different linguistic states, determine the linguistic configurations that populations can adopt and contribute to the possible emergence of a multi-linguistic community. The results are compared with those derived within the mean-field approximation.
en
nlin.AO, physics.soc-ph
EqualizeIR: Mitigating Linguistic Biases in Retrieval Models
Jiali Cheng, Hadi Amiri
This study finds that existing information retrieval (IR) models show significant biases based on the linguistic complexity of input queries, performing well on linguistically simpler (or more complex) queries while underperforming on linguistically more complex (or simpler) queries. To address this issue, we propose EqualizeIR, a framework to mitigate linguistic biases in IR models. EqualizeIR uses a linguistically biased weak learner to capture linguistic biases in IR datasets and then trains a robust model by regularizing and refining its predictions using the biased weak learner. This approach effectively prevents the robust model from overfitting to specific linguistic patterns in data. We propose four approaches for developing linguistically-biased models. Extensive experiments on several datasets show that our method reduces performance disparities across linguistically simple and complex queries, while improving overall retrieval performance.
PBEBench: A Multi-Step Programming by Examples Reasoning Benchmark inspired by Historical Linguistics
Atharva Naik, Prakam, Yash Mathur
et al.
Although many benchmarks evaluate the reasoning abilities of Large Language Models (LLMs) within domains such as mathematics, coding, or data wrangling, few abstract away from domain specifics to examine reasoning as a capability in and of itself. We contribute a novel type of benchmark evaluating the inductive reasoning capabilities of LLMs that is inspired by the forward reconstruction task from historical linguistics but is formulated in an extremely simple, general way (in the form of Programming by Examples). The task involves generating a cascade of simple string rewrite programs to transform a given list of input strings into a list of desired output strings. We present a fully automated pipeline that programmatically generates problems of this type with controllable difficulty, enabling scalable evaluation of reasoning models while avoiding contamination. Using this approach, we construct two benchmarks: PBEBench-Lite, which efficiently stratifies models of varying capabilities, and PBEBench, which requires models to induce programs similar in complexity to those constructed by historical linguists. Our experiments reveal a substantial performance gap between models that leverage test-time compute or LCoT (long chain-of-thought) reasoning and those that do not. Moreover, although recent models show promise, the solve rate for both of them drops below 5% for hard instances of the PBEBench dataset (ground truth cascade lengths of 20 and 30, respectively), falling well short of realistic historical linguistics requirements even with computationally expensive, popular scaling techniques from the PBE and reasoning literature. Additionally, we also study the effectiveness of different scaling strategies and the impact of various hyperparameters on the difficulty of the generated data using gpt-oss-120b, the best-performing open-source model.
Imperfect Language, Artificial Intelligence, and the Human Mind: An Interdisciplinary Approach to Linguistic Errors in Native Spanish Speakers
Francisco Portillo López
Linguistic errors are not merely deviations from normative grammar; they offer a unique window into the cognitive architecture of language and expose the current limitations of artificial systems that seek to replicate them. This project proposes an interdisciplinary study of linguistic errors produced by native Spanish speakers, with the aim of analyzing how current large language models (LLM) interpret, reproduce, or correct them. The research integrates three core perspectives: theoretical linguistics, to classify and understand the nature of the errors; neurolinguistics, to contextualize them within real-time language processing in the brain; and natural language processing (NLP), to evaluate their interpretation against linguistic errors. A purpose-built corpus of authentic errors of native Spanish (+500) will serve as the foundation for empirical analysis. These errors will be tested against AI models such as GPT or Gemini to assess their interpretative accuracy and their ability to generalize patterns of human linguistic behavior. The project contributes not only to the understanding of Spanish as a native language but also to the development of NLP systems that are more cognitively informed and capable of engaging with the imperfect, variable, and often ambiguous nature of real human language.
Tout ce qui n’est pas dit par Madeleine Bourdouxhe. L’affirmation féminine de Sept Nouvelles à travers le silence
Jorge Alcón Borrega
Après des années dans un arrière-plan, l'œuvre de Madeleine Bourdouxhe (1906-1996) a été redécouverte et la critique a trouvé dans ses récits un message d'affirmation de l'identité féminine, précoce pour son époque. Sept nouvelles rassemble certaines des nouvelles écrits par l'auteure tout au long de sa vie, qui mettent en scène différentes femmes, contemporaines à l'écrivaine, qui souffrent du manque d'identité auquel elles sont soumises par la société et qui les contraint à être conditionnées par leurs maris.
À travers ces histoires, l'auteur exprime le besoin des femmes de la moitié du XXe siècle d'atteindre leur propre identité. Toutefois, ce message n'est pas présenté explicitement dans l'œuvre, mais il est développé à travers différentes ressources littéraires et linguistiques qui parviendront à transmettre au lecteur le vide ressenti par ses personnages. Les brèves interventions des femmes, leurs caractéristiques et, surtout, leurs silences constitueront le point central de notre analyse.
Philology. Linguistics, French literature - Italian literature - Spanish literature - Portuguese literature
What fifty-one years of Linguistics and Artificial Intelligence research tell us about their correlation: A scientometric analysis
Mohammed Q. Shormani
There is a strong correlation between linguistics and artificial intelligence (AI), best manifested by deep learning language models. This study provides a thorough scientometric analysis of this correlation, synthesizing the intellectual production over 51 years, from 1974 to 2024. Web of Science Core Collection (WoSCC) database was the data source. The data collected were analyzed by two powerful software, viz., CiteSpace and VOSviewer, through which mapping visualizations of the intellectual landscape, trending issues and (re)emerging hotspots were generated. The results indicate that in the 1980s and 1990s, linguistics and AI (AIL) research was not robust, characterized by unstable publication over time. It has, however, witnessed a remarkable increase of publication since then, reaching 1478 articles in 2023, and 546 articles in January-March timespan in 2024, involving emerging issues including Natural language processing, Cross-sectional study, Using bidirectional encoder representation, and Using ChatGPT and hotspots such as Novice programmer, Prioritization, and Artificial intelligence, addressing new horizons, new topics, and launching new applications and powerful deep learning language models including ChatGPT. It concludes that linguistics and AI correlation is established at several levels, research centers, journals, and countries shaping AIL knowledge production and reshaping its future frontiers.
Explainability of machine learning approaches in forensic linguistics: a case study in geolinguistic authorship profiling
Dana Roemling, Yves Scherrer, Aleksandra Miletic
Forensic authorship profiling uses linguistic markers to infer characteristics about an author of a text. This task is paralleled in dialect classification, where a prediction is made about the linguistic variety of a text based on the text itself. While there have been significant advances in recent years in variety classification, forensic linguistics rarely relies on these approaches due to their lack of transparency, among other reasons. In this paper we therefore explore the explainability of machine learning approaches considering the forensic context. We focus on variety classification as a means of geolinguistic profiling of unknown texts based on social media data from the German-speaking area. For this, we identify the lexical items that are the most impactful for the variety classification. We find that the extracted lexical features are indeed representative of their respective varieties and note that the trained models also rely on place names for classifications.
Análise da dinâmica e especialização das atividades criativas nas capitais brasileiras e cidades criativas da Unesco
Jonas da Silva Henrique
Este artigo tem por objetivo discutir o crescimento das atividades criativas, que fazem interface com o turismo, nas 27 capitais brasileiras e nas cidades criativas reconhecidas pela Organização das Nações Unidas para a Educação, a Ciência e a Cultura de Santos (SP) e Paraty (RJ). Para tanto, realizou-se uma aplicação estatística com dados da Relação Anual de Informações Sociais de 2011 a 2020. Para comparar as potencialidades locais em diferentes períodos, optou-se pelo uso do quociente locacional, shift-share e seu desdobramento Esteban-Marquillas. Os resultados encontrados ilustram que, a partir de 2017, as taxas de encolhimento das atividades criativas foram superiores ao encolhimento das atividades convencionais. O quociente locacional elucida que não houve mudanças estruturais abruptas em termos de concentração espacial das atividades criativas, o shift-share enfatiza o agravamento do encolhimento da economia criativa, coincidindo com os períodos das crises econômicas/institucionais e de distanciamento social derivado da covid-19.
Communication. Mass media, Social sciences (General)
Echoes of Haunted Memories and Nightmares: Understanding Trauma in Gurnah's Afterlives
Harzat Abbas, Asif Abbas, Asim Iqbal
et al.
This research paper investigates the exploration of trauma within Abdulrazak Gurnah's novel "Afterlives" (2020), examining the profound impact of historical and personal traumas on the characters, particularly the protagonist Hamza. The research adopts a qualitative research paradigm and incorporates primary and secondary sources to analyze the text comprehensively using trauma analysis theory. Literature, as a dominant medium, reflects human experience, with trauma emerging as a pervasive merged theme of stories of suffering and self-discovery. The examination explores treating trauma as a mere narrative device, revealing it as a tangible representation shaping characters' lives. Memories and nightmares in the novel are depicted as echoes of a haunting past, challenging Hamza's sense of self and resilience. The study concludes that "Afterlives" stands out as an exceptional portrayal of trauma in literature, emphasizing the long-lasting impact on the human psyche. Suggestions include a comparative analysis with similar works, an exploration of postcolonial perspectives in Gurnah's literature, and an examination of healing mechanisms portrayed in the aftermath of trauma. Ultimately, the research adds to a broader comprehension of literary trauma, emphasizing its relevance in shaping human experiences and promoting empathy, kindness, and solidarity in adversity.
English literature, Language. Linguistic theory. Comparative grammar
Tributo a Leonor Lopes Fávero: o primeiro momento da linguística textual no Brasil
Dieli Vesaro Palma, Thiago Zilio Passerini
O presente artigo tem como objetivo detalhar as principais contribuições de Leonor Lopes Fávero aos estudos linguísticos empreendidos no Brasil, mais especificamente os relacionados à linguística textual. Para tanto, estabeleceu-se o recorte temporal de 1980 a 1986, que compreende, aproximadamente, ao primeiro momento da linguística textual brasileira, delimitado por Koch (1999). Com relação à perspectiva de análise adotada, partiu-se dos pressupostos da historiografia linguística postulados sobretudo por Koerner (2014) e Swiggers (2012). O corpus selecionado contou com textos que circularam no intervalo estabelecido, entre eles, artigos, anais de congressos, livros e capítulos de livros. Como material epi-historiográfico, utilizaram-se principalmente, as contribuições de Bentes (2001), Fávero (2017, 2019, 2021), Galembeck (2015) e Koch (1997, 1999, 2003). Os resultados da análise mostraram a relevância da autora para o período em questão, no que se refere tanto à introdução quanto ao desenvolvimento dos estudos linguísticos textuais no Brasil.
Language and Literature, English language
Quantifying the Roles of Visual, Linguistic, and Visual-Linguistic Complexity in Verb Acquisition
Yuchen Zhou, Michael J. Tarr, Daniel Yurovsky
Children typically learn the meanings of nouns earlier than the meanings of verbs. However, it is unclear whether this asymmetry is a result of complexity in the visual structure of categories in the world to which language refers, the structure of language itself, or the interplay between the two sources of information. We quantitatively test these three hypotheses regarding early verb learning by employing visual and linguistic representations of words sourced from large-scale pre-trained artificial neural networks. Examining the structure of both visual and linguistic embedding spaces, we find, first, that the representation of verbs is generally more variable and less discriminable within domain than the representation of nouns. Second, we find that if only one learning instance per category is available, visual and linguistic representations are less well aligned in the verb system than in the noun system. However, in parallel with the course of human language development, if multiple learning instances per category are available, visual and linguistic representations become almost as well aligned in the verb system as in the noun system. Third, we compare the relative contributions of factors that may predict learning difficulty for individual words. A regression analysis reveals that visual variability is the strongest factor that internally drives verb learning, followed by visual-linguistic alignment and linguistic variability. Based on these results, we conclude that verb acquisition is influenced by all three sources of complexity, but that the variability of visual structure poses the most significant challenge for verb learning.
LFTK: Handcrafted Features in Computational Linguistics
Bruce W. Lee, Jason Hyung-Jong Lee
Past research has identified a rich set of handcrafted linguistic features that can potentially assist various tasks. However, their extensive number makes it difficult to effectively select and utilize existing handcrafted features. Coupled with the problem of inconsistent implementation across research works, there has been no categorization scheme or generally-accepted feature names. This creates unwanted confusion. Also, most existing handcrafted feature extraction libraries are not open-source or not actively maintained. As a result, a researcher often has to build such an extraction system from the ground up. We collect and categorize more than 220 popular handcrafted features grounded on past literature. Then, we conduct a correlation analysis study on several task-specific datasets and report the potential use cases of each feature. Lastly, we devise a multilingual handcrafted linguistic feature extraction system in a systematically expandable manner. We open-source our system for public access to a rich set of pre-implemented handcrafted features. Our system is coined LFTK and is the largest of its kind. Find it at github.com/brucewlee/lftk.
A blind spot for large language models: Supradiegetic linguistic information
Julia Witte Zimmerman, Denis Hudon, Kathryn Cramer
et al.
Large Language Models (LLMs) like ChatGPT reflect profound changes in the field of Artificial Intelligence, achieving a linguistic fluency that is impressively, even shockingly, human-like. The extent of their current and potential capabilities is an active area of investigation by no means limited to scientific researchers. It is common for people to frame the training data for LLMs as "text" or even "language". We examine the details of this framing using ideas from several areas, including linguistics, embodied cognition, cognitive science, mathematics, and history. We propose that considering what it is like to be an LLM like ChatGPT, as Nagel might have put it, can help us gain insight into its capabilities in general, and in particular, that its exposure to linguistic training data can be productively reframed as exposure to the diegetic information encoded in language, and its deficits can be reframed as ignorance of extradiegetic information, including supradiegetic linguistic information. Supradiegetic linguistic information consists of those arbitrary aspects of the physical form of language that are not derivable from the one-dimensional relations of context -- frequency, adjacency, proximity, co-occurrence -- that LLMs like ChatGPT have access to. Roughly speaking, the diegetic portion of a word can be thought of as its function, its meaning, as the information in a theoretical vector in a word embedding, while the supradiegetic portion of the word can be thought of as its form, like the shapes of its letters or the sounds of its syllables. We use these concepts to investigate why LLMs like ChatGPT have trouble handling palindromes, the visual characteristics of symbols, translating Sumerian cuneiform, and continuing integer sequences.
Axiology of Imitative Communication: Formulation of the Problem
Anton N. Fortunatov
The article explores the hypothesis that modern communication is taking on a special form today. It can
be called imitative communication, in which information processing technologies play a very important role. The
author derives the origin of this communication from the commercial, consumer status of social information,
which in the early stages of the emergence of media reality contributed to the development of market relations.
Thanks to the development of technology, the quality of imitation of reality becomes an independent factor infl uencing the deconstruction of a person in the information process. In these conclusions, the author relies on his
own long-term studies of the history of television. It was this mass media that laid the foundations of imitative
reality, tearing off the perception of reality from its real development and laying the foundations for technological
identifi cation of a person to the detriment of traditional personal efforts. Imitative reality has a connection with the
phenomenon of the game. The author carries out a phenomenological rethinking of the game in comparison with
its classical defi nitions. The game appears as a form of detection of ontological boundaries in which the personality is located. However, with today’s development of technology, the game is turning into a way to replace reality,
losing its original status as an identifi er of social meanings. With this interpretation, many currently unresolved
problems, for example, the problem of computer addiction, acquire the features of distinct social deviations that
have specifi c ways of treatment. As a perspective for the development of the existing situation, the author calls
the reditization (return) of the individual to reality, while virtual reality should acquire the status of another utopia
in the history of mankind
Philology. Linguistics, Philosophy (General)
How does the pre-training objective affect what large language models learn about linguistic properties?
Ahmed Alajrami, Nikolaos Aletras
Several pre-training objectives, such as masked language modeling (MLM), have been proposed to pre-train language models (e.g. BERT) with the aim of learning better language representations. However, to the best of our knowledge, no previous work so far has investigated how different pre-training objectives affect what BERT learns about linguistics properties. We hypothesize that linguistically motivated objectives such as MLM should help BERT to acquire better linguistic knowledge compared to other non-linguistically motivated objectives that are not intuitive or hard for humans to guess the association between the input and the label to be predicted. To this end, we pre-train BERT with two linguistically motivated objectives and three non-linguistically motivated ones. We then probe for linguistic characteristics encoded in the representation of the resulting models. We find strong evidence that there are only small differences in probing performance between the representations learned by the two different types of objectives. These surprising results question the dominant narrative of linguistically informed pre-training.
CLSE: Corpus of Linguistically Significant Entities
Aleksandr Chuklin, Justin Zhao, Mihir Kale
One of the biggest challenges of natural language generation (NLG) is the proper handling of named entities. Named entities are a common source of grammar mistakes such as wrong prepositions, wrong article handling, or incorrect entity inflection. Without factoring linguistic representation, such errors are often underrepresented when evaluating on a small set of arbitrarily picked argument values, or when translating a dataset from a linguistically simpler language, like English, to a linguistically complex language, like Russian. However, for some applications, broadly precise grammatical correctness is critical -- native speakers may find entity-related grammar errors silly, jarring, or even offensive. To enable the creation of more linguistically diverse NLG datasets, we release a Corpus of Linguistically Significant Entities (CLSE) annotated by linguist experts. The corpus includes 34 languages and covers 74 different semantic types to support various applications from airline ticketing to video games. To demonstrate one possible use of CLSE, we produce an augmented version of the Schema-Guided Dialog Dataset, SGD-CLSE. Using the CLSE's entities and a small number of human translations, we create a linguistically representative NLG evaluation benchmark in three languages: French (high-resource), Marathi (low-resource), and Russian (highly inflected language). We establish quality baselines for neural, template-based, and hybrid NLG systems and discuss the strengths and weaknesses of each approach.
Introduction to Linguistics
Aleksandr Blinov, Yuriy Rozhdestvenskiy, Y. Marchuk
et al.
The textbook is based on the lecture course "Fundamentals of Linguistics", taught at the Faculty of Philology of Lomonosov Moscow State University. The book introduces students to the system of concepts and terms used by any philological discipline. The purpose of the textbook is to provide theoretical training for students to learn languages and help them master languages practically. The text of the textbook introduces students to the range of problems that are further generalized in the courses "General Linguistics", "Theory of Language", "History of linguistic teachings". Meets the requirements of the federal state educational standards of higher education of the latest generation. For students of philological faculties of higher educational institutions studying in the direction of training 45.03.01 "Philology" (bachelor's degree).
Philosophical and methodological bases of cut and sentence interpretation in formal linguistics
M. Bolibekova, Noila A. Abdurakhimova, Adiba S. Toshtemirova
et al.
Philosophical and methodological bases of cut and sentence interpretation in formal linguistics. Now it remains to answer the question of which of these units - logical units or linguistic units - is more scientifically older. Science has already answered this question - linguistics (broader philology) was separated from the structure of philosophy only in the XVIII century, that is, linguistic definitions were copied from philosophy - from logic. This can be proved by another incident. This section of our work can be concluded with the following general conclusion: In Uzbek formal linguistics, the interpretation of the position of the cut in the sentence structure was given not based on the internal (native, ontological) features of the Uzbek language, but based on the understanding of the cut in Russian formal linguistics. In formal linguistics itself, the interpretation of the cut is based on the understanding of the logical predicate.