Building a Strong Instruction Language Model for a Less-Resourced Language
Domen Vreš, Tjaša Arčon, Timotej Petrič
et al.
Large language models (LLMs) have become an essential tool for natural language processing and artificial intelligence in general. Current open-source models are primarily trained on English texts, resulting in poorer performance on less-resourced languages and cultures. We present a set of methodological approaches necessary for the successful adaptation of an LLM to a less-resourced language, and demonstrate them using the Slovene language. We present GaMS3-12B, a generative model for Slovene with 12 billion parameters, and demonstrate that it is the best-performing open-source model for Slovene within its parameter range. We adapted the model to the Slovene language using three-stage continual pre-training of the Gemma 3 model, followed by two-stage supervised fine-tuning (SFT). We trained the model on a combination of 140B Slovene, English, Bosnian, Serbian, and Croatian pretraining tokens, and over 200 thousand English and Slovene SFT examples. We evaluate GaMS3-12B on the Slovenian-LLM-Eval datasets, English-to-Slovene translation, and the Slovene LLM arena. We show that the described model outperforms 12B Gemma 3 across all three scenarios and performs comparably to much larger commercial GPT-4o in the Slovene LLM arena, achieving a win rate of over 60 %.
Linguistic Changes of the Kurdish Language on Social Media Platforms
Rozhan Qadir Abubakir
This research examines the transformations within the Kurdish language across social media platforms, focusing specifically on Facebook, Instagram, and TikTok. Utilizing a descriptive-analytical approach and drawing upon sociolinguistic frameworks—such as linguistic diversity and code-switching—the study reveals that social media has fostered a new digital dialect termed "Digital Kurdish." The findings indicate that 55% of vocabulary is abbreviated and 70% of messages contain emoji’s, leading to a decline in lexical expressive capacity and a neglect of grammatical rules. Furthermore, the widespread use of foreign loanwords, shorthand writing, and the mixing of scripts (Arabic and Latin) poses a significant challenge to the unity and stability of the standard language. The study confirms that digitalization presents both opportunities and risks: while it revitalizes the language and bridges communication gaps between users, it simultaneously threatens cultural linguistic identity. Consequently, the research proposes the establishment of a "Soft Digital Standard" theory consisting of three levels: a unified foundational level for orthography and basic grammar; a flexible level to accommodate shorthand and emoji’s, and a technical support level integrated through Unicode fonts and AI-powered keyboards. This framework aims to preserve the Kurdish language within the digital sphere and facilitate its evolution into a global online medium. Keywords: Kurdish Language, Social Media, Linguistic Change, Digital Kurdish, Standardization.
Language Scaffolding Strategies Based on Vygotsky's Zone of Proximal Development in Early Reading Education: A Cross-Cultural Comparison
Lei Wu, Kamariah Abu Bakar
This study presents a comparative analysis of language scaffolding strategies based on Vygotsky's Zone of Proximal Development theory in early reading education across China, the United States, and the United Kingdom. Through systematic review of 78 studies published between 2014–2024, including 27 Chinese studies, 31 American studies, and 20 British studies, the research examines how scaffolding practices are implemented within each country's distinct cultural and educational context. The methodology employed multi-dimensional analytical framework encompassing theoretical conceptualization, pedagogical implementation, cultural-linguistic adaptation, and effectiveness assessment to ensure comprehensive cross-cultural comparison. The analysis reveals significant variations in scaffolding approaches: Chinese practices emphasize structured teacher guidance tailored to character acquisition and morphological awareness development, reflecting Confucian educational traditions and examination-oriented culture. American strategies prioritize evidence-based methods within multi-tiered support frameworks, incorporating technological integration and individualized assessment protocols. British approaches integrate systematic progression with teacher autonomy in diverse multicultural contexts, demonstrating sustained intervention effectiveness and culturally responsive implementation strategies. The findings demonstrate how educational policy structures, linguistic characteristics, teacher preparation models, and cultural values fundamentally shape scaffolding implementation patterns. Cross-cultural triangulation reveals both universal principles and context-specific adaptations in applying Vygotskian theory to early literacy instruction. The study concludes by proposing evidence-informed adaptations for Chinese early reading education that balance explicit instruction with student-centered approaches while preserving cultural strengths in structured progression and collective learning orientations, contributing to international understanding of culturally responsive scaffolding practices.
NAVIGATING LINGUISTIC LANDSCAPES: THE INTERPLAY OF TRADITIONAL, HISTORICAL, STRUCTURAL, GENERATIVE, AND FUNCTIONAL LINGUISTICS
Donya Nasrollahi, Maryam Beiki
This paper explores the foundational principles and contributions of five major schools of linguistics: Traditional Linguistics, Historical Linguistics, Structural Linguistics, Generative Linguistics, and Functional Linguistics. Traditional Linguistics emphasizes the prescriptive and normative analysis of language, often grounded in classical grammar. Historical Linguistics investigates the diachronic development of languages, providing insights into language change and evolution. Structural Linguistics, pioneered by Ferdinand de Saussure, prioritizes the synchronic study of language structures, introducing the concepts of signifier and signified. Generative Linguistics, initiated by Noam Chomsky, focuses on the innate mechanisms of language acquisition and the formal systems underlying linguistic competence. Finally, Functional Linguistics examines the role of context and pragmatics in shaping linguistic forms and functions. The paper highlights the interconnectivity and distinctions among these paradigms, emphasizing their enduring relevance in linguistic theory and application. Curriculum designer and policymakers might benefit from a multifaceted understanding of linguistics that combines insights from various schools to create more effective language education programs and promote language preservation efforts.
Embodiment in Metaphor and Imagery
Zafar Abbas
This article investigates the construction of poetic meaning through two competing theoretical frameworks: the classical Aristotelian concept of mimesis (imitation) and the contemporary theory of cognitive mimesis. By contrasting these paradigms, we reveal fundamental divergences in their mechanisms of meaning-making. Our analysis begins with concrete examples embodied metaphor and imaginative projection—to demonstrate how each theory interprets poetic language. We then trace the historical development of meaning in poetic theory, from Aristotle’s emphasis on repressentation to modern cognitive approaches that prioritize embodied experience. This progression highlights how poetic meaning has evolved into an interdisciplinary nexus, bridging literary criticism, rhetoric, poetics, and cognitive neuroscience. Cognitive poetics, in particular, reframes poetic imagery through models of induction, generalization, and implicature (implied meaning). These models treat the image not as a static symbol but as a dynamic construct shaped by the reader’s imagination and perceptual grounding. Central to this approach is the principle of embodiment, which ties linguistic meaning to sensory-motor experiences and challenges traditional binaries between figurative and literal language. By examining how images emerge from and interact with everyday speech, cognitive poetics reveals the deep cognitive underpinnings of poetic effect.
Language. Linguistic theory. Comparative grammar
El conservadurismo paradojal de Aki Kaurismäki. Trabajo, género, amor y vitalidad en Hojas de otoño
Nicolas Lema Habash, Giovana Suárez Ortiz
Proponemos un estudio del filme Hojas de otoño (2023) de Aki Kaurismäki en términos de una reflexión cinemática acerca de cómo es posible perseverar afectivamente en el mundo contemporáneo dominado por la lógica de la extracción de plusvalía por medio del trabajo asalariado. Luego de una contextualización de este filme dentro la obra de Kaurismäki, donde destaca una reflexión sobre el mundo laboral contemporáneo, argumentamos que en su cine se da una revalorización del amor romántico. Aunque ciertamente escenificado por motivos heterosexuales tradicionales, se trata de una forma de amor romántico que implica un intento rupturista por encontrar una suerte de refugio afectivo en medio de una estructura de la sociedad y el mundo que tiende a la aniquilación de la vitalidad. De ahí que propongamos la idea de un “conservadurismo paradojal” en los motivos del amor romántico en Kaurismäki: paradojal, puesto que las formas tradicionales del amor con las cuales piensa Hojas de otoño implican formas de potenciamiento afectivo y no simplemente la mantención del statu quo.
History of scholarship and learning. The humanities, Philology. Linguistics
La dimensi´ón dialógica en la Refutación de la Donación de Constantino de Lorenzo Valla
Mariano Vilar
Este artículo analiza el De falso credita et ementita Constantini donatione de Lorenzo Valla a partir de sus vínculos con el género dialógico que cultivó en obras como De vero bono, De libero arbitrio y De professione religiosorum. Aunque estructurado como una oratio forense, el texto incorpora recursos propios de la disputatio humanística —prosopopeya, apóstrofe, enargeia, concessio— que lo convierten en un “diálogo in absentia” con adversarios históricos e imaginarios. El estudio muestra cómo Valla construye escenas verosímiles para resaltar la inverosimilitud del Constitutum Constantini, personifica figuras como el falsificador “Palea” para exhibir su ignorancia o hipocresía, y emplea la concessio como trampa dialéctica que refuerza la refutación. Estas estrategias no cumplen solo una función estilística, sino que configuran una “crítica dialógica” en la que la filología se vuelve performativa, revive el pasado y despoja de autoridad a textos e instituciones. Se concluye que esta obra combina retórica forense y dramatización dialógica para transformar la crítica filológica en un acto de emancipación intelectual, afirmando la primacía de la razón y de la evidencia frente a toda jerarquía.
Medieval history, Style. Composition. Rhetoric
Towards Fundamental Language Models: Does Linguistic Competence Scale with Model Size?
Jaime Collado-Montañez, L. Alfonso Ureña-López, Arturo Montejo-Ráez
Large Language Models offer impressive language capabilities but suffer from well-known limitations, including hallucinations, biases, privacy concerns, and high computational costs. These issues are largely driven by the combination of linguistic competence and factual memorization within a single monolithic model. This paper introduces and empirically supports the Fundamental Language Model (FLM) paradigm, which advocates for smaller, linguistically competent models that offload factual retrieval to external tools. We evaluate models ranging from 135M to 32B parameters across three dimensions: linguistic competence, external factual knowledge, and internal factual knowledge. Our findings reveal that while both linguistic competence and factual knowledge improve with scale, internal factual knowledge grows significantly faster, suggesting that model size is more closely tied to memorization than to core language ability. These results support a modular approach to language modeling, where compact, linguistically proficient models serve as the foundation for tool-augmented systems. The FLM paradigm offers a path toward more efficient, interpretable, and sustainable NLP solutions.
Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages
Xabier de Zuazo, Eva Navas, Ibon Saratxaga
et al.
Automatic speech recognition systems have undoubtedly advanced with the integration of multilingual and multitask models such as Whisper, which have shown a promising ability to understand and process speech across a wide range of languages. Despite their robustness, these models often fall short in handling the linguistic distinctions of minority languages. This study addresses this gap by integrating traditional and novel language models with fine-tuned Whisper models to raise their performance in less commonly studied languages. Through rigorous fine-tuning and evaluation across multiple datasets, we demonstrate substantial improvements in word error rate, particularly in low-resource scenarios. Our approach not only does take advantage of the extensive data Whisper was pre-trained on, but also complements its linguistic adaptability by incorporating language models. We obtained improvements up to 51% for in-distribution datasets and up to 34% for out-of-distribution sentences using statistical language models, while large language models provided moderate but consistently robust improvement across diverse linguistic contexts. The findings reveal that, while the integration reliably benefits all model sizes, the extent of improvement varies, highlighting the importance of optimized language model parameters. Finally, we emphasize the importance of selecting appropriate evaluation parameters when reporting the results using transformer-based ASR models. In summary, this research clears the way for more inclusive ASR technologies that perform better across languages by enriching their linguistic knowledge. For further implementation details of this study, the technical documentation and source code are available at http://www.github.com/hitz-zentroa/whisper-lm.
Are formal and functional linguistic mechanisms dissociated in language models?
Michael Hanna, Yonatan Belinkov, Sandro Pezzelle
Although large language models (LLMs) are increasingly capable, these capabilities are unevenly distributed: they excel at formal linguistic tasks, such as producing fluent, grammatical text, but struggle more with functional linguistic tasks like reasoning and consistent fact retrieval. Inspired by neuroscience, recent work suggests that to succeed on both formal and functional linguistic tasks, LLMs should use different mechanisms for each; such localization could either be built-in or emerge spontaneously through training. In this paper, we ask: do current models, with fast-improving functional linguistic abilities, exhibit distinct localization of formal and functional linguistic mechanisms? We answer this by finding and comparing the "circuits", or minimal computational subgraphs, responsible for various formal and functional tasks. Comparing 5 LLMs across 10 distinct tasks, we find that while there is indeed little overlap between circuits for formal and functional tasks, there is also little overlap between formal linguistic tasks, as exists in the human brain. Thus, a single formal linguistic network, unified and distinct from functional task circuits, remains elusive. However, in terms of cross-task faithfulness - the ability of one circuit to solve another's task - we observe a separation between formal and functional mechanisms, suggesting that shared mechanisms between formal tasks may exist.
Linguistic Interpretability of Transformer-based Language Models: a systematic review
Miguel López-Otal, Jorge Gracia, Jordi Bernad
et al.
Language models based on the Transformer architecture achieve excellent results in many language-related tasks, such as text classification or sentiment analysis. However, despite the architecture of these models being well-defined, little is known about how their internal computations help them achieve their results. This renders these models, as of today, a type of 'black box' systems. There is, however, a line of research -- 'interpretability' -- aiming to learn how information is encoded inside these models. More specifically, there is work dedicated to studying whether Transformer-based models possess knowledge of linguistic phenomena similar to human speakers -- an area we call 'linguistic interpretability' of these models. In this survey we present a comprehensive analysis of 160 research works, spread across multiple languages and models -- including multilingual ones -- that attempt to discover linguistic information from the perspective of several traditional Linguistics disciplines: Syntax, Morphology, Lexico-Semantics and Discourse. Our survey fills a gap in the existing interpretability literature, which either not focus on linguistic knowledge in these models or present some limitations -- e.g. only studying English-based models. Our survey also focuses on Pre-trained Language Models not further specialized for a downstream task, with an emphasis on works that use interpretability techniques that explore models' internal representations.
LinguaLens: Towards Interpreting Linguistic Mechanisms of Large Language Models via Sparse Auto-Encoder
Yi Jing, Zijun Yao, Hongzhu Guo
et al.
Large language models (LLMs) demonstrate exceptional performance on tasks requiring complex linguistic abilities, such as reference disambiguation and metaphor recognition/generation. Although LLMs possess impressive capabilities, their internal mechanisms for processing and representing linguistic knowledge remain largely opaque. Prior research on linguistic mechanisms is limited by coarse granularity, limited analysis scale, and narrow focus. In this study, we propose LinguaLens, a systematic and comprehensive framework for analyzing the linguistic mechanisms of large language models, based on Sparse Auto-Encoders (SAEs). We extract a broad set of Chinese and English linguistic features across four dimensions (morphology, syntax, semantics, and pragmatics). By employing counterfactual methods, we construct a large-scale counterfactual dataset of linguistic features for mechanism analysis. Our findings reveal intrinsic representations of linguistic knowledge in LLMs, uncover patterns of cross-layer and cross-lingual distribution, and demonstrate the potential to control model outputs. This work provides a systematic suite of resources and methods for studying linguistic mechanisms, offers strong evidence that LLMs possess genuine linguistic knowledge, and lays the foundation for more interpretable and controllable language modeling in future research.
Linguistic Blind Spots of Large Language Models
Jiali Cheng, Hadi Amiri
Large language models (LLMs) are the foundation of many AI applications today. However, despite their remarkable proficiency in generating coherent text, questions linger regarding their ability to perform fine-grained linguistic annotation tasks, such as detecting nouns or verbs, or identifying more complex syntactic structures like clauses in input texts. These tasks require precise syntactic and semantic understanding of input text, and when LLMs underperform on specific linguistic structures, it raises concerns about their reliability for detailed linguistic analysis and whether their (even correct) outputs truly reflect an understanding of the inputs. In this paper, we empirically study the performance of recent LLMs on fine-grained linguistic annotation tasks. Through a series of experiments, we find that recent LLMs show limited efficacy in addressing linguistic queries and often struggle with linguistically complex inputs. We show that the most capable LLM (Llama3-70b) makes notable errors in detecting linguistic structures, such as misidentifying embedded clauses, failing to recognize verb phrases, and confusing complex nominals with clauses. Our results provide insights to inform future advancements in LLM design and development.
Multiple grammars within linguistic populations
M. Polinsky
This paper explores the concept of multiple grammars (MGs) and their implications for linguistic theory, language acquisition, and bilingual language knowledge. Drawing on evidence from phenomena such as scope interactions, verb raising, and agreement patterns, I argue that seemingly identical surface structures can be undergirded by different grammatical analyses that may compete within speaker populations. I then propose a typology of MG distributions, including shared MGs, competing MGs, and partial MGs, each with distinct consequences for acquisition and use. Contrary to expectations of simplification, bilingualism can sometimes lead to an expansion of grammatical analyses and does not always lead to the elimination of MGs. The paper discusses methods for predicting environments conducive to MGs, considering factors such as structural ambiguity and silent elements. The examination of MGs compels us to explore how learners navigate underdetermined input, especially in bilingual contexts, and to examine the interplay between gradient acceptability judgments and categorical grammatical distinctions. The study of MGs offers valuable insights into language variation, change, and the nature of linguistic competence.
THE AGE FACTOR IN LANGUAGE DEVELOPMENT THEORIES AND ITS IMPLICATIONS FOR ACQUISITION AND TEACHING
L. Jovanović
This research investigates the progression of grammatical complexity in English language textbooks designed for various age groups, ranging from young children to adults. By analyzing four textbooks—Smiles (ages 6-7), Wider World 1 (ages 11-12), Solutions (ages 14-15), and English File (ages 18+)—the study explores how linguistic theories, particularly the Innate Hypothesis and the Critical Period Hypothesis, are implemented in practical language teaching. A comparative method was used to evaluate the curricula of these textbooks, focusing on morphology, syntax, and phonology. The findings reveal a clear increase in grammatical complexity, corresponding to the learners’ cognitive and linguistic development stages. For instance, Smiles introduces basic grammatical structures and phonetics suitable for young children, while English File covers advanced grammar and detailed phonological content for adult learners. This study highlights the importance of aligning language instruction with learners’ developmental stages and provides insights for educators to adapt their teaching strategies accordingly. The results support the claim that early exposure to language, as well as the adaptation of teaching methods to different age groups, enhances language acquisition.
"Science of the Russian language": Nikolai Gavrilovich Kurganov and his "Pismovnik" (to mark the tercentenary of the enlightener’s birth)
O. Nikitin
This article analyses the linguistic experiment of the gifted 18th-century thinker N. G. Kurganov, namely "Pismovnik" (language and literature textbook for self-study), which became a breakthrough in linguodidactics. The research examines the structure and unusual content of the book which combined the genres of a Russian language textbook and a chrestomathy, i. e., "verbal description". The unique composition of the textbook enabled readers to effortlessly master the basics of grammar while simultaneously becoming familiar with the literary, philosophical, and spiritual traditions. The study highlights the author’s original method of interpreting linguistic material and its presentation to readers. Additionally, N. G. Kurganov’s techniques employed to describe literary and scientific texts as well as the distinctive features of using lexical means are characterised. Moreover, the findings indicate the epistemological nature of 18th-century educational knowledge. The study employed descriptive, comparative-historical, textual, stylistic and sociolinguistic analysis. The important practical applications of the "Pismovnik" as a linguistic thought landmark in philological theory and practice are emphasised. The article concludes that N. G. Kurganov’s book has heuristic potential and can be used in schools and universities as a source for historical lexicology and scientific discourse studies.
Conceptualization and Metonymy: Cognitive Mechanisms of Meaning Construction in Language
Dilber Orujova
This paper examines how conceptualization and metonymy jointly underpin meaning construction in language within the framework of Cognitive Linguistics. Conceptualization is treated as the dynamic, construal-based, and embodied process through which linguistic expressions prompt mental representations, while metonymy is argued to be a domain-internal access operation that provides cognitively economical routes to complex conceptual structures. Integrating insights from Cognitive Grammar, Frame Semantics, and Conceptual Integration Theory, the study shows how reference-point relations, frame activation, and local compressions in blended spaces account for familiar patterns such as AUTHOR→WORK (“She reads Shakespeare”) and PLACE→INSTITUTION (“The White House issued a statement”). The analysis highlights the roles of salience, profiling, and cultural models in stabilizing metonymic preferences across languages and registers, and it outlines operational diagnostics for distinguishing metonymy from metaphor in corpus data. We argue that metonymy is not a peripheral stylistic device but a fundamental cognitive mechanism embedded in embodied experience and socio-cultural practice. The paper concludes with implications for linguistic theory, cross-linguistic variation, corpus methodology, and applied domains including translation, lexicography, and natural language understanding.
Restoring the Aboriginal Voice: Language and Identity in The Chant of Jimmie Blacksmith
Pengkang Liu, Norhanim Abdul Samat, Faraha Hamidi
This paper examines how Thomas Keneally's The Chant of Jimmie Blacksmith engages with the restoration of Aboriginal voice and identity within the context of Australian colonial discourse. Specifically, the novel addresses the historical silencing of Aboriginal people and highlights the tension between imposed European language and enduring Aboriginal oral traditions. The central objective of this study is to explore how linguistic hybridity—particularly Aboriginal English and ceremonial chanting—functions as a tool of identity assertion and resistance against colonial authority. Employing close textual analysis informed by postcolonial theory, the paper draws on Said's concept of colonial discourse, Bhabha's notion of hybridity, and Ashcroft's idea of linguistic resistance. This analysis demonstrates how Aboriginal English, with its hybridised grammar and vocabulary, alongside chants rooted in kinship and cultural memory, operate subversively within the English novel form. Furthermore, the study situates Keneally's novel alongside Indigenous-authored works such as Alexis Wright's Carpentaria and Kim Scott's Benang: From the Heart. This comparative perspective underscores both the contributions and the limits of settler-authored attempts to “restore” Aboriginal voice, contrasting them with Indigenous narrative sovereignty that reconstitutes English from within Aboriginal epistemologies. The findings conclude that while Keneally's novel functions as a counter-narrative that unsettles colonial silencing, its mediation through settler authorship leaves it marked by ambivalence. Ultimately, the enduring reclamation of Aboriginal voice and identity in literature is most powerfully enacted in Indigenous-authored narratives, which reshape language, memory, and representation on their own terms. Highlights: Aboriginal voice restoration (it examines how The Chant of Jimmie Blacksmith restores Aboriginal voice and identity within Australian colonial discourse). Linguistic hybridity (it demonstrates how linguistic hybridity functions as a means of survival, identity assertion). Counter-narrative discourse (it argues that the novel operates as a counter-narrative, asserting Aboriginal agency and reframing dominant historical accounts).
EXPLORING THE CRITICAL PERIOD HYPOTHESIS: PERSPECTIVES, THEORIES, AND IMPLICATIONS FOR LANGUAGE ACQUISITION
Mirhat Aliu
The whole purpose of this research is to describe what exactly the Critical Period Hypothesis is. The researcher takes into account different opinions of various authors on what specifically they have emphasized regarding the Critical Period Hypothesis. The Critical Period Hypothesis states that language acquisition becomes significantly more challenging and ultimately less successful after the first few years of life, which is the period when language develops most easily. This typically occurs between the ages of five and puberty. However, this does not imply that every researcher agrees—or should agree—with this theory. Noam Chomsky, one of the most well-known linguists of the twentieth century, founded his linguistic theories on certain philosophical ideas. Transformational Generative Grammar, based on mentalist philosophy, represents his primary contribution to linguistics. According to his Innateness Hypothesis, developed in the context of language learning, children are born with an inherent knowledge of the basic rules of grammar. Despite the complexity of the process, this innate knowledge aids children in acquiring their native language naturally and methodically. During a child's developmental stage, language acquisition is considered one of the most challenging processes. From the first few weeks of life until the mastery of grammar, the process of first language acquisition is carefully examined. The question of whether children learn their native language naturally or through significant effort remains a topic of considerable debate. Despite not being universally accepted, the Innateness Hypothesis remains the most compelling theory to explain how children acquire language. What, then, is the relevance of the critical period concept for adult learners as opposed to children? This, along with other important questions related to the issue, is addressed in the following sections.
Syntax and Emotion in Theatre Performance Dialogue: A Cognitive-Linguistic Analysis
Xiaoyu Liu, Marlenny Bt Deenerwan, Shahnaz Binti Mohd Baldev Shah
The way emotion is expressed in Chinese opera is through detailed sentence structures, but grammar has not been studied much in this context. The researcher examines how emotions in Peking and Kunqu operas are influenced by interrogatives, exclamatives, ellipses and modal particles using 200 annotated scripts. The framework uses syntactic parsing (with spaCy), lexicons (NRC and LIWC) that have been adapted for opera, gesture-syntax mapping (from 150 scenes) and conceptual metaphor theory. The analysis shows that how complex the syntax is in a sentence is the most important indicator of emotional strength (β = 0.58, p < 0.001), with modal particles being the second most important (β = 0.34, p< 0.01). Peking Opera often uses questions (28.7%), connected with anger (φ = 0.72), but Kunqu mostly leaves out certain words (32.1%) which is connected to sadness (φ = 0.65). It is found that in 68.7% of anger scenes, the interrogative is accompanied by finger-pointing. In comparing with Shakespeare, we find that Peking uses exclamative-surprise much more often, with a residual of 4.81. Conceptual metaphors (for example, “ANGER IS VERTICALITY”) join the way we talk with how our body experiences the world. It offers a way to scale up the project, advancing digital heritage, cultural linguistics and emotion-aware NLP, so that Chinese opera becomes a key focus of research connecting language, culture and emotion.