Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases
Michael Y. Hu, Jackson Petty, Chuan Shi
et al.
Pretraining language models on formal language can improve their acquisition of natural language. Which features of the formal language impart an inductive bias that leads to effective transfer? Drawing on insights from linguistics and complexity theory, we hypothesize that effective transfer occurs when two conditions are met: the formal language should capture the dependency structures present in natural language, and it should remain within the computational limitations of the model architecture. We experiment with pre-pretraining (training on formal language before natural languages) on transformers and find that formal languages capturing hierarchical dependencies indeed enable language models to achieve lower loss on natural language and better linguistic generalization compared to other formal languages. We also find modest support for the hypothesis that the formal language should fall within the computational limitations of the architecture. Strikingly, pre-pretraining reduces loss more efficiently than training on a matched amount of natural language. For a 1B-parameter language model trained on roughly 1.6B tokens of natural language, pre-pretraining achieves the same loss and better linguistic generalization with a 33% smaller token budget. Finally, we also give mechanistic evidence of transfer from formal to natural language: attention heads acquired during pre-pretraining remain crucial for the model's performance on syntactic evaluations.
Limited-Resource Adapters Are Regularizers, Not Linguists
Marcell Fekete, Nathaniel R. Robinson, Ernests Lavrinovics
et al.
Cross-lingual transfer from related high-resource languages is a well-established strategy to enhance low-resource language technologies. Prior work has shown that adapters show promise for, e.g., improving low-resource machine translation (MT). In this work, we investigate an adapter souping method combined with cross-attention fine-tuning of a pre-trained MT model to leverage language transfer for three low-resource Creole languages, which exhibit relatedness to different language groups across distinct linguistic dimensions. Our approach improves performance substantially over baselines. However, we find that linguistic relatedness -- or even a lack thereof -- does not covary meaningfully with adapter performance. Surprisingly, our cross-attention fine-tuning approach appears equally effective with randomly initialized adapters, implying that the benefit of adapters in this setting lies in parameter regularization, and not in meaningful information transfer. We provide analysis supporting this regularization hypothesis. Our findings underscore the reality that neural language processing involves many success factors, and that not all neural methods leverage linguistic knowledge in intuitive ways.
Linguistic Predictability and Search Complexity: How Linguistic Redundancy Constraints the Landscape of Classical and Quantum Search
Alessio Di Santo, Gabriella Lanziani
This study examines the quantitative relationship between linguistic regularities and computational search complexity through a hybrid classical-quantum framework applied to Renaissance Italian texts. Using four representative works from the fifteenth and sixteenth centuries-Il Principe (Machiavelli), Il Cortegiano (Castiglione), I Ricordi (Guicciardini), and Orlando Furioso (Ariosto)-we construct character-based n-gram models under both a historically grounded 25-letter orthography and the full modern Italian alphabet. These models provide corpus-derived probabilistic baselines for evaluating substitution-cipher search processes. Combining classical hill climbing and simulated annealing with Grover-style quantum-inspired estimates and a QUBO annealing formulation, we quantify how the probability that a key produces a linguistically plausible decryption (pgood) relates to expected computational effort. Across cipher lengths from 200 to 1000 characters, empirical results confirm the predicted dependence of Grover oracle calls on 1/sqrt(pgood) and show that longer texts yield sharper score distributions and smaller feasible key regions. Overall, the findings establish a link between linguistic redundancy and search-space contraction, providing an empirical framework for comparing classical, quantum-inspired, and idealized quantum search dynamics under unified corpus-driven constraints.
Out of Style: RAG's Fragility to Linguistic Variation
Tianyu Cao, Neel Bhandari, Akhila Yerukola
et al.
Despite the impressive performance of Retrieval-augmented Generation (RAG) systems across various NLP benchmarks, their robustness in handling real-world user-LLM interaction queries remains largely underexplored. This presents a critical gap for practical deployment, where user queries exhibit greater linguistic variations and can trigger cascading errors across interdependent RAG components. In this work, we systematically analyze how varying four linguistic dimensions (formality, readability, politeness, and grammatical correctness) impact RAG performance. We evaluate two retrieval models and nine LLMs, ranging from 3 to 72 billion parameters, across four information-seeking Question Answering (QA) datasets. Our results reveal that linguistic reformulations significantly impact both retrieval and generation stages, leading to a relative performance drop of up to 40.41% in Recall@5 scores for less formal queries and 38.86% in answer match scores for queries containing grammatical errors. Notably, RAG systems exhibit greater sensitivity to such variations compared to LLM-only generations, highlighting their vulnerability to error propagation due to linguistic shifts. These findings highlight the need for improved robustness techniques to enhance reliability in diverse user interactions. Code is available at https://github.com/Springcty/RAG-fragility-to-linguistic-variation.
ON SOME TRENDS IN CONTEMPORARY
ENGLISH STUDIES
O. V. Aleksandrova, Natalia B. Gvishiani, A. Lipgart
The academic work at the Department of English Linguistics is organized along many directions, which were defined by the founders of the Department — A.I Smirnitskij and O.S. Akhmanova. Scholars work on the traditional spheres of the linguistic study, as well as new approaches, based on the interdisciplinary data which are received by the borderline spheres of knowledge, at the same time closely connected with philology, when a human being and his objective understanding of the surrounding world, as well as subjective attitude to it, is in the centre of attention. The article deals only with some very important aspects of the departmental work: pragmaphonostylistics, functional stylistics, cognitive approach to the study of the language, including corpus linguistics — a new directon which is indispensable for modern linguistics.
AHEAD OF THE WORD: REVIEW OF THE MONOGRAPH ‘WORD AND THOUGHT: ONE STEP AHEAD’ BY V. K. KHARCHENKO (MOSCOW, INFRA-M PUBL, 2024, 219 P.)
D. A. Romanov
The author reviews a new monograph by V. K. Kharchenko, Professor of Belgorod State National Research University. The monograph discusses the view of philological epistemology from the perspective of the first third of the 21st century. V. K. Kharchenko has a modern view of the scientific knowledge problems in philology and makes an attempt to outline the main ways of further development of the mentioned scientific field. The main task of the scholar-philologist of the new millennium is, first of all, to overcome many standardized approaches to language units, text, style. According to Kharchenko, constancy does not contribute to the scientific research productivity. For example, academic lexicography proves that the great polysemantics of a word, its ability to develop additional meanings determines the degree of its stability in the active lexicon of native speakers. Traditional paradigms of philology are currently receiving new development impulses due to the branching of the research line, greater attention to details and the active involvement of the individual creative vision of the scholar. V. К. Kharchenko assesses the riskiness of modern philological concepts, paying attention to the possibility of taking into account not only the proven facts, but also hypotheses from the sphere of scientific irrationality and creative intuition. Using rich illustrative material, the author examines those philological achievements of previous epochs that were initially considered materially impossible and even fantastic. The broad scientific erudition of V. K. Kharchenko allows her to convincingly operate with facts from linguistics (mainly lexicology, lexicography, paremiology), literary criticism (literary and journalistic texts by Russian authors of the 18th – 21st centuries), art history (history of painting, sculpture, architecture, etc.), psychology (various sensory phenomena, synesthesia, effects on consciousness), physiology (including the physiology of higher nervous activity, the brain), etc. V. K. Kharchenko also notes unique cases of the emergence of new thematic areas in philology and new objects of research (for example, the rhyme). The book presents various stylistic phenomena that the author identifies in modern philological research, outlines the connections between philology and ethics, and postulates the importance of linguistic positivity and a aiming for success.
Derivatives of Latin Origin Going Back to the Indo-European Root *Gene- in the Russian Language
A. Belyakova, A. V. Grigorev
Aim. To establish derivatives of the Indo-European root *gene- with the meaning of ‘to give birth, to occur, to generate, to produce’ in the Russian language, their semantics, and the definition of the time of occurrence of derivatives in the Russian language.Methodology. The main research methods are descriptive, comparative, comparative-historical, and the technique of etymological analysis of derivatives was also used.Results. The identified words, which have phonetic and graphic forms that differ from each other, go back to a single Indo-European root *gene-. The data obtained allow us to trace the development of the semantic meaning of the root *gene- at various stages of the development of the Russian language.Research implications. Studies in the field of etymological analysis allow us to trace the phonetic and semantic changes of the words belonging to the same etymological nest. The results of the research will be reflected in the scientific project of the Department of General Linguistics named I. G. Dobrodomov and A. F. Losev Center for Russian Language and Culture (Institute of Philology, Moscow Pedagogical State University) – Derivational Dictionary of the Russian Language.
Zero-shot Generative Linguistic Steganography
Ke Lin, Yiyang Luo, Zijian Zhang
et al.
Generative linguistic steganography attempts to hide secret messages into covertext. Previous studies have generally focused on the statistical differences between the covertext and stegotext, however, ill-formed stegotext can readily be identified by humans. In this paper, we propose a novel zero-shot approach based on in-context learning for linguistic steganography to achieve better perceptual and statistical imperceptibility. We also design several new metrics and reproducible language evaluations to measure the imperceptibility of the stegotext. Our experimental results indicate that our method produces $1.926\times$ more innocent and intelligible stegotext than any other method.
Unveiling Linguistic Regions in Large Language Models
Zhihao Zhang, Jun Zhao, Qi Zhang
et al.
Large Language Models (LLMs) have demonstrated considerable cross-lingual alignment and generalization ability. Current research primarily focuses on improving LLMs' cross-lingual generalization capabilities. However, there is still a lack of research on the intrinsic mechanisms of how LLMs achieve cross-lingual alignment. From the perspective of region partitioning, this paper conducts several investigations on the linguistic competence of LLMs. We discover a core region in LLMs that corresponds to linguistic competence, accounting for approximately 1% of the total model parameters. Removing this core region by setting parameters to zero results in a significant performance decrease across 30 different languages. Furthermore, this core region exhibits significant dimensional dependence, perturbations to even a single parameter on specific dimensions leading to a loss of linguistic competence. Moreover, we discover that distinct monolingual regions exist for different languages, and disruption to these specific regions substantially reduces the LLMs' proficiency in those corresponding languages. Our research also indicates that freezing the core linguistic region during further pre-training can mitigate the issue of catastrophic forgetting (CF), a common phenomenon observed during further pre-training of LLMs. Overall, exploring the LLMs' functional regions provides insights into the foundation of their intelligence.
The Creolizing Turn and Its Archipelagic Directions
A. Kabir
Abstract Recent years have seen a resurgence of scholarly interest across disciplines around the concept “creolization” even as there has been some pushback against this development in other academic quarters. This article contextualizes this state of art around “creolization” and presents an analytical overview of the term’s discursive history. First, I discuss the appearance of the term creole in several areas of the world as an epiphenomenon of the first wave of European expansionism from the fifteenty century onward. Second, I track the emergence of “Creole” as an analytical category within nineteenth-century philology and its further development within linguistics. Third, I focus on milestones in the move of “creole” to “creolization” as a category for theorists of culture. Finally, I discuss recuperations of creolization as a theoretical model, including my own work that articulates it together with theoretical approaches to archipelagos.
The Notion of “Fascination” and “Fascinativity” in Linguistic Poetics Discourse
Олена Скоробогатова, Антоніна Золотько
The study of phenomena that have the potential for verbal persuasive influence on the recipient as well as the identification and development of ways to counteract this influence are the challenges of modern linguistics. Poetry is considered to be one of the most ancient forms of verbal communication. The analysis of the nature of poetic text units provides the key not only to a deeper interpretation of the idea of a particular work but also to the explanation of general linguistic phenomena. The poetic discourse as a linguistic and creative environment regularly activates expressive and figurative potential of language units at different levels. Describing the processes of the grammatical level of poetic language, the terms “fascination” and “fascinativity” are often used. In this paper, the authors refer to the theoretical contributions of representatives of the Kharkiv School of Philology and scholars of other linguistic branches whose works are related to the issues of perception and apperception, and fascinativity of texts. Fascination is considered as an aspect of social communication complementary to information, and fascinativity as an ontological feature of a poetic text that is closely connected with the realisation of the author's supertask. Fascinativity is a basic notion, considered as a specific quality of the text – the ability to “enchant” the reader. Fascinativity is created by the author with the help of linguistic and discursive (in this case, poetic) means and affects the perception of poetry by the recipient. The research findings emphasise that the degree of fascinativity is strongly affected by all the immanent elements of a particular poetic text, the existing discursive practice, and tradition. The study of fascinativity features in a poetic text and fascination as an important aspect of social communication becomes convincing when scholars draw on recent research findings of modern linguists in the field of linguistic poetics. Combination of the communicative and linguistic poetics approaches allows to explain how the mechanism of fascination works in terms of communication and how the fascinativity of a poetic text and discourse is formed.
COMPARATIVE TYPOLOGY OF THE ENGLISH AND UZBEK LANGUAGES
Khusenova Mekhriniso Uktamovna
Comparative linguistics, or comparative-historical linguistics (formerly comparative philology ) is a branch of historical linguistics that is concerned with comparing languages to establish their historical relatedness. This article focuses on the comparative typology of English, Uzbek and discusses the formation of comparative typology as a science, its methods of analysis, and the relations it with other linguistic subjects. Key words-comparative typology, confrontative linguistics, contrastive linguistics, linguistic characterology, comparativists, notions of a type of a language and a type in a language, linguistic universals, recessives and uncials
Cultivating Intercultural Attitudes among EIL learners: A Case Study of Undergraduates
Muhammad Waqar Ali, Muhammad Iqbal, Sareer Badshah
Teaching of English as an international language (EIL) necessitates promoting intercultural communicative competence (ICC) to enable the individual to survive and progress in multicultural societies. It thereby requires them to develop intercultural attitudes—one of the essential components of ICC. The current paper strives to fathom their intercultural attitudes before and after the intervention, and to measure the impact of the treatment on their intercultural attitudes. Based on the material designed for promoting intercultural competence, a semester-long course was carried out as the intervention. Unlike the control group, the treatment group was exposed to the treatment. A self-report survey was administered to both groups before and after the treatment. Though both groups reported improvement in their attitudes, the mean differences and effect size values for the cumulative scale and subscales indicate a substantial difference between the attitudes of the two groups; it thereby may be inferred that the treatment proved to be effective in cultivating intercultural attitudes. Furthermore, female participants slightly surpassed the male participants in certain attitudes, but the difference infinitesimal.
English literature, Language. Linguistic theory. Comparative grammar
A longitudinal study on students’ self-regulated listening during transition to an English-medium transnational university in China
Sihan Zhou, Gene Thompson
Listening to academic content in English medium instruction (EMI) classrooms at university can be a demanding task for students who transition from first language (L1) instructed secondary schools. This longitudinal mixed methods study analyzes data from 316 students collected at the beginning, midterm, and the end of their first semester after entering an EMI transnational university in southeast China. The analysis of questionnaire responses revealed significant variations in students’ listening strategies over time, with a significant decrease in deep processing cognitive strategies at the midterm when content difficulty increased. Conversely, two types of metacognitive strategies (problem solving, plan evaluation) increased significantly during the second half of the semester. Informed by Zimmerman’s (2000) social cognitive self-regulated learning (SRL) model, thematic analysis of semi-structured interviews with 34 participants revealed that students generally developed a more top-down listening approach focusing on content learning over the semester and became more selective in their strategic and self-regulatory processes after the midterm “watershed” moment. Results highlight the importance of structured topic knowledge in EMI curriculum design and the necessity of strategy training in language support programs.
Exploring Linguistic Features for Turkish Text Readability
Ahmet Yavuz Uluslu, Gerold Schneider
This paper presents the first comprehensive study on automatic readability assessment of Turkish texts. We combine state-of-the-art neural network models with linguistic features at lexical, morphological, syntactic and discourse levels to develop an advanced readability tool. We evaluate the effectiveness of traditional readability formulas compared to modern automated methods and identify key linguistic features that determine the readability of Turkish texts.
Research on scheduling strategy for automated storage and retrieval system
Sai Geng, Lei Wang, Dongdong Li
et al.
Abstract With the continuous and rapid growth of transport demand, scheduling strategy of warehouse has become a key issue in the field of logistics transportation. The structural differences of the warehouse, the automated storage and retrieval system (AS/RS) model and the two‐end dual stackers scheduling model (TDSM) are considered, and a new improved genetic algorithm (NIGA) is proposed. It can adjust the algorithm structure according to the density of population fitness value, and effectively optimize the stacker path. In the TDSM, an improved anti‐collision principle is proposed to avoid collision of two stackers. Besides, combined with the optimal anti‐collision boundary inspection mechanism, the best working area for the two stackers is allocated by using NIGA. Finally, the new improved GA is compared with GA and the adaptive GA on specific storage and retrieval tasks. The simulation results show that the proposed NIGA well outperforms other GAs in most instances, which indicates that it is an effective approach for the AS/RS and the TDSM scheduling optimization problem.
Computational linguistics. Natural language processing, Computer software
Self-Repair in Elicited Narrative Production in Speakers of Russian as the First (L1), Second (L2), and Heritage (HL) Language
Natalia Bogdanova-Beglarian, Kristina Zaides, Tatiana Verkhovtceva
et al.
The current study investigates self-repairs in the speech of three groups of Russian speakers: monolingual controls (<i>N</i> = 12) residing in the Russian Federation, for whom Russian is their first dominant language (L1); bilingual Russian–Hebrew speaking participants (<i>N</i> = 12), who acquired Russian as their Heritage Language (HL) in contact with the dominant Societal Hebrew in Israel; and bilingual Russian–Chinese speakers (<i>N</i> = 12) residing in the Russian Federation at the time of testing, for whom Russian is their second language (L2). Picture-elicited narratives were coded for instances of self-repairs, split into Conceptualizer Repairs (C-repairs)—which imply pragmatic, semantic, or lexical changes—and Formulator Repairs (F-repairs), correcting different types of errors. In addition, self-repair initiators—such as cut-offs, hesitation pauses, and discourse markers—were annotated before each instance of self-repair. The results indicate that L2 speakers, in general, use self-repairs more frequently than L1 and HL speakers. L1 speakers hardly produced F-repairs, while HL and L2 speakers resorted to both C- and F-repairs. L1 speakers mainly used C-repairs for appropriacy, whereas HL and L2 speakers used C-repairs for rephrasing and lexical item change. As for F-repairs, HL speakers tended to change pronunciation and morphology, while L2 speakers implemented more morphological repairs. Lexical initiators of self-repairs were more common in L1 speech; however, in the L2 group we saw much more frequent cut-offs of repaired speech fragments. As such, varying self-repair strategies were employed by different speaker groups, shedding light on the underlying processes of language production. There was also evidence of cross-linguistic transfer of non-lexical self-repair initiators: HL speakers resorted to prolongations as initiators in HL-Russian (a strategy that is common in their dominant language, Hebrew), whereas L1 speakers used vocalized and silent pauses more frequently.
Learning a second language via print: On the logical necessity of a fluent first language
Catherine L. Caldwell-Harris, Robert J. Hoffmeister
How Deaf children should be taught to read has long been debated. Severely or profoundly Deaf children, who face challenges in acquiring language from its spoken forms, must learn to read a language they do not speak. We refer to this as learning a language via print. How children can learn language via print is not a topic regularly studied by educators, psychologists, or language acquisition theorists. Nonetheless, Deaf children can do this. We discuss how Deaf children can learn a written language via print by mapping print words and phrases to sign language sequences. However, established, time-tested curricula for using a signed language to teach the print forms of spoken languages do not exist. We describe general principles for approaching this task, how it differs from acquiring a spoken language naturalistically, and empirical evidence that Deaf children's knowledge of a signed language facilitates and advances learning a printed language.
Communication. Mass media
Book Review: Brian McFarlane (ed.), The Encyclopaedia of British Film, 5th edition (Manchester: Manchester University Press, 2021)
Jonny Smith
Brian McFarlane (ed.), The Encyclopaedia of British Film, 5th edition (Manchester: Manchester University Press, 2021), pp. 968, ISBN: 978-1526159267, £133.20
Visual arts, Communication. Mass media
RuCoLA: Russian Corpus of Linguistic Acceptability
Vladislav Mikhailov, Tatiana Shamardina, Max Ryabinin
et al.
Linguistic acceptability (LA) attracts the attention of the research community due to its many uses, such as testing the grammatical knowledge of language models and filtering implausible texts with acceptability classifiers. However, the application scope of LA in languages other than English is limited due to the lack of high-quality resources. To this end, we introduce the Russian Corpus of Linguistic Acceptability (RuCoLA), built from the ground up under the well-established binary LA approach. RuCoLA consists of $9.8$k in-domain sentences from linguistic publications and $3.6$k out-of-domain sentences produced by generative models. The out-of-domain set is created to facilitate the practical use of acceptability for improving language generation. Our paper describes the data collection protocol and presents a fine-grained analysis of acceptability classification experiments with a range of baseline approaches. In particular, we demonstrate that the most widely used language models still fall behind humans by a large margin, especially when detecting morphological and semantic errors. We release RuCoLA, the code of experiments, and a public leaderboard (rucola-benchmark.com) to assess the linguistic competence of language models for Russian.