Hasil untuk "Language. Linguistic theory. Comparative grammar"

Menampilkan 20 dari ~4431691 hasil · dari DOAJ, arXiv, Semantic Scholar, CrossRef

JSON API
arXiv Open Access 2026
Blackbird Language Matrices: A Framework to Investigate the Linguistic Competence of Language Models

Paola Merlo, Chunyang Jiang, Giuseppe Samo et al.

This article describes a novel language task, the Blackbird Language Matrices (BLM) task, inspired by intelligence tests, and illustrates the BLM datasets, their construction and benchmarking, and targeted experiments on chunking and systematicity. BLMs are multiple-choice problems, structured at multiple levels: within each sentence, across the input sequence, within each candidate answer. Because of their rich structure, these curated, but naturalistic datasets are key to answer some core questions about current large language models abilities: do LLMs detect linguistic objects and their properties? Do they detect and use systematic patterns across sentences? Are they more prone to linguistic or reasoning errors, and how do these interact? We show that BLMs, while challenging, can be solved at good levels of performance, in more than one language, with simple baseline models or, at better performance levels, with more tailored models. We show that their representations contain the grammatical objects and attributes relevant to solve a linguistic task. We also show that these solutions are reached by detecting systematic patterns across sentences. The paper supports the point of view that curated, structured datasets support multi-faceted investigations of properties of language and large language models. Because they present a curated, articulated structure, because they comprise both learning contexts and expected answers, and because they are partly built by hand, BLMs fall in the category of datasets that can support explainability investigations, and be useful to ask why large language models behave the way they do.

en cs.CL
arXiv Open Access 2026
ALBA: A European Portuguese Benchmark for Evaluating Language and Linguistic Dimensions in Generative LLMs

Inês Vieira, Inês Calvo, Iago Paulo et al.

As Large Language Models (LLMs) expand across multilingual domains, evaluating their performance in under-represented languages becomes increasingly important. European Portuguese (pt-PT) is particularly affected, as existing training data and benchmarks are mainly in Brazilian Portuguese (pt-BR). To address this, we introduce ALBA, a linguistically grounded benchmark designed from the ground up to assess LLM proficiency in linguistic-related tasks in pt-PT across eight linguistic dimensions, including Language Variety, Culture-bound Semantics, Discourse Analysis, Word Plays, Syntax, Morphology, Lexicology, and Phonetics and Phonology. ALBA is manually constructed by language experts and paired with an LLM-as-a-judge framework for scalable evaluation of pt-PT generated language. Experiments on a diverse set of models reveal performance variability across linguistic dimensions, highlighting the need for comprehensive, variety-sensitive benchmarks that support further development of tools in pt-PT.

en cs.CL, cs.AI
S2 Open Access 2026
Some differences of the number representation as the origin of the interference (on the example of Russian-Spanish language contact)

O. A. Gnatyuk

Aim. The aim of this study is to identify the psycholinguistic basis for interference caused by differences in the number category of nouns in Spanish and Russian. Methodology. The complex approach is applied in the research. The main methods used in the work are continuous sampling, comparative analysis, and qualitative analysis. Results. The study revealed that interference errors caused by significant differences in the category of number can be grouped as follows: incorrect form of the accompanying dependent word (adjective, indefinite pronoun); incorrect form of the verb (notional, linking verb). The students' relatively high level of linguistic competence, corresponding to C1, indicates the psycholinguistic nature of the interference errors identified and analyzed during the study. Research implications. The results of the study contribute to translation theory and also support the hypothesis that the linguistic worldview of the native language influences the target language. The data obtained can be used in compiling bilingual dictionaries and grammar manuals.

DOAJ Open Access 2025
Activismo lingüístico digital para lenguas minoritarias: el caso del catalán

Sergi Alvarez-Vidal

El auge de Internet y la evolución de las tecnologías del lenguaje, en particular los sistemas de traducción automática y los modelos masivos de lenguaje, han transformado de manera radical las condiciones de acceso al conocimiento, la producción textual y la circulación global de información. Estas innovaciones han facilitado la comunicación multilingüe, pero también han reforzado dinámicas de concentración lingüística que pueden afectar negativamente la visibilidad, el prestigio y la funcionalidad de las lenguas minoritarias, generando lo que se ha denominado “brecha lingüística digital”. Este fenómeno pone en riesgo la sostenibilidad de dichas lenguas en el ecosistema digital, al limitar su presencia en plataformas tecnológicas y restringir su capacidad de adaptación a los nuevos entornos comunicativos. El artículo analiza críticamente el impacto de esta brecha sobre la diversidad lingüística y examina el papel del activismo lingüístico digital como forma de acción colectiva orientada a contrarrestar estas asimetrías. En este marco, se estudia el caso del catalán, a través de dos iniciativas emblemáticas: la Viquipèdia y Softcatalà, una organización sin ánimo de lucro pionera en el desarrollo de tecnologías de la lengua en catalán. Ambas propuestas ilustran cómo el compromiso sostenido de comunidades lingüísticas puede traducirse en la creación de infraestructuras digitales, recursos lingüísticos abiertos y espacios de uso significativo en línea, más allá de las políticas gubernamentales.

Translating and interpreting
arXiv Open Access 2025
Can LLMs Solve and Generate Linguistic Olympiad Puzzles?

Neh Majmudar, Elena Filatova

In this paper, we introduce a combination of novel and exciting tasks: the solution and generation of linguistic puzzles. We focus on puzzles used in Linguistic Olympiads for high school students. We first extend the existing benchmark for the task of solving linguistic puzzles. We explore the use of Large Language Models (LLMs), including recent state-of-the-art models such as OpenAI's o1, for solving linguistic puzzles, analyzing their performance across various linguistic topics. We demonstrate that LLMs outperform humans on most puzzles types, except for those centered on writing systems, and for the understudied languages. We use the insights from puzzle-solving experiments to direct the novel task of puzzle generation. We believe that automating puzzle generation, even for relatively simple puzzles, holds promise for expanding interest in linguistics and introducing the field to a broader audience. This finding highlights the importance of linguistic puzzle generation as a research task: such puzzles can not only promote linguistics but also support the dissemination of knowledge about rare and understudied languages.

en cs.CL
arXiv Open Access 2025
UrBLiMP: A Benchmark for Evaluating the Linguistic Competence of Large Language Models in Urdu

Farah Adeeba, Brian Dillon, Hassan Sajjad et al.

Multilingual Large Language Models (LLMs) have shown remarkable performance across various languages; however, they often include significantly less data for low-resource languages such as Urdu compared to high-resource languages like English. To assess the linguistic knowledge of LLMs in Urdu, we present the Urdu Benchmark of Linguistic Minimal Pairs (UrBLiMP) i.e. pairs of minimally different sentences that contrast in grammatical acceptability. UrBLiMP comprises 5,696 minimal pairs targeting ten core syntactic phenomena, carefully curated using the Urdu Treebank and diverse Urdu text corpora. A human evaluation of UrBLiMP annotations yielded a 96.10% inter-annotator agreement, confirming the reliability of the dataset. We evaluate twenty multilingual LLMs on UrBLiMP, revealing significant variation in performance across linguistic phenomena. While LLaMA-3-70B achieves the highest average accuracy (94.73%), its performance is statistically comparable to other top models such as Gemma-3-27B-PT. These findings highlight both the potential and the limitations of current multilingual LLMs in capturing fine-grained syntactic knowledge in low-resource languages.

en cs.CL
arXiv Open Access 2025
Developing an Open Conversational Speech Corpus for the Isan Language

Adisai Na-Thalang, Chanakan Wittayasakpan, Kritsadha Phatcharoen et al.

This paper introduces the development of the first open conversational speech dataset for the Isan language, the most widely spoken regional dialect in Thailand. Unlike existing speech corpora that are primarily based on read or scripted speech, this dataset consists of natural speech, thereby capturing authentic linguistic phenomena such as colloquials, spontaneous prosody, disfluencies, and frequent code-switching with central Thai. A key challenge in building this resource lies in the lack of a standardized orthography for Isan. Current writing practices vary considerably, due to the different lexical tones between Thai and Isan. This variability complicates the design of transcription guidelines and poses questions regarding consistency, usability, and linguistic authenticity. To address these issues, we establish practical transcription protocols that balance the need for representational accuracy with the requirements of computational processing. By releasing this dataset as an open resource, we aim to contribute to inclusive AI development, support research on underrepresented languages, and provide a basis for addressing the linguistic and technical challenges inherent in modeling conversational speech.

en cs.CL
arXiv Open Access 2025
Game Theory Meets Large Language Models: A Systematic Survey with Taxonomy and New Frontiers

Haoran Sun, Yusen Wu, Peng Wang et al.

Game theory is a foundational framework for analyzing strategic interactions, and its intersection with large language models (LLMs) is a rapidly growing field. However, existing surveys mainly focus narrowly on using game theory to evaluate LLM behavior. This paper provides the first comprehensive survey of the bidirectional relationship between Game Theory and LLMs. We propose a novel taxonomy that categorizes the research in this intersection into four distinct perspectives: (1) evaluating LLMs in game-based scenarios; (2) improving LLMs using game-theoretic concepts for better interpretability and alignment; (3) modeling the competitive landscape of LLM development and its societal impact; and (4) leveraging LLMs to advance game models and to solve corresponding game theory problems. Furthermore, we identify key challenges and outline future research directions. By systematically investigating this interdisciplinary landscape, our survey highlights the mutual influence of game theory and LLMs, fostering progress at the intersection of these fields.

en cs.AI, cs.GT
S2 Open Access 2025
The history of research on parts of speech in the Chinese language under the influence of Europe (up to the 1950s).

C. Chao

The subject of the study is the history of the examination of parts of speech in the Chinese language under European influence (up to the 1950s). The object of the research is the evolution of theories on the classification of Chinese parts of speech. The author examines in detail aspects such as how the system of Latin grammar influenced the studies of Chinese grammar through the work of Ma Jianzhong (1898), how in the first half of the 20th century, under the influence of English grammatical theories, Li Jinsi (1924) proposed the principle of "sentence as a basis," which shifted the focus of research on parts of speech to syntactic functions, and the key role of the Soviet-Chinese discussion of the 1950s. Special attention is paid to how the history of research on parts of speech in the Chinese language (an isolating language) has progressed from simple imitation to integration with the characteristics of the Chinese language. Historical-linguistic and descriptive analysis is applied using a chronological approach, comparative study of European and Chinese grammatical models, analysis of the works of key scholars, and materials from discussions in the 1950s. The main conclusions of the conducted study are: The study of parts of speech in the Chinese language has been heavily influenced by European linguistic paradigms from the very beginning. The history has evolved from straightforward imitation (Ma Jianzhong, 1898) to integration with the features of the Chinese language. Scientific discussions between Soviet and Chinese scholars in the 1950s deepened theories and ultimately formed comprehensive classification standards combining morphological, semantic, and syntactic criteria. The novelty of the study: This article systematically investigates the history of research on parts of speech in the Chinese language under European influence (up to the 1950s). These studies not only continuously deepened the understanding of the typological characteristics of the Chinese language but also provided a theory from a non-Indo-European perspective for general linguistics.

S2 Open Access 2025
From Principles to Practice: The Plan-Generate-Verify-Adapt (PGVA) Framework for AI Integration in Luganda Language Instruction

T. Kakinda, M. Mulumba, F. Masaazi et al.

The integration of artificial intelligence in indigenous language education requires systematic pedagogical frameworks that translate theoretical knowledge into practical implementation. This study developed and validated the Plan-Generate-Verify-Adapt (PGVA) framework, an empirically-grounded operational structure guiding AI-chatbot integration in competency-based indigenous language grammar instruction. Using Design-Based Research methodology across two iterative implementation cycles with 26 preservice Luganda teachers in Uganda, the study identified 15 design principles organized into six thematic clusters: Knowledge Integration, AI Interaction, Pedagogical Structures, Cultural-Linguistic Integrity, Assessment and Metacognition, and Contextual Responsiveness. Analysis of interaction logs, reflective journals, and lesson plans revealed that these principles coalesce into a four-phase cyclical workflow. The PGVA framework positions teachers as cultural-linguistic guardians who systematically plan learning outcomes before AI engagement, collaboratively generate materials with AI through strategic prompting, verify outputs against community-authoritative sources, and adapt verified materials for specific learner contexts. Implementation data demonstrated that adherence to the PGVA sequence correlated strongly with material quality (r=0.73, p<0.001), while phase-skipping or resequencing produced culturally inappropriate or pedagogically weak outputs. The framework addresses critical gaps in existing AI pedagogy models by providing explicit procedural guidance suitable for resource-constrained indigenous language contexts where technological outputs require systematic validation. Comparative analysis positions PGVA's distinctive contributions relative to AI-TEACH, IAPM, and AI-Oriented Pedagogy frameworks. Implications for indigenous language teacher preparation, practical implementation guidance, and transferability to other contexts are discussed. The PGVA framework demonstrates that successful AI integration requires systematic workflow structures grounded in sound pedagogical theory rather than assuming technology alone transforms practice.

S2 Open Access 2024
The impact of computerized dynamic assessment on the explicit and implicit knowledge of grammar

Hossein Kargar Behbahani, Sedigheh Karimpour

Abstract This study investigates the transformative potential of Computerized Dynamic Assessment (CDA) on learners’ explicit and implicit language knowledge of grammar, focusing on past perfect tense. Grounded in sociocultural theory (SCT), CDA integrates assessment and mediation within the Zone of Proximal Development (ZPD), offering personalized support to learners. The interface position in language acquisition theory posits that explicit knowledge can evolve into implicit through extensive practice. To explore this dynamic, an interventionist CDA was employed, where 27 learners received tailored mediation during assessment, while the control group (N = 25) did not. The findings challenge the conventional separation between explicit and implicit knowledge, revealing that CDA significantly enhances both forms of language knowledge. This study extends the applicability of SCT in language education and highlights CDA's potential to individualize instruction, informing evidence-based language education policies. Limitations and suggestions for further research are also discussed, emphasizing the need for broader investigations across linguistic features, proficiency levels, and educational contexts.

S2 Open Access 2024
Rich Descriptions and Simple Explanations in Morphosyntax and Language Acquisition

This volume offers new perspectives on the tension between the rich patterns of language variation that emerge from comparative studies and the quest for simple theoretical primitives. The chapters explore the debate between Cartography and Minimalism: on the one hand, the need for detailed and articulated descriptions of the clausal architecture, and on the other, the endeavor to reduce the theoretical apparatus to fundamental computational mechanisms. The first part of the book begins with a reflection on the goals of modern linguistic theory, and investigates the principles of human language, in an effort to subsume the regularities of particular grammars under a small set of morphosyntactic and semantic primitives. The second part examines the clausal structure—both the CP-layer and the IP-layer—from a comparative perspective, which directly relates to the fundamental questions of universality, linguistic variation, and learnability addressed in the first part of the book. With chapters written by world-leading linguists who analyze a wide range of old and new phenomena, the volume will be a valuable resource for researchers and students interested in theoretical linguistics and language development.

S2 Open Access 2024
Grammar intervention using graduated input type variation (GITV) for pre-primary children: A single-case experimental design study

Anita M-Y Wong, Hannah H-I Sou, Lexi H-T Ip et al.

Abstract Purpose This study examined the early efficacy of a new theory-driven principle of grammar intervention, graduated input type variation (GITV). Method Three Cantonese-speaking children, aged between 4;01 and 5;10, with oral language difficulties participated in this single baseline within-participant single case experimental study. The children received a total of 300 teaching episodes of the target serial verb construction via focused stimulation and recast over 10 30- to 45-minute sessions. The 30 exemplars of the target included low type variation of the verbs in each of the first five sessions, followed by high type variation in the remaining sessions. Result Visual analysis revealed that all children improved their performance in the target construction but not the control vocabulary in the probes, suggesting a treatment effect. Maintenance of treatment effects was also observed in all children. Positive results in across-behaviour generalisation to the untrained construction were observed in all children. Generalisation to other less structured linguistic contexts and to the narrative retell discourse context was minimal and observed in one child only. Conclusion Preliminary evidence suggested early efficacy of GITV as a principle for grammar intervention. Modifications in the research methodology are recommended for future studies involving children with developmental language disorder.

1 sitasi en Medicine
S2 Open Access 2024
Model of Comparative Analysis of the Structure and Semantics of Causal Constructions in Multi-Structural Languages

Bimbetov Alimjan

This article explores the application of constructive grammar, a linguistic approach that integrates cognitive, syntactic, and semantic frameworks to analyse causative constructions in English and Russian languages. Constructive grammar interprets not only the form and content of linguistic components but also the structure itself, which influences and imposes constraints on these components. The study considers the methodological evolution from traditional syntactic theories to cognitive models, particularly the influence of Ch. Fillmore’s work. By focusing on causative constructions, specifically analytical causatives like "The police made him confess," the research draws upon the theories of V.P. Nedyalkov, G.G. Silnitsky, and A.A. Kholodovich, among others, to provide a comparative analysis of causative structures in different languages. Cognitive and generative grammar perspectives, including R. King's spatial metaphor and L. Talmy’s force dynamics, are integrated to establish a comprehensive framework for understanding the universal and language-specific properties of causative expressions. The findings highlight the importance of the cognitive interpretation of syntactic structures, the role of energy transfer between agents and patients, and the diverse semantic types of causative situations. The study concludes by suggesting future research directions in the syntactic classification and formal expression of causative components.

1 sitasi en
S2 Open Access 2024
Multimodal English Language Pedagogy and Social Justice: A Sociohistorical Perspective from South Korea

Kenneth Burke

This research investigates English language education in South Korea, situated within the broader sociohistorical and global influences that shape the education system. The theory-based study sets out to critique the prevailing grammar-intensive and rote learning methodologies, proposing instead a pedagogical paradigm that foregrounds holistic, inclusive practices and purpose for student agency. By employing a qualitative analysis, it explores multimodality while endorsing the fusion of interdisciplinary insights to enhance both the learning process and students' communicative competencies. Findings underscore the pivotal role of educators in fostering significant pedagogical shifts, necessitating deep introspection and an informed grasp of sociohistorical and cultural contexts. The research reveals a need for an approach to learning that harmonises linguistic skills with broad sociocultural understanding, advocating for a teaching model that seamlessly integrates historical consciousness, creativity and innovative strategies, and a commitment to principles of social justice. The significance of this study lies in its potential to influence policy and practice, encouraging a move towards more engaged, reflective and socially responsible English language instruction within the region.

1 sitasi en
arXiv Open Access 2024
Manipulating language models' training data to study syntactic constraint learning: the case of English passivization

Cara Su-Yi Leong, Tal Linzen

Grammatical rules in natural languages are often characterized by exceptions. How do language learners learn these exceptions to otherwise general patterns? Here, we study this question through the case study of English passivization. While passivization is in general quite productive, there are cases where it cannot apply (cf. the following sentence is ungrammatical: *One hour was lasted by the meeting). Using neural network language models as theories of language acquisition, we explore the sources of indirect evidence that a learner can leverage to learn whether a verb can be passivized. We first characterize English speakers' judgments of exceptions to the passive, and confirm that speakers find some verbs more passivizable than others. We then show that a neural network language model's verb passivizability judgments are largely similar to those displayed by humans, suggesting that evidence for these exceptions is available in the linguistic input. Finally, we test two hypotheses as to the source of evidence that language models use to learn these restrictions: frequency (entrenchment) and semantics (affectedness). We do so by training models on versions of the corpus that have had sentences of the types implicated by each hypothesis removed, altered, or introduced. We find support for both hypotheses: entrenchment and affectedness make independent contributions to a verb's passivizability. From a methodological point of view, this study highlights the utility of altering a language model's training data for answering questions where complete control over a learner's input is vital.

en cs.CL
arXiv Open Access 2024
Linguistic Analysis, Description, and Typological Exploration with Categorial Grammar (TheBench Guide)

Cem Bozsahin

TheBench is a tool to study monadic structures in natural language. It is for writing monadic grammars to explore analyses, compare diverse languages through their categories, and to train models of grammar from form-meaning pairs where syntax is latent variable. Monadic structures are binary combinations of elements that employ semantics of composition only. TheBench is essentially old-school categorial grammar to syntacticize the idea, with the implication that although syntax is autonomous (recall \emph{colorless green ideas sleep furiously}), the treasure is in the baggage it carries at every step, viz. semantics, more narrowly, predicate-argument structures indicating choice of categorial reference and its consequent placeholders for decision in such structures. There is some new thought in old school. Unlike traditional categorial grammars, application is turned into composition in monadic analysis. Moreover, every correspondence requires specifying two command relations, one on syntactic command and the other on semantic command. A monadic grammar of TheBench contains only synthetic elements (called `objects' in category theory of mathematics) that are shaped by this analytic invariant, viz. composition. Both ingredients (command relations) of any analytic step must therefore be functions (`arrows' in category theory). TheBench is one implementation of the idea for iterative development of such functions along with grammar of synthetic elements.

en cs.CL
S2 Open Access 2023
Spanish Resource Grammar Version 2023

Olga Zamaraeva, Carlos Gómez-Rodríguez

We present the latest version of the Spanish Resource Grammar (SRG), a grammar of Spanish implemented in the HPSG formalism. Such grammars encode a complex set of hypotheses about syntax making them a resource for empirical testing of linguistic theory. They also encode a strict notion of grammaticality which makes them a resource for natural language processing applications in computer-assisted language learning. This version of the SRG uses the recent version of the Freeling morphological analyzer and is released along with an automatically created, manually verified treebank of 2,291 sentences. We explain the treebanking process, emphasizing how it is different from treebanking with manual annotation and how it contributes to empirically-driven development of syntactic theory. The treebanks’ high level of consistency and detail makes them a resource for training high-quality semantic parsers and generally systems that benefit from precise and detailed semantics. Finally, we present the grammar’s coverage and overgeneration on 100 sentences from a learner corpus, a new research line related to developing methodologies for robust empirical evaluation of hypotheses in second language acquisition.

1 sitasi en Computer Science

Halaman 10 dari 221585