Imperfect Language, Artificial Intelligence, and the Human Mind: An Interdisciplinary Approach to Linguistic Errors in Native Spanish Speakers
Francisco Portillo López
Linguistic errors are not merely deviations from normative grammar; they offer a unique window into the cognitive architecture of language and expose the current limitations of artificial systems that seek to replicate them. This project proposes an interdisciplinary study of linguistic errors produced by native Spanish speakers, with the aim of analyzing how current large language models (LLM) interpret, reproduce, or correct them. The research integrates three core perspectives: theoretical linguistics, to classify and understand the nature of the errors; neurolinguistics, to contextualize them within real-time language processing in the brain; and natural language processing (NLP), to evaluate their interpretation against linguistic errors. A purpose-built corpus of authentic errors of native Spanish (+500) will serve as the foundation for empirical analysis. These errors will be tested against AI models such as GPT or Gemini to assess their interpretative accuracy and their ability to generalize patterns of human linguistic behavior. The project contributes not only to the understanding of Spanish as a native language but also to the development of NLP systems that are more cognitively informed and capable of engaging with the imperfect, variable, and often ambiguous nature of real human language.
Zero-Shot Grammar Competency Estimation Using Large Language Model Generated Pseudo Labels
Sourya Dipta Das, Shubham Kumar, Kuldeep Yadav
Grammar competency estimation is essential for assessing linguistic proficiency in both written and spoken language; however, the spoken modality presents additional challenges due to its spontaneous, unstructured, and disfluent nature. Developing accurate grammar scoring models further requires extensive expert annotation, making large-scale data creation impractical. To address these limitations, we propose a zero-shot grammar competency estimation framework that leverages unlabeled data and Large Language Models (LLMs) without relying on manual labels. During training, we employ LLM-generated predictions on unlabeled data by using grammar competency rubric-based prompts. These predictions, treated as pseudo labels, are utilized to train a transformer-based model through a novel training framework designed to handle label noise effectively. We show that the choice of LLM for pseudo-label generation critically affects model performance and that the ratio of clean-to-noisy samples during training strongly influences stability and accuracy. Finally, a qualitative analysis of error intensity and score prediction confirms the robustness and interpretability of our approach. Experimental results demonstrate the efficacy of our approach in estimating grammar competency scores with high accuracy, paving the way for scalable, low-resource grammar assessment systems.
Inductive Linguistic Reasoning with Large Language Models
Raghav Ramji, Keshav Ramji
Evaluating large language models (LLMs) on their linguistic reasoning capabilities is an important task to understand the gaps in their skills that may surface during large-scale adoption. In this work, we investigate the abilities of such models to perform abstract multilingual reasoning through the lens of linguistic puzzles on extremely low-resource languages. As these translation tasks involve inductive and deductive reasoning from reference instances, we examine whether diverse auxiliary demonstrations can be automatically induced from seed exemplars, through analogical prompting. We employ a two-stage procedure, first generating analogical exemplars with a language model, and then applying them in-context along with provided target language exemplars. Our results on the modeLing dataset show that analogical prompting is effective in eliciting models' knowledge of language grammar similarities, boosting the performance of GPT-4o by as much as 8.1% and Llama-3.1-405B-Instruct by 5.9% over chain-of-thought approaches. These gains are attributable to the analogical demonstrations, both when self-generated as well as when produced by weaker multilingual models. Furthermore, we demonstrate that our method generalizes to other tasks present in Linguistics Olympiad competitions, achieving sizable improvements across all problem types and difficulty levels included in the LINGOLY dataset with GPT-4o. We also report several findings about interesting phenomena which drive linguistic reasoning performance, suggesting that such puzzles are a valuable benchmark for new reasoning methods.
Overview of the First Workshop on Language Models for Low-Resource Languages (LoResLM 2025)
Hansi Hettiarachchi, Tharindu Ranasinghe, Paul Rayson
et al.
The first Workshop on Language Models for Low-Resource Languages (LoResLM 2025) was held in conjunction with the 31st International Conference on Computational Linguistics (COLING 2025) in Abu Dhabi, United Arab Emirates. This workshop mainly aimed to provide a forum for researchers to share and discuss their ongoing work on language models (LMs) focusing on low-resource languages, following the recent advancements in neural language models and their linguistic biases towards high-resource languages. LoResLM 2025 attracted notable interest from the natural language processing (NLP) community, resulting in 35 accepted papers from 52 submissions. These contributions cover a broad range of low-resource languages from eight language families and 13 diverse research areas, paving the way for future possibilities and promoting linguistic inclusivity in NLP.
Large language models and linguistic intentionality
Jumbly Grindrod
Do large language models like Chat-GPT or LLaMa meaningfully use the words they produce? Or are they merely clever prediction machines, simulating language use by producing statistically plausible text? There have already been some initial attempts to answer this question by showing that these models meet the criteria for entering meaningful states according to metasemantic theories of mental content. In this paper, I will argue for a different approach - that we should instead consider whether language models meet the criteria given by our best metasemantic theories of linguistic content. In that vein, I will illustrate how this can be done by applying two such theories to the case of language models: Gareth Evans' (1982) account of naming practices and Ruth Millikan's (1984, 2004, 2005) teleosemantics. In doing so, I will argue that it is a mistake to think that the failure of LLMs to meet plausible conditions for mental intentionality thereby renders their outputs meaningless, and that a distinguishing feature of linguistic intentionality - dependency on a pre-existing linguistic system - allows for the plausible result LLM outputs are meaningful.
Are Language Models Agnostic to Linguistically Grounded Perturbations? A Case Study of Indic Languages
Poulami Ghosh, Raj Dabre, Pushpak Bhattacharyya
Pre-trained language models (PLMs) are known to be susceptible to perturbations to the input text, but existing works do not explicitly focus on linguistically grounded attacks, which are subtle and more prevalent in nature. In this paper, we study whether PLMs are agnostic to linguistically grounded attacks or not. To this end, we offer the first study addressing this, investigating different Indic languages and various downstream tasks. Our findings reveal that although PLMs are susceptible to linguistic perturbations, when compared to non-linguistic attacks, PLMs exhibit a slightly lower susceptibility to linguistic attacks. This highlights that even constrained attacks are effective. Moreover, we investigate the implications of these outcomes across a range of languages, encompassing diverse language families and different scripts.
Minderheitensprache im didaktischen Kontext am Beispiel des Deutschunterrichts in Polen
Ryszard Ziaja
In the following article, an attempt is made to define basic approaches for a future successful design of German minority teaching in Poland. On the one hand, reference is made to the thirty-year history and current situation of teaching in Poland and, on the other hand, to the already researched fields of activity of German minority teaching in Eastern Europe. Linguistic as well as cultural and methodological aspects are explored. The contribution should be understood as a scientific suggestion to redefine the term minority language in a didactic context.
Language. Linguistic theory. Comparative grammar
Translators who own it: A case study on how doxa and psychological ownership impact translators’ engagement and job satisfaction
Esther Monzó-Nebot
This paper explores the job satisfaction of translators working for an international intergovernmental organization. The extant literature on translators’ job satisfaction has explored a number of constructs. Based on developments in the field of organizational theory and the complexity of translation as a job, it is argued that psychological ownership may prove an adequate framework to explain translators’ job satisfaction and instrumental in establishing a dialogue between the various analyses of different workplaces in the field of translation and interpreting studies. The study focuses on a specific multilingual intergovernmental organization and draws on the interviews of 17 Spanish-native translators of different nationalities. Their feelings of ownership are analyzed and variations in how they relate to constructs of psychological ownership — feelings of control, intimacy of knowledge, and self-investment— become apparent. Exploring patterns shows those variations to be related to translators’ differing translation dosas, that is, they're divergent, competing, and sometimes conflicting understanding of what translation is and should be. Furthermore, relationships between psychological ownership, translation doxa, and translators’ efforts to advance their own doxas in the organization are examined with a view towards creating means to engage professional translators in advancing a doxa shaped by and for translators across workplaces.
Translating and interpreting
GraMMaR: Ground-aware Motion Model for 3D Human Motion Reconstruction
Sihan Ma, Qiong Cao, Hongwei Yi
et al.
Demystifying complex human-ground interactions is essential for accurate and realistic 3D human motion reconstruction from RGB videos, as it ensures consistency between the humans and the ground plane. Prior methods have modeled human-ground interactions either implicitly or in a sparse manner, often resulting in unrealistic and incorrect motions when faced with noise and uncertainty. In contrast, our approach explicitly represents these interactions in a dense and continuous manner. To this end, we propose a novel Ground-aware Motion Model for 3D Human Motion Reconstruction, named GraMMaR, which jointly learns the distribution of transitions in both pose and interaction between every joint and ground plane at each time step of a motion sequence. It is trained to explicitly promote consistency between the motion and distance change towards the ground. After training, we establish a joint optimization strategy that utilizes GraMMaR as a dual-prior, regularizing the optimization towards the space of plausible ground-aware motions. This leads to realistic and coherent motion reconstruction, irrespective of the assumed or learned ground plane. Through extensive evaluation on the AMASS and AIST++ datasets, our model demonstrates good generalization and discriminating abilities in challenging cases including complex and ambiguous human-ground interactions. The code will be available at https://github.com/xymsh/GraMMaR.
Re/Deconstructing voices of (female) translators: The case of Bolesława Kopelówna (1897-1961)
Joanna Sobesto
The article presents the life and work of Bolesława Kopelówna, a Polish literary translator who was especially active (and widely criticised) in the interwar years in Poland, and is now almost completely forgotten. The article attempts to answer the following questions: why was Kopelówna so intensely criticised? Why has she disappeared from the collective memory? Why was she so active in the field of translation? And, no less crucially, who was this enigmatic figure of Bolesława Kopelówna? Through an application of microhistorical tools to fragments of Kopelówna’s life and work, I will re/deconstruct her seemingly non-existing archive. Combining interdisciplinary tools from literary history, history and feminist studies, my aim is not only to bring back the voice of a silenced, overlooked, and underestimated translator, but also to encourage other researchers to attempt to fill blank spaces in translation history.
Translating and interpreting
Methodological Challenges in Audiovisual Translation: Experimenting New Software for Multimodal Corpus-based Analysis
Dora Renna
Audiovisual translation has long struggled to strike a balance between corpus-based analysis of large amounts of text and the need to systematically integrate multimodality in its research scope, in order to fully acknowledge the complex nature of the audiovisual product. This paper aims to relate on the experimentation that is currently being conducted at Ca’ Foscari University of Venice, with the collaboration of the University of Basel (Switzerland): using an existing software (created for pragmatics) for audiovisual translation. The main aim is to make sure this software is able to support the researcher in transcribing, annotating, adding metadata, managing and querying text and video files. The experimented software has never been used for parallel aligned audiovisual text so far. The contribution here presented first briefly describes a framework developed to analyse language variation and multimodality (with a focus on character design) in audiovisual translation, which was then immersed in the software. Subsequently, the software itself is described in detail, with specific attention to its potential and limits in the use within the field of Translation Studies and audiovisual translation. This is done by showing examples from a pilot study that belongs to a broader corpus currently under construction.
Language. Linguistic theory. Comparative grammar, Communication. Mass media
Aproximación al estudio de las locuciones adjetivas en la comunidad de habla de la ciudad de Barranquilla
Luis Fernando Nieto Ruiz
Este artículo científico se enmarca dentro de la línea de la sociolingüística, disciplina que permite identificar la manera de hablar de una comunidad, así como su incidencia en los procesos de cambio que se dan al interior de la lengua, reflejada en la variedad lingüística. Se toma como variables sociales: género, nivel de instrucción y edad; en cuanto a las variantes lingüísticas se trabajó como variante fija general la fraseología, y como variantes específicas las locuciones adjetivas. El objetivo principal es averiguar si en la comunidad de habla barranquillera se emplea de forma significativa este tipo de recursos léxicos. Para esto se emplea un corpus elaborado con base en el Diccionario Fraseológico Documentado del Español Actual (2009). Se tomó como muestra a 48 hombres y 48 mujeres, tres niveles de instrucción y cuatro grupos etarios. La metodología empleada es de carácter sociolingüístico en donde el enfoque investigativo se basa en el análisis de la información. En forma particular, el resultado derivado de las variables sociales, en consonancia con la variante lingüística sintagma preposicional más adjetivo, permitió demostrar que las mujeres hacen un mayor uso de las locuciones adjetivas. Las identidades que ellas conllevan tienden a implicar una función de marcación de fronteras mediante un uso particular de la lengua y una coherencia interna altamente marcada, gracias al dominio de determinados rasgos lingüísticos.
Philology. Linguistics, Language. Linguistic theory. Comparative grammar
A Gynocritical Study of The Color Purple by Alice Walker: A corpus-based Analysis of Adjectives
Zunaira Zafar , Haleema Majeed, Tehseen Zahra
From the beginning of the scholarly work on the women, most of the research have been carried out by opposite gender. Therefore, there have been limited work done to see how a woman portrays another woman in her writings. Moreover, there have been limited research conducted utilizing corpus tools for the analysis of gynocriticism. Thus, the present research aims to examine The Color Purple by Alice Walker by employing the corpus-based approach to investigate the representation of female character by the author in the novel. The positive or negative portrayal of women by the author in the novel was investigated through author’s usage of adjectives. Showalter's (2009) Theory of Gynocriticsim was used as a theoretical framework for the current research. Further, corpus-based methodology was employed to analyze how Alice Walker has portrayed the female character in her novel through the use of adjectives. An in-depth analysis has shown that Alice Walker has depicted woman as a helpless and sidelined being who can be turned as a resilient after suffering from frightful circumstances. The current research also opens new gates for the researchers to analyze the text from Gynocriticial perspective along with corpus techniques.
Keywords: Adjectives; Corpus-based analysis; Gynocriticsim; The Color Purple; Woman
Language. Linguistic theory. Comparative grammar, Oral communication. Speech
An Open Natural Language Processing Development Framework for EHR-based Clinical Research: A case demonstration using the National COVID Cohort Collaborative (N3C)
Sijia Liu, Andrew Wen, Liwei Wang
et al.
While we pay attention to the latest advances in clinical natural language processing (NLP), we can notice some resistance in the clinical and translational research community to adopt NLP models due to limited transparency, interpretability, and usability. In this study, we proposed an open natural language processing development framework. We evaluated it through the implementation of NLP algorithms for the National COVID Cohort Collaborative (N3C). Based on the interests in information extraction from COVID-19 related clinical notes, our work includes 1) an open data annotation process using COVID-19 signs and symptoms as the use case, 2) a community-driven ruleset composing platform, and 3) a synthetic text data generation workflow to generate texts for information extraction tasks without involving human subjects. The corpora were derived from texts from three different institutions (Mayo Clinic, University of Kentucky, University of Minnesota). The gold standard annotations were tested with a single institution's (Mayo) ruleset. This resulted in performances of 0.876, 0.706, and 0.694 in F-scores for Mayo, Minnesota, and Kentucky test datasets, respectively. The study as a consortium effort of the N3C NLP subgroup demonstrates the feasibility of creating a federated NLP algorithm development and benchmarking platform to enhance multi-institution clinical NLP study and adoption. Although we use COVID-19 as a use case in this effort, our framework is general enough to be applied to other domains of interest in clinical NLP.
Uma nova tradução para Ovídio, Amores 1.8
Luiza dos Santos Souza
A presente tradução é parte uma primeira revisão do trabalho tradutório apresentado em dissertação de mestrado (SOUZA, 2016). A proposta foi traduzir os dísticos elegíacos de Ovídio em uma forma que ecoasse o ritmo original do poema latino. Para isso, partiu-se da proposta de Carlos Alberto Nunes, a substituição de longas em posição princeps por tônicas. Porém, nesta tradução, ao contrário da de Nunes, permitiu-se também o seguimento desta tônica por apenas uma átona, formando troqueu que pode ser executado em performance como espondeu. Dessa maneira, a possibilidade de variação no metro foi mantida, mas os dátilos foram mantidos fixos no quinto pé do hexâmetro e segundo hemistíquio do pentâmetro. A cesura obrigatória do pentâmetro datílico foi executada com a aproximação de duas tônicas e enfatizada com espaçamento obrigatório, que induz o leitor ao reinício do ritmo. Para manter o andamento do metro, foram utilizados recursos como deslocamento de tônica, elisão entre o fim do hexâmetro e início do pentâmetro, elisões em geral. A presente revisão apresenta alterações no hexâmetro, que procura executar as suas cesuras, porém com o resultado de várias cesuras femininas. Além disso, o tom do poema procurou maior leveza e fluidez, com a eliminação de hipérbatos radicais e o rearranjo de informações dentro do dístico, que é a unidade dentro do poema. O poema traduzido, oitavo do livro 1, apresenta a figura da alcoviteira, personagem comum na comédia nova latina, que aqui é a presentada como uma bruxa que influencia a amada a extorquir os seus amantes e a desprezar o eu-poético, que sendo poeta, não tem como dar presentes valiosos além dos próprios poemas.
Language and Literature, Translating and interpreting
Las consecuencias de la traducción automática neuronal en las lenguas de llegada
Pilar Sánchez-Gijón, Ramon Piqué Huerta
En este articulo se abordan las consecuencias del uso de la traducción automática neuronal entrenada a partir de corpus obtenidos de traducciones de géneros textuales específicos producidos en contextos profesionales concretos. Fenómenos como el translationese y el post-editese vinculados al uso de la traducción automática marcan la necesidad de orientar la investigación hacia nuevas maneras de abordar la calidad en la traducción profesional. Desde una perspectiva de salvaguarda de las lenguas en tanto que patrimonio cultural inmaterial, se plantean diversas cuestiones relacionadas con el efecto que puede tener, en las cultures y las lengua de llegada, el uso de la traducción automática, en términos de percepción y de calidad, y cómo su uso puede llegar a alterar el estándar de la lengua. Se apunta la necesidad de definir umbrales de calidad de la traducción que vayan más allá de los aspectos relacionados con la detección de errores, que es el tipo de correcciones de mayor viabilidad económica.
Translating and interpreting
Towards the Study of Morphological Processing of the Tangkhul Language
Mirinso Shadang, Navanath Saharia, Thoudam Doren Singh
There is no or little work on natural language processing of Tangkhul language. The current work is a humble beginning of morphological processing of this language using an unsupervised approach. We use a small corpus collected from different sources of text books, short stories and articles of other topics. Based on the experiments carried out, the morpheme identification task using morphessor gives reasonable and interesting output despite using a small corpus.
Experience Grounds Language
Yonatan Bisk, Ari Holtzman, Jesse Thomason
et al.
Language understanding research is held back by a failure to relate language to the physical world it describes and to the social interactions it facilitates. Despite the incredible effectiveness of language processing models to tackle tasks after being trained on text alone, successful linguistic communication relies on a shared experience of the world. It is this shared experience that makes utterances meaningful. Natural language processing is a diverse field, and progress throughout its development has come from new representational theories, modeling techniques, data collection paradigms, and tasks. We posit that the present success of representation learning approaches trained on large, text-only corpora requires the parallel tradition of research on the broader physical and social context of language to address the deeper questions of communication.
Legal Discourse: Text Analysis and Translation Strategies
Velykodska Olena
The article deals with basic requirements to the translation for specific purposes, namely legal translation. The problem posed here is defining object and theoretical basis of legal translation. The question of the necessity of information search as an integral part of translation strategy has been raised. Detailed analysis revealed that the requirements of professional translators include knowledge of lexical and grammatical peculiarities of both languages in legal sphere; deep understanding of the concepts employed by specialists in particular field and the specialist terms used to express these concepts and their relationships in the source and target languages. It is recommended that evaluation of the translation may be done on the following principles: communicative pragmatic norms of translation; equivalent norms of translation; absence of contextual, cultural, functional, lexico-grammatical mistakes.
Language. Linguistic theory. Comparative grammar, Comparative law. International uniform law
Comparing morphological complexity of Spanish, Otomi and Nahuatl
Ximena Gutierrez-Vasques, Victor Mijangos
We use two small parallel corpora for comparing the morphological complexity of Spanish, Otomi and Nahuatl. These are languages that belong to different linguistic families, the latter are low-resourced. We take into account two quantitative criteria, on one hand the distribution of types over tokens in a corpus, on the other, perplexity and entropy as indicators of word structure predictability. We show that a language can be complex in terms of how many different morphological word forms can produce, however, it may be less complex in terms of predictability of its internal structure of words.