Convergent Representations of Linguistic Constructions in Human and Artificial Neural Systems
Pegah Ramezani, Thomas Kinfe, Andreas Maier
et al.
Understanding how the brain processes linguistic constructions is a central challenge in cognitive neuroscience and linguistics. Recent computational studies show that artificial neural language models spontaneously develop differentiated representations of Argument Structure Constructions (ASCs), generating predictions about when and how construction-level information emerges during processing. The present study tests these predictions in human neural activity using electroencephalography (EEG). Ten native English speakers listened to 200 synthetically generated sentences across four construction types (transitive, ditransitive, caused-motion, resultative) while neural responses were recorded. Analyses using time-frequency methods, feature extraction, and machine learning classification revealed construction-specific neural signatures emerging primarily at sentence-final positions, where argument structure becomes fully disambiguated, and most prominently in the alpha band. Pairwise classification showed reliable differentiation, especially between ditransitive and resultative constructions, while other pairs overlapped. Crucially, the temporal emergence and similarity structure of these effects mirror patterns in recurrent and transformer-based language models, where constructional representations arise during integrative processing stages. These findings support the view that linguistic constructions are neurally encoded as distinct form-meaning mappings, in line with Construction Grammar, and suggest convergence between biological and artificial systems on similar representational solutions. More broadly, this convergence is consistent with the idea that learning systems discover stable regions within an underlying representational landscape - recently termed a Platonic representational space - that constrains the emergence of efficient linguistic abstractions.
A fotografia na produção de narrativas: entre o verbal e o não-verbal, entre o visível e o não-visível
Alexandre Bergamo
O objetivo deste artigo é compreender o lugar que vem sendo ocupado pela fotografia na produção de memórias e narrativas na atualidade. Para tanto, parte de uma articulação entre as pesquisas sobre a fotografia desenvolvidas nas Ciências Sociais brasileira e as pesquisas mais recentes sobre a memória e a narrativa. Os diferentes suportes (papel e tela), as formas de arquivamento e circulação utilizadas tanto nos contextos privados quanto institucionais, assim como as diferenças de classe e gênero são também de fundamental importância para a compreensão do uso da fotografia enquanto parte essencial da construção de memórias e narrativas, individuais ou coletivas. Partindo da articulação entre o verbal e o não-verbal, entre o visível e o não-visível, analisa as formas de uso e de transmissão cultural da fotografia, as estratégias utilizadas para a sua seleção, seu arquivamento e seu descarte e, com isso, as diferentes demandas em torno da produção e do registro fotográfico, assim como os espaços de celebração utilizados na legitimação do trabalho dos fotógrafos: museus, galerias, editoras, prêmios, e as formas de exposição tornadas possíveis pela internet. Entre os usos particulares e os coletivos da fotografia articulam-se os mais distintos “lugares de memória” e, consequentemente, as mais diversas disputas em torno de uma “narrativa” aplicada à imagem fotográfica.
Reframing History through Discourse: A French Discourse Analysis of Vox’s Political Language
Nicola Riccardi
Abstract
This article investigates the epistemological convergence between French historiography and French Discourse Analysis (FDA), emphasizing their shared focus on language as a vehicle for historical meaning and ideological formation. Tracing the evolution from the Annales School to post-structuralist thinkers such as Foucault and LaCapra, the study highlights how history has increasingly embraced discourse as both a methodological tool and an object of analysis. Through a case study of Santiago Abascal, leader of Spain’s far-right party Vox, the article illustrates how FDA concepts – such as lexical worlds, interdiscursivity, and discursive ethos – can be applied to political rhetoric in the digital age. Using lexicometric methods (IraMuTeQ, SketchEngine) and qualitative interpretation, the study analyzes a corpus of Abascal’s tweets (2019-2022), focusing particularly on the term golpista as a strategic signifier of ideological antagonism. The findings show how far-right discourse reactivates historically loaded terms to construct binary oppositions and mobilize collective memory. Ultimately, the article argues that FDA offers historians a powerful framework to interrogate the linguistic production of identity, power, and historical continuity in contemporary political narratives.
Anthropology, Language. Linguistic theory. Comparative grammar
Humor and Resistance in Pakistani Corporate Memes
Aneeqa Ahmad, Musarrat Azher, Muhammad Asim Mahmood
This study aims at investigating how humor is constructed in corporate memes through cognitive mechanisms and how corporate memes are used as a collective voice by Pakistani employees to express discontent with their work culture and organizational practices. Based on Conceptual blending theory (Fauconnier, & Turner, 2002), this research paper analyzes Pakistani corporate memes selected from widely used platforms such as Instagram, Facebook, Reddit, and Pinterest. Drawing upon Bakhtin’s conception of Carnivalesque (Bakhtin, 1984), the significance of workplace humor in challenging dominant narratives, questioning power imbalance and facilitating collective acknowledgement has been highlighted. The findings suggest that the dynamic cognitive processes are responsible for the successful interpretation of humor within corporate memes, depending upon integration of distinct mental spaces evoked by exposure to various ideas present in targeted memes. Further, it is revealed that the workplace memes have the potential to express critique and assert new perspectives by employing techniques such as grotesque exaggeration, visual metaphor, imagery, hyperbole and sarcasm. This article contributes to new debates on internet memes and their potential to galvanize action within digital realm.
Conflict of interest:
The authors have declared no potential conflicts of interest and falsification/fabrication of data with respect to the research, authorship, and/or publication of this article.
Language. Linguistic theory. Comparative grammar, Computational linguistics. Natural language processing
Бесермянско-коми изоглоссы как свидетельство былых контактов между народами [Besermyan-Komi Isoglosses as Evidence for Past Contacts between Peoples]
Sergey Maksimov
This research is devoted to the study of linguistic elements present in Besermyan and Komi, but absent from Udmurt or most of its dialects. The Besermyan language has not been object of areal research to the present day. The purpose of the study is to identify and describe Besermyan-Komi isoglosses. In the course of our research, we revealed that those isoglosses were formed as a result of late secondary contacts of the Besermyans with a Komi population, but they are not archaic elements of the Proto-Permian period. We believe that such contacts took place not only at the level of superficial social and linguistic interaction, but also through interethnic marital ties. The study of isoglosses allows us to suggest that not only the Turkic component (currently generally recognized), but also a Komi component took part in the ethnogenesis of the Besermyans.
Is linguistically-motivated data augmentation worth it?
Ray Groshan, Michael Ginn, Alexis Palmer
Data augmentation, a widely-employed technique for addressing data scarcity, involves generating synthetic data examples which are then used to augment available training data. Researchers have seen surprising success from simple methods, such as random perturbations from natural examples, where models seem to benefit even from data with nonsense words, or data that doesn't conform to the rules of the language. A second line of research produces synthetic data that does in fact follow all linguistic constraints; these methods require some linguistic expertise and are generally more challenging to implement. No previous work has done a systematic, empirical comparison of both linguistically-naive and linguistically-motivated data augmentation strategies, leaving uncertainty about whether the additional time and effort of linguistically-motivated data augmentation work in fact yields better downstream performance. In this work, we conduct a careful and comprehensive comparison of augmentation strategies (both linguistically-naive and linguistically-motivated) for two low-resource languages with different morphological properties, Uspanteko and Arapaho. We evaluate the effectiveness of many different strategies and their combinations across two important sequence-to-sequence tasks for low-resource languages: machine translation and interlinear glossing. We find that linguistically-motivated strategies can have benefits over naive approaches, but only when the new examples they produce are not significantly unlike the training data distribution.
From Isolates to Families: Using Neural Networks for Automated Language Affiliation
Frederic Blum, Steffen Herbold, Johann-Mattis List
In historical linguistics, the affiliation of languages to a common language family is traditionally carried out using a complex workflow that relies on manually comparing individual languages. Large-scale standardized collections of multilingual wordlists and grammatical language structures might help to improve this and open new avenues for developing automated language affiliation workflows. Here, we present neural network models that use lexical and grammatical data from a worldwide sample of more than 1,000 languages with known affiliations to classify individual languages into families. In line with the traditional assumption of most linguists, our results show that models trained on lexical data alone outperform models solely based on grammatical data, whereas combining both types of data yields even better performance. In additional experiments, we show how our models can identify long-ranging relations between entire subgroups, how they can be employed to investigate potential relatives of linguistic isolates, and how they can help us to obtain first hints on the affiliation of so far unaffiliated languages. We conclude that models for automated language affiliation trained on lexical and grammatical data provide comparative linguists with a valuable tool for evaluating hypotheses about deep and unknown language relations.
Mind the Gap: Assessing Wiktionary's Crowd-Sourced Linguistic Knowledge on Morphological Gaps in Two Related Languages
Jonathan Sakunkoo, Annabella Sakunkoo
Morphological defectivity is an intriguing and understudied phenomenon in linguistics. Addressing defectivity, where expected inflectional forms are absent, is essential for improving the accuracy of NLP tools in morphologically rich languages. However, traditional linguistic resources often lack coverage of morphological gaps as such knowledge requires significant human expertise and effort to document and verify. For scarce linguistic phenomena in under-explored languages, Wikipedia and Wiktionary often serve as among the few accessible resources. Despite their extensive reach, their reliability has been a subject of controversy. This study customizes a novel neural morphological analyzer to annotate Latin and Italian corpora. Using the massive annotated data, crowd-sourced lists of defective verbs compiled from Wiktionary are validated computationally. Our results indicate that while Wiktionary provides a highly reliable account of Italian morphological gaps, 7% of Latin lemmata listed as defective show strong corpus evidence of being non-defective. This discrepancy highlights potential limitations of crowd-sourced wikis as definitive sources of linguistic knowledge, particularly for less-studied phenomena and languages, despite their value as resources for rare linguistic features. By providing scalable tools and methods for quality assurance of crowd-sourced data, this work advances computational morphology and expands linguistic knowledge of defectivity in non-English, morphologically rich languages.
Géneros, cuerpos y colonialidad en la Capoeira en Barcelona: “La mujer tiene que hacerlo mejor que el hombre para llegar al mismo sitio”
María Laura Schaufler, Evelyn Ríos
La Capoeira es una manifestación del deporte y la cultura popular afro brasilera en la cual se entrelazan movimientos de lucha y expresión corporal o danza, bajo una base rítmica musical. Formó parte de la lucha por la liberación y resistencia contra la mercantilización de los cuerpos racializados en el Brasil colonial y en la actualidad es promovida como una práctica símbolo de libertad corporal. No obstante, la habitan discursos y prácticas que reproducen lógicas opresoras y patriarcales, así como un modelo hegemónico de la educación física y el deporte: elitista, sexista, racista y clasista. En este artículo buscamos reflexionar acerca de las posibilidades y límites para transformar los discursos y prácticas circulantes en torno a los géneros y corporeidades.
Para este estudio seleccionamos la práctica de la Capoeira en la región de Barcelona, del cual participaron 38 personas: 34 practicantes fueron entrevistadas a partir de un cuestionario cualitativo y cuatro docentes fueron entrevistados en profundidad. La recolección de información se llevó a cabo durante dos meses en eventos y clases en Barcelona (junio-julio de 2022) a través de entrevistas semiestructuradas y observación participante.
Entre las temáticas emergentes encontramos cuestionamientos acerca del: a) binarismo de género; b) discriminaciones en torno a los cuerpos; c) roles y funciones en los grupos discriminados por género; d) acoso; e) violencia territorial. La práctica de la Capoeira perpetúa diariamente normas machistas. Entre los resultados del trabajo de campo puede arriesgarse una primera interpretación relativa a los géneros en la Capoeira: conquistar la igualdad no implica adecuarse a parámetros masculinistas, tanto en nivel de gestión y docencia, como de juego y la práctica cotidiana. Pero, además, es preciso desafiar tanto el mito de la virilidad, así como el androcentrismo europeo que alega que las prácticas machistas provienen exclusivamente de los docentes brasileros.
Identificador permanente (ARK): http://id.caicyt.gov.ar/ark:/s18535925/1ruhoetg1
Communication. Mass media
Development of graphomotor skills in senior preschoolers: a neuropsychological aspect
A. A. Vorobyova, L. N. Makarova
The aim of the study is to generalize the available interdisciplinary data on the formation of graphomotor skills in older preschoolers and use them as a basis for describing the process of forming graphomotor skills in left-handed children. The essence and structure of graphomotor skills as a component of written activity is considered. In the structure of writing, graphomotor skills represent the final link in writing activity and are associated with the drawing of graphic symbols. Based on neuropsychological data on interhemispheric asymmetry in children, the necessity of developing a graphomotor skill in older preschoolers is substantiated, taking into account the dominant profile of lateralization. We consider the provisions of the theory of the level organization of movements by N.A. Bernstein as a support in the development of a graphomotor skill. Noting the level nature of the organization of written activity, attention is focused on the role of gross motor skills, posture retention and other components of background levels, without which the very implementation of the graphic skill is difficult. A scheme is proposed that reflects the indicators of the formation of graphomotor skills in older preschoolers. The following components have the leading importance in it: general motor readiness, visual-object gnosis and visual-motor coordination, elementary graphic skill. The features of pedagogical work in the formation of graphomotor skills in left-handed children are described. The novelty of the study lies in modeling the system of work on the formation of this skill of written activity based on the theory of the level organization of movements by N.A. Bernstein. The proposed system model can be used to form graphic-motor skills in left-handed children, which reflects its practical significance.
Education (General), Philology. Linguistics
Analysis and Visualization of Linguistic Structures in Large Language Models: Neural Representations of Verb-Particle Constructions in BERT
Hassane Kissane, Achim Schilling, Patrick Krauss
This study investigates the internal representations of verb-particle combinations within transformer-based large language models (LLMs), specifically examining how these models capture lexical and syntactic nuances at different neural network layers. Employing the BERT architecture, we analyse the representational efficacy of its layers for various verb-particle constructions such as 'agree on', 'come back', and 'give up'. Our methodology includes a detailed dataset preparation from the British National Corpus, followed by extensive model training and output analysis through techniques like multi-dimensional scaling (MDS) and generalized discrimination value (GDV) calculations. Results show that BERT's middle layers most effectively capture syntactic structures, with significant variability in representational accuracy across different verb categories. These findings challenge the conventional uniformity assumed in neural network processing of linguistic elements and suggest a complex interplay between network architecture and linguistic representation. Our research contributes to a better understanding of how deep learning models comprehend and process language, offering insights into the potential and limitations of current neural approaches to linguistic analysis. This study not only advances our knowledge in computational linguistics but also prompts further research into optimizing neural architectures for enhanced linguistic precision.
Language Models as Models of Language
Raphaël Millière
This chapter critically examines the potential contributions of modern language models to theoretical linguistics. Despite their focus on engineering goals, these models' ability to acquire sophisticated linguistic knowledge from mere exposure to data warrants a careful reassessment of their relevance to linguistic theory. I review a growing body of empirical evidence suggesting that language models can learn hierarchical syntactic structure and exhibit sensitivity to various linguistic phenomena, even when trained on developmentally plausible amounts of data. While the competence/performance distinction has been invoked to dismiss the relevance of such models to linguistic theory, I argue that this assessment may be premature. By carefully controlling learning conditions and making use of causal intervention methods, experiments with language models can potentially constrain hypotheses about language acquisition and competence. I conclude that closer collaboration between theoretical linguists and computational researchers could yield valuable insights, particularly in advancing debates about linguistic nativism.
Teaching Linguistic Justice through Augmented Reality
Ashvini Varatharaj, Abigail Welch, Mary Bucholtz
et al.
This position paper presents the AR Language Map, a speculative artifact designed to enhance understanding of linguistic justice among middle and high school students through augmented reality (AR) that allows students to map their linguistic experiences. Through a social justice-oriented academic outreach program aimed at linguistically, economically, and racially minoritized students, academic concepts on language, culture, race, and power are introduced to California middle school and high school students. The curriculum has activities for each lesson plan drawn from students' culturally relevant experiences. By enabling interactive exploration of linguistic justice, this tool aims to foster empathy, challenge linguistic racism, and valorize linguistic diversity. We discuss its conceptualization within the broader context of AR in social justice education. The AR Language Map not only deepens students' understanding of these critical issues but also enables them to become co-creators of their learning experiences.
modeLing: A Novel Dataset for Testing Linguistic Reasoning in Language Models
Nathan A. Chi, Teodor Malchev, Riley Kong
et al.
We introduce modeLing, a novel benchmark of Linguistics Olympiad-style puzzles which tests few-shot reasoning in AI systems. Solving these puzzles necessitates inferring aspects of a language's grammatical structure from a small number of examples. Such puzzles provide a natural testbed for language models, as they require compositional generalization and few-shot inductive reasoning. Consisting solely of new puzzles written specifically for this work, modeLing has no risk of appearing in the training data of existing AI systems: this ameliorates the risk of data leakage, a potential confounder for many prior evaluations of reasoning. Evaluating several large open source language models and GPT on our benchmark, we observe non-negligible accuracy, demonstrating few-shot emergent reasoning ability which cannot merely be attributed to shallow memorization. However, imperfect model performance suggests that modeLing can be used to measure further progress in linguistic reasoning.
Rivista Italiana di Onomastica 28. (2022)
Zsuzsanna Fábián
-
Large Linguistic Models: Investigating LLMs' metalinguistic abilities
Gašper Beguš, Maksymilian Dąbkowski, Ryan Rhodes
The performance of large language models (LLMs) has recently improved to the point where models can perform well on many language tasks. We show here that--for the first time--the models can also generate valid metalinguistic analyses of language data. We outline a research program where the behavioral interpretability of LLMs on these tasks is tested via prompting. LLMs are trained primarily on text--as such, evaluating their metalinguistic abilities improves our understanding of their general capabilities and sheds new light on theoretical models in linguistics. We show that OpenAI's (2024) o1 vastly outperforms other models on tasks involving drawing syntactic trees and phonological generalization. We speculate that OpenAI o1's unique advantage over other models may result from the model's chain-of-thought mechanism, which mimics the structure of human reasoning used in complex cognitive tasks, such as linguistic analysis.
Le sens de fête en polonais, en lituanien, en français et sa (non)coïncidence collocationnelle
Vitalija Kazlauskienė, Agnieszka Dryjańska
The linguistic overview of the word holiday in the three languages (French, Lithuanian and Polish) is promising for the intercultural approach to teaching French as a foreign language with a view to go beyond the roughly monocultural contexts in Poland and Lithuania. The research is based on text corpora in these three languages. Its objective is to analyse the linguistic images of the word holiday and its Lithuanian and Polish equivalents and to examine their collocational (non)coincidence in order to systematize the teaching/learning of collocations to French learners. The aim would be to help students retain meaning and lexical association simultaneously, as well as to fix the structures they already partially know and to discover (inter)cultural aspects.
Investigating the efficacy of the Egyptian Data Protection Law on Media Freedom: Journalists’ perceptions
Miral-Sabry AlAshry
The purpose of this study is to investigate the effectiveness of the Egyptian Personal Data Protection Law No. 151 for 2020, as well as its implications for journalistic practice. More specifically, the focal point of this study was to explore how Egyptian journalists interpret the law and its implication for press freedom in Egypt. The underpinning theoretical framework was informed by the Authoritarian school of thought. Questionnaires were distributed to 199 journalists from both independent and semi-governmental representing thirteen official newspapers of Egypt, while in-depth interviews were done with (3) Editors, (4) journalists, and (3) human rights lawyers. The finding of the study indicated that the government placed restrictions on journalists by using Data Protection Law relating to the media. That law is negatively impacting journalists and media houses. It was clear from the findings that the journalists see the law as an obstacle to media independence, as it allows the government to exercise greater information control through digital policy and puts rules of regulation against journalists.
Communication. Mass media, Advertising
CLSE: Corpus of Linguistically Significant Entities
Aleksandr Chuklin, Justin Zhao, Mihir Kale
One of the biggest challenges of natural language generation (NLG) is the proper handling of named entities. Named entities are a common source of grammar mistakes such as wrong prepositions, wrong article handling, or incorrect entity inflection. Without factoring linguistic representation, such errors are often underrepresented when evaluating on a small set of arbitrarily picked argument values, or when translating a dataset from a linguistically simpler language, like English, to a linguistically complex language, like Russian. However, for some applications, broadly precise grammatical correctness is critical -- native speakers may find entity-related grammar errors silly, jarring, or even offensive. To enable the creation of more linguistically diverse NLG datasets, we release a Corpus of Linguistically Significant Entities (CLSE) annotated by linguist experts. The corpus includes 34 languages and covers 74 different semantic types to support various applications from airline ticketing to video games. To demonstrate one possible use of CLSE, we produce an augmented version of the Schema-Guided Dialog Dataset, SGD-CLSE. Using the CLSE's entities and a small number of human translations, we create a linguistically representative NLG evaluation benchmark in three languages: French (high-resource), Marathi (low-resource), and Russian (highly inflected language). We establish quality baselines for neural, template-based, and hybrid NLG systems and discuss the strengths and weaknesses of each approach.
Linguistically inspired roadmap for building biologically reliable protein language models
Mai Ha Vu, Rahmad Akbar, Philippe A. Robert
et al.
Deep neural-network-based language models (LMs) are increasingly applied to large-scale protein sequence data to predict protein function. However, being largely black-box models and thus challenging to interpret, current protein LM approaches do not contribute to a fundamental understanding of sequence-function mappings, hindering rule-based biotherapeutic drug development. We argue that guidance drawn from linguistics, a field specialized in analytical rule extraction from natural language data, can aid with building more interpretable protein LMs that are more likely to learn relevant domain-specific rules. Differences between protein sequence data and linguistic sequence data require the integration of more domain-specific knowledge in protein LMs compared to natural language LMs. Here, we provide a linguistics-based roadmap for protein LM pipeline choices with regard to training data, tokenization, token embedding, sequence embedding, and model interpretation. Incorporating linguistic ideas into protein LMs enables the development of next-generation interpretable machine-learning models with the potential of uncovering the biological mechanisms underlying sequence-function relationships.