Enhancing Linguistic Generalization of VLA: Fine-Tuning OpenVLA via Synthetic Instruction Augmentation
Dongik Shin
Generalization remains a core challenge in embodied AI, as robots must adapt to diverse environments. While OpenVLA represents the State-of-the-Art (SOTA) in Vision-Language-Action models by leveraging large-scale pre-training, its zero-shot performance can be limited when encountering completely new environments. This paper proposes a parameter-efficient fine-tuning strategy to enhance the linguistic generalization of OpenVLA by synthesizing a general instruction set for the Bridge Dataset V2. The paper leverages a Large Language Model (LLM) to generate a rich variety of semantically equivalent but structurally diverse commands for existing trajectories. In this experiment, Low-Rank Adaptation (LoRA) is implemented to fine-tune OpenVLA on augmented pairs, allowing the model to bridge the gap between complex natural language intent and robotic actions. Results demonstrate that the LoRA-enhanced model's robustness, suggesting that enriching the linguistic space of specialized datasets is crucial for embodied agents.
Paradoxes in Generation Z’s Media Consumption and Fake News Perception: An International Comparative Study
Ana Pérez-Escoda, Cristina Ruiz-Poveda
Information consumption by young audiences in the internet of algorithms has become a timely issue in the public sphere. This comparative study examines media consumption among generation Z and their perceptions of news and misinformation from an international perspective. More precisely, this article explores the types of information young people consume, the media outlets they prefer for information, whether they prefer receiving information selected by algorithms, the credibility they grant different outlets, and their reception and perception of fake news. The methodological design was based on a quantitative approach focusing on descriptive and correlational analysis. Using a questionnaire, data were collected from a total sample of 405 university students from three countries: Spain, the United States, and Portugal. The results show that, across the three counties, the sample is predominantly informed by digital media and consumes mainly information on leisure and entertainment, humor and memes, music, and news. Reportedly, fake news does not appear to be a significant factor in their information consumption. This study reveals new, paradoxical findings about media credibility and the perception of fake news among generation Z. In the three countries, subjects report that their most used media outlets for information are also the least trusted: social media and websites. There are significant differences by country regarding the extent do young people like to receive information chosen by algorithms. Overall, while the study detects significant differences across countries, it also hints at a degree of homogenization in media consumption habits worldwide.
Communication. Mass media
AlignCAT: Visual-Linguistic Alignment of Category and Attribute for Weakly Supervised Visual Grounding
Yidan Wang, Chenyi Zhuang, Wutao Liu
et al.
Weakly supervised visual grounding (VG) aims to locate objects in images based on text descriptions. Despite significant progress, existing methods lack strong cross-modal reasoning to distinguish subtle semantic differences in text expressions due to category-based and attribute-based ambiguity. To address these challenges, we introduce AlignCAT, a novel query-based semantic matching framework for weakly supervised VG. To enhance visual-linguistic alignment, we propose a coarse-grained alignment module that utilizes category information and global context, effectively mitigating interference from category-inconsistent objects. Subsequently, a fine-grained alignment module leverages descriptive information and captures word-level text features to achieve attribute consistency. By exploiting linguistic cues to their fullest extent, our proposed AlignCAT progressively filters out misaligned visual queries and enhances contrastive learning efficiency. Extensive experiments on three VG benchmarks, namely RefCOCO, RefCOCO+, and RefCOCOg, verify the superiority of AlignCAT against existing weakly supervised methods on two VG tasks. Our code is available at: https://github.com/I2-Multimedia-Lab/AlignCAT.
Discovering Hidden Visual Concepts Beyond Linguistic Input in Infant Learning
Xueyi Ke, Satoshi Tsutsui, Yayun Zhang
et al.
Infants develop complex visual understanding rapidly, even preceding the acquisition of linguistic skills. As computer vision seeks to replicate the human vision system, understanding infant visual development may offer valuable insights. In this paper, we present an interdisciplinary study exploring this question: can a computational model that imitates the infant learning process develop broader visual concepts that extend beyond the vocabulary it has heard, similar to how infants naturally learn? To investigate this, we analyze a recently published model in Science by Vong et al., which is trained on longitudinal, egocentric images of a single child paired with transcribed parental speech. We perform neuron labeling to identify visual concept neurons hidden in the model's internal representations. We then demonstrate that these neurons can recognize objects beyond the model's original vocabulary. Furthermore, we compare the differences in representation between infant models and those in modern computer vision models, such as CLIP and ImageNet pre-trained model. Ultimately, our work bridges cognitive science and computer vision by analyzing the internal representations of a computational model trained on an infant visual and linguistic inputs. Project page is available at https://kexueyi.github.io/webpage-discover-hidden-visual-concepts.
Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning
Yana Wei, Liang Zhao, Jianjian Sun
et al.
The remarkable reasoning capability of large language models (LLMs) stems from cognitive behaviors that emerge through reinforcement with verifiable rewards. This work investigates how to transfer this principle to Multimodal LLMs (MLLMs) to unlock advanced visual reasoning. We introduce a two-stage paradigm built on Qwen2.5-VL-7B: a massive linguistic cold-start fine-tuning, followed by multimodal reinforcement learning (RL) spanning nearly 1,000 steps, surpassing all previous open-source efforts in scale. This pioneering work reveals three fundamental insights: 1) Behavior transfer emerges surprisingly early in cold start due to linguistic mental imagery. 2) Cold start broadly memorizes visual behaviors, while RL critically discerns and scales up effective patterns. 3) Transfer strategically favors high-utility behaviors such as visual reflection. Our resulting model, Open-Vision-Reasoner (OVR), achieves state-of-the-art performance on a suite of reasoning benchmarks, including 95.3% on MATH500, 51.8% on MathVision and 54.6% on MathVerse. We release our model, data, and training dynamics to catalyze the development of more capable, behavior-aligned multimodal reasoners.
taz2024full: Analysing German Newspapers for Gender Bias and Discrimination across Decades
Stefanie Urchs, Veronika Thurner, Matthias Aßenmacher
et al.
Open-access corpora are essential for advancing natural language processing (NLP) and computational social science (CSS). However, large-scale resources for German remain limited, restricting research on linguistic trends and societal issues such as gender bias. We present taz2024full, the largest publicly available corpus of German newspaper articles to date, comprising over 1.8 million texts from taz, spanning 1980 to 2024. As a demonstration of the corpus's utility for bias and discrimination research, we analyse gender representation across four decades of reporting. We find a consistent overrepresentation of men, but also a gradual shift toward more balanced coverage in recent years. Using a scalable, structured analysis pipeline, we provide a foundation for studying actor mentions, sentiment, and linguistic framing in German journalistic texts. The corpus supports a wide range of applications, from diachronic language analysis to critical media studies, and is freely available to foster inclusive and reproducible research in German-language NLP.
NSL-MT: Linguistically Informed Negative Samples for Efficient Machine Translation in Low-Resource Languages
Mamadou K. Keita, Christopher Homan, Huy Le
We introduce Negative Space Learning MT (NSL-MT), a training method that teaches models what not to generate by encoding linguistic constraints as severity-weighted penalties in the loss function. NSL-MT increases limited parallel data with synthetically generated violations of target language grammar, explicitly penalizing the model when it assigns high probability to these linguistically invalid outputs. We demonstrate that NSL-MT delivers improvements across all architectures: 3-12\% BLEU gains for well-performing models and 56-89\% gains for models lacking descent initial support. Furthermore, NSL-MT provides a 5x data efficiency multiplier -- training with 1,000 examples matches or exceeds normal training with 5,000 examples. Thus, NSL-MT provides a data-efficient alternative training method for settings where there is limited annotated parallel corporas.
Media Manipulations in the Coverage of Events of the Ukrainian Revolution of Dignity: Historical, Linguistic, and Psychological Approaches
Ivan Khoma, Solomia Fedushko, Zoryana Kunch
This article examines the use of manipulation in the coverage of events of the Ukrainian Revolution of Dignity in the mass media, namely in the content of the online newspaper Ukrainian Truth (Ukrainska pravda), online newspaper High Castle (Vysokyi Zamok), and online newspaper ZIK during the public protest, namely during the Ukrainian Revolution of Dignity. Contents of these online newspapers the historical, linguistic, and psychological approaches are used. Also media manipulations in the coverage of events of the Ukrainian Revolution of Dignity are studied. Internet resources that cover news are analyzed. Current and most popular Internet resources are identified. The content of online newspapers is analyzed and statistically processed. Internet content of newspapers by the level of significance of data (very significant data, significant data and insignificant data) is classified. The algorithm of detection of the media manipulations in the highlighting the course of the Ukrainian revolutions based on historical, linguistic, and psychological approaches is designed. Methods of counteracting information attacks in online newspapers are developed.
Weakly-Supervised 3D Scene Graph Generation via Visual-Linguistic Assisted Pseudo-labeling
Xu Wang, Yifan Li, Qiudan Zhang
et al.
Learning to build 3D scene graphs is essential for real-world perception in a structured and rich fashion. However, previous 3D scene graph generation methods utilize a fully supervised learning manner and require a large amount of entity-level annotation data of objects and relations, which is extremely resource-consuming and tedious to obtain. To tackle this problem, we propose 3D-VLAP, a weakly-supervised 3D scene graph generation method via Visual-Linguistic Assisted Pseudo-labeling. Specifically, our 3D-VLAP exploits the superior ability of current large-scale visual-linguistic models to align the semantics between texts and 2D images, as well as the naturally existing correspondences between 2D images and 3D point clouds, and thus implicitly constructs correspondences between texts and 3D point clouds. First, we establish the positional correspondence from 3D point clouds to 2D images via camera intrinsic and extrinsic parameters, thereby achieving alignment of 3D point clouds and 2D images. Subsequently, a large-scale cross-modal visual-linguistic model is employed to indirectly align 3D instances with the textual category labels of objects by matching 2D images with object category labels. The pseudo labels for objects and relations are then produced for 3D-VLAP model training by calculating the similarity between visual embeddings and textual category embeddings of objects and relations encoded by the visual-linguistic model, respectively. Ultimately, we design an edge self-attention based graph neural network to generate scene graphs of 3D point cloud scenes. Extensive experiments demonstrate that our 3D-VLAP achieves comparable results with current advanced fully supervised methods, meanwhile significantly alleviating the pressure of data annotation.
Large Models of What? Mistaking Engineering Achievements for Human Linguistic Agency
Abeba Birhane, Marek McGann
In this paper we argue that key, often sensational and misleading, claims regarding linguistic capabilities of Large Language Models (LLMs) are based on at least two unfounded assumptions; the assumption of language completeness and the assumption of data completeness. Language completeness assumes that a distinct and complete thing such as `a natural language' exists, the essential characteristics of which can be effectively and comprehensively modelled by an LLM. The assumption of data completeness relies on the belief that a language can be quantified and wholly captured by data. Work within the enactive approach to cognitive science makes clear that, rather than a distinct and complete thing, language is a means or way of acting. Languaging is not the kind of thing that can admit of a complete or comprehensive modelling. From an enactive perspective we identify three key characteristics of enacted language; embodiment, participation, and precariousness, that are absent in LLMs, and likely incompatible in principle with current architectures. We argue that these absences imply that LLMs are not now and cannot in their present form be linguistic agents the way humans are. We illustrate the point in particular through the phenomenon of `algospeak', a recently described pattern of high stakes human language activity in heavily controlled online environments. On the basis of these points, we conclude that sensational and misleading claims about LLM agency and capabilities emerge from a deep misconception of both what human language is and what LLMs are.
Utopian semiospheres: Isolation and dialogue across borders
Artur Blaim, Ludmiła Gruszewska-Blaim
The paper applies Juri Lotman’s theory of the semiosphere to utopian studies with the aim of identifying the principal components and mode of functioning of classic utopian discourse. Without questioning the ultimate result of any interaction within a utopian dialogic situation, which necessarily demonstrates the superiority of the ideal world (no-place/ou-topos) over the external world of imperfection (e.g. Europe or America), it is argued that the encounter between the utopian and non-utopian semiospheres offers an interesting starting point for a discussion of intercultural translation and dialogicity involving two different mechanisms of sign production. Contrary to its ‘real-world’ counterpart, where the sign production is governed by asymmetry, binarism, replacement, and diversification, the utopian semiosphere relies on the truthfulness of signs, all-encompassing semioticity, unifying enhancement, and homogeneity. The hyperbolization of the opposition between the ideal state and the external world is metonymically reflected in the construction of the utopian state itself, with its centre and periphery radically polarized and separated by the impassable internal boundary. Although typical representations of the external utopian boundary foreground its distinctly separative function, multiple acts of the intercultural exchange between representatives of the two semiospheres expose the boundary’s translatory function.
Language. Linguistic theory. Comparative grammar
ECHo: A Visio-Linguistic Dataset for Event Causality Inference via Human-Centric Reasoning
Yuxi Xie, Guanzhen Li, Min-Yen Kan
We introduce ECHo (Event Causality Inference via Human-Centric Reasoning), a diagnostic dataset of event causality inference grounded in visio-linguistic social scenarios. ECHo employs real-world human-centric deductive information building on a television crime drama. ECHo requires the Theory-of-Mind (ToM) ability to understand and reason about social interactions based on multimodal information. Using ECHo, we propose a unified Chain-of-Thought (CoT) framework to assess the reasoning capability of current AI systems. Our ToM-enhanced CoT pipeline accommodates various large foundation models in both zero-shot and few-shot visio-linguistic reasoning. We use this framework to scrutinize recent large foundation models such as InstructGPT and MiniGPT-4 on three diagnostic human-centric tasks. Further analysis demonstrates ECHo as a challenging dataset to expose imperfections and inconsistencies in reasoning. Our data and code are publicly available at https://github.com/YuxiXie/ECHo.
Information-Theoretic Characterization of Vowel Harmony: A Cross-Linguistic Study on Word Lists
Julius Steuer, Badr Abdullah, Johann-Mattis List
et al.
We present a cross-linguistic study that aims to quantify vowel harmony using data-driven computational modeling. Concretely, we define an information-theoretic measure of harmonicity based on the predictability of vowels in a natural language lexicon, which we estimate using phoneme-level language models (PLMs). Prior quantitative studies have relied heavily on inflected word-forms in the analysis of vowel harmony. We instead train our models using cross-linguistically comparable lemma forms with little or no inflection, which enables us to cover more under-studied languages. Training data for our PLMs consists of word lists with a maximum of 1000 entries per language. Despite the fact that the data we employ are substantially smaller than previously used corpora, our experiments demonstrate the neural PLMs capture vowel harmony patterns in a set of languages that exhibit this phenomenon. Our work also demonstrates that word lists are a valuable resource for typological research, and offers new possibilities for future studies on low-resource, under-studied languages.
Urhobo-English Loanwords Coda Adaptation: A Constraint-Based Account
Philip Oghenesuowho Ekiugbo , Cecilia Amaoge Eme
This study examines how codas of Urhobo-English loanwords are adapted and shows that the strategies adopted in repairing loanword coda in Urhobo are driven by syllabification constraints and universal conventions. Syllabification conditions in languages that forbid filled coda will require that all the consonant sounds in a phonological word that are to be found in the phonetic string are parsed as onsets. Assuming this is true, it has implications for loanword adaptation. Urhobo exclusively permits the open syllable type. Implicitly, all the coda elements of loanwords are likely to be licensed as onsets, which may result in a possible ‘illicit’ onset cluster given the onset condition requirement of the language. Accordingly, this study examines the attested patterns of adaptation of English coda in Urhobo loanwords and their motivations as well as implications. The discussion is built around the theory of constraints and repair strategies.
Education (General), Language acquisition
Le magistrat congolais face au blanchiment d’argent sale
Carlos NGWAPITSHI NGWAMASHI
Résumé : L’étude menée par nous qui se solde par la présentation des résultats sous format d’un article porte sur « le magistrat congolais face au blanchiment d’argent sale ». Il répond à la question de savoir si le magistrat congolais peut-être classé parmi les usagers du blanchiment d’argent sale ? Après analyse, il s’est dégagé une opinion. Il y a lieu de faire la part des choses, dans la mesure où la corruption étant un interdit en droit positif congolais, ce qui renvoi à ce que tout juge ou magistrat selon le cas, qui recourt à cette pratique, tombe sous le coup du blanchiment d’argent sale que nous qualifions de « blanchiment d’opportunité ». Cependant, il y a lieu de préciser qu’après une observation de la situation, l’appareil judiciaire congolais est bourré d’un mal, qu’est la corruption voire même la concussion, raison pour laquelle depuis un certain temps, soit de 2010 à nos jours, il y a une abondance des procès en procédure de prise à partie devant la Cour de Cassation pour dol ou déni de justice.
Mots-clés : Magistrat, blanchiment, argent sale
Arts in general, Computational linguistics. Natural language processing
Linguocultural Commentary as a Means of Immersing Foreigners in the Russian Mentality when Studying a Local Text (on the Example of Work with the Essay by E. I. Nosov «The Road Home»)
I. O. Amelina, T. Kovaleva
The article dwells on the experience of using an authentic text in the classroom with foreign students.The relevance of the study is due to the scientific, theoretical and practical significance of the problems associated with understanding and adequate interpretation of the new linguoculture and is determined by the search for optimal tools for language teaching within Russian linguoculture, the study of which involves identifying the specifics of the worldview of Russian speakers, understanding the mentality of the inhabitants of Russia as a whole and the region in particular, where foreigners are trained. The scientific value and novelty of the research lie in the development of tools – a linguocultural commentary to authentic texts of a regional theme – in order to develop communicative and linguocultural competencies.The study is carried out on the material of the essay by E.I. Nosov “The Road Home”, which reflects Kursk regional realities, thus allowing to update linguocultural knowledge and develop communication skills. The object of commenting in a foreign audience is, first of all, vocabulary with a pronounced national and cultural component: archaisms, historicisms, neologisms, anthroponyms, toponyms, phraseological units, etc.The study is based on an interdisciplinary approach that integrates knowledge from various fields of science: linguoculturology, philology, cognitive linguistics, and linguodidactics. In the process of research, such methods as commenting, statistical, comparative-analytical methods were used.To assess the effectiveness of the methodology of linguocultural commenting on the authentic text, a test was conducted in the experimental and control groups of foreign students. The results of the test showed that the linguocultural commentary contributes to the understanding and adequate perception of the new linguoculture, the development of communicative competence.
Polysemy
I. Falkum, Agustín Vicente
A word is said to be polysemous when it is associated with two or several related senses (e.g., a straight line/a washing on a line/a line of bad decisions; lose a wallet/lose a relative; a handsome man/a handsome gift). It is distinguished from monosemy, where a word form is associated with a single meaning, and homonymy, where a single word form is associated with two or several unrelated meanings (e.g., the ‘financial institution’ and ‘riverside’ meanings of bank). Although the distinctions between polysemy, monosemy, and homonymy may seem clear at an intuitive level, they have proven difficult to draw in practice. Some problems are how to count senses or meanings; how to decide whether two senses are related, and in which way they are related; and how to tell apart polysemy from pragmatic effects that affect lexical meanings. Some tests have been proposed in the literature, but such tests do not give uniform results, in part because there are different ways in which an expression can be polysemous. There is an emerging consensus concerning the following “minimal” taxonomy of polysemy, such that the polysemy a word displays can belong to at least one of the following patterns: (i) irregular or accidental polysemy (e.g.,The Sun is a star/Mary is a star), (ii) regular polysemy (e.g., catch the rabbit/order the rabbit), and (iii) logical or inherent polysemy (e.g., The book is interesting/the book is heavy). The current literature approaches polysemy from different perspectives and research traditions, including lexicography, formal semantics, cognitive linguistics, distributional semantics, psycholinguistics, pragmatics, and computational linguistics.
Towards Greater Accuracy in Lexicostatistic Dating
M. Swadesh
The Linguistic Blind Spot of Value-Aligned Agency, Natural and Artificial
Travis LaCroix
The value-alignment problem for artificial intelligence (AI) asks how we can ensure that the 'values' (i.e., objective functions) of artificial systems are aligned with the values of humanity. In this paper, I argue that linguistic communication (natural language) is a necessary condition for robust value alignment. I discuss the consequences that the truth of this claim would have for research programmes that attempt to ensure value alignment for AI systems; or, more loftily, designing robustly beneficial or ethical artificial agents.
Zero-shot Cross-Linguistic Learning of Event Semantics
Malihe Alikhani, Thomas Kober, Bashar Alhafni
et al.
Typologically diverse languages offer systems of lexical and grammatical aspect that allow speakers to focus on facets of event structure in ways that comport with the specific communicative setting and discourse constraints they face. In this paper, we look specifically at captions of images across Arabic, Chinese, Farsi, German, Russian, and Turkish and describe a computational model for predicting lexical aspects. Despite the heterogeneity of these languages, and the salient invocation of distinctive linguistic resources across their caption corpora, speakers of these languages show surprising similarities in the ways they frame image content. We leverage this observation for zero-shot cross-lingual learning and show that lexical aspects can be predicted for a given language despite not having observed any annotated data for this language at all.