As Large Language Models (LLMs) expand across multilingual domains, evaluating their performance in under-represented languages becomes increasingly important. European Portuguese (pt-PT) is particularly affected, as existing training data and benchmarks are mainly in Brazilian Portuguese (pt-BR). To address this, we introduce ALBA, a linguistically grounded benchmark designed from the ground up to assess LLM proficiency in linguistic-related tasks in pt-PT across eight linguistic dimensions, including Language Variety, Culture-bound Semantics, Discourse Analysis, Word Plays, Syntax, Morphology, Lexicology, and Phonetics and Phonology. ALBA is manually constructed by language experts and paired with an LLM-as-a-judge framework for scalable evaluation of pt-PT generated language. Experiments on a diverse set of models reveal performance variability across linguistic dimensions, highlighting the need for comprehensive, variety-sensitive benchmarks that support further development of tools in pt-PT.
Marianne de Heer Kloots, Martijn Bentum, Hosein Mohebbi
et al.
Self-supervised speech models learn effective representations of spoken language, which have been shown to reflect various aspects of linguistic structure. But when does such structure emerge in model training? We study the encoding of a wide range of linguistic structures, across layers and intermediate checkpoints of six Wav2Vec2 and HuBERT models trained on spoken Dutch. We find that different levels of linguistic structure show notably distinct layerwise patterns as well as learning trajectories, which can partially be explained by differences in their degree of abstraction from the acoustic signal and the timescale at which information from the input is integrated. Moreover, we find that the level at which pre-training objectives are defined strongly affects both the layerwise organization and the learning trajectories of linguistic structures, with greater parallelism induced by higher-order prediction tasks (i.e. iteratively refined pseudo-labels).
Oliver Cory, Ozge Mercanoglu Sincan, Richard Bowden
This paper introduces SignAgent, a novel agentic framework that utilises Large Language Models (LLMs) for scalable, linguistically-grounded Sign Language (SL) annotation and dataset curation. Traditional computational methods for SLs often operate at the gloss level, overlooking crucial linguistic nuances, while manual linguistic annotation remains a significant bottleneck, proving too slow and expensive for the creation of large-scale, phonologically-aware datasets. SignAgent addresses these challenges through SignAgent Orchestrator, a reasoning LLM that coordinates a suite of linguistic tools, and SignGraph, a knowledge-grounded LLM that provides lexical and linguistic grounding. We evaluate our framework on two downstream annotation tasks. First, on Pseudo-gloss Annotation, where the agent performs constrained assignment, using multi-modal evidence to extract and order suitable gloss labels for signed sequences. Second, on ID Glossing, where the agent detects and refines visual clusters by reasoning over both visual similarity and phonological overlap to correctly identify and group lexical sign variants. Our results demonstrate that our agentic approach achieves strong performance for large-scale, linguistically-aware data annotation and curation.
Textless spoken language models (SLMs) are generative models of speech that do not rely on text supervision. Most textless SLMs learn to predict the next semantic token, a discrete representation of linguistic content, and rely on a separate vocoder to add acoustic information to the generated speech. Such models have no access to acoustic context and no built-in control over acoustic details. In this work, we propose to jointly model linguistic and acoustic information by generating semantic tokens and a continuous real-valued representation of the acoustic frame. We use a flow-matching objective to predict the continuous vector conditioned on the semantic tokens. We study the design space of this approach and find that predicting multiple future semantic tokens helps preserve linguistic information. Our approach achieves comparable performance to existing models in terms of linguistic likelihood benchmarks, while providing better acoustic detail in prompted generation.
This study explores the application of machine learning-based genetic linguistics for identifying heavy metal response genes in rice (Oryza sativa). By integrating convolutional neural networks and random forest algorithms, we developed a hybrid model capable of extracting and learning meaningful features from gene sequences, such as k-mer frequencies and physicochemical properties. The model was trained and tested on datasets of genes, achieving high predictive performance (precision: 0.89, F1-score: 0.82). RNA-seq and qRT-PCR experiments conducted on rice leaves which exposed to Hg0, revealed differential expression of genes associated with heavy metal responses, which validated the model's predictions. Co-expression network analysis identified 103 related genes, and a literature review indicated that these genes are highly likely to be involved in heavy metal-related biological processes. By integrating and comparing the analysis results with those of differentially expressed genes (DEGs), the validity of the new machine learning method was further demonstrated. This study highlights the efficacy of combining machine learning with genetic linguistics for large-scale gene prediction. It demonstrates a cost-effective and efficient approach for uncovering molecular mechanisms underlying heavy metal responses, with potential applications in developing stress-tolerant crop varieties.
This study presents a fascinating linguistic property related to the number of letters in words and their corresponding numerical values. By selecting any arbitrary word, counting its constituent letters, and subsequently spelling out the resulting count and tallying the letters anew, an unanticipated pattern is observed. Remarkably, this iterative sequence, conducted on a dataset of 100,000 random words, invariably converges to the numeral four (4), termed the Linguistic Loop (LL) constant. Examining 73 languages utilizing the Latin alphabet, this research reveals distinctive patterns. Among them, 28 languages exhibit LL-positive behavior adhering to the established property, while 31 languages deviate as LL-negative. Additionally, 13 languages display nuanced tendencies: eight feature two LL constants (bi-positivity), and five feature three constants (tri-positivity). This discovery highlights a linguistic quirk within Latin alphabet-based language number-word representations, uncovering an intriguing facet across diverse alphabetic systems. It also raises questions about the underlying linguistic and cognitive mechanisms responsible for this phenomenon.
Large Language Models (LLMs) are increasingly used in clinical settings, where sensitivity to linguistic uncertainty can influence diagnostic interpretation and decision-making. Yet little is known about where such epistemic cues are internally represented within these models. Distinct from uncertainty quantification, which measures output confidence, this work examines input-side representational sensitivity to linguistic uncertainty in medical text. We curate a contrastive dataset of clinical statements varying in epistemic modality (e.g., 'is consistent with' vs. 'may be consistent with') and propose Model Sensitivity to Uncertainty (MSU), a layerwise probing metric that quantifies activation-level shifts induced by uncertainty cues. Our results show that LLMs exhibit structured, depth-dependent sensitivity to clinical uncertainty, suggesting that epistemic information is progressively encoded in deeper layers. These findings reveal how linguistic uncertainty is internally represented in LLMs, offering insight into their interpretability and epistemic reliability.
Humans have clear cross-modal preferences when matching certain novel words to visual shapes. Evidence suggests that these preferences play a prominent role in our linguistic processing, language learning, and the origins of signal-meaning mappings. With the rise of multimodal models in AI, such as vision- and-language (VLM) models, it becomes increasingly important to uncover the kinds of visio-linguistic associations these models encode and whether they align with human representations. Informed by experiments with humans, we probe and compare four VLMs for a well-known human cross-modal preference, the bouba-kiki effect. We do not find conclusive evidence for this effect but suggest that results may depend on features of the models, such as architecture design, model size, and training details. Our findings inform discussions on the origins of the bouba-kiki effect in human cognition and future developments of VLMs that align well with human cross-modal associations.
Conspiracy theories present a relevant phenomenon in society and are studied within different disciplines. One of the aspects of analyzing conspiracy theories is understanding the pattern whereby social movements utilize the media to construct and spread those narratives. In that regard, social movements frame certain events as part of a conspiracy. By applying an international comparative approach, this study examines the media framing of events as a conspiracy theory in the USA and Egypt. The study focuses on political conspiracies targeting the local elite or political establishment as serving foreign enemies. Qualitative framing analysis is applied to analyze news segments (n = 140), 80 articles, and 60 videos from the Egyptian Islamist media and the Make America Great Again (MAGA) conservative media in the USA. The results show a huge similarity in the frames utilized by the media in both countries in constructing conspiracy theories. The Egyptian and American media constructed conspiracy narratives about claimed secret collusion between the elite leaders and foreign enemies. Whereas the Egyptian media morally framed the conspiracy as treason, the American media framed it as part of corruption. Moreover, the historical continuity in the Egyptian conspiracy theories was more obvious than in the theories in the USA. The empirical analysis shows that conspiracy theories are strategically used in the media to achieve certain political goals.
The article discusses the activities of Khusain Faizkhanov and the results of his work in the areas related to linguistics. Based on available source materials and indirect information, the article characterizes Kh. Faizkhanov’s proficiency in various languages, which was demonstrated by him both in his educational and research activities. The article draws attention to the factors that influenced Kh. Faizkhanov’s linguistic skills development. The range of languages that Kh. Faizkhanov knew and used in his work was wide. If he could master some of the languages due to his origin and profile of education (Turkic, Arabic, Persian), he learned others on his own, achieving the level that enabled him to use them functionally for specific tasks (Russian, Chuvash). Kh. Faizkhanov equally excelled as a linguistic theorist, which is clearly evidenced by his own textbook on the Tatar language, as a philologist, which is reflected in his work with written monuments, and as a linguist, which follows from his ability to make important, in terms of historical linguistics, methodological conclusions on the interpretation of the Volga-Bulgarian epitaphs, involving the use of information from the Chuvash language.
Linguistic landscape is an important field in sociolinguistic research. Eye tracking technology is a common technology in psychological research. There are few cases of using eye movement to study linguistic landscape. This paper uses eye tracking technology to study the actual fixation of the linguistic landscape and finds that in the two dimensions of fixation time and fixation times, the fixation of native Chinese speakers to the linguistic landscape is higher than that of the general landscape. This paper argues that this phenomenon is due to the higher information density of linguistic landscapes. At the same time, the article also discusses other possible reasons for this phenomenon.
Natural language artefact descriptions are primary carriers of engineering design knowledge, whose retrieval, representation, and reuse are fundamental to supporting knowledge-intensive tasks in the design process. In this paper, we explicate design knowledge from patented artefact descriptions as knowledge graphs and examine these to understand the linguistic and structural basis. The purpose of our work is to advance the traditional and ontological perspectives of design knowledge and to guide Large-Language Models (LLMs) on how to articulate natural language responses that reflect knowledge that is valuable in a design environment. We populate 33,881 knowledge graphs from a sample of patents stratified according to technology classes. For linguistic basis, we conduct Zipf distribution analyses on the frequencies of unique entities and relationships to identify 64 and 37 generalisable linguistic syntaxes respectively. The relationships largely represent attributes ('of'), structure ('in', 'with'), purpose ('to', 'for'), hierarchy ('include'), exemplification ('such as'), and behaviour ('to', 'from'). For structural basis, we draw inspiration from various studies on biological/ecological networks and discover motifs from patent knowledge graphs. We identify four 3-node and four 4-node subgraph patterns that could be converged and simplified into sequence [->...->], aggregation [->...<-], and hierarchy [<-...->]. Based on these results, we suggest concretisation strategies for entities and relationships and explicating hierarchical structures, potentially aiding the construction and modularisation of design knowledge.
Emanuele La Malfa, Matthew Wicker, Marta Kwiatkowska
Large Language Models (LLMs) have been reported to have strong performance on natural language processing tasks. However, performance metrics such as accuracy do not measure the quality of the model in terms of its ability to robustly represent complex linguistic structures. In this paper, focusing on the ability of language models to represent syntax, we propose a framework to assess the consistency and robustness of linguistic representations. To this end, we introduce measures of robustness of neural network models that leverage recent advances in extracting linguistic constructs from LLMs via probing tasks, i.e., simple tasks used to extract meaningful information about a single facet of a language model, such as syntax reconstruction and root identification. Empirically, we study the performance of four LLMs across six different corpora on the proposed robustness measures by analysing their performance and robustness with respect to syntax-preserving perturbations. We provide evidence that context-free representation (e.g., GloVe) are in some cases competitive with context-dependent representations from modern LLMs (e.g., BERT), yet equally brittle to syntax-preserving perturbations. Our key observation is that emergent syntactic representations in neural networks are brittle. We make the code, trained models and logs available to the community as a contribution to the debate about the capabilities of LLMs.
In order to better support researchers, journalist, and practitioners in their use of the MeLa-BitChute dataset for exploration and investigative reporting, we provide new psycho-linguistic metadata for the videos, comments, and channels in the dataset using LIWC22. This paper describes that metadata and methods to filter the data using the metadata. In addition, we provide basic analysis and comparison of the language on BitChute to other social media platforms. The MeLa-BitChute dataset and LIWC metadata described in this paper can be found at: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/KRD1VS.
Probing is popular to analyze whether linguistic information can be captured by a well-trained deep neural model, but it is hard to answer how the change of the encoded linguistic information will affect task performance. To this end, we study the dynamic relationship between the encoded linguistic information and task performance from the viewpoint of Pareto Optimality. Its key idea is to obtain a set of models which are Pareto-optimal in terms of both objectives. From this viewpoint, we propose a method to optimize the Pareto-optimal models by formalizing it as a multi-objective optimization problem. We conduct experiments on two popular NLP tasks, i.e., machine translation and language modeling, and investigate the relationship between several kinds of linguistic information and task performances. Experimental results demonstrate that the proposed method is better than a baseline method. Our empirical findings suggest that some syntactic information is helpful for NLP tasks whereas encoding more syntactic information does not necessarily lead to better performance, because the model architecture is also an important factor.
[es] Los escenarios de comunicación vinculados a las redes sociales y los medios con base en internet han sido estudiados desde distintas perspectivas, y han sido caracterizados como digitales, reticulares e interactivos. Este trabajo se propone analizar expresiones vinculadas al catolicismo en estos circuitos y se concentra en la actividad de la cuenta del papa Francisco en Twitter en español (@Pontifex_es) y en las repercusiones entre los usuarios de esa plataforma que se reflejan on-line. Se busca dar cuenta de la producción de sentido, negociaciones, disputas y significaciones emergentes en torno a las publicaciones de la máxima autoridad de la iglesia católica. Para ello se analizan formatos digitales, particularmente tuits, imágenes y memes, considerando las representaciones que expresan de las relaciones entre religión y nuevas tecnologías; y las representaciones de la imagen del Papa, las creencias y prácticas religiosas. [en] The communication scenarios linked to social networks and internet-based media have been studied from different perspectives, and have been characterized as digital, reticular and interactive. This work aims to analyze expressions linked to Catholicism in these circuits and focuses on the activity of Pope Francis’ account on Twitter in Spanish (@Pontifex_es) and on the repercussions among the users of that plat form that are reflected online. It seeks to account for the production of meaning, negotiations, disputes and emerging meanings around the publications of the highest authority of the Catholic Church. For this, digital formats are analyzed, particularly tweets, images and memes, considering the representations that express the relationships between religion and new technologies; and representations of the Pope’s image, religious beliefs and practices.
Brazilian Portuguese displays a cluster of apparently unrelated properties that set it aside within Romance. On the one hand, it has lost its third person possessive pronouns (cf. Oliveira e Silva 1985, Perini 1985, Cerqueira 1996, and Müller 1996), its third person accusative and dative clitics (cf. Omena 1978, Duarte 1986, Galves 1987, Kato 1993, Nunes 1993, Cyrino 1997, and Berlinck 2006), and its null subjects and null possessors have become severely restricted (cf. Duarte 1995, Figueiredo Silva 1996, Kato 1999, Ferreira 2000, Modesto 2000, Galves 2001, Floripi 2003, and Rodrigues 2004). On the other hand, it came to allow hyper-raising constructions (cf. Ferreira 2000 and Nunes 2020b), as well as the so-called ‘topic subject’ constructions, where a putative topic controls verbal agreement (cf. Pontes 1987, Galves 1987, Nunes 2017, and Kato and Ordóñez 2019). Moreover, it makes a pervasive use of preposition deletion in relative clauses (cf. Tarallo 1983) and its directional verbs came to select the preposition em ‘in’ instead of a ‘to’ (cf. Wiedemer 2013). In this paper, I argue that these and other seemingly independent changes can be accounted for if there is a general process of underspecification affecting phases in Brazilian Portuguese.
Artykuł jest recenzją książki Sebastiana Jagielskiego Przerwane emancypacje. Polityka ekscesu w polskim kinie 1968-1982 (2022). Publikacja dotyczy wybranych filmów powstałych między Marcem '68 a wprowadzeniem stanu wojennego, charakteryzujących się formalną i tematyczną nadmiarowością, które autor umieszcza w precyzyjnie oddanym kontekście instytucjonalnym i społeczno-politycznym. Kategorię ekscesu, odczytywaną między innymi w kluczu teorii queer, autor traktuje jako symptom niewykorzystanego potencjału progresywnych ruchów kontrkulturowych zduszonych przez konserwatywną rewolucję początku lat 80. – symptom umożliwiający jednak skonstruowanie mniejszościowej narracji o polskiej kulturze, odmiennej od hegemonicznego modelu pamięci narodowo-katolickiej.
This essay analyzes the current linguistic policy of the Real Academia Española (RAE) and the consortium of Spanish-American academies based on a standardized and unified conception of language underpinned by the ideological notions of a (Pan-Hispanic) nation and language (of the old empire). The analysis involves a comparison of the Ortografía de la lengua española [Spelling of the Spanish Language], published in 1999, whose motto “unifica, limpia y fija” [“unify, clean, and resolve”] sums up the linguistic policy practiced by the RAE, with texts on linguistic and glotopolitical policies.
Philology. Linguistics, French literature - Italian literature - Spanish literature - Portuguese literature