Hasil "Philology. Linguistics"

arXiv Open Access 2026

Blackbird Language Matrices: A Framework to Investigate the Linguistic Competence of Language Models

Paola Merlo, Chunyang Jiang, Giuseppe Samo et al.

This article describes a novel language task, the Blackbird Language Matrices (BLM) task, inspired by intelligence tests, and illustrates the BLM datasets, their construction and benchmarking, and targeted experiments on chunking and systematicity. BLMs are multiple-choice problems, structured at multiple levels: within each sentence, across the input sequence, within each candidate answer. Because of their rich structure, these curated, but naturalistic datasets are key to answer some core questions about current large language models abilities: do LLMs detect linguistic objects and their properties? Do they detect and use systematic patterns across sentences? Are they more prone to linguistic or reasoning errors, and how do these interact? We show that BLMs, while challenging, can be solved at good levels of performance, in more than one language, with simple baseline models or, at better performance levels, with more tailored models. We show that their representations contain the grammatical objects and attributes relevant to solve a linguistic task. We also show that these solutions are reached by detecting systematic patterns across sentences. The paper supports the point of view that curated, structured datasets support multi-faceted investigations of properties of language and large language models. Because they present a curated, articulated structure, because they comprise both learning contexts and expected answers, and because they are partly built by hand, BLMs fall in the category of datasets that can support explainability investigations, and be useful to ask why large language models behave the way they do.

en cs.CL

Detail Sumber

arXiv Open Access 2026

What Makes a Good Query? Measuring the Impact of Human-Confusing Linguistic Features on LLM Performance

William Watson, Nicole Cho, Sumitra Ganesh et al.

Large Language Model (LLM) hallucinations are usually treated as defects of the model or its decoding strategy. Drawing on classical linguistics, we argue that a query's form can also shape a listener's (and model's) response. We operationalize this insight by constructing a 22-dimension query feature vector covering clause complexity, lexical rarity, and anaphora, negation, answerability, and intention grounding, all known to affect human comprehension. Using 369,837 real-world queries, we ask: Are there certain types of queries that make hallucination more likely? A large-scale analysis reveals a consistent "risk landscape": certain features such as deep clause nesting and underspecification align with higher hallucination propensity. In contrast, clear intention grounding and answerability align with lower hallucination rates. Others, including domain specificity, show mixed, dataset- and model-dependent effects. Thus, these findings establish an empirically observable query-feature representation correlated with hallucination risk, paving the way for guided query rewriting and future intervention studies.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2026

EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs

Zhenghao Chen, Huiqun Wang, Di Huang

Multimodal large language models (MLLMs) are increasingly being applied to spatial cognition tasks, where they are expected to understand and interact with complex environments. Most existing works improve spatial reasoning by introducing 3D priors or geometric supervision, which enhances performance but incurs substantial data preparation and alignment costs. In contrast, purely 2D approaches often struggle with multi-frame spatial reasoning due to their limited ability to capture cross-frame spatial relationships. To address these limitations, we propose EgoMind, a Chain-of-Thought framework that enables geometry-free spatial reasoning through Role-Play Caption, which jointly constructs a coherent linguistic scene graph across frames, and Progressive Spatial Analysis, which progressively reasons toward task-specific questions. With only 5K auto-generated SFT samples and 20K RL samples, EgoMind achieves competitive results on VSI-Bench, SPAR-Bench, SITE-Bench, and SPBench, demonstrating its effectiveness in strengthening the spatial reasoning capabilities of MLLMs and highlighting the potential of linguistic reasoning for spatial cognition. Code and data are released at https://github.com/Hyggge/EgoMind.

en cs.CV

Detail Sumber

arXiv Open Access 2025

NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities

Abdellah El Mekki, Houdaifa Atou, Omer Nacar et al.

Enhancing the linguistic capabilities of Large Language Models (LLMs) to include low-resource languages is a critical research area. Current research directions predominantly rely on synthetic data generated by translating English corpora, which, while demonstrating promising linguistic understanding and translation abilities, often results in models aligned with source language culture. These models frequently fail to represent the cultural heritage and values of local communities. This work proposes a methodology to create both synthetic and retrieval-based pre-training data tailored to a specific community, considering its (i) language, (ii) cultural heritage, and (iii) cultural values. We demonstrate our methodology using Egyptian and Moroccan dialects as testbeds, chosen for their linguistic and cultural richness and current underrepresentation in LLMs. As a proof-of-concept, we develop NileChat, a 3B parameter Egyptian and Moroccan Arabic LLM adapted for Egyptian and Moroccan communities, incorporating their language, cultural heritage, and values. Our results on various understanding, translation, and cultural and values alignment benchmarks show that NileChat outperforms existing Arabic-aware LLMs of similar size and performs on par with larger models. This work addresses Arabic dialect in LLMs with a focus on cultural and values alignment via controlled synthetic data generation and retrieval-augmented pre-training for Moroccan Darija and Egyptian Arabic, including Arabizi variants, advancing Arabic NLP for low-resource communities. We share our methods, data, and models with the community to promote the inclusion and coverage of more diverse communities in cultural LLM development: https://github.com/UBC-NLP/nilechat .

en cs.CL

Detail Sumber

arXiv Open Access 2025

Do language models accommodate their users? A study of linguistic convergence

Terra Blevins, Susanne Schmalwieser, Benjamin Roth

While large language models (LLMs) are generally considered proficient in generating language, how similar their language usage is to that of humans remains understudied. In this paper, we test whether models exhibit linguistic convergence, a core pragmatic element of human language communication: do models adapt, or converge, to the linguistic patterns of their user? To answer this, we systematically compare model completions of existing dialogues to original human responses across sixteen language models, three dialogue corpora, and various stylometric features. We find that models strongly converge to the conversation's style, often significantly overfitting relative to the human baseline. While convergence patterns are often feature-specific, we observe consistent shifts in convergence across modeling settings, with instruction-tuned and larger models converging less than their pretrained and smaller counterparts. Given the differences in human and model convergence patterns, we hypothesize that the underlying mechanisms driving these behaviors are very different.

en cs.CL

Detail Sumber

arXiv Open Access 2025

Adaptive Linguistic Prompting (ALP) Enhances Phishing Webpage Detection in Multimodal Large Language Models

Atharva Bhargude, Ishan Gonehal, Dave Yoon et al.

Phishing attacks represent a significant cybersecurity threat, necessitating adaptive detection techniques. This study explores few-shot Adaptive Linguistic Prompting (ALP) in detecting phishing webpages through the multimodal capabilities of state-of-the-art large language models (LLMs) such as GPT-4o and Gemini 1.5 Pro. ALP is a structured semantic reasoning method that guides LLMs to analyze textual deception by breaking down linguistic patterns, detecting urgency cues, and identifying manipulative diction commonly found in phishing content. By integrating textual, visual, and URL-based analysis, we propose a unified model capable of identifying sophisticated phishing attempts. Our experiments demonstrate that ALP significantly enhances phishing detection accuracy by guiding LLMs through structured reasoning and contextual analysis. The findings highlight the potential of ALP-integrated multimodal LLMs to advance phishing detection frameworks, achieving an F1-score of 0.93, surpassing traditional approaches. These results establish a foundation for more robust, interpretable, and adaptive linguistic-based phishing detection systems using LLMs.

en cs.CL

Detail DOI Sumber

arXiv Open Access 2024

Linguistically Differentiating Acts and Recalls of Racial Microaggressions on Social Media

Uma Sushmitha Gunturi, Anisha Kumar, Xiaohan Ding et al.

In this work, we examine the linguistic signature of online racial microaggressions (acts) and how it differs from that of personal narratives recalling experiences of such aggressions (recalls) by Black social media users. We manually curate and annotate a corpus of acts and recalls from in-the-wild social media discussions, and verify labels with Black workshop participants. We leverage Natural Language Processing (NLP) and qualitative analysis on this data to classify (RQ1), interpret (RQ2), and characterize (RQ3) the language underlying acts and recalls of racial microaggressions in the context of racism in the U.S. Our findings show that neural language models (LMs) can classify acts and recalls with high accuracy (RQ1) with contextual words revealing themes that associate Blacks with objects that reify negative stereotypes (RQ2). Furthermore, overlapping linguistic signatures between acts and recalls serve functionally different purposes (RQ3), providing broader implications to the current challenges in content moderation systems on social media.

en cs.HC, cs.SI

Detail DOI Sumber

arXiv Open Access 2024

Testing MediaPipe Holistic for Linguistic Analysis of Nonmanual Markers in Sign Languages

Anna Kuznetsova, Vadim Kimmelman

Advances in Deep Learning have made possible reliable landmark tracking of human bodies and faces that can be used for a variety of tasks. We test a recent Computer Vision solution, MediaPipe Holistic (MPH), to find out if its tracking of the facial features is reliable enough for a linguistic analysis of data from sign languages, and compare it to an older solution (OpenFace, OF). We use an existing data set of sentences in Kazakh-Russian Sign Language and a newly created small data set of videos with head tilts and eyebrow movements. We find that MPH does not perform well enough for linguistic analysis of eyebrow movement - but in a different way from OF, which is also performing poorly without correction. We reiterate a previous proposal to train additional correction models to overcome these limitations.

en cs.CV

Detail Sumber

arXiv Open Access 2024

Context is Important in Depressive Language: A Study of the Interaction Between the Sentiments and Linguistic Markers in Reddit Discussions

Neha Sharma, Kairit Sirts

Research exploring linguistic markers in individuals with depression has demonstrated that language usage can serve as an indicator of mental health. This study investigates the impact of discussion topic as context on linguistic markers and emotional expression in depression, using a Reddit dataset to explore interaction effects. Contrary to common findings, our sentiment analysis revealed a broader range of emotional intensity in depressed individuals, with both higher negative and positive sentiments than controls. This pattern was driven by posts containing no emotion words, revealing the limitations of the lexicon based approaches in capturing the full emotional context. We observed several interesting results demonstrating the importance of contextual analyses. For instance, the use of 1st person singular pronouns and words related to anger and sadness correlated with increased positive sentiments, whereas a higher rate of present-focused words was associated with more negative sentiments. Our findings highlight the importance of discussion contexts while interpreting the language used in depression, revealing that the emotional intensity and meaning of linguistic markers can vary based on the topic of discussion.

en cs.CL

Detail DOI Sumber

DOAJ Open Access 2024

Students’ Perceptions of An Exemplary Online ESP Teacher: A Mixed-Methods Study

Putri Gayatri, Hafida Ruminar, Alies Poetri Lintangsari

The shift from traditional to online teaching and learning during the COVID-19 pandemic has caused various hurdles. As indicated by many reports, both students and the teachers experienced dissatisfaction and met challenges during the teaching and learning process. Amidst several students’ dissatisfaction with the online delivery of various subject, a report on teaching and learning experience revealed the exceptional performance of an English for Specific Purposes (ESP) teacher named Ms. Sarah (a pseudonym) as recognised by the student. The researchers then did a mixed method study to identify the specific features of Ms. Sarah's teaching that students found engaging and pleasurable. The data, both quantitative and qualitative, was gathered by administering closed and open-ended questionnaires to a total of 38 students who were enrolling in Ms. Sarah's class. The findings indicated that Ms. Sarah's online ESP lesson was well-received. All students (100%) expressed confidence in Ms. Sarah's competence to deliver online ESP instruction. A majority of over 80% of the participants indicated their endorsement for other teachers to embrace Ms. Sarah’s instructional approach, while a total of 97.37% conveyed their intention to join Ms. Sarah's other courses in the upcoming semester. This paper also presents a more detailed explanation of the rationales why students appreciate Ms. Sarah's online class. Besides, considering SLA theory and cultural dimension was also found to contribute to the effective online instructions. Ms Sarah's approach in conducting online ESP course might be an invaluable resource for ESP teachers looking to improve their online teaching strategies. Keywords: Online instructions, ESP, SLA, Cultural dimensions

Philology. Linguistics

Detail DOI Sumber

DOAJ Open Access 2024

El tratamiento de las unidades discursivas superiores de lo escrito: desde el párrafo y más allá del párrafo. Retos, propuestas y materiales

Shima Salameh Jiménez, Adrià Pardo Llibrer

El interés por la segmentación del discurso escrito ha aumentado considerablemente en los últimos años, un hecho que puede asociarse a distintos factores que van desde la teorización (p.ej., la necesidad de categorizar y delimiat claramente las unidades que permiten segmentar lo escrito sin caer en la sintaxis tradicional, la falta de acuerdo acerca de cuáles son las propiedades de unidades tan esenciales como el párrafo, etc.) hasta la didáctica (p.ej., la dificultad, cada vez mayor, de redactar textos escritos formales observable en la inadecuada organización y distribución informativa textual, el mal uso de la puntuación o de los marcadores discursivos y conectores, etc.), pasando por la aparición de nuevos géneros híbridos en los que las unidades como enunciado o párrafo pueden no ajustarse del todo a la producción analizada (p. ej., textos escritos en redes sociales en los que no existe una correspondencia directa entre párrafo y unidad de sentido, o en los que incluso las posiciones clásical inicial, medial, o final pueden verse alteradas). Sin duda, en comparación con los modelos de segmentación de la oralidad, que desde 2010 han experimentado un impulso y desarrollo notables en el ámbito nacional e internacional (Pons, 2014), los modelos de lo escrito, a pesar de su solidez teórica, parecen quedar en un segundo plano, eclipsados por el interés en analizar géneros como la conversación coloquial o la entrevista. El presente monográfico, compuesto por cinco trabajos que partes de marcos teóricos distintos, busca arrojar luz sobre una cuestión esencial: organizar, caracterizar y delimitar todo aquello que se encuentra por encima del párrafo y por debajo del texto como unidad superior máxima, un espacio que supone un reto para el investigador.

Philology. Linguistics

Detail DOI Sumber

arXiv Open Access 2023

Personalized State Anxiety Detection: An Empirical Study with Linguistic Biomarkers and A Machine Learning Pipeline

Zhiyuan Wang, Mingyue Tang, Maria A. Larrazabal et al.

Individuals high in social anxiety symptoms often exhibit elevated state anxiety in social situations. Research has shown it is possible to detect state anxiety by leveraging digital biomarkers and machine learning techniques. However, most existing work trains models on an entire group of participants, failing to capture individual differences in their psychological and behavioral responses to social contexts. To address this concern, in Study 1, we collected linguistic data from N=35 high socially anxious participants in a variety of social contexts, finding that digital linguistic biomarkers significantly differ between evaluative vs. non-evaluative social contexts and between individuals having different trait psychological symptoms, suggesting the likely importance of personalized approaches to detect state anxiety. In Study 2, we used the same data and results from Study 1 to model a multilayer personalized machine learning pipeline to detect state anxiety that considers contextual and individual differences. This personalized model outperformed the baseline F1-score by 28.0%. Results suggest that state anxiety can be more accurately detected with personalized machine learning approaches, and that linguistic biomarkers hold promise for identifying periods of state anxiety in an unobtrusive way.

en cs.HC, cs.CY

Detail Sumber

arXiv Open Access 2023

Jambu: A historical linguistic database for South Asian languages

Aryaman Arora, Adam Farris, Samopriya Basu et al.

We introduce Jambu, a cognate database of South Asian languages which unifies dozens of previous sources in a structured and accessible format. The database includes 287k lemmata from 602 lects, grouped together in 23k sets of cognates. We outline the data wrangling necessary to compile the dataset and train neural models for reflex prediction on the Indo-Aryan subset of the data. We hope that Jambu is an invaluable resource for all historical linguists and Indologists, and look towards further improvement and expansion of the database.

en cs.CL

Detail Sumber

DOAJ Open Access 2023

Translators who own it: A case study on how doxa and psychological ownership impact translators’ engagement and job satisfaction

Esther Monzó-Nebot

This paper explores the job satisfaction of translators working for an international intergovernmental organization. The extant literature on translators’ job satisfaction has explored a number of constructs. Based on developments in the field of organizational theory and the complexity of translation as a job, it is argued that psychological ownership may prove an adequate framework to explain translators’ job satisfaction and instrumental in establishing a dialogue between the various analyses of different workplaces in the field of translation and interpreting studies. The study focuses on a specific multilingual intergovernmental organization and draws on the interviews of 17 Spanish-native translators of different nationalities. Their feelings of ownership are analyzed and variations in how they relate to constructs of psychological ownership — feelings of control, intimacy of knowledge, and self-investment— become apparent. Exploring patterns shows those variations to be related to translators’ differing translation dosas, that is, they're divergent, competing, and sometimes conflicting understanding of what translation is and should be. Furthermore, relationships between psychological ownership, translation doxa, and translators’ efforts to advance their own doxas in the organization are examined with a view towards creating means to engage professional translators in advancing a doxa shaped by and for translators across workplaces.

Translating and interpreting

Detail DOI Sumber

DOAJ Open Access 2023

Experiências em divulgação científica e sensibilização da população: importância do controle mecânico do vetor Aedes aegypti

Karine Pedreira Padilha, Rafaela Vieira Bruno, Luana Cristina Farnesi

Aedes aegypti é o principal vetor de dengue, Zika e chikungunya, doenças para as quais não existem vacinas totalmente eficazes. Alternativas de controle visando mitigar essas arboviroses são primordiais. Dentre essas, o controle mecânico aborda práticas de eliminação e/ou limpeza de criadouros do vetor. Neste relato, apresentamos e avaliamos criticamente ações realizadas pelo grupo, ocorridas entre 2016 e 2019, onde divulgamos informação científica clara através do diálogo com a população. Os métodos utilizados foram: 1) Palestras em escolas (público infantojuvenil) utilizando slides, fotos e vídeos; 2) Oficinas (público misto), stand com material in vivo do ciclo de vida do Aedes, jogos e desenhos. Analisamos dez palestras em escolas do ensino fundamental e médio e 20 oficinas realizadas em diferentes regiões do Brasil. Concluímos que tais ações e suas análises críticas devem ser realizadas continuamente para que sejam bem-sucedidas.

Communication. Mass media, Public aspects of medicine

Detail DOI Sumber

arXiv Open Access 2022

Extracting linguistic speech patterns of Japanese fictional characters using subword units

Mika Kishino, Kanako Komiya

This study extracted and analyzed the linguistic speech patterns that characterize Japanese anime or game characters. Conventional morphological analyzers, such as MeCab, segment words with high performance, but they are unable to segment broken expressions or utterance endings that are not listed in the dictionary, which often appears in lines of anime or game characters. To overcome this challenge, we propose segmenting lines of Japanese anime or game characters using subword units that were proposed mainly for deep learning, and extracting frequently occurring strings to obtain expressions that characterize their utterances. We analyzed the subword units weighted by TF/IDF according to gender, age, and each anime character and show that they are linguistic speech patterns that are specific for each feature. Additionally, a classification experiment shows that the model with subword units outperformed that with the conventional method.

en cs.CL

Detail Sumber

arXiv Open Access 2021

Lexically Aware Semi-Supervised Learning for OCR Post-Correction

Shruti Rijhwani, Daisy Rosenblum, Antonios Anastasopoulos et al.

Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents. Optical character recognition (OCR) can be used to produce digitized text, and previous work has demonstrated the utility of neural post-correction methods that improve the results of general-purpose OCR systems on recognition of less-well-resourced languages. However, these methods rely on manually curated post-correction data, which are relatively scarce compared to the non-annotated raw images that need to be digitized. In this paper, we present a semi-supervised learning method that makes it possible to utilize these raw images to improve performance, specifically through the use of self-training, a technique where a model is iteratively trained on its own outputs. In addition, to enforce consistency in the recognized vocabulary, we introduce a lexically-aware decoding method that augments the neural post-correction model with a count-based language model constructed from the recognized texts, implemented using weighted finite-state automata (WFSA) for efficient and effective decoding. Results on four endangered languages demonstrate the utility of the proposed method, with relative error reductions of 15-29%, where we find the combination of self-training and lexically-aware decoding essential for achieving consistent improvements. Data and code are available at https://shrutirij.github.io/ocr-el/.

en cs.CL

Detail Sumber

DOAJ Open Access 2021

LA FASE DI PASSAGGIO DALLO SPAZIO FISICO DELL’AULA ALLA DIDATTICA VIRTUALE. UNA PROPOSTA TEORICO-PRATICA IN CORSI DI ITALIANO L2 FORMALI

Damiano Longo

Il comportamento degli apprendenti in una didattica a distanza mediata da una macchina implica fattori diversi rispetto alle dinamiche della didattica faccia a faccia. L’emergenza COVID-19 ha spinto i docenti a improvvisare modalità didattiche virtuali. Il presente lavoro si focalizza sugli interventi didattici in DAD rivolti a studenti non italofoni nei percorsi di lingua italiana, a livello basico, presso i Centri Provinciali di Istruzione per gli Adulti (CPIA). A tal scopo sono stati analizzati i ruoli del docente di italiano L2, dell’apprendente e del setting nella DAD. Avvalendosi dell’esperienza personale dell’autore, nella presente memoria è anche illustrata una proposta teorico-pratica basata sulla teoria del modeling e dei neuroni specchio per il passaggio dallo spazio fisico dell’aula alla didattica virtuale. The transition from the physical space of the classroom to virtual teaching. A theoretical-practical proposal in formal Italian L2 courses Learners’ behavior in machine-mediated distance learning has different implications compared to the dynamics of face-to-face teaching. The COVID-19 emergence pushed teachers to improvise virtual teaching modes. This paper focuses on DAD teaching interventions aimed at non-Italian speaking students in Italian language courses, at a basic level, at the Provincial Centers for Adult Education (CPIA). To this end, the roles of the teacher of Italian L2, the learner and the setting in the DAD were analyzed. Using the author's personal experience, this paper also illustrates a theoretical-practical proposal based on the theory of modelling and mirror neurons for the transition from the physical space of the classroom to virtual teaching.

Language and Literature, Philology. Linguistics

Detail DOI Sumber

DOAJ Open Access 2021

Aesthetic aspects of Mark Zakharov’s direction

S. A. Chebotarev, V. A. Sazonova

We present an analysis of the basic aesthetic principles of Mark Zakharov’s directing. We consider the creative path of the director, factors influencing the development of his directorial views and worldview. Throughout his career, Zakharov improved and was constantly on the lookout. The director formulated some principles of working with actors. Among them is the need to help an actor, directing him in the right direction of his plan, without turning him into a pawn for directorial experiments. A modern actor must constantly take into account three important theatrical elements: the material environment, partners and spectators. Mark Zakharov’s performances did not put the actor in a certain framework, but on the contrary, gave an opportunity for improvisation, active life, constant creative search. We also focus on the director's ability to influence the viewer. He was able to unobtrusively, in a comedy-musical form, convey to the viewer the most terrible events, the wisest thoughts, and this form is always perceived better, the viewer is more willing to let himself be taught. We conclude that Zakharov built his own special directing system based on his worldview.

Philology. Linguistics

Detail DOI Sumber

DOAJ Open Access 2020

“Fue sin querer queriendo”: o teatro como estratégia para o desenvolvimento oral em língua espanhola.

Adriana Teixeira Pereira

Aprender uma língua adicional trata de aprender uma nova cultura, uma nova forma de ver e interagir com o outro. Nesse sentido, a aprendizagem de língua requer que o aluno desenvolva diferentes habilidades de compreensão e produção – tanto na perspectiva de escrita como de oralidade. Neste trabalho, apresento como foi organizada uma mostra de teatro hispânico no Instituto Federal de Educação, Ciência e Tecnologia, campus Juazeiro do Norte, cujo objetivo foi desenvolver, por meio do teatro, a prática oral em língua espanhola dos alunos e, para além disso, proporcionar uma viagem pelas culturas dos hispano falantes. Como aporte teórico, este trabalho está permeado pelas discussões que envolvem o ensino de línguas no ensino básico (BRASIL, 2006; 2002), a habilidade oral (GÓMEZ, 2004) e o teatro no ensino de língua estrangeira (GRANERO, 2011; LOPES JÚNIOR, 2006). O êxito desta experiência se reflete no envolvimento e participação dos alunos na atividade para um desenvolvimento da língua espanhola em um contexto diferenciado da sala de aula. Com este relato, espero contribuir para a reflexão sobre a importância de se trabalhar a oralidade no ensino de espanhol na educação básica.

Special aspects of education, Philology. Linguistics

Detail DOI Sumber

Hasil untuk "Philology. Linguistics"