Hasil "Philology. Linguistics"

arXiv Open Access 2026

Language Family Matters: Evaluating LLM-Based ASR Across Linguistic Boundaries

Yuchen Zhang, Ravi Shekhar, Haralambos Mouratidis

Large Language Model (LLM)-powered Automatic Speech Recognition (ASR) systems achieve strong performance with limited resources by linking a frozen speech encoder to a pretrained LLM via a lightweight connector. Prior work trains a separate connector per language, overlooking linguistic relatedness. We propose an efficient and novel connector-sharing strategy based on linguistic family membership, enabling one connector per family, and empirically validate its effectiveness across two multilingual LLMs and two real-world corpora spanning curated and crowd-sourced speech. Our results show that family-based connectors reduce parameter count while improving generalization across domains, offering a practical and scalable strategy for multilingual ASR deployment.

en cs.CL, cs.AI

Detail Sumber

DOAJ Open Access 2026

Affective Multiliteracies: Harmonizing Inner and Outer Literacies in English Language Education

Luis Javier Pentón Herrera, Ana Maria Ferreira Barcelos, Yasir Hussain

Abstract: This article introduces the concept of affective multiliteracies as a new framework for English language education. While current approaches to multiliteracies have expanded the scope of literacy to include multimodal, cultural, and digital dimensions, they continue to emphasize what we define as outer literacies: visible, assessable, and performative skills used to engage with texts, tools, and external contexts. This emphasis often overshadows inner literacies, or the social, emotional, and relational capacities that shape how learners interpret experiences and connect with others. In response, we propose affective multiliteracies as a way to harmonize these inner and outer dimensions, positioning literacy as both internal meaning-making and external participation. Drawing on research in emotional intelligence, pedagogical love, and harmony, this paper examines how educators can cultivate learning environments that support students as human beings. We provide theoretical grounding, practical illustrations, and pedagogical implications to support this holistic vision. By rethinking literacy as a human practice shaped by both cognitive and affective processes, this approach aims to guide educators in preparing students to communicate thoughtfully, relate ethically, and participate meaningfully in their communities worldwide.

Philology. Linguistics

Detail DOI Sumber

arXiv Open Access 2025

BNLI: A Linguistically-Refined Bengali Dataset for Natural Language Inference

Farah Binta Haque, Md Yasin, Shishir Saha et al.

Despite the growing progress in Natural Language Inference (NLI) research, resources for the Bengali language remain extremely limited. Existing Bengali NLI datasets exhibit several inconsistencies, including annotation errors, ambiguous sentence pairs, and inadequate linguistic diversity, which hinder effective model training and evaluation. To address these limitations, we introduce BNLI, a refined and linguistically curated Bengali NLI dataset designed to support robust language understanding and inference modeling. The dataset was constructed through a rigorous annotation pipeline emphasizing semantic clarity and balance across entailment, contradiction, and neutrality classes. We benchmarked BNLI using a suite of state-of-the-art transformer-based architectures, including multilingual and Bengali-specific models, to assess their ability to capture complex semantic relations in Bengali text. The experimental findings highlight the improved reliability and interpretability achieved with BNLI, establishing it as a strong foundation for advancing research in Bengali and other low-resource language inference tasks.

en cs.CL

Detail Sumber

arXiv Open Access 2025

Prompt-Induced Linguistic Fingerprints for LLM-Generated Fake News Detection

Chi Wang, Min Gao, Zongwei Wang et al.

With the rapid development of large language models, the generation of fake news has become increasingly effortless, posing a growing societal threat and underscoring the urgent need for reliable detection methods. Early efforts to identify LLM-generated fake news have predominantly focused on the textual content itself; however, because much of that content may appear coherent and factually consistent, the subtle traces of falsification are often difficult to uncover. Through distributional divergence analysis, we uncover prompt-induced linguistic fingerprints: statistically distinct probability shifts between LLM-generated real and fake news when maliciously prompted. Based on this insight, we propose a novel method named Linguistic Fingerprints Extraction (LIFE). By reconstructing word-level probability distributions, LIFE can find discriminative patterns that facilitate the detection of LLM-generated fake news. To further amplify these fingerprint patterns, we also leverage key-fragment techniques that accentuate subtle linguistic differences, thereby improving detection reliability. Our experiments show that LIFE achieves state-of-the-art performance in LLM-generated fake news and maintains high performance in human-written fake news. The code and data are available at https://anonymous.4open.science/r/LIFE-E86A.

en cs.CL

Detail Sumber

arXiv Open Access 2025

Computational Social Linguistics for Telugu Cultural Preservation: Novel Algorithms for Chandassu Metrical Pattern Recognition

Boddu Sri Pavan, Boddu Swathi Sree

This research presents a computational social science approach to preserving Telugu Chandassu, the metrical poetry tradition representing centuries of collective cultural intelligence. We develop the first comprehensive digital framework for analyzing Telugu prosodic patterns, bridging traditional community knowledge with modern computational methods. Our social computing approach involves collaborative dataset creation of 4,651 annotated padyams, expert-validated linguistic patterns, and culturally-informed algorithmic design. The framework includes AksharamTokenizer for prosody-aware tokenization, LaghuvuGuruvu Generator for classifying light and heavy syllables, and PadyaBhedam Checker for automated pattern recognition. Our algorithm achieves 91.73% accuracy on the proposed Chandassu Score, with evaluation metrics reflecting traditional literary standards. This work demonstrates how computational social science can preserve endangered cultural knowledge systems while enabling new forms of collective intelligence around literary heritage. The methodology offers insights for community-centered approaches to cultural preservation, supporting broader initiatives in digital humanities and socially-aware computing systems.

en cs.CL

Detail Sumber

arXiv Open Access 2025

QFrBLiMP: a Quebec-French Benchmark of Linguistic Minimal Pairs

David Beauchemin, Pier-Luc Veilleux, Johanna-Pascale Roy et al.

In this paper, we introduce the Quebec-French Benchmark of Linguistic Minimal Pairs (QFrBLiMP), a corpus designed to evaluate LLMs' linguistic knowledge of prominent grammatical phenomena in Quebec-French. QFrBLiMP comprises 1,761 minimal pairs annotated with 20 LPs. Specifically, these minimal pairs have been created by manually modifying sentences extracted from an official online resource maintained by a Québec government institution. Each pair is annotated by 12 Quebec-French native speakers, who select the sentence they consider grammatical from the two. These annotations are used to compare the competency of LLMs with that of humans. We evaluate different LLMs on QFrBLiMP and MultiBLiMP-Fr by observing the rate of higher probabilities assigned to the sentences of each minimal pair for each category. We find that while grammatical competence scales with model size, a clear hierarchy of difficulty emerges. All benchmarked models consistently fail on phenomena requiring deep semantic understanding, revealing a critical limitation. Finally, our statistical analysis comparing QFrBLiMP and MultiBLiMP reveals a significant performance degradation for most models on Quebec-French; however, the most capable models remain within the statistical significance interval, demonstrating cross-dialectal robustness.

en cs.CL

Detail Sumber

CrossRef Open Access 2024

A terminological frame-network model of the conceptual and semantic organization of notions of computational linguistics

Olga Nikolaevna Prokhorova, Olga Nikolaevna Polshchykova

The research object is the conceptual and semantic organization of the terminological apparatus, the research subject is the logical and conceptual structure of notions denoted by terms of computational linguistics. The study aims to identify the features of the systemic organization of notions of computational linguistics based on the use of a terminological frame-network model. The paper briefly analyzes the use of the frame approach to modeling the structure of terminological systems by linguists. A terminological frame-network model of the subject area “Computational linguistics” is proposed. The study is illustrated with fragments of the terminological network of frames of the subject area “Synonymy of terms of computational linguistics” presented in the form of diagrams. The work is novel in that it provides an original definition of a terminological network of frames and is the first to formulate the rules for its construction. The research findings showed the possibility of using a terminological frame-network model to analyze the logical and conceptual structure of the subject area “Computational linguistics” and the organization of semantic term-formation processes and lexicosemantic relations in the terminology under study.

en

Detail DOI Sumber

arXiv Open Access 2024

Provably Secure Disambiguating Neural Linguistic Steganography

Yuang Qi, Kejiang Chen, Kai Zeng et al.

Recent research in provably secure neural linguistic steganography has overlooked a crucial aspect: the sender must detokenize stegotexts to avoid raising suspicion from the eavesdropper. The segmentation ambiguity problem, which arises when using language models based on subwords, leads to occasional decoding failures in all neural language steganography implementations based on these models. Current solutions to this issue involve altering the probability distribution of candidate words, rendering them incompatible with provably secure steganography. We propose a novel secure disambiguation method named SyncPool, which effectively addresses the segmentation ambiguity problem. We group all tokens with prefix relationships in the candidate pool before the steganographic embedding algorithm runs to eliminate uncertainty among ambiguous tokens. To enable the receiver to synchronize the sampling process of the sender, a shared cryptographically-secure pseudorandom number generator (CSPRNG) is deployed to select a token from the ambiguity pool. SyncPool does not change the size of the candidate pool or the distribution of tokens and thus is applicable to provably secure language steganography methods. We provide theoretical proofs and experimentally demonstrate the applicability of our solution to various languages and models, showing its potential to significantly improve the reliability and security of neural linguistic steganography systems.

en cs.CR, cs.CL

Detail Sumber

arXiv Open Access 2024

Using Model-Theoretic Approaches to Uncover Linguistic Organization

Olivia Griffin, Jerry Sun

In this paper, we consider pluractional markers in Kaqchikel, Karuk, and Yurok. Like Balinese, each of these languages marks one type of pluractionality via reduplication, and a different type of pluractionality via non-reduplicative affixation. This paper serves as a proof-of-concept for applying model-theoretic approaches to language as a lens that can help us to recognize linguistic organization that is not apparent on the surface.

en cs.CL

Detail Sumber

arXiv Open Access 2023

Multi-round Dynamic Group Decision Making Method On 2-Dimension Uncertain Linguistic Variables

Yukun Zhang

The language evaluation information of the interactive group decision method at present is based on the one-dimension language variable. At the same time, multi-attribute group decision making method based on two-dimension linguistic information only use single-stage and static evaluation method. In this paper, we propose a dynamic group decision making method based on two-dimension linguistic information, combining dynamic interactive group decision making methods with two-dimensional language evaluation information The method first use Two-Dimensional Uncertain Linguistic Generalized Weighted Aggregation (DULGWA) Operators to aggregate the preference information of each decision maker, then adopting dynamic information entropy method to obtain weights of attributes at each stage. Finally we propose the group consistency index to quantify the termination conditions of group interaction. One example is given to verify the developed approach and to demonstrate its effectiveness.

en cs.SI

Detail Sumber

arXiv Open Access 2023

FlaCGEC: A Chinese Grammatical Error Correction Dataset with Fine-grained Linguistic Annotation

Hanyue Du, Yike Zhao, Qingyuan Tian et al.

Chinese Grammatical Error Correction (CGEC) has been attracting growing attention from researchers recently. In spite of the fact that multiple CGEC datasets have been developed to support the research, these datasets lack the ability to provide a deep linguistic topology of grammar errors, which is critical for interpreting and diagnosing CGEC approaches. To address this limitation, we introduce FlaCGEC, which is a new CGEC dataset featured with fine-grained linguistic annotation. Specifically, we collect raw corpus from the linguistic schema defined by Chinese language experts, conduct edits on sentences via rules, and refine generated samples manually, which results in 10k sentences with 78 instantiated grammar points and 3 types of edits. We evaluate various cutting-edge CGEC methods on the proposed FlaCGEC dataset and their unremarkable results indicate that this dataset is challenging in covering a large range of grammatical errors. In addition, we also treat FlaCGEC as a diagnostic dataset for testing generalization skills and conduct a thorough evaluation of existing CGEC models.

en cs.CL, cs.AI

Detail DOI Sumber

arXiv Open Access 2023

Contextualized word senses: from attention to compositionality

Pablo Gamallo

The neural architectures of language models are becoming increasingly complex, especially that of Transformers, based on the attention mechanism. Although their application to numerous natural language processing tasks has proven to be very fruitful, they continue to be models with little or no interpretability and explainability. One of the tasks for which they are best suited is the encoding of the contextual sense of words using contextualized embeddings. In this paper we propose a transparent, interpretable, and linguistically motivated strategy for encoding the contextual sense of words by modeling semantic compositionality. Particular attention is given to dependency relations and semantic notions such as selection preferences and paradigmatic classes. A partial implementation of the proposed model is carried out and compared with Transformer-based architectures for a given semantic task, namely the similarity calculation of word senses in context. The results obtained show that it is possible to be competitive with linguistically motivated models instead of using the black boxes underlying complex neural architectures.

en cs.CL

Detail DOI Sumber

arXiv Open Access 2023

A Neural Divide-and-Conquer Reasoning Framework for Image Retrieval from Linguistically Complex Text

Yunxin Li, Baotian Hu, Yuxin Ding et al.

Pretrained Vision-Language Models (VLMs) have achieved remarkable performance in image retrieval from text. However, their performance drops drastically when confronted with linguistically complex texts that they struggle to comprehend. Inspired by the Divide-and-Conquer algorithm and dual-process theory, in this paper, we regard linguistically complex texts as compound proposition texts composed of multiple simple proposition sentences and propose an end-to-end Neural Divide-and-Conquer Reasoning framework, dubbed NDCR. It contains three main components: 1) Divide: a proposition generator divides the compound proposition text into simple proposition sentences and produces their corresponding representations, 2) Conquer: a pretrained VLMs-based visual-linguistic interactor achieves the interaction between decomposed proposition sentences and images, 3) Combine: a neural-symbolic reasoner combines the above reasoning states to obtain the final solution via a neural logic reasoning approach. According to the dual-process theory, the visual-linguistic interactor and neural-symbolic reasoner could be regarded as analogical reasoning System 1 and logical reasoning System 2. We conduct extensive experiments on a challenging image retrieval from contextual descriptions data set. Experimental results and analyses indicate NDCR significantly improves performance in the complex image-text reasoning problem. Code link: https://github.com/YunxinLi/NDCR.

en cs.CL

Detail Sumber

DOAJ Open Access 2022

The Future of Public Relations, Advertising and Journalism: How Artificial Intelligence May Transform the Communication Profession and Why Society Should Care?

Nilüfer Türksoy

This study aims to examine the latest academic research conducted in the last decade on the future benefits, challenges, and impact of Artificial Intelligence and its adoption in the communication profession. The study is grounded in the perceptions of reviews from relevant academic articles and emphasizes the technological innovations related to Artificial Intelligence applications within the public relations, advertising, and journalism professions. The followings are some of the key questions asked in the current study: 1) How is the implementation of Artificial Intelligence going to impact the jobs of the communication profession? 2) Can robots replace public relations, advertising, and journalism professionals? and 3) What kind of opportunities and challenges will AI-powered systems bring to these three professions? The findings suggest that Artificial Intelligence would benefit the communication profession and those who adopt Artificial Intelligence technologies would certainly gain a competitive advantage. It appears that the future of the communication profession would be a blend of both Artificial Intelligence technologies and human insight. Also, it is emphasized that it would be a rather narrow view to see that Artificial Intelligence would create robots to replace humans while performing various tasks or would outperform human intelligence in most of its dimensions.

Communication. Mass media

Detail DOI Sumber

DOAJ Open Access 2022

Translating Poetry in Sign Language: An Embodied Perspective

Erika Raniolo

In the translation into sign language, where does the ≪sense≫ reside and how can it be constructed in the target language? To what extent does the orality of sign languages, intended as the absence of a writing system, affect the translation process? What role do the characteristics of sign languages, first and foremost iconicity, play? The issues we address in this study are placed at the crossroads between sign language linguistics and translation studies, thanks to the awareness that both disciplines have, respectively matured in recent decades. As regards the linguistics of sign languages, we refer to the semiological model proposed by Cuxac and colleagues. On the subject of translation studies, our main reference is represented by Meschonnic, according to whom the sense is found in the ≪rhythm≫ (understood as form). Analyzing the translation process, and more specifically the poetic translation, allows us to observe the centrality of the body. We take into account the perspective of embodied cognition, based on the link between the language and the sensorimotor system. Therefore, we question the role of the body in the construction of the sense: the body is considered above all in its sensorial dimension, in its being an entity that perceives and enters into a relationship with the world. That makes us hypothesize a synesthetic construction of the sense. In order to follow in practice what is stated theorically, we present one of our translations: the translation into LIS of a poem in Italian, L'Infinito by Giacomo Leopardi. The translation into sign language makes it possible to observe the role of corporeality in the process of re-enunciation of sense.

Communication. Mass media

Detail DOI Sumber

arXiv Open Access 2021

Psycho-linguistic differences among competing vaccination communities on social media

Jialiang Shi, Piyush Ghasiya, Kazutoshi Sasahara

Currently, the significance of social media in disseminating noteworthy information on topics such as health, politics, and the economy is indisputable. During the COVID-19 pandemic, anti-vaxxers use social media to distribute fake news and anxiety-provoking information about the vaccine, which may harm the public. Here, we characterize the psycho-linguistic features of anti-vaxxers on the online social network Twitter. For this, we collected COVID-19 related tweets from February 2020 to June 2021 to analyse vaccination stance, linguistic features, and social network characteristics. Our results demonstrated that, compared to pro-vaxxers, anti-vaxxers tend to have more negative emotions, narrative thinking, and worse moral tendencies. This study can advance our understanding of the online anti-vaccination movement, and become critical for social media management and policy action during and after the pandemic.

en cs.CY

Detail Sumber

arXiv Open Access 2020

Enriching the Transformer with Linguistic Factors for Low-Resource Machine Translation

Jordi Armengol-Estapé, Marta R. Costa-jussà, Carlos Escolano

Introducing factors, that is to say, word features such as linguistic information referring to the source tokens, is known to improve the results of neural machine translation systems in certain settings, typically in recurrent architectures. This study proposes enhancing the current state-of-the-art neural machine translation architecture, the Transformer, so that it allows to introduce external knowledge. In particular, our proposed modification, the Factored Transformer, uses linguistic factors that insert additional knowledge into the machine translation system. Apart from using different kinds of features, we study the effect of different architectural configurations. Specifically, we analyze the performance of combining words and features at the embedding level or at the encoder level, and we experiment with two different combination strategies. With the best-found configuration, we show improvements of 0.8 BLEU over the baseline Transformer in the IWSLT German-to-English task. Moreover, we experiment with the more challenging FLoRes English-to-Nepali benchmark, which includes both extremely low-resourced and very distant languages, and obtain an improvement of 1.2 BLEU.

en cs.CL

Detail Sumber

arXiv Open Access 2020

DeepIaC: Deep Learning-Based Linguistic Anti-pattern Detection in IaC

Nemania Borovits, Indika Kumara, Parvathy Krishnan et al.

Linguistic anti-patterns are recurring poor practices concerning inconsistencies among the naming, documentation, and implementation of an entity. They impede readability, understandability, and maintainability of source code. This paper attempts to detect linguistic anti-patterns in infrastructure as code (IaC) scripts used to provision and manage computing environments. In particular, we consider inconsistencies between the logic/body of IaC code units and their names. To this end, we propose a novel automated approach that employs word embeddings and deep learning techniques. We build and use the abstract syntax tree of IaC code units to create their code embedments. Our experiments with a dataset systematically extracted from open source repositories show that our approach yields an accuracy between0.785and0.915in detecting inconsistencies

en cs.SE

Detail Sumber

DOAJ Open Access 2020

Uma nova tradução para Ovídio, Amores 1.8

Luiza dos Santos Souza

A presente tradução é parte uma primeira revisão do trabalho tradutório apresentado em dissertação de mestrado (SOUZA, 2016). A proposta foi traduzir os dísticos elegíacos de Ovídio em uma forma que ecoasse o ritmo original do poema latino. Para isso, partiu-se da proposta de Carlos Alberto Nunes, a substituição de longas em posição princeps por tônicas. Porém, nesta tradução, ao contrário da de Nunes, permitiu-se também o seguimento desta tônica por apenas uma átona, formando troqueu que pode ser executado em performance como espondeu. Dessa maneira, a possibilidade de variação no metro foi mantida, mas os dátilos foram mantidos fixos no quinto pé do hexâmetro e segundo hemistíquio do pentâmetro. A cesura obrigatória do pentâmetro datílico foi executada com a aproximação de duas tônicas e enfatizada com espaçamento obrigatório, que induz o leitor ao reinício do ritmo. Para manter o andamento do metro, foram utilizados recursos como deslocamento de tônica, elisão entre o fim do hexâmetro e início do pentâmetro, elisões em geral. A presente revisão apresenta alterações no hexâmetro, que procura executar as suas cesuras, porém com o resultado de várias cesuras femininas. Além disso, o tom do poema procurou maior leveza e fluidez, com a eliminação de hipérbatos radicais e o rearranjo de informações dentro do dístico, que é a unidade dentro do poema. O poema traduzido, oitavo do livro 1, apresenta a figura da alcoviteira, personagem comum na comédia nova latina, que aqui é a presentada como uma bruxa que influencia a amada a extorquir os seus amantes e a desprezar o eu-poético, que sendo poeta, não tem como dar presentes valiosos além dos próprios poemas.

Language and Literature, Translating and interpreting

Detail DOI Sumber

DOAJ Open Access 2020

Cicero on artes liberales. Merits and Problems

Woldemar Görler

The expression artes liberales, found first in Cicero, has had a lasting and multiform impact through Seneca and Martanius Capella to our days. In its essence, the expression calls for a complete education (over and above strictly technical training) – a project which still has a bearing today with reference to Cice- ro. This is a credit to Cicero, but the project itself is not without issues. Liberalis does not mean “free”, as it is often translated today, but “worthy of a free man”, making it distinct from “dirty” money-making occupations. Such a limitation imposed upon an elite upper class is alien to us. In other instances, Cicero describes the same arts without reservation as “human” (humanae). They are the object of study for their own sake, insofar as they have an intrinsic value, and not to gain an advantage or benefit (such as to earn money). In this respect, the at- tributes liberalis and humanus are often interchangeable. Cicero includes in the artes liberales geometry, music, literature (and poetry), natural sciences, ethics, and politics. This programme is still valid today, though it presents some gaps such as the visual arts and foreign languages. The programme ought to, therefore, be completed and modified. Cicero regularly highlights that all the artes liberales share a common bond. This picture allows for different interpretations. It comes close to the allegory of a circle of noble arts, which is not mentioned in Cicero. However, even if there exist variations and modifications in form, image and con- tent of the artes liberales, Cicero’s reflections on the ideal education remain their original source and essence; they should therefore also be taken into account in current debates on education.

Philology. Linguistics, Greek language and literature. Latin language and literature

Detail DOI Sumber

Hasil untuk "Philology. Linguistics"