Hasil untuk "Philology. Linguistics"

Menampilkan 20 dari ~449059 hasil · dari arXiv, DOAJ, Semantic Scholar

JSON API
arXiv Open Access 2026
X-Blocks: Linguistic Building Blocks of Natural Language Explanations for Automated Vehicles

Ashkan Y. Zadeh, Xiaomeng Li, Andry Rakotonirainy et al.

Natural language explanations play a critical role in establishing trust and acceptance of automated vehicles (AVs), yet existing approaches lack systematic frameworks for analysing how humans linguistically construct driving rationales across diverse scenarios. This paper introduces X-Blocks (eXplanation Blocks), a hierarchical analytical framework that identifies the linguistic building blocks of natural language explanations for AVs at three levels: context, syntax, and lexicon. At the context level, we propose RACE (Reasoning-Aligned Classification of Explanations), a multi-LLM ensemble framework that combines Chain-of-Thought reasoning with self-consistency mechanisms to robustly classify explanations into 32 scenario-aware categories. Applied to human-authored explanations from the Berkeley DeepDrive-X dataset, RACE achieves 91.45 percent accuracy and a Cohens kappa of 0.91 against cases with human annotator agreement, indicating near-human reliability for context classification. At the lexical level, log-odds analysis with informative Dirichlet priors reveals context-specific vocabulary patterns that distinguish driving scenarios. At the syntactic level, dependency parsing and template extraction show that explanations draw from a limited repertoire of reusable grammar families, with systematic variation in predicate types and causal constructions across contexts. The X-Blocks framework is dataset-agnostic and task-independent, offering broad applicability to other automated driving datasets and safety-critical domains. Overall, our findings provide evidence-based linguistic design principles for generating scenario-aware explanations that support transparency, user trust, and cognitive accessibility in automated driving systems.

en cs.AI, cs.CL
arXiv Open Access 2026
Multilingual Language Models Encode Script Over Linguistic Structure

Aastha A K Verma, Anwoy Chatterjee, Mehak Gupta et al.

Multilingual language models (LMs) organize representations for typologically and orthographically diverse languages into a shared parameter space, yet the nature of this internal organization remains elusive. In this work, we investigate which linguistic properties - abstract language identity or surface-form cues - shape multilingual representations. Focusing on compact, distilled models where representational trade-offs are explicit, we analyze language-associated units in Llama-3.2-1B and Gemma-2-2B using the Language Activation Probability Entropy (LAPE) metric, and further decompose activations with Sparse Autoencoders. We find that these units are strongly conditioned on orthography: romanization induces near-disjoint representations that align with neither native-script inputs nor English, while word-order shuffling has limited effect on unit identity. Probing shows that typological structure becomes increasingly accessible in deeper layers, while causal interventions indicate that generation is most sensitive to units that are invariant to surface-form perturbations rather than to units identified by typological alignment alone. Overall, our results suggest that multilingual LMs organize representations around surface form, with linguistic abstraction emerging gradually without collapsing into a unified interlingua.

en cs.CL, cs.LG
arXiv Open Access 2026
STEAD: Robust Provably Secure Linguistic Steganography with Diffusion Language Model

Yuang Qi, Na Zhao, Qiyi Yao et al.

Recent provably secure linguistic steganography (PSLS) methods rely on mainstream autoregressive language models (ARMs) to address historically challenging tasks, that is, to disguise covert communication as ``innocuous'' natural language communication. However, due to the characteristic of sequential generation of ARMs, the stegotext generated by ARM-based PSLS methods will produce serious error propagation once it changes, making existing methods unavailable under an active tampering attack. To address this, we propose a robust, provably secure linguistic steganography with diffusion language models (DLMs). Unlike ARMs, DLMs can generate text in a partially parallel manner, allowing us to find robust positions for steganographic embedding that can be combined with error-correcting codes. Furthermore, we introduce error correction strategies, including pseudo-random error correction and neighborhood search correction, during steganographic extraction. Theoretical proof and experimental results demonstrate that our method is secure and robust. It can resist token ambiguity in stegotext segmentation and, to some extent, withstand token-level attacks of insertion, deletion, and substitution.

en cs.CR
S2 Open Access 2018
Posthumanist Applied Linguistics

Jiayu Wang, Xinfu Fang

impossible dream’ (p. 275), but apparently believes that it ‘may come true one day’ (p. 275). However, it would be safer not to hope for something rather uncertain, but to work on something more likely to happen. Translation technology has made great advances, as is shown in this book, but it is not Frankenstein’s monster; it is the light that will illuminate the road towards the future of translation, and a light that will benefit mankind and the whole world.

243 sitasi en Philosophy
arXiv Open Access 2025
Can Vision-Language Models Infer Speaker's Ignorance? The Role of Visual and Linguistic Cues

Ye-eun Cho, Yunho Maeng

This study investigates whether vision-language models (VLMs) can perform pragmatic inference, focusing on ignorance implicatures, utterances that imply the speaker's lack of precise knowledge. To test this, we systematically manipulated contextual cues: the visually depicted situation (visual cue) and QUD-based linguistic prompts (linguistic cue). When only visual cues were provided, three state-of-the-art VLMs (GPT-4o, Gemini 1.5 Pro, and Claude 3.5 sonnet) produced interpretations largely based on the lexical meaning of the modified numerals. When linguistic cues were added to enhance contextual informativeness, Claude exhibited more human-like inference by integrating both types of contextual cues. In contrast, GPT and Gemini favored precise, literal interpretations. Although the influence of contextual cues increased, they treated each contextual cue independently and aligned them with semantic features rather than engaging in context-driven reasoning. These findings suggest that although the models differ in how they handle contextual cues, Claude's ability to combine multiple cues may signal emerging pragmatic competence in multimodal models.

en cs.CL
arXiv Open Access 2025
PragWorld: A Benchmark Evaluating LLMs' Local World Model under Minimal Linguistic Alterations and Conversational Dynamics

Sachin Vashistha, Aryan Bibhuti, Atharva Naik et al.

Real-world conversations are rich with pragmatic elements, such as entity mentions, references, and implicatures. Understanding such nuances is a requirement for successful natural communication, and often requires building a local world model which encodes such elements and captures the dynamics of their evolving states. However, it is not well-understood whether language models (LMs) construct or maintain a robust implicit representation of conversations. In this work, we evaluate the ability of LMs to encode and update their internal world model in dyadic conversations and test their malleability under linguistic alterations. To facilitate this, we apply seven minimal linguistic alterations to conversations sourced from popular datasets and construct two benchmarks comprising yes-no questions. We evaluate a wide range of open and closed source LMs and observe that they struggle to maintain robust accuracy. Our analysis unveils that LMs struggle to memorize crucial details, such as tracking entities under linguistic alterations to conversations. We then propose a dual-perspective interpretability framework which identifies transformer layers that are useful or harmful and highlights linguistic alterations most influenced by harmful layers, typically due to encoding spurious signals or relying on shortcuts. Inspired by these insights, we propose two layer-regularization based fine-tuning strategies that suppress the effect of the harmful layers.

en cs.AI, cs.CL
arXiv Open Access 2025
Linguistic and Embedding-Based Profiling of Texts generated by Humans and Large Language Models

Sergio E. Zanotto, Segun Aroyehun

The rapid advancements in large language models (LLMs) have significantly improved their ability to generate natural language, making texts generated by LLMs increasingly indistinguishable from human-written texts. While recent research has primarily focused on using LLMs to classify text as either human-written or machine-generated texts, our study focuses on characterizing these texts using a set of linguistic features across different linguistic levels such as morphology, syntax, and semantics. We select a dataset of human-written and machine-generated texts spanning 8 domains and produced by 11 different LLMs. We calculate different linguistic features such as dependency length and emotionality, and we use them for characterizing human-written and machine-generated texts along with different sampling strategies, repetition controls, and model release dates. Our statistical analysis reveals that human-written texts tend to exhibit simpler syntactic structures and more diverse semantic content. Furthermore, we calculate the variability of our set of features across models and domains. Both human- and machine-generated texts show stylistic diversity across domains, with human-written texts displaying greater variation in our features. Finally, we apply style embeddings to further test variability among human-written and machine-generated texts. Notably, newer models output text that is similarly variable, pointing to a homogenization of machine-generated texts.

en cs.CL, cs.AI
arXiv Open Access 2024
Understanding the effects of word-level linguistic annotations in under-resourced neural machine translation

Víctor M. Sánchez-Cartagena, Juan Antonio Pérez-Ortiz, Felipe Sánchez-Martínez

This paper studies the effects of word-level linguistic annotations in under-resourced neural machine translation, for which there is incomplete evidence in the literature. The study covers eight language pairs, different training corpus sizes, two architectures, and three types of annotation: dummy tags (with no linguistic information at all), part-of-speech tags, and morpho-syntactic description tags, which consist of part of speech and morphological features. These linguistic annotations are interleaved in the input or output streams as a single tag placed before each word. In order to measure the performance under each scenario, we use automatic evaluation metrics and perform automatic error classification. Our experiments show that, in general, source-language annotations are helpful and morpho-syntactic descriptions outperform part of speech for some language pairs. On the contrary, when words are annotated in the target language, part-of-speech tags systematically outperform morpho-syntactic description tags in terms of automatic evaluation metrics, even though the use of morpho-syntactic description tags improves the grammaticality of the output. We provide a detailed analysis of the reasons behind this result.

arXiv Open Access 2024
Investigating Causal Cues: Strengthening Spoofed Audio Detection with Human-Discernible Linguistic Features

Zahra Khanjani, Tolulope Ale, Jianwu Wang et al.

Several types of spoofed audio, such as mimicry, replay attacks, and deepfakes, have created societal challenges to information integrity. Recently, researchers have worked with sociolinguistics experts to label spoofed audio samples with Expert Defined Linguistic Features (EDLFs) that can be discerned by the human ear: pitch, pause, word-initial and word-final release bursts of consonant stops, audible intake or outtake of breath, and overall audio quality. It is established that there is an improvement in several deepfake detection algorithms when they augmented the traditional and common features of audio data with these EDLFs. In this paper, using a hybrid dataset comprised of multiple types of spoofed audio augmented with sociolinguistic annotations, we investigate causal discovery and inferences between the discernible linguistic features and the label in the audio clips, comparing the findings of the causal models with the expert ground truth validation labeling process. Our findings suggest that the causal models indicate the utility of incorporating linguistic features to help discern spoofed audio, as well as the overall need and opportunity to incorporate human knowledge into models and techniques for strengthening AI models. The causal discovery and inference can be used as a foundation of training humans to discern spoofed audio as well as automating EDLFs labeling for the purpose of performance improvement of the common AI-based spoofed audio detectors.

en cs.SD, cs.CL
arXiv Open Access 2024
Teranga Go!: Carpooling Collaborative Consumption Community with multi-criteria hesitant fuzzy linguistic term set opinions to build confidence and trust

Rosana Montes, Ana M. Sanchez, Pedro Villar et al.

Classic Delphi and Fuzzy Delphi methods are used to test content validity of a data collection tools such as questionnaires. Fuzzy Delphi takes the opinion issued by judges from a linguistic perspective reducing ambiguity in opinions by using fuzzy numbers. We propose an extension named 2-Tuple Fuzzy Linguistic Delphi method to deal with scenarios in which judges show different expertise degrees by using fuzzy multigranular semantics of the linguistic terms and to obtain intermediate and final results expressed by 2-tuple linguistic values. The key idea of our proposal is to validate the full questionnaire by means of the evaluation of its parts, defining the validity of each item as a Decision Making problem. Taking the opinion of experts, we measure the degree of consensus, the degree of consistency, and the linguistic score of each item, in order to detect those items that affect, positively or negatively, the quality of the instrument. Considering the real need to evaluate a b-learning educational experience with a consensual questionnaire, we present a Decision Making model for questionnaire validation that solve it. Additionally, we contribute to this consensus reaching problem by developing an online tool under GPL v3 license. The software visualizes the collective valuations for each iteration and assists to determine which parts of the questionnaire should be modified to reach a consensual solution.

en cs.CY, cs.AI
DOAJ Open Access 2024
Disney characters as moral role models: a discourse analysis of Coco and Encanto

Ruta Vaidya, Yousra Osman

Popular culture characters and narratives can influence people’s understanding of moral actions, and thus guide their behaviors accordingly. Films have proven to be strong anchors in people’s understanding of morality. We examine the perception of resilience as a moral characteristic in examining not just how people become resilient (thereby focusing on impact and action) but also to study “resilience” as a discourse and discussion in society as a moral message. We argue that popular culture narratives, particularly Disney films, help reinforce moral meanings of resilience. By conducting a discourse analysis of two films-Encanto and Coco and particularly focusing on resilience, we aim to contribute to a deepened understanding of Disney’s role in producing morality through diverse narratives and characters who serve as role models.

Communication. Mass media
DOAJ Open Access 2024
Language Game in Advertising and Its Impact on Consumers

Anna E. Bazanova, Mohamed Alsadig Hamid Musa

The concept and phenomenon of a language game, its main functions, types and application in commercial advertising is a way to attract the attention of consumers and promote a product. Examples of phonetic, morphological and syntactic wordplay in the texts of English-language commercial advertising are analyzed. The purpose of the article is to analyze the techniques of a language game and identify their functional features at various levels in an English-language advertising text. In this article, the following methods were used: descriptive-analytical method, interpretation method, search method. When choosing a material for analysis, the method of continuous sampling was used. As the material of the research, we used English-language advertisements in various resources, such as from magazines and newspapers and videos, in which a language game was revealed. Thus, the language game implemented in advertising texts is an important phenomenon, since it contributes to the maximum impact on the consumer, since the recipient, thanks to his techniques and functions, draws attention to this advertisement. In addition, an advertisement in which a language game as present is an indicator of a high level of the consumer’s language competence.

Language. Linguistic theory. Comparative grammar, Semantics
arXiv Open Access 2023
A Ship of Theseus: Curious Cases of Paraphrasing in LLM-Generated Texts

Nafis Irtiza Tripto, Saranya Venkatraman, Dominik Macko et al.

In the realm of text manipulation and linguistic transformation, the question of authorship has been a subject of fascination and philosophical inquiry. Much like the Ship of Theseus paradox, which ponders whether a ship remains the same when each of its original planks is replaced, our research delves into an intriguing question: Does a text retain its original authorship when it undergoes numerous paraphrasing iterations? Specifically, since Large Language Models (LLMs) have demonstrated remarkable proficiency in both the generation of original content and the modification of human-authored texts, a pivotal question emerges concerning the determination of authorship in instances where LLMs or similar paraphrasing tools are employed to rephrase the text--i.e., whether authorship should be attributed to the original human author or the AI-powered tool. Therefore, we embark on a philosophical voyage through the seas of language and authorship to unravel this intricate puzzle. Using a computational approach, we discover that the diminishing performance in text classification models, with each successive paraphrasing iteration, is closely associated with the extent of deviation from the original author's style, thus provoking a reconsideration of the current notion of authorship.

en cs.CL
arXiv Open Access 2023
Improving Korean NLP Tasks with Linguistically Informed Subword Tokenization and Sub-character Decomposition

Taehee Jeon, Bongseok Yang, Changhwan Kim et al.

We introduce a morpheme-aware subword tokenization method that utilizes sub-character decomposition to address the challenges of applying Byte Pair Encoding (BPE) to Korean, a language characterized by its rich morphology and unique writing system. Our approach balances linguistic accuracy with computational efficiency in Pre-trained Language Models (PLMs). Our evaluations show that this technique achieves good performances overall, notably improving results in the syntactic task of NIKL-CoLA. This suggests that integrating morpheme type information can enhance language models' syntactic and semantic capabilities, indicating that adopting more linguistic insights can further improve performance beyond standard morphological analysis.

en cs.CL
DOAJ Open Access 2023
“Please. Do. Not. Share. Videos. Share. Cats.”: Counteracting Terrorist and Violent Extremist Content on Twitter during Terrorist Attacks

Moa Eriksson Krutrök

Obtaining accurate information from social media during a crisis can be difficult, but should all information really be disseminated? Social media platforms actively filter out terrorist and violent extremist content (TVEC), but how are users themselves counteracting its spread? This paper aims to connect the research on media events with studies currently being conducted in information science and digital media research through a case study of tweets during the Vienna terror attack in late 2020. These tweets were manually coded in accordance with Braun and Clarke’s reflexive thematic analysis. This study shows that during the 2020 Vienna attack, GIFs shared on Twitter served three functions: amplification, personalisation and ethical practice. The paper ends with a discussion on the ways cats may function as a countermeasure against the prevalence of TVEC on social media during terrorist attacks and the implications of such countermeasures.

Journalism. The periodical press, etc., Communication. Mass media
S2 Open Access 2020
Is an Antiracist and Decolonizing Applied Linguistics Possible?

Suhanthie Motha

Abstract This article argues for an uncovering of the multitude of ways in which applied linguistics has functioned as an important and effective vehicle for White supremacy and empire, with its disciplinary roots embedded in assumptions about racial inequalities and racial hierarchies and, equally importantly, the concealment of these forms of racial discrimination which often manifest as innocuous language practices. In particular, the notion of objectivity has played a guiding role in reinscribing Whiteness in much applied linguistics theorizing and research within a global context of inequitable racial power and forms of knowledge production and transmission that are steeped in colonial reasoning. In this piece, the author considers what antiracism and decolonization mean within applied linguistics and asks: Is the discipline of applied linguistics irretrievably rooted in an ontology of race and empire? Or is an antiracist and decolonizing applied linguistics possible?

60 sitasi en Sociology
S2 Open Access 2019
On classification trees and random forests in corpus linguistics: Some words of caution and suggestions for improvement

Stefan Th Gries

Abstract This paper is a discussion of methodological problems that (can) arise in the analysis of multifactorial data analyzed with tree-based or forest-based classifiers in (corpus) linguistics. I showcase a data set that highlights where such methods can fail at providing optimal results and then discuss solutions to this problem as well as the interpretation of random forests more generally.

86 sitasi en Computer Science
S2 Open Access 2020
Indigenous Language Revitalization and Applied Linguistics: Parallel Histories, Shared Futures?

Onowa McIvor

Abstract Damages done to Indigenous languages occurred due to colonial forces, some of which continue to this day, and many believe efforts to revive them should involve more than Indigenous peoples alone. Therefore, the need for learning Indigenous languages as “additional” languages is a relatively new societal phenomenon and Indigenous language revitalization (ILR) an emerging academic field of study. As the ILR body of literature has developed, it has become clear that this work does not fit neatly into any single academic discipline. While there have been substantial contributions from linguistics and education, the study and recovery of Indigenous languages are necessarily self-determined and self-governing. Also, due to the unique set of circumstances, contexts, and, therefore, solutions needed, it is argued that this discipline is separate from, yet connected to, others. Applied linguists hold specific knowledge and skills that could be extended to ILR toward great gains. This paper explores current foci within ILR, especially concepts, theories, and areas of study that connect applied linguistics and Indigenous language learning. The intention of this paper is to consider commonalities, differences, current and future interests for shared consideration of the potential of collaborations, and partnerships between applied linguistics and ILR scholars.

44 sitasi en Sociology

Halaman 40 dari 22453