Hasil "Philology. Linguistics"

arXiv Open Access 2026

Detecting Jailbreak Attempts in Clinical Training LLMs Through Automated Linguistic Feature Extraction

Tri Nguyen, Huy Hoang Bao Le, Lohith Srikanth Pentapalli et al.

Detecting jailbreak attempts in clinical training large language models (LLMs) requires accurate modeling of linguistic deviations that signal unsafe or off-task user behavior. Prior work on the 2-Sigma clinical simulation platform showed that manually annotated linguistic features could support jailbreak detection. However, reliance on manual annotation limited both scalability and expressiveness. In this study, we extend this framework by using experts' annotations of four core linguistic features (Professionalism, Medical Relevance, Ethical Behavior, and Contextual Distraction) and training multiple general-domain and medical-domain BERT-based LLM models to predict these features directly from text. The most reliable feature regressor for each dimension was selected and used as the feature extractor in a second layer of classifiers. We evaluate a suite of predictive models, including tree-based, linear, probabilistic, and ensemble methods, to determine jailbreak likelihood from the extracted features. Across cross-validation and held-out evaluations, the system achieves strong overall performance, indicating that LLM-derived linguistic features provide an effective basis for automated jailbreak detection. Error analysis further highlights key limitations in current annotations and feature representations, pointing toward future improvements such as richer annotation schemes, finer-grained feature extraction, and methods that capture the evolving risk of jailbreak behavior over the course of a dialogue. This work demonstrates a scalable and interpretable approach for detecting jailbreak behavior in safety-critical clinical dialogue systems.

en cs.AI, cs.LG

Detail Sumber

DOAJ Open Access 2025

Linguoesthetic and idiodiscursive dimensions of artistic modelling

Alexander V. Ignatenko

The study addresses the issue of the linguistic analysis of artistic discourse in the aesthetic projection of idiodiscursive language modelling, where artistic discourse is understood as a particular aesthetic and ethno-cultural worldview formed through the intellectual and creative interaction between the artist (author) and the recipient (reader). The purpose of this work is to clarify the dimensions of the linguistic aesthetic model of the author's set of techniques and tools based on the linguistic discursive practices of the contemporary Chinese literature (the turn of the XX-XXI centuries). The study analyses six popular works of prose in the form of a lengthy novels by Mo Yan, Liu Zhenyun, Yu Hua, Jia Pingwa, Wang Anyi and Moxiang Tongxiu, comprising a total of about three million hieroglyphs. The study employs a comprehensive methodology underpinned by the concepts of integral linguistics, text linguistics, pragmatics, cognitive and suggestive linguistics, which consequently engenders an interdisciplinary approach. The paper sets out to analyse the influence of ethnocultural features on the formation of a nationally oriented artistic discourse. The concept of national culture is identified through the realms of the linguistic picture of the world, which act as a cultural code or cultural marker. The author’s linguistic personality is regarded as a modelling projection of the national language and ethnocultural features. The results of the study demonstrate that the parameters of the linguoesthetic model of the author’s linguistic consciousness include: the individual author’s linguistic features, which are subjective and syntagmatic in nature; the ethnonational and sociocultural-cognitive author’s attitudes; the use of intermediate tools of cognitive-pragmatic psychology to attract multidimensional matrix attention, thereby helping to obtain aesthetic perception from fictitious interaction with the artistic world; the emotive density at points of structural tension; the linguoesthetic discursive markers; the integration of cultural, historical and ethnonational realities. The proposed dimensions require further expansion and detail, but in this form, they allow us to apply them in the methodology of analyzing artistic discourse on the issue of identifying ideodiscursive and linguoesthetic features in the use of writing practices.

Philology. Linguistics

Detail DOI Sumber

arXiv Open Access 2025

AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models

Mohammad Zbeeb, Hasan Abed Al Kader Hammoud, Sina Mukalled et al.

We present AraLingBench: a fully human annotated benchmark for evaluating the Arabic linguistic competence of large language models (LLMs). The benchmark spans five core categories: grammar, morphology, spelling, reading comprehension, and syntax, through 150 expert-designed multiple choice questions that directly assess structural language understanding. Evaluating 35 Arabic and bilingual LLMs reveals that current models demonstrate strong surface level proficiency but struggle with deeper grammatical and syntactic reasoning. AraLingBench highlights a persistent gap between high scores on knowledge-based benchmarks and true linguistic mastery, showing that many models succeed through memorization or pattern recognition rather than authentic comprehension. By isolating and measuring fundamental linguistic skills, AraLingBench provides a diagnostic framework for developing Arabic LLMs. The full evaluation code is publicly available on GitHub.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2025

LUCY: Linguistic Understanding and Control Yielding Early Stage of Her

Heting Gao, Hang Shao, Xiong Wang et al.

The film Her features Samantha, a sophisticated AI audio agent who is capable of understanding both linguistic and paralinguistic information in human speech and delivering real-time responses that are natural, informative and sensitive to emotional subtleties. Moving one step toward more sophisticated audio agent from recent advancement in end-to-end (E2E) speech systems, we propose LUCY, a E2E speech model that (1) senses and responds to user's emotion, (2) deliver responses in a succinct and natural style, and (3) use external tool to answer real-time inquiries. Experiment results show that LUCY is better at emotion control than peer models, generating emotional responses based on linguistic emotional instructions and responding to paralinguistic emotional cues. Lucy is also able to generate responses in a more natural style, as judged by external language models, without sacrificing much performance on general question answering. Finally, LUCY can leverage function calls to answer questions that are out of its knowledge scope.

en cs.CL, cs.SD

Detail Sumber

arXiv Open Access 2025

ViSoLex: An Open-Source Repository for Vietnamese Social Media Lexical Normalization

Anh Thi-Hoang Nguyen, Dung Ha Nguyen, Kiet Van Nguyen

ViSoLex is an open-source system designed to address the unique challenges of lexical normalization for Vietnamese social media text. The platform provides two core services: Non-Standard Word (NSW) Lookup and Lexical Normalization, enabling users to retrieve standard forms of informal language and standardize text containing NSWs. ViSoLex's architecture integrates pre-trained language models and weakly supervised learning techniques to ensure accurate and efficient normalization, overcoming the scarcity of labeled data in Vietnamese. This paper details the system's design, functionality, and its applications for researchers and non-technical users. Additionally, ViSoLex offers a flexible, customizable framework that can be adapted to various datasets and research requirements. By publishing the source code, ViSoLex aims to contribute to the development of more robust Vietnamese natural language processing tools and encourage further research in lexical normalization. Future directions include expanding the system's capabilities for additional languages and improving the handling of more complex non-standard linguistic patterns.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2025

LLMs Struggle with NLI for Perfect Aspect: A Cross-Linguistic Study in Chinese and Japanese

Jie Lu, Du Jin, Hitomi Yanaka

Unlike English, which uses distinct forms (e.g., had, has, will have) to mark the perfect aspect across tenses, Chinese and Japanese lack separate grammatical forms for tense within the perfect aspect, which complicates Natural Language Inference (NLI). Focusing on the perfect aspect in these languages, we construct a linguistically motivated, template-based NLI dataset (1,350 pairs per language). Experiments reveal that even advanced LLMs struggle with temporal inference, particularly in detecting subtle tense and reference-time shifts. These findings highlight model limitations and underscore the need for cross-linguistic evaluation in temporal semantics. Our dataset is available at https://github.com/Lujie2001/CrossNLI.

en cs.CL

Detail Sumber

arXiv Open Access 2025

The Sound of Populism: Distinct Linguistic Features Across Populist Variants

Yu Wang, Runxi Yu, Zhongyuan Wang et al.

This study explores the sound of populism by integrating the classic Linguistic Inquiry and Word Count (LIWC) features, which capture the emotional and stylistic tones of language, with a fine-tuned RoBERTa model, a state-of-the-art context-aware language model trained to detect nuanced expressions of populism. This approach allows us to uncover the auditory dimensions of political rhetoric in U.S. presidential inaugural and State of the Union addresses. We examine how four key populist dimensions (i.e., left-wing, right-wing, anti-elitism, and people-centrism) manifest in the linguistic markers of speech, drawing attention to both commonalities and distinct tonal shifts across these variants. Our findings reveal that populist rhetoric consistently features a direct, assertive ``sound" that forges a connection with ``the people'' and constructs a charismatic leadership persona. However, this sound is not simply informal but strategically calibrated. Notably, right-wing populism and people-centrism exhibit a more emotionally charged discourse, resonating with themes of identity, grievance, and crisis, in contrast to the relatively restrained emotional tones of left-wing and anti-elitist expressions.

en cs.CL

Detail Sumber

arXiv Open Access 2024

Efficiently Leveraging Linguistic Priors for Scene Text Spotting

Nguyen Nguyen, Yapeng Tian, Chenliang Xu

Incorporating linguistic knowledge can improve scene text recognition, but it is questionable whether the same holds for scene text spotting, which typically involves text detection and recognition. This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models. This allows the model to capture the relationship between characters in the same word. Additionally, we introduce a technique to generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning. As a result, the newly created text distributions are more informative than pure one-hot encoding, leading to improved spotting and recognition performance. Our method is simple and efficient, and it can easily be integrated into existing auto-regressive-based approaches. Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words. It significantly improves both state-of-the-art scene text spotting and recognition pipelines, achieving state-of-the-art results on several benchmarks.

en cs.CV

Detail Sumber

arXiv Open Access 2024

Decentralised Emergence of Robust and Adaptive Linguistic Conventions in Populations of Autonomous Agents Grounded in Continuous Worlds

Jérôme Botoko Ekila, Jens Nevens, Lara Verheyen et al.

This paper introduces a methodology through which a population of autonomous agents can establish a linguistic convention that enables them to refer to arbitrary entities that they observe in their environment. The linguistic convention emerges in a decentralised manner through local communicative interactions between pairs of agents drawn from the population. The convention consists of symbolic labels (word forms) associated to concept representations (word meanings) that are grounded in a continuous feature space. The concept representations of each agent are individually constructed yet compatible on a communicative level. Through a range of experiments, we show (i) that the methodology enables a population to converge on a communicatively effective, coherent and human-interpretable linguistic convention, (ii) that it is naturally robust against sensor defects in individual agents, (iii) that it can effectively deal with noisy observations, uncalibrated sensors and heteromorphic populations, (iv) that the method is adequate for continual learning, and (v) that the convention self-adapts to changes in the environment and communicative needs of the agents.

en cs.AI, cs.CL

Detail Sumber

DOAJ Open Access 2023

Minderheitensprache im didaktischen Kontext am Beispiel des Deutschunterrichts in Polen

Ryszard Ziaja

In the following article, an attempt is made to define basic approaches for a future successful design of German minority teaching in Poland. On the one hand, reference is made to the thirty-year history and current situation of teaching in Poland and, on the other hand, to the already researched fields of activity of German minority teaching in Eastern Europe. Linguistic as well as cultural and methodological aspects are explored. The contribution should be understood as a scientific suggestion to redefine the term minority language in a didactic context.

Language. Linguistic theory. Comparative grammar

Detail DOI Sumber

DOAJ Open Access 2023

Knut Hamsuns Pan og Joseph Conrads Heart of Darkness

Martin Humpál

The article examines similarities between ”Glahns død. Et papir fra 1861”, i.e., the second part of Knut Hamsun’s novel Pan (1894), and Joseph Conrad’s novel Heart of Darkness (1899). The comparative analysis demonstrates that the two texts have several common features both in terms of setting, thematic aspects and narrative technique. Both ”Glahns død” and Heart of Darkness take as a point of departure a riverboat journey into the jungle. Both texts thematize the differences between the European civilization and the “uncivilized” exotic world and focus on the contrasts between the civilized and the primitive life, the rational and the irrational behavior. Both stories are narrated by a male narrator personality that is strongly fascinated by another man, and in each case the text involves a partially unreliable witness type of narrator. The article describes some of these common features as modernist and confirms the position of both novels in early modernism.

Philology. Linguistics

Detail DOI Sumber

arXiv Open Access 2023

On the Amplification of Linguistic Bias through Unintentional Self-reinforcement Learning by Generative Language Models -- A Perspective

Minhyeok Lee

Generative Language Models (GLMs) have the potential to significantly shape our linguistic landscape due to their expansive use in various digital applications. However, this widespread adoption might inadvertently trigger a self-reinforcement learning cycle that can amplify existing linguistic biases. This paper explores the possibility of such a phenomenon, where the initial biases in GLMs, reflected in their generated text, can feed into the learning material of subsequent models, thereby reinforcing and amplifying these biases. Moreover, the paper highlights how the pervasive nature of GLMs might influence the linguistic and cognitive development of future generations, as they may unconsciously learn and reproduce these biases. The implications of this potential self-reinforcement cycle extend beyond the models themselves, impacting human language and discourse. The advantages and disadvantages of this bias amplification are weighed, considering educational benefits and ease of future GLM learning against threats to linguistic diversity and dependence on initial GLMs. This paper underscores the need for rigorous research to understand and address these issues. It advocates for improved model transparency, bias-aware training techniques, development of methods to distinguish between human and GLM-generated text, and robust measures for fairness and bias evaluation in GLMs. The aim is to ensure the effective, safe, and equitable use of these powerful technologies, while preserving the richness and diversity of human language.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2023

Multi-round Dynamic Group Decision Making Method On 2-Dimension Uncertain Linguistic Variables

Yukun Zhang

The language evaluation information of the interactive group decision method at present is based on the one-dimension language variable. At the same time, multi-attribute group decision making method based on two-dimension linguistic information only use single-stage and static evaluation method. In this paper, we propose a dynamic group decision making method based on two-dimension linguistic information, combining dynamic interactive group decision making methods with two-dimensional language evaluation information The method first use Two-Dimensional Uncertain Linguistic Generalized Weighted Aggregation (DULGWA) Operators to aggregate the preference information of each decision maker, then adopting dynamic information entropy method to obtain weights of attributes at each stage. Finally we propose the group consistency index to quantify the termination conditions of group interaction. One example is given to verify the developed approach and to demonstrate its effectiveness.

en cs.SI

Detail Sumber

DOAJ Open Access 2022

Implementing media literacy in the classroom

Biljana Temelkova, Julijana Zlatevska, Saska Delova

Social networks use a variety of models from the spectrum of web technologies. These tools were designed for entertainment and communication purposes, and the most commonly used among students are the social media tools. Knowing this, teachers have incorporated traditional teaching strategies as part of increasing the ability of students to grasp the objectives. Considering that most students find social media tools interesting, most researchers have supported this strategy as one of the most innovative methods of improving the learning outcomes. Students today can learn almost everything on their own or with their teachers using the Internet, talk to people from all over the world, share experiences with their peers, get acquainted with the world and other people, customs and cultures through their own personal experience. Researching a particular topic to find information, literature by authors from around the world and literary criticism, there are things that can open their view of the world, as well as enable them to think critically and creatively. These are some of the most significant benefits of a global network. This study wants to indicate the level of awareness in teachers and students about the proper use of social media and media literacy as well as to address the necessity to overcome the lack of media literacy knowledge within youth as well as the wider population. It also focuses on pointing out the dangers that social media causes with illiterate youth. Students who daily take in a huge amount of information from a wide array of sources, are going to become better critical thinkers and smart consumers that recognize the point of view and recognize the author’s goal. Hopefully in future, teachers will pay more attention to what extent and in what manner their students use social media and pay particular attention to teach them how to be more alert when using social media.

Education, Communication. Mass media

Detail Sumber

arXiv Open Access 2022

Addressing Segmentation Ambiguity in Neural Linguistic Steganography

Jumon Nozaki, Yugo Murawaki

Previous studies on neural linguistic steganography, except Ueoka et al. (2021), overlook the fact that the sender must detokenize cover texts to avoid arousing the eavesdropper's suspicion. In this paper, we demonstrate that segmentation ambiguity indeed causes occasional decoding failures at the receiver's side. With the near-ubiquity of subwords, this problem now affects any language. We propose simple tricks to overcome this problem, which are even applicable to languages without explicit word boundaries.

en cs.CL

Detail Sumber

CrossRef Open Access 2021

Article Title: “Linguistics, Philology, and the Role of Theory”

Willem S. Boshoff

en

Detail DOI Sumber

DOAJ Open Access 2021

Indira Šabić Prezimena u Bosni i Hercegovini

Tanja Kuštović

Indira Šabić Prezimena u Bosni i Hercegovini Tuzla: Off-set Tuzla, 2020., 541 str.

Philology. Linguistics

Detail Sumber

DOAJ Open Access 2020

Феномен «слов года» в немецком языке

Elena A. Smolonogina

Настоящее исследование посвящено словам года в немецком языке как сравнительно мало изученному лингвистическому феномену. В статье предпринимается попытка описания понятия «слово года» в сопоставлении с близким ему понятием «неологизм». Помимо определения статуса слов года в современной лингвистике, особое внимание уделяется дальнейшей судьбе слов года после их опубликования, их роли в пополнении словарного состава немецкого языка и лексикографическому статусу. Слова года во многом сопоставимы с неологизмами: большинство слов года представляет собой результат словообразования, заимствования или расширения значения слова. Однако среди слов года есть группа лексических единиц, давно существующих в немецком языке. Именно она наглядно демонстрирует природу слов года. Слова года — лексические единицы, значение которых имеет особую актуальность. Под актуальностью понимается значимость лексической единицы для носителей языка и культуры в целом в конкретный временной период, а также связанный с этим возможный рост частотности ее употребления. Актуальность является следствием возникновения определенного экстралингвистического контекста использования слова и контекстуального значения, обусловленного дискурсом данного временного периода. Непродолжительное фиксированное время существования в статусе слова года и соотнесенность с конкретным временным периодом объединяет слова года и неологизмы. Совпадает и их дальнейший жизненный путь, который может вести как к закреплению слов или их значений в языке, так и к исчезновению. Результаты исследования показывают, что включение или невключение слов года в словарь не определяется такими факторами, как большая частотность употребления, наличие деривационной, парадигматической и синтагматической активности. Данные факторы важны при закреплении в языке неологизмов, но малозначимы для слов года. Определяющим критерием при фиксации слов года общими толковыми словарями является семантический. При лексикографировании слов года их особый статус в языке, как правило, не указывается.

Philology. Linguistics

Detail DOI Sumber

DOAJ Open Access 2020

La autonomía informativa en Comunicación Nuclear: análisis de la legislación vigente

Enrique Cobos Urbina

La opinión pública en España se ha posicionado históricamente en contra de la energía nuclear. El sector atómico señala la falta de información y el desconocimiento de la población como las causas principales de ese rechazo. El puente que une a la opinión pública y a la industria atómica debe trazarse con información transparente por parte de la autoridad responsable de informar (ya sea el Gobierno, el organismo regulador, o una central nuclear) para que los ciudadanos estén protegidos en caso de sucesos imprevistos, y para que puedan comprender las ventajas e inconvenientes de esta fuente de energía. En este sentido, la legislación en materia de Comunicación Nuclear contempla la transparencia como la base fundamental para esa interacción con la sociedad. En este artículo se revisa la normativa internacional, europea y nacional para conocer si la población está protegida por la ley en materia informativa en asuntos nucleares y para conocer la proyección e influencia que esa regulación tiene en la gestión de la comunicación de estas industrias.

Communication. Mass media, Journalism. The periodical press, etc.

Detail Sumber

arXiv Open Access 2020

Detecting Asks in SE attacks: Impact of Linguistic and Structural Knowledge

Bonnie J. Dorr, Archna Bhatia, Adam Dalton et al.

Social engineers attempt to manipulate users into undertaking actions such as downloading malware by clicking links or providing access to money or sensitive information. Natural language processing, computational sociolinguistics, and media-specific structural clues provide a means for detecting both the ask (e.g., buy gift card) and the risk/reward implied by the ask, which we call framing (e.g., lose your job, get a raise). We apply linguistic resources such as Lexical Conceptual Structure to tackle ask detection and also leverage structural clues such as links and their proximity to identified asks to improve confidence in our results. Our experiments indicate that the performance of ask detection, framing detection, and identification of the top ask is improved by linguistically motivated classes coupled with structural clues such as links. Our approach is implemented in a system that informs users about social engineering risk situations.

en cs.CL

Detail Sumber

Hasil untuk "Philology. Linguistics"