Hasil "Language. Linguistic theory. Comparative grammar"

arXiv Open Access 2026

STEAD: Robust Provably Secure Linguistic Steganography with Diffusion Language Model

Yuang Qi, Na Zhao, Qiyi Yao et al.

Recent provably secure linguistic steganography (PSLS) methods rely on mainstream autoregressive language models (ARMs) to address historically challenging tasks, that is, to disguise covert communication as ``innocuous'' natural language communication. However, due to the characteristic of sequential generation of ARMs, the stegotext generated by ARM-based PSLS methods will produce serious error propagation once it changes, making existing methods unavailable under an active tampering attack. To address this, we propose a robust, provably secure linguistic steganography with diffusion language models (DLMs). Unlike ARMs, DLMs can generate text in a partially parallel manner, allowing us to find robust positions for steganographic embedding that can be combined with error-correcting codes. Furthermore, we introduce error correction strategies, including pseudo-random error correction and neighborhood search correction, during steganographic extraction. Theoretical proof and experimental results demonstrate that our method is secure and robust. It can resist token ambiguity in stegotext segmentation and, to some extent, withstand token-level attacks of insertion, deletion, and substitution.

en cs.CR

Detail Sumber

arXiv Open Access 2026

Detection of Illicit Content on Online Marketplaces using Large Language Models

Quoc Khoa Tran, Thanh Thi Nguyen, Campbell Wilson

Online marketplaces, while revolutionizing global commerce, have inadvertently facilitated the proliferation of illicit activities, including drug trafficking, counterfeit sales, and cybercrimes. Traditional content moderation methods such as manual reviews and rule-based automated systems struggle with scalability, dynamic obfuscation techniques, and multilingual content. Conventional machine learning models, though effective in simpler contexts, often falter when confronting the semantic complexities and linguistic nuances characteristic of illicit marketplace communications. This research investigates the efficacy of Large Language Models (LLMs), specifically Meta's Llama 3.2 and Google's Gemma 3, in detecting and classifying illicit online marketplace content using the multilingual DUTA10K dataset. Employing fine-tuning techniques such as Parameter-Efficient Fine-Tuning (PEFT) and quantization, these models were systematically benchmarked against a foundational transformer-based model (BERT) and traditional machine learning baselines (Support Vector Machines and Naive Bayes). Experimental results reveal a task-dependent advantage for LLMs. In binary classification (illicit vs. non-illicit), Llama 3.2 demonstrated performance comparable to traditional methods. However, for complex, imbalanced multi-class classification involving 40 specific illicit categories, Llama 3.2 significantly surpassed all baseline models. These findings offer substantial practical implications for enhancing online safety, equipping law enforcement agencies, e-commerce platforms, and cybersecurity specialists with more effective, scalable, and adaptive tools for illicit content detection and moderation.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2026

Can Linguistically Related Languages Guide LLM Translation in Low-Resource Settings?

Aishwarya Ramasethu, Niyathi Allu, Rohin Garg et al.

Large Language Models (LLMs) have achieved strong performance across many downstream tasks, yet their effectiveness in extremely low-resource machine translation remains limited. Standard adaptation techniques typically rely on large-scale parallel data or extensive fine-tuning, which are infeasible for the long tail of underrepresented languages. In this work, we investigate a more constrained question: in data-scarce settings, to what extent can linguistically similar pivot languages and few-shot demonstrations provide useful guidance for on-the-fly adaptation in LLMs? We study a data-efficient experimental setup that combines linguistically related pivot languages with few-shot in-context examples, without any parameter updates, and evaluate translation behavior under controlled conditions. Our analysis shows that while pivot-based prompting can yield improvements in certain configurations, particularly in settings where the target language is less well represented in the model's vocabulary, the gains are often modest and sensitive to few shot example construction. For closely related or better represented varieties, we observe diminishing or inconsistent gains. Our findings provide empirical guidance on how and when inference-time prompting and pivot-based examples can be used as a lightweight alternative to fine-tuning in low-resource translation settings.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2026

Language as Mathematical Structure: Examining Semantic Field Theory Against Language Games

Dimitris Vartziotis

Large language models (LLMs) offer a new empirical setting in which long-standing theories of linguistic meaning can be examined. This paper contrasts two broad approaches: social constructivist accounts associated with language games, and a mathematically oriented framework we call Semantic Field Theory. Building on earlier work by the author, we formalize the notions of lexical fields (Lexfelder) and linguistic fields (Lingofelder) as interacting structures in a continuous semantic space. We then analyze how core properties of transformer architectures-such as distributed representations, attention mechanisms, and geometric regularities in embedding spaces-relate to these concepts. We argue that the success of LLMs in capturing semantic regularities supports the view that language exhibits an underlying mathematical structure, while their persistent limitations in pragmatic reasoning and context sensitivity are consistent with the importance of social grounding emphasized in philosophical accounts of language use. On this basis, we suggest that mathematical structure and language games can be understood as complementary rather than competing perspectives. The resulting framework clarifies the scope and limits of purely statistical models of language and motivates new directions for theoretically informed AI architectures.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2025

AutoSign: Direct Pose-to-Text Translation for Continuous Sign Language Recognition

Samuel Ebimobowei Johnny, Blessed Guda, Andrew Blayama Stephen et al.

Continuously recognizing sign gestures and converting them to glosses plays a key role in bridging the gap between the hearing and hearing-impaired communities. This involves recognizing and interpreting the hands, face, and body gestures of the signer, which pose a challenge as it involves a combination of all these features. Continuous Sign Language Recognition (CSLR) methods rely on multi-stage pipelines that first extract visual features, then align variable-length sequences with target glosses using CTC or HMM-based approaches. However, these alignment-based methods suffer from error propagation across stages, overfitting, and struggle with vocabulary scalability due to the intermediate gloss representation bottleneck. To address these limitations, we propose AutoSign, an autoregressive decoder-only transformer that directly translates pose sequences to natural language text, bypassing traditional alignment mechanisms entirely. The use of this decoder-only approach allows the model to directly map between the features and the glosses without the need for CTC loss while also directly learning the textual dependencies in the glosses. Our approach incorporates a temporal compression module using 1D CNNs to efficiently process pose sequences, followed by AraGPT2, a pre-trained Arabic decoder, to generate text (glosses). Through comprehensive ablation studies, we demonstrate that hand and body gestures provide the most discriminative features for signer-independent CSLR. By eliminating the multi-stage pipeline, AutoSign achieves substantial improvements on the Isharah-1000 dataset, achieving an improvement of up to 6.1\% in WER score compared to the best existing method.

en cs.CV, cs.AI

Detail Sumber

arXiv Open Access 2025

ARS: Adaptive Reasoning Suppression for Efficient Large Reasoning Language Models

Dongqi Zheng

Large Reasoning Language Models (LRLMs or LRMs) demonstrate remarkable capabilities in complex reasoning tasks, but suffer from significant computational inefficiencies due to overthinking phenomena. Existing efficient reasoning methods face the challenge of balancing reasoning quality with inference cost reduction. We propose \textbf{Adaptive Reasoning Suppression (ARS)}, a novel training-free approach that dynamically suppresses redundant reasoning steps while preserving accuracy through adaptive certainty monitoring. ARS introduces a multi-checkpoint certainty estimation mechanism with progressive suppression thresholds, achieving superior efficiency compared to static suppression methods. Our extensive evaluation across mathematical reasoning benchmarks using multiple model architectures demonstrates that ARS achieves up to 53%, 46.1%, and 57.9% in token, latency and energy reduction, while maintaining or improving accuracy.

en cs.AI, cs.CL

Detail Sumber

arXiv Open Access 2025

CURE: Controlled Unlearning for Robust Embeddings -- Mitigating Conceptual Shortcuts in Pre-Trained Language Models

Aysenur Kocak, Shuo Yang, Bardh Prenkaj et al.

Pre-trained language models have achieved remarkable success across diverse applications but remain susceptible to spurious, concept-driven correlations that impair robustness and fairness. In this work, we introduce CURE, a novel and lightweight framework that systematically disentangles and suppresses conceptual shortcuts while preserving essential content information. Our method first extracts concept-irrelevant representations via a dedicated content extractor reinforced by a reversal network, ensuring minimal loss of task-relevant information. A subsequent controllable debiasing module employs contrastive learning to finely adjust the influence of residual conceptual cues, enabling the model to either diminish harmful biases or harness beneficial correlations as appropriate for the target task. Evaluated on the IMDB and Yelp datasets using three pre-trained architectures, CURE achieves an absolute improvement of +10 points in F1 score on IMDB and +2 points on Yelp, while introducing minimal computational overhead. Our approach establishes a flexible, unsupervised blueprint for combating conceptual biases, paving the way for more reliable and fair language understanding systems.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2025

Tiny language models

Ronit D. Gross, Yarden Tzach, Tal Halevi et al.

A prominent achievement of natural language processing (NLP) is its ability to understand and generate meaningful human language. This capability relies on complex feedforward transformer block architectures pre-trained on large language models (LLMs). However, LLM pre-training is currently feasible only for a few dominant companies due to the immense computational resources required, limiting broader research participation. This creates a critical need for more accessible alternatives. In this study, we explore whether tiny language models (TLMs) exhibit the same key qualitative features of LLMs. We demonstrate that TLMs exhibit a clear performance gap between pre-trained and non-pre-trained models across classification tasks, indicating the effectiveness of pre-training, even at a tiny scale. The performance gap increases with the size of the pre-training dataset and with greater overlap between tokens in the pre-training and classification datasets. Furthermore, the classification accuracy achieved by a pre-trained deep TLM architecture can be replicated through a soft committee of multiple, independently pre-trained shallow architectures, enabling low-latency TLMs without affecting classification accuracy. Our results are based on pre-training BERT-6 and variants of BERT-1 on subsets of the Wikipedia dataset and evaluating their performance on FewRel, AGNews, and DBPedia classification tasks. Future research on TLM is expected to further illuminate the mechanisms underlying NLP, especially given that its biologically inspired models suggest that TLMs may be sufficient for children or adolescents to develop language. The data and code that support the findings of this study are openly available on https://github.com/Rg32601/Tiny-Language-Models .

en cs.CL

Detail DOI Sumber

arXiv Open Access 2024

FANAL -- Financial Activity News Alerting Language Modeling Framework

Urjitkumar Patel, Fang-Chun Yeh, Chinmay Gondhalekar et al.

In the rapidly evolving financial sector, the accurate and timely interpretation of market news is essential for stakeholders needing to navigate unpredictable events. This paper introduces FANAL (Financial Activity News Alerting Language Modeling Framework), a specialized BERT-based framework engineered for real-time financial event detection and analysis, categorizing news into twelve distinct financial categories. FANAL leverages silver-labeled data processed through XGBoost and employs advanced fine-tuning techniques, alongside ORBERT (Odds Ratio BERT), a novel variant of BERT fine-tuned with ORPO (Odds Ratio Preference Optimization) for superior class-wise probability calibration and alignment with financial event relevance. We evaluate FANAL's performance against leading large language models, including GPT-4o, Llama-3.1 8B, and Phi-3, demonstrating its superior accuracy and cost efficiency. This framework sets a new standard for financial intelligence and responsiveness, significantly outstripping existing models in both performance and affordability.

en cs.CL, cs.LG

Detail DOI Sumber

DOAJ Open Access 2023

Mobilità e ritmo: la raffigurazione del lavoro e del paesaggio in <em>Strada Provinciale Tre</em> di Vinci

Irene Cecchini

Questo articolo analizza la rappresentazione del lavoro considerando la sua profonda relazione con le questioni ambientali e il pensiero ecologico. Grazie a un'analisi ecopoetica di Strada Provinciale Tre di Vinci (2007), questa ricerca si propone di contribuire al dibattito critico-letterario intorno al lavoro mettendo a fuoco diversi aspetti solitamente non considerati: da un lato, l'interrelazione tra l'atto del lavorare e i cambiamenti ambientali, dall'altro, il potenziale potere delle immagini ecologiche legate al lavoro. Il primo aspetto sarà mostrato osservando gli effetti che la produzione capitalista ha sul corpo-lavoro e sul corpo-terra. Il secondo sarà esaminato attraverso tre diversi processi narrativi che contraddistinguono il romanzo di Vinci: la camminata immersiva, i ritmi naturali e antropici, gli aspetti sociali e spaziali delle pratiche mobili. Il quadro teorico combinerà la critica tematica del lavoro con gli studi sulle Mobilità e l'Ecopoetica.

Geography. Anthropology. Recreation, Language. Linguistic theory. Comparative grammar

Detail DOI Sumber

DOAJ Open Access 2023

Supporting the Transition to Self-Directed Learning in ESL: A Coaching Intervention

Colin Mitchell

Self-access language centres (SALCs) utilise self-directed learning, which allows learners to set their own learning goals and make their own learning choices. While this approach can benefit language learners, it can also be intimidating for those not used to autonomous learning. To help learners transition from teacher-directed to self-directed learning, various interventions such as coaching, counselling, mentoring, and advising can be used. These interventions can be effective in helping learners become more self-directed, but it is important to review and consider the learner’s expectations when implementing them. This paper analyses data from a one-on-one coaching intervention between an English-speaking coach and five Japanese participants using English as a second language to explore the factors contributing to the intervention session’s perceived success.

Language acquisition

Detail DOI Sumber

DOAJ Open Access 2023

Saudi undergraduate EFL learners’ attitudes toward using textism in their academic writing

Fatima Ambreen, Burhan Ozdemir, Dina Abdel Salam El-Dakhs

Abstract Textism, which refers to the special language used in text messaging, is widely used in several forms of communication, such as texting, online chatting and emails. This increased use of textism has alarmed the media and the public because this special language can impact the academic writing of the younger generation. This paper aimed at exploring the attitudes of undergraduate Saudi learners of English as a foreign language (EFL) toward using textism in their academic writing. (115) students of different levels of language proficiency completed a survey about the use of textism across several modalities and recipients. Additionally, (240) student essays were analyzed to identify the potential use of textism in student writing. The results showed that students are generally conscious of the appropriate contexts for the use of textism. However, when it comes to the actual use of textism in writing, low-proficient students tend to produce more textism than those who are at medium and high levels of writing competence.

Special aspects of education, Language acquisition

Detail DOI Sumber

arXiv Open Access 2023

Comparative Analysis of CHATGPT and the evolution of language models

Oluwatosin Ogundare, Gustavo Quiros Araya

Interest in Large Language Models (LLMs) has increased drastically since the emergence of ChatGPT and the outstanding positive societal response to the ease with which it performs tasks in Natural Language Processing (NLP). The triumph of ChatGPT, however, is how it seamlessly bridges the divide between language generation and knowledge models. In some cases, it provides anecdotal evidence of a framework for replicating human intuition over a knowledge domain. This paper highlights the prevailing ideas in NLP, including machine translation, machine summarization, question-answering, and language generation, and compares the performance of ChatGPT with the major algorithms in each of these categories using the Spontaneous Quality (SQ) score. A strategy for validating the arguments and results of ChatGPT is presented summarily as an example of safe, large-scale adoption of LLMs.

en cs.CL

Detail Sumber

arXiv Open Access 2023

Cross-Corpora Spoken Language Identification with Domain Diversification and Generalization

Spandan Dey, Md Sahidullah, Goutam Saha

This work addresses the cross-corpora generalization issue for the low-resourced spoken language identification (LID) problem. We have conducted the experiments in the context of Indian LID and identified strikingly poor cross-corpora generalization due to corpora-dependent non-lingual biases. Our contribution to this work is twofold. First, we propose domain diversification, which diversifies the limited training data using different audio data augmentation methods. We then propose the concept of maximally diversity-aware cascaded augmentations and optimize the augmentation fold-factor for effective diversification of the training data. Second, we introduce the idea of domain generalization considering the augmentation methods as pseudo-domains. Towards this, we investigate both domain-invariant and domain-aware approaches. Our LID system is based on the state-of-the-art emphasized channel attention, propagation, and aggregation based time delay neural network (ECAPA-TDNN) architecture. We have conducted extensive experiments with three widely used corpora for Indian LID research. In addition, we conduct a final blind evaluation of our proposed methods on the Indian subset of VoxLingua107 corpus collected in the wild. Our experiments demonstrate that the proposed domain diversification is more promising over commonly used simple augmentation methods. The study also reveals that domain generalization is a more effective solution than domain diversification. We also notice that domain-aware learning performs better for same-corpora LID, whereas domain-invariant learning is more suitable for cross-corpora generalization. Compared to basic ECAPA-TDNN, its proposed domain-invariant extensions improve the cross-corpora EER up to 5.23%. In contrast, the proposed domain-aware extensions also improve performance for same-corpora test scenarios.

en eess.AS, cs.CL

Detail DOI Sumber

DOAJ Open Access 2022

Historia języka ukraińskiego z perspektywy planowania językowego (UKR)

Les Beley

Historia języka ukraińskiego w artykule jest rozpatrywana z punktu widzenia planowania językowego – koncepcje H. Klossa, J. Fishmana oraz W. Crofta. Procesy ausbauzacji i einbauzacji, szczegóły regionalne całego kontinuum dialektalnego zostały opisane z uwzględnieniem wpływów języków: rosyjskiego, polskiego, węgierskiego i rumuńskiego.

Language. Linguistic theory. Comparative grammar

Detail DOI Sumber

DOAJ Open Access 2022

Assessing the Sensitivity of EEG-Based Frequency-Tagging as a Metric for Statistical Learning

Danna Pinto, Anat Prior, Elana Zion Golumbic

Language. Linguistic theory. Comparative grammar, Neurophysiology and neuropsychology

Detail DOI Sumber

arXiv Open Access 2022

Non-unital algebraic $K$-theory and almost mathematics

Yuki Kato

The Gersten conjecture is still an open problem of algebraic $K$-theory for mixed characteristic discrete valuation rings. In this paper, we establish non-unital algebraic $K$-theory which is modified to become an exact functor from the category of non-unital algebras to the stable $\infty$-category of spectra. We prove that for any almost unital algebra, the non-unital $K$-theory homotopically decomposes into the non-unital $K$-theory the corresponding ideal and the residue algebra, implying the Gersten property of non-unital $K$-theory of the the corresponding ideal.

en math.KT, math.CT

Detail Sumber

arXiv Open Access 2022

Towards Linguistically Informed Multi-Objective Pre-Training for Natural Language Inference

Maren Pielka, Svetlana Schmidt, Lisa Pucknat et al.

We introduce a linguistically enhanced combination of pre-training methods for transformers. The pre-training objectives include POS-tagging, synset prediction based on semantic knowledge graphs, and parent prediction based on dependency parse trees. Our approach achieves competitive results on the Natural Language Inference task, compared to the state of the art. Specifically for smaller models, the method results in a significant performance boost, emphasizing the fact that intelligent pre-training can make up for fewer parameters and help building more efficient models. Combining POS-tagging and synset prediction yields the overall best results.

en cs.CL, cs.LG

Detail Sumber

CrossRef Open Access 2022

Grammar Similarity: A Study of Comparative Linguistic in French and Spanish Bilingual Song

Muhammad Zudi Hardiansyah

Prancis dan Spanyol adalah bahasa Roman. Kedua Bahasa ini berasal dari bahasa Latin vulgar, membuatnya menjadi Bahasa saudara. Dikarenakan bahasa Prancis dan Spanyol adalah bahasa saudara, keduanya memiliki banyak kesamaan, termasuk di bidang tata bahasa, meskipun masing-masing pastilah mempunyai keunikannya sendiri. Tujuan penulisan artikel ini adalah untuk menemukan, menganalisis, dan menjelaskan persamaan gramatikal yang ditemukan pada lagu-lagu bilingual Prancis dan Spanyol. Hal ini dapat memperkuat pembuktian jika kedua bahasa tersebut merupakan bahasa saudara dengan adanya kesamaan dari segi tata bahasa. Metode penelitian yang digunakan dalam penulisan artikel ini adalah kualitatif. Teknik pengumpulan data yang digunakan adalah teknik simak bebas cakap dan teknik lanjutan catat. Metode komparatif, tinjauan pustaka, dan elaborasi naratif merupakan teknik yang digunakan dalam analisis data.Kata kunci: Prancis, Spanyol, Gramatika, Persamaan

en

Detail DOI Sumber

DOAJ Open Access 2021

La polémique autour de bonjour/hi sur le web : vers la déconstruction du discours d’autorité

Chiara Molinari, Geneviève Bernard Barbeau

In November 2019, Simon Jolin-Barrette, Quebec minister responsible for the French language, revisited a controversy that had taken place two years earlier concerning the bilingual greeting bonjour/hi used in Montreal businesses, considered by some to reflect the decline of the French language in Quebec. The minister stated that he intended to ban the greeting in favor of the French ritual bonjour. In a conflicted sociolinguistic context such as Quebec’s, such assertions cannot go unnoticed. The announcement of Jolin-Barrette resounded in the media, where it provoked strong reactions to such an extent that a new polemic broke out. The aim of this article is to show how the reactions provoked, especially online, contributed to the deconstruction of the minister’s discourse and to its inability to impose itself as a discourse of authority. Our analysis focusses on the (techno)discursive modalities through which Jolin-Barrette’s authority is diminished or denied.

Style. Composition. Rhetoric

Detail DOI Sumber

Hasil untuk "Language. Linguistic theory. Comparative grammar"