Hasil untuk "Philology. Linguistics"

Menampilkan 20 dari ~794310 hasil · dari arXiv, Semantic Scholar, DOAJ, CrossRef

JSON API
arXiv Open Access 2026
Visual Merit or Linguistic Crutch? A Close Look at DeepSeek-OCR

Yunhao Liang, Ruixuan Ying, Bo Li et al.

DeepSeek-OCR utilizes an optical 2D mapping approach to achieve high-ratio vision-text compression, claiming to decode text tokens exceeding ten times the input visual tokens. While this suggests a promising solution for the LLM long-context bottleneck, we investigate a critical question: "Visual merit or linguistic crutch - which drives DeepSeek-OCR's performance?" By employing sentence-level and word-level semantic corruption, we isolate the model's intrinsic OCR capabilities from its language priors. Results demonstrate that without linguistic support, DeepSeek-OCR's performance plummets from approximately 90% to 20%. Comparative benchmarking against 13 baseline models reveals that traditional pipeline OCR methods exhibit significantly higher robustness to such semantic perturbations than end-to-end methods. Furthermore, we find that lower visual token counts correlate with increased reliance on priors, exacerbating hallucination risks. Context stress testing also reveals a total model collapse around 10,000 text tokens, suggesting that current optical compression techniques may paradoxically aggravate the long-context bottleneck. This study empirically defines DeepSeek-OCR's capability boundaries and offers essential insights for future optimizations of the vision-text compression paradigm. We release all data, results and scripts used in this study at https://github.com/dududuck00/DeepSeekOCR.

en cs.CL, cs.CV
arXiv Open Access 2025
NSL-MT: Linguistically Informed Negative Samples for Efficient Machine Translation in Low-Resource Languages

Mamadou K. Keita, Christopher Homan, Huy Le

We introduce Negative Space Learning MT (NSL-MT), a training method that teaches models what not to generate by encoding linguistic constraints as severity-weighted penalties in the loss function. NSL-MT increases limited parallel data with synthetically generated violations of target language grammar, explicitly penalizing the model when it assigns high probability to these linguistically invalid outputs. We demonstrate that NSL-MT delivers improvements across all architectures: 3-12\% BLEU gains for well-performing models and 56-89\% gains for models lacking descent initial support. Furthermore, NSL-MT provides a 5x data efficiency multiplier -- training with 1,000 examples matches or exceeds normal training with 5,000 examples. Thus, NSL-MT provides a data-efficient alternative training method for settings where there is limited annotated parallel corporas.

en cs.LG
arXiv Open Access 2025
A High-Capacity and Secure Disambiguation Algorithm for Neural Linguistic Steganography

Yapei Feng, Feng Jiang, Shanhao Wu et al.

Neural linguistic steganography aims to embed information into natural text while preserving statistical undetectability. A fundamental challenge in this ffeld stems from tokenization ambiguity in modern tokenizers, which can lead to catastrophic decoding failures. The recent method, SyncPool, addresses this ambiguity by employing a coarse-grained synchronization mechanism over groups of ambiguous candidates. However, SyncPool sacriffces embedding capacity, as it utilizes the entire Shannon entropy of an ambiguous group solely for synchronization rather than for payload embedding. We propose a method named look-ahead Sync, which overcomes the capacity limitation of SyncPool while retaining its provable security guarantees. Our approach performs minimal synchronized sampling only on truly indistinguishable token sequences, while strategically preserving all other discernible paths to maximize embedding capacity. We provide theoretical proofs for the security of our method and analyze the gap between its achievable embedding capacity and the theoretical upper bound. Experiments on English (using Llama 3) and Chinese (using Qwen 2.5) benchmarks show that our method consistently approaches the theoretical capacity upper bound and signiffcantly outperforms SyncPool. The improvement in embedding rate exceeds 160% in English and 25% in Chinese, particularly in settings with larger candidate pools. This work represents a signiffcant step toward practical high-capacity provably secure linguistic steganography.

en cs.CL, cs.AI
arXiv Open Access 2025
Linguistic Complexity and Socio-cultural Patterns in Hip-Hop Lyrics

Aayam Bansal, Raghav Agarwal, Kaashvi Jain

This paper presents a comprehensive computational framework for analyzing linguistic complexity and socio-cultural trends in hip-hop lyrics. Using a dataset of 3,814 songs from 146 influential artists spanning four decades (1980-2020), we employ natural language processing techniques to quantify multiple dimensions of lyrical complexity. Our analysis reveals a 23.7% increase in vocabulary diversity over the study period, with East Coast artists demonstrating 17.3% higher lexical variation than other regions. Rhyme density increased by 34.2% across all regions, with Midwest artists exhibiting the highest technical complexity (3.04 rhymes per line). Topic modeling identified significant shifts in thematic content, with social justice themes decreasing from 28.5% to 13.8% of content while introspective themes increased from 7.6% to 26.3%. Sentiment analysis demon- strated that lyrics became significantly more negative during sociopolitical crises, with polarity decreasing by 0.31 following major social unrest. Multi-dimensional analysis revealed four dis- tinct stylistic approaches that correlate strongly with geographic origin (r=0.68, p!0.001) and time period (r=0.59, p<0.001). These findings establish quantitative evidence for the evolution of hip- hop as both an art form and a reflection of societal dynamics, providing insights into the interplay between linguistic innovation and cultural context in popular music.

en cs.CL, cs.AI
arXiv Open Access 2025
Linguistic Entity Masking to Improve Cross-Lingual Representation of Multilingual Language Models for Low-Resource Languages

Aloka Fernando, Surangika Ranathunga

Multilingual Pre-trained Language models (multiPLMs), trained on the Masked Language Modelling (MLM) objective are commonly being used for cross-lingual tasks such as bitext mining. However, the performance of these models is still suboptimal for low-resource languages (LRLs). To improve the language representation of a given multiPLM, it is possible to further pre-train it. This is known as continual pre-training. Previous research has shown that continual pre-training with MLM and subsequently with Translation Language Modelling (TLM) improves the cross-lingual representation of multiPLMs. However, during masking, both MLM and TLM give equal weight to all tokens in the input sequence, irrespective of the linguistic properties of the tokens. In this paper, we introduce a novel masking strategy, Linguistic Entity Masking (LEM) to be used in the continual pre-training step to further improve the cross-lingual representations of existing multiPLMs. In contrast to MLM and TLM, LEM limits masking to the linguistic entity types nouns, verbs and named entities, which hold a higher prominence in a sentence. Secondly, we limit masking to a single token within the linguistic entity span thus keeping more context, whereas, in MLM and TLM, tokens are masked randomly. We evaluate the effectiveness of LEM using three downstream tasks, namely bitext mining, parallel data curation and code-mixed sentiment analysis using three low-resource language pairs English-Sinhala, English-Tamil, and Sinhala-Tamil. Experiment results show that continually pre-training a multiPLM with LEM outperforms a multiPLM continually pre-trained with MLM+TLM for all three tasks.

en cs.CL
arXiv Open Access 2025
A Type 2 Fuzzy Set Approach for Building Linear Linguistic Regression Analysis under Multi Uncertainty

Junzo Watada, Pei-Chun Lin, Bo Wang et al.

In this paper, we propose a novel heuristic algorithm for constructing a Type-2 Fuzzy Set of the Linear Linguistic Regression (T2F-LLR) model, designed to address uncertainty and vagueness in real-world decision-making. We consider a practical scenario involving a cosmetic company's promotional planning across four product categories: Basic Face Care, Face Cleaning, Cosmetics, and Body Care, aimed at both male and female consumers. Data are collected using fuzzy linguistic questionnaires from customers and expert managers, with responses expressed using qualitative terms such as 'always', 'frequently', 'Often', 'Sometimes', and 'frequently'. These linguistic evaluations are modeled as Type-2 Fuzzy Set of Linear Linguistic regression (T2F-LLR) variables to capture both randomness and higher-order fuzziness. We rigorously develop a solution framework based on a one-sigma confidence interval using the credibility measure to calculate the expected values and variances of the model output. To improve computational efficiency and usability of decisions, we introduce a heuristic algorithm tailored for non-meta datasets, significantly reducing the complexity of the model solving process. The experimental results demonstrate the effectiveness of our approach, which yields a mean absolute percentage error (MAPE) of weight equals 7.97\% with all variables statistically significant. We also provide the significant results for each product using the one-way analysis of variance test (one-way ANOVA test) ($p$-value = 0.15) and the paired $t$ test ($p$-value = 0.16). The results show that there is no significant difference between observed and predicted weights overall. This paper provides a robust and interpretable methodology for decision makers dealing with imprecise data and time-sensitive planning.

en math.GM
arXiv Open Access 2025
Incorporating Linguistic Constraints from External Knowledge Source for Audio-Visual Target Speech Extraction

Wenxuan Wu, Shuai Wang, Xixin Wu et al.

Audio-visual target speaker extraction (AV-TSE) models primarily rely on target visual cues to isolate the target speaker's voice from others. We know that humans leverage linguistic knowledge, such as syntax and semantics, to support speech perception. Inspired by this, we explore the potential of pre-trained speech-language models (PSLMs) and pre-trained language models (PLMs) as auxiliary knowledge sources for AV-TSE. In this study, we propose incorporating the linguistic constraints from PSLMs or PLMs for the AV-TSE model as additional supervision signals. Without introducing any extra computational cost during inference, the proposed approach consistently improves speech quality and intelligibility. Furthermore, we evaluate our method in multi-language settings and visual cue-impaired scenarios and show robust performance gains.

en cs.SD, cs.LG
DOAJ Open Access 2025
ASCOLTA E IMPARA: POTENZIALITÀ E IMPIEGHI GLOTTODIDATTICI DEL PODCAST

Giuseppe Sergio

L’articolo presenta alcune osservazioni sul fenomeno dei podcast, focalizzandosi sulla loro breve storia, poco più che ventennale, e sulle caratteristiche (multi)mediali che li contraddistinguono. Analizzando in modo più ravvicinato i podcast educativi, in particolare quelli destinati ad apprendenti l’italiano come L2/LS, vengono individuati i macrogeneri in cui si articolano, i tratti comuni, le potenzialità e i limiti di impiego nel curricolo di italiano L2/LS. Grazie all’esame di alcuni podcast glottodidattici, l’articolo inoltre delinea le loro principali peculiarità linguistiche.    Listen and learn: potential and glottodidactic uses of the podcast The article presents some observations on the phenomenon of podcasts, focusing on their brief history – just over twenty years – and their (multi)media characteristics. By taking a closer look at educational podcasts, particularly those aimed at learners of Italian as a second or foreign language (L2/LS), the study identifies the main genres they fall into, their common features, potential, and limitations within an Italian L2/LS curriculum. Through the analysis of selected glottodidactic podcasts, the article also outlines their key linguistic features.

Language and Literature, Philology. Linguistics
DOAJ Open Access 2025
A Multidimensional Approach to Linguistic Variation in Short Turkish Texts

Hülya MISIR

This study investigates linguistic variation in Turkish using a large-scale social media corpus consisting of neutral, offensive, and hate speech tweets. Drawing on a dataset annotated for parts of speech and grammatical structures, the study identifies the main dimensions of linguistic variation through the framework of Multidimensional Analysis (MDA), using Multiple Correspondence Analysis (MCA). The paper presents the use of MCA method in Turkish, which fills a notable gap in Turkish linguistic analysis due to its suitability for short and contextually limited texts such as those found on social media. The analysis is conducted using the FactoMineR package in R, along with the widely used visualization tool ggplot2. This practical guide helps interpret the dimensions generated by MDA and demonstrates how results can be presented through different data visualization techniques. Additionally, the study presents temporal shifts in linguistic patterns using time-stamped and category-labeled data, presented through various plots and heatmaps. The article is intended as a practical resource for researchers applying MDA to short-text corpora, and for those interested in the use of data visualization in linguistic analysis.

Philology. Linguistics
arXiv Open Access 2024
Linguistics-Vision Monotonic Consistent Network for Sign Language Production

Xu Wang, Shengeng Tang, Peipei Song et al.

Sign Language Production (SLP) aims to generate sign videos corresponding to spoken language sentences, where the conversion of sign Glosses to Poses (G2P) is the key step. Due to the cross-modal semantic gap and the lack of word-action correspondence labels for strong supervision alignment, the SLP suffers huge challenges in linguistics-vision consistency. In this work, we propose a Transformer-based Linguistics-Vision Monotonic Consistent Network (LVMCN) for SLP, which constrains fine-grained cross-modal monotonic alignment and coarse-grained multimodal semantic consistency in language-visual cues through Cross-modal Semantic Aligner (CSA) and Multimodal Semantic Comparator (MSC). In the CSA, we constrain the implicit alignment between corresponding gloss and pose sequences by computing the cosine similarity association matrix between cross-modal feature sequences (i.e., the order consistency of fine-grained sign glosses and actions). As for MSC, we construct multimodal triplets based on paired and unpaired samples in batch data. By pulling closer the corresponding text-visual pairs and pushing apart the non-corresponding text-visual pairs, we constrain the semantic co-occurrence degree between corresponding gloss and pose sequences (i.e., the semantic consistency of coarse-grained textual sentences and sign videos). Extensive experiments on the popular PHOENIX14T benchmark show that the LVMCN outperforms the state-of-the-art.

en cs.CV, cs.MM
arXiv Open Access 2024
LIEDER: Linguistically-Informed Evaluation for Discourse Entity Recognition

Xiaomeng Zhu, Robert Frank

Discourse Entity (DE) recognition is the task of identifying novel and known entities introduced within a text. While previous work has found that large language models have basic, if imperfect, DE recognition abilities (Schuster and Linzen, 2022), it remains largely unassessed which of the fundamental semantic properties that govern the introduction and subsequent reference to DEs they have knowledge of. We propose the Linguistically-Informed Evaluation for Discourse Entity Recognition (LIEDER) dataset that allows for a detailed examination of language models' knowledge of four crucial semantic properties: existence, uniqueness, plurality, and novelty. We find evidence that state-of-the-art large language models exhibit sensitivity to all of these properties except novelty, which demonstrates that they have yet to reach human-level language understanding abilities.

en cs.CL
arXiv Open Access 2024
I or Not I: Unraveling the Linguistic Echoes of Identity in Samuel Beckett's "Not I" Through Natural Language Processing

Arezou Zahiri Pourzarandi, Farshad Jafari

Exploring the depths of Samuel Beckett's "Not I" through advanced natural language processing techniques, this research uncovers the intricate linguistic structures that underpin the text. By analyzing word frequency, detecting emotional sentiments with a BERT-based model, and examining repetitive motifs, we unveil how Beckett's minimalist yet complex language reflects the protagonist's fragmented psyche. Our results demonstrate that recurring themes of time, memory, and existential angst are artfully woven through recursive linguistic patterns and rhythmic repetition. This innovative approach not only deepens our understanding of Beckett's stylistic contributions but also highlights his unique role in modern literature, where language transcends simple communication to explore profound existential questions.

en cs.CL
arXiv Open Access 2024
Human-Robot Mutual Learning through Affective-Linguistic Interaction and Differential Outcomes Training [Pre-Print]

Emilia Heikkinen, Elsa Silvennoinen, Imran Khan et al.

Owing to the recent success of Large Language Models, Modern A.I has been much focused on linguistic interactions with humans but less focused on non-linguistic forms of communication between man and machine. In the present paper, we test how affective-linguistic communication, in combination with differential outcomes training, affects mutual learning in a human-robot context. Taking inspiration from child-caregiver dynamics, our human-robot interaction setup consists of a (simulated) robot attempting to learn how best to communicate internal, homeostatically-controlled needs; while a human "caregiver" attempts to learn the correct object to satisfy the robot's present communicated need. We studied the effects of i) human training type, and ii) robot reinforcement learning type, to assess mutual learning terminal accuracy and rate of learning (as measured by the average reward achieved by the robot). Our results find mutual learning between a human and a robot is significantly improved with Differential Outcomes Training (DOT) compared to Non-DOT (control) conditions. We find further improvements when the robot uses an exploration-exploitation policy selection, compared to purely exploitation policy selection. These findings have implications for utilizing socially assistive robots (SAR) in therapeutic contexts, e.g. for cognitive interventions, and educational applications.

en cs.RO, cs.AI
DOAJ Open Access 2024
The myth of Narcissus: Metamorphoses of medieval commentary

I. K. Staf

This article traces two types of commentaries on the myth of Narcissus (Ovid, The Metamorphoses, III), which existed in the Middle Ages in French literature in the vernacular — the courtly and the moral and allegorical. The first type of commentary is represented by individual works in which this myth is mentioned (“The Lay of Narcissus”, “The Romance of the Rose”, “The Book of Love Chess” by Evrart de Conty). In these, Ovid’s personage is treated as a violator of Amor’s laws; consequently, he takes on the traits of the “Belle Dame sans merci”. Commentaries of the second kind, such as the poetic and prose versions of “Moralized Ovid” (Ovide moralisé), interpret the figure of Narcissus as the embodiment of the Christian sin of pride, and of Echo as the good name disregarded by the proud man. This second type of commentary fragments the text of the “Metamorphoses” into separate stories and personages, each with its own allegorical meaning. The figure of Narcissus and the story of his death becomes a didactic exemplum that can be used in a sermon. In addition, by the early 15th century, such commentaries had been adopted into treatises on poetry (seconde rhétorique) and poetics (poetria, poetic fiction). In them, Narcissus becomes a rhetorical figure destined to create poetic works of high style. The logical conclusion of this process of fragmentation, which is also embodied in the miniatures of illuminated manuscripts, is the appearance in the 16th century of a new form which Ovid’s poem takes, the form of a collection of emblems.

Philology. Linguistics, History (General)
CrossRef Open Access 2024
The issues of feminist linguistics in Slavic philology: a fragmentary-aspectual overview

Tetyana Kosmeda, Mariia Brus, Serhii Koloniuk

This article offers a concise overview of feminist linguistics within Slavic philology, focusing on the formation and function of feminitives. It analyzes the influence of national linguocultural traditions, their continuity, and extralinguistic factors on the development of feminine forms in Slavic languages. The study highlights the dominance of masculine forms and the underrepresentation of feminist perspectives in these languages. By examining the evolution of feminitives and the factors driving their usage, the authors argue that feminization is a regular, irreversible process across the Slavic linguistic landscape. The research reveals a significant difference in the degree of feminitive formation between West Slavic languages and East/South Slavic languages, suggesting varying levels of gender equality awareness and feminist activism across these regions.

arXiv Open Access 2023
Leveraging phone-level linguistic-acoustic similarity for utterance-level pronunciation scoring

Wei Liu, Kaiqi Fu, Xiaohai Tian et al.

Recent studies on pronunciation scoring have explored the effect of introducing phone embeddings as reference pronunciation, but mostly in an implicit manner, i.e., addition or concatenation of reference phone embedding and actual pronunciation of the target phone as the phone-level pronunciation quality representation. In this paper, we propose to use linguistic-acoustic similarity to explicitly measure the deviation of non-native production from its native reference for pronunciation assessment. Specifically, the deviation is first estimated by the cosine similarity between reference phone embedding and corresponding acoustic embedding. Next, a phone-level Goodness of pronunciation (GOP) pre-training stage is introduced to guide this similarity-based learning for better initialization of the aforementioned two embeddings. Finally, a transformer-based hierarchical pronunciation scorer is used to map a sequence of phone embeddings, acoustic embeddings along with their similarity measures to predict the final utterance-level score. Experimental results on the non-native databases suggest that the proposed system significantly outperforms the baselines, where the acoustic and phone embeddings are simply added or concatenated. A further examination shows that the phone embeddings learned in the proposed approach are able to capture linguistic-acoustic attributes of native pronunciation as reference.

en eess.AS, cs.SD
arXiv Open Access 2023
DEPAC: a Corpus for Depression and Anxiety Detection from Speech

Mashrura Tasnim, Malikeh Ehghaghi, Brian Diep et al.

Mental distress like depression and anxiety contribute to the largest proportion of the global burden of diseases. Automated diagnosis systems of such disorders, empowered by recent innovations in Artificial Intelligence, can pave the way to reduce the sufferings of the affected individuals. Development of such systems requires information-rich and balanced corpora. In this work, we introduce a novel mental distress analysis audio dataset DEPAC, labeled based on established thresholds on depression and anxiety standard screening tools. This large dataset comprises multiple speech tasks per individual, as well as relevant demographic information. Alongside, we present a feature set consisting of hand-curated acoustic and linguistic features, which were found effective in identifying signs of mental illnesses in human speech. Finally, we justify the quality and effectiveness of our proposed audio corpus and feature set in predicting depression severity by comparing the performance of baseline machine learning models built on this dataset with baseline models trained on other well-known depression corpora.

en eess.AS, cs.CL
DOAJ Open Access 2023
Deutschlehrkräfte und ihre Vertrautheit mit, Nutzung von und Einstellung zu digitalen Medien während und vor der Corona-Pandemie in Deutschland

Volker Frederking

In dem Beitrag werden Ergebnisse aus zwei Studien vorgestellt, in denen Deutschlehrkräfte und ihre Vertrautheit mit, ihre Nutzung von und ihre Einstellung zu digitalen Medien untersucht worden sind. In der eigenfinanzierten DIDID-Studie („Digitales Distanzlernen im Deutschunterricht während der Corona-Pandemie“), an der von November 2020 bis März 2021 auf Basis eines Online-Fragebogens 1981 Deutschlehrkräfte teilgenommen haben, sind Daten zu digitalen Kompetenz-, Einstellungs- und Nutzungsmustern von Deutschlehrkräften in Bezug auf digitale Medien während der Corona-Pandemie gewonnen worden. In der von der DFG 2021/22 geförderten GETDIME-Studie (“Effects of the Covid Pandemic on German Language and Literature Teachers and their Familiarity with, Use of and Views on Digital Media”) wurden auf Basis unveröffentlichter Daten in der ICILS-Studie 2018 digitale Kompetenz-, Einstellungs- und Nutzungsmuster von Deutschlehrkräften vor der Corona-Pandemie untersucht. Die Extrapolation von Grundprofilen der Deutschlehrkräfte in DIDID und ICILS ermöglichte einen unmittelbaren Vergleich. Dieser lässt eine leichte Progression der in DIDID untersuchten Deutschlehrkräfte in ihrer Vertrautheit mit, Nutzung von und Einstellung zu digitalen Medien während der Corona-Pandemie gegenüber den in ICILS 2018 teilnehmenden Deutschlehrkräften vor der Pandemie erkennbar werden. Dabei zeigt sich ein deutlich erhöhtes Fortbildungsinteresse und ein klarer konturierter Fortbildungsbedarf.   Abstract (english): German teachers and their familiarity with, use of, and attitutes toward digital media during and before the Corona pandemic in Germany. Finding from two studies.  This paper presents results from two studies that investigated German teachers and their familiarity with, use of, and attitudes towards digital media. In the self-funded DIDID study („Digital Distance Learning in the German Classroom during the Corona Pandemic“), in which 1981 German teachers participated from November 2020 to March 2021 on the basis of an online questionnaire, data on digital competence, attitude and usage patterns of German teachers with regard to digital media during the Corona Pandemic were obtained. In the GETDIME study („Effects of the Covid Pandemic on German Language and Literature Teachers and their Familiarity with, Use of and Views on Digital Media“), funded by the DFG 2021/22, digital competence, attitude and usage patterns of German teachers prior to the Corona Pandemic were investigated on the basis of unpublished data in the ICILS study of 2018. The extrapolation of basic profiles of German teachers in DIDID and ICILS enabled a direct comparison. This reveals a slight progression of the German teachers studied in DIDID in their familiarity with, use of and attitude towards digital media during the Corona pandemic in comparison with the German teachers participating in ICILS 2018 before the pandemic. This reveals a clearly increased interest in further training and a clearly contoured need for further training.

Education, Communication. Mass media
DOAJ Open Access 2023
Church communication highlights 2022

Diego Contreras

AbstractThe year 2022 will be remembered for the outbreak of war in Ukraine, a conflict that shattered the incipient optimism caused by the gradual disappearance of the Covid pandemic. Alongside the succession of news about the war and the Pope’s appeals for peace, the media echoed other important events in the life of the Church. In these pages we offer a selection of the most significant ones. Obviously, this is a choice based on the author’s criteria, which include the persistence of the news throughout the year.

Philosophy of religion. Psychology of religion. Religion in relation to other subjects, Communication. Mass media

Halaman 46 dari 39716