Hasil untuk "Language. Linguistic theory. Comparative grammar"

Menampilkan 20 dari ~4437759 hasil · dari DOAJ, arXiv, CrossRef, Semantic Scholar

JSON API
arXiv Open Access 2026
Beyond Training for Cultural Awareness: The Role of Dataset Linguistic Structure in Large Language Models

Reem I. Masoud, Chen Feng, Shunta Asano et al.

The global deployment of large language models (LLMs) has raised concerns about cultural misalignment, yet the linguistic properties of fine-tuning datasets used for cultural adaptation remain poorly understood. We adopt a dataset-centric view of cultural alignment and ask which linguistic properties of fine-tuning data are associated with cultural performance, whether these properties are predictive prior to training, and how these effects vary across models. We compute lightweight linguistic, semantic, and structural metrics for Arabic, Chinese, and Japanese datasets and apply principal component analysis separately within each language. This design ensures that the resulting components capture variation among datasets written in the same language rather than differences between languages. The resulting components correspond to broadly interpretable axes related to semantic coherence, surface-level lexical and syntactic diversity, and lexical or structural richness, though their composition varies across languages. We fine-tune three major LLM families (LLaMA, Mistral, DeepSeek) and evaluate them on benchmarks of cultural knowledge, values, and norms. While PCA components correlate with downstream performance, these associations are strongly model-dependent. Through controlled subset interventions, we show that lexical-oriented components (PC3) are the most robust, yielding more consistent performance across models and benchmarks, whereas emphasizing semantic or diversity extremes (PC1-PC2) is often neutral or harmful.

en cs.CL
S2 Open Access 2025
Counting trees: a treebank-driven exploration of syntactic variation in speech and writing across languages

Kaja Dobrovoljc

Abstract This paper presents a novel treebank-driven approach to comparing syntactic structures in speech and writing using dependency-parsed corpora. Adopting a fully inductive, bottom-up method, we define syntactic structures as delexicalized dependency (sub)trees and extract them from spoken and written Universal Dependencies (UD) treebanks in two syntactically distinct languages, English and Slovenian. For each corpus, we analyze the size, diversity, and distribution of syntactic inventories, their overlap across modalities, and the structures most characteristic of speech. Results show that, across both languages, spoken corpora contain fewer and less diverse syntactic structures than their written counterparts, with consistent cross-linguistic preferences for certain structural types across modalities. Strikingly, the overlap between spoken and written syntactic inventories is very limited: most structures attested in speech do not occur in writing, pointing to modality-specific preferences in syntactic organization that reflect the distinct demands of real-time interaction and elaborated writing. This contrast is further supported by a keyness analysis of the most frequent speech-specific structures, which highlights patterns associated with interactivity, context-grounding, and economy of expression. We argue that this scalable, language-independent framework offers a useful general method for systematically studying syntactic variation across corpora, laying the groundwork for more comprehensive data-driven theories of grammar in use.

3 sitasi en Computer Science
DOAJ Open Access 2025
Exploring Phraseological Patterns in Business English Non-Finite Clauses

Olfa Ben Amor

The increasing availability of large-scale corpora and advanced data-processing tools has enhanced the analysis of phraseological units. This study investigates the phraseology of English non-finite clauses – specifically to-infinitive, -ing, and past participle clauses –headed by adjectives, adverbs, nouns, and pronouns. It explores the phraseological patterns of these structures and their semantic extensions within a specialized corpus of business English. The corpus comprises academic and journalistic registers, with the academic register including research articles from four leading journals and graduate theses from Tunisian institutions, while the news register features business articles from The Economist and Financial Times. The study identifies lexico-grammatical patterns forming various phraseologies of non-finite clauses and categorizes these patterns into semantic sets based on the degree of fixedness. Findings reveal differences in the frequency of non-finite phraseologies across the academic and news register, and similarities in the degree of fixedness and functions of the most frequent patterns. The study offers a corpus-based account of how non-finite clause constructions are used across business registers, contributing to a broader understanding of register variation, discourse organization, and phraseological conventions in business discourse.

Special aspects of education, Language acquisition
DOAJ Open Access 2025
Linguistic insights into dementia from 1994 to 2023: A structural topic modeling-assisted bibliometric analysis

Hong Lei, Zhanhao Jiang

This article presents a bibliometric analysis of research on dementia in the field of linguistics. We reviewed and analyzed 545 articles published in 89 peer-reviewed journals between 1994 and 2023, to identify key bibliometric information and major research topics in this expanding field of research. The distribution of countries indicates that the United States is the most productive country, and researchers from the United Kingdom, Australia and Canada also play an important role. Aphasiology and Brain and Language are the most influential journals in terms of research productivity and impact. The analysis of highly cited references demonstrates the intellectual foundation of this research field. The topics generated by structural topic modeling show that scholars in linguistics have responded to a variety of issues on dementia, encompassing semantic processing, multilingualism and cognitive functions, primary progressive aphasia and apraxia of speech, natural language processing techniques, the role of speech-language pathologists, communication dynamics in contexts, speech processing, syntactic processing, and word retrieval and language processing. This study aims to enhance researchers’ understanding of the current state of this research field and provide insights for future studies.

Language. Linguistic theory. Comparative grammar, Medicine (General)
arXiv Open Access 2025
Categorial grammars with unique category assignment

Maxim Vishnikin, Alexander Okhotin

A categorial grammar assigns one of several syntactic categories to each symbol of the alphabet, and the category of a string is then deduced from the categories assigned to its symbols using two simple reduction rules. This paper investigates a special class of categorial grammars, in which only one category is assigned to each symbol, thus eliminating ambiguity on the lexical level (in linguistic terms, a unique part of speech is assigned to each word). While unrestricted categorial grammars are equivalent to the context-free grammars, the proposed subclass initially appears weak, as it cannot define even some regular languages. It is proved in the paper that it is actually powerful enough to define a homomorphic encoding of every context-free language, in the sense that for every context-free language $L$ over an alphabet $Σ$ there is a language $L'$ over some alphabet $Ω$ defined by categorial grammar with unique category assignment and a homomorphism $h \colon Σ\to Ω^+$, such that a string $w$ is in $L$ if and only if $h(w)$ is in $L'$. In particular, in Greibach's hardest context-free language theorem, it is sufficient to use a hardest language defined by a categorial grammar with unique category assignment.

en math.LO
arXiv Open Access 2025
PolyTruth: Multilingual Disinformation Detection using Transformer-Based Language Models

Zaur Gouliev, Jennifer Waters, Chengqian Wang

Disinformation spreads rapidly across linguistic boundaries, yet most AI models are still benchmarked only on English. We address this gap with a systematic comparison of five multilingual transformer models: mBERT, XLM, XLM-RoBERTa, RemBERT, and mT5 on a common fake-vs-true machine learning classification task. While transformer-based language models have demonstrated notable success in detecting disinformation in English, their effectiveness in multilingual contexts still remains up for debate. To facilitate evaluation, we introduce PolyTruth Disinfo Corpus, a novel corpus of 60,486 statement pairs (false claim vs. factual correction) spanning over twenty five languages that collectively cover five language families and a broad topical range from politics, health, climate, finance, and conspiracy, half of which are fact-checked disinformation claims verified by an augmented MindBugs Discovery dataset. Our experiments revealed performance variations. Models such as RemBERT achieved better overall accuracy, particularly excelling in low-resource languages, whereas models like mBERT and XLM exhibit considerable limitations when training data is scarce. We provide a discussion of these performance patterns and implications for real-world deployment. The dataset is publicly available on our GitHub repository to encourage further experimentation and advancement. Our findings illuminate both the potential and the current limitations of AI systems for multilingual disinformation detection.

en cs.CL, cs.LG
arXiv Open Access 2025
Small Language Models Reshape Higher Education: Courses, Textbooks, and Teaching

Jian Zhang, Jia Shao

While large language models (LLMs) have introduced novel paradigms in science and education, their adoption in higher education is constrained by inherent limitations. These include a tendency to produce inaccuracies and high computational requirements, which compromise the strict demands for accurate and reliable knowledge essential in higher education. Small language models (MiniLMs), by contrast, offer distinct advantages in professional education due to their lightweight nature and precise retrieval capabilities. This research takes "Atmospheric Physics" as an example. We established a specialized corpus and image repository by gathering over 550,000 full-text PDFs from over 130 international well-respected journals in Earth and environmental science. From this collection, we extracted over 100 million high-quality sentence-level corpus and more than 3 million high-resolution academic images. Using MiniLMs, these resources were organized into a high-dimensional vector library for precise retrieval and efficient utilization of extensive educational content. Consequently, we systematically redesigned the courses, textbooks, and teaching strategies for "Atmospheric Physics" based on MiniLMs. The course is designed as a "interdisciplinary-frontier" system, breaking down traditional boundaries between atmospheric science, space science, hydrology, and remote sensing. Teaching materials are transformed from static, lagging text formats into a dynamic digital resource library powered by MiniLM. For teaching methods, we have designed a question-based learning pathway. This paradigm promotes a shift from passive knowledge transfer to active cognitive development. Consequently, this MiniLM-driven "Atmospheric Physics" course demonstrates a specific avenue for "AI for education".

en physics.ed-ph, cs.CL
arXiv Open Access 2025
Mind the Gap: Assessing Wiktionary's Crowd-Sourced Linguistic Knowledge on Morphological Gaps in Two Related Languages

Jonathan Sakunkoo, Annabella Sakunkoo

Morphological defectivity is an intriguing and understudied phenomenon in linguistics. Addressing defectivity, where expected inflectional forms are absent, is essential for improving the accuracy of NLP tools in morphologically rich languages. However, traditional linguistic resources often lack coverage of morphological gaps as such knowledge requires significant human expertise and effort to document and verify. For scarce linguistic phenomena in under-explored languages, Wikipedia and Wiktionary often serve as among the few accessible resources. Despite their extensive reach, their reliability has been a subject of controversy. This study customizes a novel neural morphological analyzer to annotate Latin and Italian corpora. Using the massive annotated data, crowd-sourced lists of defective verbs compiled from Wiktionary are validated computationally. Our results indicate that while Wiktionary provides a highly reliable account of Italian morphological gaps, 7% of Latin lemmata listed as defective show strong corpus evidence of being non-defective. This discrepancy highlights potential limitations of crowd-sourced wikis as definitive sources of linguistic knowledge, particularly for less-studied phenomena and languages, despite their value as resources for rare linguistic features. By providing scalable tools and methods for quality assurance of crowd-sourced data, this work advances computational morphology and expands linguistic knowledge of defectivity in non-English, morphologically rich languages.

en cs.CL, cs.CY
DOAJ Open Access 2024
Designações eponímicas em textos de divulgação médica – um estudo contrastivo alemão-português

Teresa Alegre, Katrin Herget

No campo da linguagem médica e de saúde, a designação de doenças, assim como designações anatómicas, métodos terapêuticos e aparelhos usados no tratamento de pacientes, tem sido desde sempre marcada pelo recurso a epónimos e metáforas. Diversos estudos (Montalt & González-Davies, 2007; Ferguson & Thomas, 2014, entre outros) dão conta de problemas de tradução originados pelas discrepâncias do uso destas designações num dado par de línguas. É partindo destas questões que o presente estudo se propõe analisar de forma contrastiva o uso de epónimos, metáforas e metonímias na linguagem médica no par de línguas alemão-português. O objetivo do estudo prende-se com a identificação e análise de designações eponímicas em textos de divulgação médica, através de corpora comparáveis. Numa primeira fase, procedeu-se à constituição dos corpora, constituídos por textos retirados de websites de divulgação médica, pertencentes a diversas instituições de saúde. A amostra é constituída por um total de 123 026 tokens em alemão e 150 323 tokens em português. Com base nos corpora, foram posteriormente extraídos os epónimos, de modo a proceder-se à análise e classificação, consoante as características linguísticas. A análise culminou com a comparação do recurso aos epónimos identificados em ambas as línguas. O estudo pretende contribuir para a investigação em línguas de especialidade na área da divulgação médica, bem como reunir informação relevante para a tomada de decisões tradutivas.

Language. Linguistic theory. Comparative grammar
DOAJ Open Access 2024
Beyond the Science

Shabrina Adzani Yusra

This study examines how Robert Louis Stevenson’s The Strange Case of Dr. Jekyll and Mr. Hyde reflects Victorian anxieties surrounding misuse of scientific discoveries. The novels follow Dr. Henry Jekyll, a respected physician who creates a potion to unleash his darker self in the form of the monstrous Edward Hyde. Using a historical-biographical approach and qualitative analysis, this study draws on primary data from novels supported by secondary sources including literary criticism and journal articles. The findings reveal that Stevenson addresses Victorian fears of unchecked scientific progress through Dr. Jekyll’s experiments, the societal reactions to Mr. Hyde and the depiction of Jekyll’s hubris, which ultimately highlight the era’s concerns about the ethical limits of science.

Language. Linguistic theory. Comparative grammar, Languages and literature of Eastern Asia, Africa, Oceania
arXiv Open Access 2024
Native vs Non-Native Language Prompting: A Comparative Analysis

Mohamed Bayan Kmainasi, Rakif Khan, Ali Ezzat Shahroor et al.

Large language models (LLMs) have shown remarkable abilities in different fields, including standard Natural Language Processing (NLP) tasks. To elicit knowledge from LLMs, prompts play a key role, consisting of natural language instructions. Most open and closed source LLMs are trained on available labeled and unlabeled resources--digital content such as text, images, audio, and videos. Hence, these models have better knowledge for high-resourced languages but struggle with low-resourced languages. Since prompts play a crucial role in understanding their capabilities, the language used for prompts remains an important research question. Although there has been significant research in this area, it is still limited, and less has been explored for medium to low-resourced languages. In this study, we investigate different prompting strategies (native vs. non-native) on 11 different NLP tasks associated with 12 different Arabic datasets (9.7K data points). In total, we conducted 197 experiments involving 3 LLMs, 12 datasets, and 3 prompting strategies. Our findings suggest that, on average, the non-native prompt performs the best, followed by mixed and native prompts.

en cs.CL, cs.AI
arXiv Open Access 2024
FS-RAG: A Frame Semantics Based Approach for Improved Factual Accuracy in Large Language Models

Harish Tayyar Madabushi

We present a novel extension to Retrieval Augmented Generation with the goal of mitigating factual inaccuracies in the output of large language models. Specifically, our method draws on the cognitive linguistic theory of frame semantics for the indexing and retrieval of factual information relevant to helping large language models answer queries. We conduct experiments to demonstrate the effectiveness of this method both in terms of retrieval effectiveness and in terms of the relevance of the frames and frame relations automatically generated. Our results show that this novel mechanism of Frame Semantic-based retrieval, designed to improve Retrieval Augmented Generation (FS-RAG), is effective and offers potential for providing data-driven insights into frame semantics theory. We provide open access to our program code and prompts.

en cs.CL
arXiv Open Access 2024
A Survey of Large Language Models for Arabic Language and its Dialects

Malak Mashaabi, Shahad Al-Khalifa, Hend Al-Khalifa

This survey offers a comprehensive overview of Large Language Models (LLMs) designed for Arabic language and its dialects. It covers key architectures, including encoder-only, decoder-only, and encoder-decoder models, along with the datasets used for pre-training, spanning Classical Arabic, Modern Standard Arabic, and Dialectal Arabic. The study also explores monolingual, bilingual, and multilingual LLMs, analyzing their architectures and performance across downstream tasks, such as sentiment analysis, named entity recognition, and question answering. Furthermore, it assesses the openness of Arabic LLMs based on factors, such as source code availability, training data, model weights, and documentation. The survey highlights the need for more diverse dialectal datasets and attributes the importance of openness for research reproducibility and transparency. It concludes by identifying key challenges and opportunities for future research and stressing the need for more inclusive and representative models.

en cs.CL, cs.AI
arXiv Open Access 2024
The American Sign Language Knowledge Graph: Infusing ASL Models with Linguistic Knowledge

Lee Kezar, Nidhi Munikote, Zian Zeng et al.

Language models for American Sign Language (ASL) could make language technologies substantially more accessible to those who sign. To train models on tasks such as isolated sign recognition (ISR) and ASL-to-English translation, datasets provide annotated video examples of ASL signs. To facilitate the generalizability and explainability of these models, we introduce the American Sign Language Knowledge Graph (ASLKG), compiled from twelve sources of expert linguistic knowledge. We use the ASLKG to train neuro-symbolic models for 3 ASL understanding tasks, achieving accuracies of 91% on ISR, 14% for predicting the semantic features of unseen signs, and 36% for classifying the topic of Youtube-ASL videos.

en cs.CL, cs.CV
DOAJ Open Access 2023
Les partícules confirmatòries (com a question tags i com a respostes confirmatives) en català (amb especial referència al nord-occidental i al rossellonès)

Xavier Rull

Questions can be open-ended (waiting for an affirmative or negative answer) but they can also have a confirmatory interpretation (i.e. there is an assumption that may be confirmed). In the latter case, question tags (like oh) may appear in the questions. This paper reports all the question tags in Catalan and the precise syntactic contexts in which they occur (at the beginning or at the end; although in principle they can all occur in two syntactic contexts, not all of them appear everywhere). Special emphasis is placed on the north-western dialects.

Semantics, Comparative grammar
DOAJ Open Access 2023
The Use of Linguistic jokes in EFL Classroom

Yusuf Al Arief

Humor is widely recognized as an effective method for promoting language acquisition and enhancing language skills. Using humor in language training can help students feel more at ease, reduce tension, and encourage good emotions, all of which have a favorable effect on their motivation and engagement. In language-learning environments, linguistic jokes are one sort of humor that has not been well investigated. This study uses descriptive qualitative methodologies to investigate the advantages of linguistic jokes in the English as a foreign language (EFL) classroom. Data were gathered for the study from interviews, focus groups, and classroom observations involving EFL students. The results show that language learners gain from using linguistic jokes in EFL classes in several ways, including greater motivation, increased engagement, improved communication skills, and decreased anxiety. The study also emphasizes how humour can improve the learning environment by encouraging greater participation and a sense of belonging among students. The qualitative data show that learners have good attitudes toward using humour in language acquisition, which lends credence to these findings. The study proposes that EFL teachers utilize humour as a teaching method to facilitate language learning and develop language skills. It also finds that including linguistic jokes in EFL classrooms can benefit language learners in various ways.

Language. Linguistic theory. Comparative grammar

Halaman 21 dari 221888