Hasil untuk "Language. Linguistic theory. Comparative grammar"

Menampilkan 20 dari ~41739 hasil · dari DOAJ, arXiv

JSON API
arXiv Open Access 2026
Task-Specific Efficiency Analysis: When Small Language Models Outperform Large Language Models

Jinghan Cao, Yu Ma, Xinjin Li et al.

Large Language Models achieve remarkable performance but incur substantial computational costs unsuitable for resource-constrained deployments. This paper presents the first comprehensive task-specific efficiency analysis comparing 16 language models across five diverse NLP tasks. We introduce the Performance-Efficiency Ratio (PER), a novel metric integrating accuracy, throughput, memory, and latency through geometric mean normalization. Our systematic evaluation reveals that small models (0.5--3B parameters) achieve superior PER scores across all given tasks. These findings establish quantitative foundations for deploying small models in production environments prioritizing inference efficiency over marginal accuracy gains.

en cs.CL, cs.LG
DOAJ Open Access 2025
Mitologia, kultura i historia w świecie Wiedźmina. Językowe aspekty powieściowe

Karolina Krygier

Celem artykułu jest omówienie najistotniejszych nawiązań sagi o wiedźminie Andrzeja Sapkowskiego do mitologii, kultury i historii, a także zobrazowanie występujących w niej językowych aspektów powieściowych. Artykuł ma na celu ukazanie, iż nawet literatura popularna (fantasy) zawiera w sobie wiele interesujących elementów, które warto poddać pogłębionej analizie. Na podstawie dotychczasowych badań powieści wiedźmińskich oraz gier komputerowych omówiono najważniejsze odwołania cyklu wiedźmińskiego, przedstawiono skrócone opisy istot nieludzkich, a także przeanalizowano stworzony przez Sapkowskiego język bohaterów jako nowy system językowy. Tezy badawcze poparte zostały cytatami z poszczególnych tomów sagi oraz gier. Podjęto próbę odczytania znaczeń niektórych form językowych ze Starszej Mowy w oparciu o dostępne źródła. Badania te ukazują, iż saga o wiedźminie nie bez przyczyny stała się fenomenem w popkulturze światowej.

Language. Linguistic theory. Comparative grammar
arXiv Open Access 2025
Looking forward: Linguistic theory and methods

John Mansfield, Ethan Gotlieb Wilcox

This chapter examines current developments in linguistic theory and methods, focusing on the increasing integration of computational, cognitive, and evolutionary perspectives. We highlight four major themes shaping contemporary linguistics: (1) the explicit testing of hypotheses about symbolic representation, such as efficiency, locality, and conceptual semantic grounding; (2) the impact of artificial neural networks on theoretical debates and linguistic analysis; (3) the importance of intersubjectivity in linguistic theory; and (4) the growth of evolutionary linguistics. By connecting linguistics with computer science, psychology, neuroscience, and biology, we provide a forward-looking perspective on the changing landscape of linguistic research.

en cs.CL
arXiv Open Access 2025
Understanding Network Behaviors through Natural Language Question-Answering

Mingzhe Xing, Chang Tian, Jianan Zhang et al.

Modern large-scale networks introduce significant complexity in understanding network behaviors, increasing the risk of misconfiguration. Prior work proposed to understand network behaviors by mining network configurations, typically relying on domain-specific languages interfaced with formal models. While effective, they suffer from a steep learning curve and limited flexibility. In contrast, natural language (NL) offers a more accessible and interpretable interface, motivating recent research on NL-guided network behavior understanding. Recent advances in large language models (LLMs) further enhance this direction, leveraging their extensive prior knowledge of network concepts and strong reasoning capabilities. However, three key challenges remain: 1) numerous router devices with lengthy configuration files challenge LLM's long-context understanding ability; 2) heterogeneity across devices and protocols impedes scalability; and 3) complex network topologies and protocols demand advanced reasoning abilities beyond the current capabilities of LLMs. To tackle the above challenges, we propose NetMind, a novel framework for querying networks using NL. Our approach introduces a tree-based configuration chunking strategy to preserve semantic coherence while enabling efficient partitioning. We then construct a unified fact graph as an intermediate representation to normalize vendor-specific configurations. Finally, we design a hybrid imperative-declarative language to reduce the reasoning burden on LLMs and enhance precision. We contribute a benchmark consisting of NL question-answer pairs paired with network configurations. Experiments demonstrate that NetMind achieves accurate and scalable network behavior understanding, outperforming existing baselines.

en cs.CL, cs.AI
arXiv Open Access 2025
MathRobust-LV: Evaluation of Large Language Models' Robustness to Linguistic Variations in Mathematical Reasoning

Neeraja Kirtane, Yuvraj Khanna, Peter Relan

Large language models excel on math benchmarks, but their math reasoning robustness to linguistic variation is underexplored. While recent work increasingly treats high-difficulty competitions like the IMO as the gold standard for evaluating reasoning, we believe in comprehensive benchmarking of high school-level math problems in real educational settings. We introduce MathRobust-LV, a test set and evaluation methodology that mirrors how instructors rephrase problems across assessments while keeping difficulty constant: we change surface details (names, contexts, variables) while preserving numerical structure and answers. In contrast to prior efforts that alter problem content or emphasize IMO-level tasks, we focus on high-school-level dataset problems at the difficulty level where models are currently deployed in educational settings: tutoring and assessment systems. In these applications, instructors rephrase identical concepts in varied ways, making linguistic robustness essential for reliable deployment. Although MATH data benchmarking is often regarded as saturated, our experiment on 34 models reveals that accuracy declines when moving from the baseline to the variants. These drops are severe for smaller models (9-11%) while stronger models also show measurable degradation. Frontier models like GPT-5, Gemini-2.5pro remain comparatively stable. Our results highlight that robustness to linguistic variation is a fundamental challenge, exposing reasoning vulnerabilities in models.

en cs.CL
arXiv Open Access 2024
Do Audio-Language Models Understand Linguistic Variations?

Ramaneswaran Selvakumar, Sonal Kumar, Hemant Kumar Giri et al.

Open-vocabulary audio language models (ALMs), like Contrastive Language Audio Pretraining (CLAP), represent a promising new paradigm for audio-text retrieval using natural language queries. In this paper, for the first time, we perform controlled experiments on various benchmarks to show that existing ALMs struggle to generalize to linguistic variations in textual queries. To address this issue, we propose RobustCLAP, a novel and compute-efficient technique to learn audio-language representations agnostic to linguistic variations. Specifically, we reformulate the contrastive loss used in CLAP architectures by introducing a multi-view contrastive learning objective, where paraphrases are treated as different views of the same audio scene and use this for training. Our proposed approach improves the text-to-audio retrieval performance of CLAP by 0.8%-13% across benchmarks and enhances robustness to linguistic variation.

en cs.SD, cs.LG
DOAJ Open Access 2023
The Trajectory of English Foreign Learners’ Emotions and Motivation from the Start to the End of their Learning Journey: A Pseudo-Longitudinal Investigation

Jean-Marc Dewaele, Rachid Meftah

The current study is based on a pseudo-longitudinal design to investigate the trajectory of Foreign Language Enjoyment (FLE), Foreign Language Peace of Mind (FLPOM), Foreign Language Classroom Anxiety (FLCA), Foreign Language Boredom (FLB) among a total of 502 Beginner, Intermediate and Advanced English Foreign Language (EFL) learners in Morocco who filled out a single online questionnaire. Statistical results showed that motivation remained unchanged across skill levels but that positive emotions increased significantly and negative emotions dropped significantly, with the transition from Beginner to Intermediate skill levels showing the biggest change. The direction of relationships between the dependent variables remained similar although their strengths varied slightly across skill levels, reflecting the dynamic nature of FL learners’ emotions and motivation.

Special aspects of education, Language acquisition
DOAJ Open Access 2023
Consonant mutation in Nzema and Esahie

Samuel Amoh, John Nyame, Isaac Nyarko et al.

This study examines consonant mutation particularly in Esahie[1] and Nzema. The reason for the study arises from the mutuality in the two languages. Again, previous studies have superficially treated this subject in the separate languages. However, the subject of consonant mutation is common in the above languages, yet the prevailing characteristics of the phenomenon are similar and dissimilar in the languages. Thus, this study compares and contrast consonant mutation in the two related languages to establish a correlation. In this vein, it discusses the mutational pattern, directionalities and voicing. Data for this study are assembled from four native speakers of each of the languages and literature of the respective languages. The data are analysed within the purview of Distinctive Feature Theory. The study identifies evidence to the relational effect that, the phonological environments in which the various realizations of mutation occurs also results in a harmony system. However, they differ in their mutational domains. The featural agreement normally forges between the vowel-consonant. Again, it is observed in Nzema and Esahie that, the common feature responsible for mutation in the alternant pairs, [k/x] and [k/g] is [dorsal]; [d/l] and [d/n] is [coronal]; [ʨ/ɕ] is [+strident] and [b/m] is [labial]. In both languages, harmonic assimilation is bidirectional. [1] Esahie is also known as Sefwi, Sehwi, or Asahyue. In Ghana, the language is classified as part of the linguistic Akan group of the larger Akan dialects, even though it shares very close intelligibility with Nzema (a solely ethnographic Akan) than Twi and Mfantse (which are both linguistic and ethnographic Akan). Therefore, in this paper, we shall refer to both variants as separate languages.

Computational linguistics. Natural language processing, Language. Linguistic theory. Comparative grammar
arXiv Open Access 2023
Antarlekhaka: A Comprehensive Tool for Multi-task Natural Language Annotation

Hrishikesh Terdalkar, Arnab Bhattacharya

One of the primary obstacles in the advancement of Natural Language Processing (NLP) technologies for low-resource languages is the lack of annotated datasets for training and testing machine learning models. In this paper, we present Antarlekhaka, a tool for manual annotation of a comprehensive set of tasks relevant to NLP. The tool is Unicode-compatible, language-agnostic, Web-deployable and supports distributed annotation by multiple simultaneous annotators. The system sports user-friendly interfaces for 8 categories of annotation tasks. These, in turn, enable the annotation of a considerably larger set of NLP tasks. The task categories include two linguistic tasks not handled by any other tool, namely, sentence boundary detection and deciding canonical word order, which are important tasks for text that is in the form of poetry. We propose the idea of sequential annotation based on small text units, where an annotator performs several tasks related to a single text unit before proceeding to the next unit. The research applications of the proposed mode of multi-task annotation are also discussed. Antarlekhaka outperforms other annotation tools in objective evaluation. It has been also used for two real-life annotation tasks on two different languages, namely, Sanskrit and Bengali. The tool is available at https://github.com/Antarlekhaka/code.

en cs.CL
arXiv Open Access 2022
Why is constrained neural language generation particularly challenging?

Cristina Garbacea, Qiaozhu Mei

Recent advances in deep neural language models combined with the capacity of large scale datasets have accelerated the development of natural language generation systems that produce fluent and coherent texts (to various degrees of success) in a multitude of tasks and application contexts. However, controlling the output of these models for desired user and task needs is still an open challenge. This is crucial not only to customizing the content and style of the generated language, but also to their safe and reliable deployment in the real world. We present an extensive survey on the emerging topic of constrained neural language generation in which we formally define and categorize the problems of natural language generation by distinguishing between conditions and constraints (the latter being testable conditions on the output text instead of the input), present constrained text generation tasks, and review existing methods and evaluation metrics for constrained text generation. Our aim is to highlight recent progress and trends in this emerging field, informing on the most promising directions and limitations towards advancing the state-of-the-art of constrained neural language generation research.

en cs.CL, cs.AI
arXiv Open Access 2022
Prompting Is Programming: A Query Language for Large Language Models

Luca Beurer-Kellner, Marc Fischer, Martin Vechev

Large language models have demonstrated outstanding performance on a wide range of tasks such as question answering and code generation. On a high level, given an input, a language model can be used to automatically complete the sequence in a statistically-likely way. Based on this, users prompt these models with language instructions or examples, to implement a variety of downstream tasks. Advanced prompting methods can even imply interaction between the language model, a user, and external tools such as calculators. However, to obtain state-of-the-art performance or adapt language models for specific tasks, complex task- and model-specific programs have to be implemented, which may still require ad-hoc interaction. Based on this, we present the novel idea of Language Model Programming (LMP). LMP generalizes language model prompting from pure text prompts to an intuitive combination of text prompting and scripting. Additionally, LMP allows constraints to be specified over the language model output. This enables easy adaption to many tasks while abstracting language model internals and providing high-level semantics. To enable LMP, we implement LMQL(short for Language Model Query Language), which leverages the constraints and control flow from an LMP prompt to generate an efficient inference procedure that minimizes the number of expensive calls to the underlying language model. We show that LMQL can capture a wide range of state-of-the-art prompting methods in an intuitive way, especially facilitating interactive flows that are challenging to implement with existing high-level APIs. Our evaluation shows that we retain or increase the accuracy on several downstream tasks, while also significantly reducing the required amount of computation or cost in the case of pay-to-use APIs (26-85% cost savings).

en cs.CL, cs.AI
arXiv Open Access 2022
Challenges in Measuring Bias via Open-Ended Language Generation

Afra Feyza Akyürek, Muhammed Yusuf Kocyigit, Sejin Paik et al.

Researchers have devised numerous ways to quantify social biases vested in pretrained language models. As some language models are capable of generating coherent completions given a set of textual prompts, several prompting datasets have been proposed to measure biases between social groups -- posing language generation as a way of identifying biases. In this opinion paper, we analyze how specific choices of prompt sets, metrics, automatic tools and sampling strategies affect bias results. We find out that the practice of measuring biases through text completion is prone to yielding contradicting results under different experiment settings. We additionally provide recommendations for reporting biases in open-ended language generation for a more complete outlook of biases exhibited by a given language model. Code to reproduce the results is released under https://github.com/feyzaakyurek/bias-textgen.

en cs.CL, cs.CY
arXiv Open Access 2021
gaBERT -- an Irish Language Model

James Barry, Joachim Wagner, Lauren Cassidy et al.

The BERT family of neural language models have become highly popular due to their ability to provide sequences of text with rich context-sensitive token encodings which are able to generalise well to many NLP tasks. We introduce gaBERT, a monolingual BERT model for the Irish language. We compare our gaBERT model to multilingual BERT and the monolingual Irish WikiBERT, and we show that gaBERT provides better representations for a downstream parsing task. We also show how different filtering criteria, vocabulary size and the choice of subword tokenisation model affect downstream performance. We compare the results of fine-tuning a gaBERT model with an mBERT model for the task of identifying verbal multiword expressions, and show that the fine-tuned gaBERT model also performs better at this task. We release gaBERT and related code to the community.

en cs.CL
DOAJ Open Access 2020
Language Shapes Socially Constructed Gender Roles: Ibsen’s ‘A Doll’s House’ in Focus

Haniya Munir

Language plays an important role in human life that can be seen from various perspectives such as the cultural perspective, linguistic perspective, social perspective, psychological perspective, perspective of gender and moral and ethical perspectives. This is undoubtedly a proven fact that we use language and at the same time, language uses us to define, designate, tag and shape our places in the society (Cameron, 2005). This role of language is generally suitable for all human race either male or female but the basic purpose of this study is to explain how language shapes a woman’s place and identity in society. Often we find that women face linguistic discrimination in two different ways: one is the way; they are taught to speak and use language and the other way is about how language treats them (Lakoff, 2004). These linguistic disparities tend to specify a woman’s role and function in the society as a sex object, a servant, a wife, a daughter, a mother and specifically a woman (Kerber, 1988). The researcher collected the data for this study from Ibsen’s (1999) ‘A Doll’s House’ in which different lexical items, phrases and sentences were uttered intentionally to explain the role of the main character Nora as a wife, as a daughter and as a woman. The researcher examined the speeches of different characters only to show the language –made and man- made places of women in the society. For this purpose, the researcher used a theoretical framework based on the qualitative approach while consulting the related ideas of Lakoff (2004) who, in her ‘Dominance Theory,’ explains how language shapes a woman’s place in the society by analyzing her own speeches and the speeches of different people in the society. The findings of the study go a long way in telling people and the upcoming researchers that language not only specifies gender roles individually, but also internally and externally as well. Basically different social characters surrounding a woman use language in such a way that it starts shaping a woman’s character in different sub- characters as explained in the work of Ibsen (1999). Furthermore, language use tells us that a man remains a man in every situation either as a father, as a husband, as a son, and above all as a man but a woman’s place in society is changeable according to language use and those tagged names that men have used for women ever. For example, if a little girl talks roughly like a boy, she is scolded by her parents and friends (Lakoff, 2004). This process of socialization is harmful in the sense that it is making women weak, incapable and less –confident but if we analyze the last lines spoken by Nora in the selected text of Ibsen (1999), we come to know that constant battering and hammering of socialization and generalization are now making women aware of their individual place and identity in the society and they are now looking at life from a different perspective that is still unacceptable in the man-made society (Kramer, 1974). This study will open new avenues for sociolinguists to study language and gender keenly and critically.

Language. Linguistic theory. Comparative grammar, Oral communication. Speech
arXiv Open Access 2020
Speech Recognition for Endangered and Extinct Samoyedic languages

Niko Partanen, Mika Hämäläinen, Tiina Klooster

Our study presents a series of experiments on speech recognition with endangered and extinct Samoyedic languages, spoken in Northern and Southern Siberia. To best of our knowledge, this is the first time a functional ASR system is built for an extinct language. We achieve with Kamas language a Label Error Rate of 15\%, and conclude through careful error analysis that this quality is already very useful as a starting point for refined human transcriptions. Our results with related Nganasan language are more modest, with best model having the error rate of 33\%. We show, however, through experiments where Kamas training data is enlarged incrementally, that Nganasan results are in line with what is expected under low-resource circumstances of the language. Based on this, we provide recommendations for scenarios in which further language documentation or archive processing activities could benefit from modern ASR technology. All training data and processing scripts haven been published on Zenodo with clear licences to ensure further work in this important topic.

en cs.CL
DOAJ Open Access 2019
L’estetica dell’impossibile: lo strano caso del gotico e della fantascienza

Federica Perazzini

In the history of human creativity, the act of imagining the impossible has always been at the core of the physical and metaphysical perception of the unknown. The scholarly debate regarding the nature of the impossible gained particular relevance in the context of British Enlightenment when the expanding sciences, along with literature, attempted to provide empirical validation to inexplicable and supernatural phenomena. In this way, the discrepancies between the overlapping ontologies of the Age of Faith and the Age of Reason became apparent as the ancestral literary practice of the fantastic merged with the rising genre of the novel. The assimilation of the conventional tropes of supernatural literature within the narrative frame of formal realism led to the development of two fortunate sub-genres: the Gothic and Science Fiction. The former evolved around the mutual disruption of the empirically-based conception of reality and the transgression of the moral code implied in the construction of civic order. The latter derived from the relocation of specific gothic features into a larger dimension of social anxiety concerning the abuses of reason concealed as a path towards common good and future progress. By exploring the evolution of the gothic imagery and its dissolution into the narrative horizon of Science Fiction, this article will trace the early modern roots of the dialogue between science and literature in the human quest for the impossible. The thesis that Gothic and Science Fiction are historically interdependent will be reviewed in light of the common matrix of fear and desire which characterises their ideological function.

Geography. Anthropology. Recreation, Language. Linguistic theory. Comparative grammar
arXiv Open Access 2019
Polysemy and brevity versus frequency in language

Bernardino Casas, Antoni Hernández-Fernández, Neus Català et al.

The pioneering research of G. K. Zipf on the relationship between word frequency and other word features led to the formulation of various linguistic laws. The most popular is Zipf's law for word frequencies. Here we focus on two laws that have been studied less intensively: the meaning-frequency law, i.e. the tendency of more frequent words to be more polysemous, and the law of abbreviation, i.e. the tendency of more frequent words to be shorter. In a previous work, we tested the robustness of these Zipfian laws for English, roughly measuring word length in number of characters and distinguishing adult from child speech. In the present article, we extend our study to other languages (Dutch and Spanish) and introduce two additional measures of length: syllabic length and phonemic length. Our correlation analysis indicates that both the meaning-frequency law and the law of abbreviation hold overall in all the analyzed languages.

en cs.CL, physics.soc-ph
arXiv Open Access 2019
Abusive Language Detection with Graph Convolutional Networks

Pushkar Mishra, Marco Del Tredici, Helen Yannakoudakis et al.

Abuse on the Internet represents a significant societal problem of our time. Previous research on automated abusive language detection in Twitter has shown that community-based profiling of users is a promising technique for this task. However, existing approaches only capture shallow properties of online communities by modeling follower-following relationships. In contrast, working with graph convolutional networks (GCNs), we present the first approach that captures not only the structure of online communities but also the linguistic behavior of the users within them. We show that such a heterogeneous graph-structured modeling of communities significantly advances the current state of the art in abusive language detection.

en cs.CL

Halaman 24 dari 2087