Hasil "Philology. Linguistics"

arXiv Open Access 2026

Can LLM Agents Identify Spoken Dialects like a Linguist?

Tobias Bystrich, Lukas Hamm, Maria Hassan et al.

Due to the scarcity of labeled dialectal speech, audio dialect classification is a challenging task for most languages, including Swiss German. In this work, we explore the ability of large language models (LLMs) as agents in understanding the dialects and whether they can show comparable performance to models such as HuBERT in dialect classification. In addition, we provide an LLM baseline and a human linguist one. Our approach uses phonetic transcriptions produced by ASR systems and combines them with linguistic resources such as dialect feature maps, vowel history, and rules. Our findings indicate that, when linguistic information is provided, the LLM predictions improve. The human baseline shows that automatically generated transcriptions can be beneficial for such classifications, but also presents opportunities for improvement.

en cs.CL

Detail Sumber

arXiv Open Access 2025

Statistical laws and linguistics inform meaning in naturalistic and fictional conversation

Ashley M. A. Fehr, Calla G. Beauregard, Julia Witte Zimmerman et al.

Conversation is a cornerstone of social connection and is linked to well-being outcomes. Conversations vary widely in type with some portion generating complex, dynamic stories. One approach to studying how conversations unfold in time is through statistical patterns such as Heaps' law, which holds that vocabulary size scales with document length. Little work on Heaps' law has looked at conversation and considered how language features impact scaling. We measure Heaps' law for conversations recorded in two distinct mediums: 1. Strangers brought together on video chat and 2. Fictional characters in movies. We find that scaling of vocabulary size differs by parts of speech. We discuss these findings through behavioral and linguistic frameworks.

en cs.CL, cs.CY

Detail Sumber

arXiv Open Access 2025

Automatic Extraction of Clausal Embedding Based on Large-Scale English Text Data

Iona Carslaw, Sivan Milton, Nicolas Navarre et al.

For linguists, embedded clauses have been of special interest because of their intricate distribution of syntactic and semantic features. Yet, current research relies on schematically created language examples to investigate these constructions, missing out on statistical information and naturally-occurring examples that can be gained from large language corpora. Thus, we present a methodological approach for detecting and annotating naturally-occurring examples of English embedded clauses in large-scale text data using constituency parsing and a set of parsing heuristics. Our tool has been evaluated on our dataset Golden Embedded Clause Set (GECS), which includes hand-annotated examples of naturally-occurring English embedded clause sentences. Finally, we present a large-scale dataset of naturally-occurring English embedded clauses which we have extracted from the open-source corpus Dolma using our extraction tool.

en cs.CL

Detail Sumber

arXiv Open Access 2025

Tokenization Standards for Linguistic Integrity: Turkish as a Benchmark

M. Ali Bayram, Ali Arda Fincan, Ahmet Semih Gümüş et al.

Tokenization is a fundamental preprocessing step in NLP, directly impacting large language models' (LLMs) ability to capture syntactic, morphosyntactic, and semantic structures. This paper introduces a novel framework for systematically evaluating tokenization strategies, addressing challenges in morphologically rich and low-resource languages. Using a Turkish dataset of 6,200 multiple-choice questions from the Massive Multitask Language Understanding (MMLU) benchmark, the framework assesses tokenizers across five key metrics: vocabulary size, token count, processing time, language-specific token percentages (\%TR), and token purity. These metrics provide a structured approach to evaluating how well tokenizers preserve linguistic structures. While \%TR measures the proportion of valid words in the target language, \%Pure assesses the alignment of tokens with meaningful linguistic units, such as roots and valid morphemes, minimizing semantic fragmentation. The findings reveal that \%TR, introduced as a critical metric, exhibits a stronger correlation with downstream performance (e.g., MMLU scores) than token purity, emphasizing its role in improving model accuracy. Additionally, larger model parameters do not necessarily yield better tokenization quality or enhanced results, highlighting the importance of tailored tokenization strategies that prioritize linguistic alignment. This framework sets a new standard for developing robust tokenization methods optimized for morphologically complex and low-resource languages. Future work will refine morphological analysis, explore domain-specific customizations, and conduct cross-linguistic evaluations to further enhance tokenization practices.

en cs.CL

Detail Sumber

arXiv Open Access 2025

FLUKE: A Linguistically-Driven and Task-Agnostic Framework for Robustness Evaluation

Yulia Otmakhova, Hung Thinh Truong, Rahmad Mahendra et al.

We present FLUKE (Framework for LingUistically-driven and tasK-agnostic robustness Evaluation), a framework for assessing model robustness through systematic minimal variations of test data. FLUKE introduces controlled variations across linguistic levels -- from orthography to dialect and style -- and leverages large language models (LLMs) with human validation to generate modifications. We demonstrate FLUKE's utility by evaluating both fine-tuned models and LLMs across six diverse NLP tasks (four classification and two generation tasks), and reveal that (1) the impact of linguistic variations is highly task-dependent, with some tests being critical for certain tasks but irrelevant for others; (2) LLMs still exhibit significant brittleness to certain linguistic variations, with reasoning LLMs surprisingly showing less robustness on some tasks compared to base models, and scaling improving robustness only for surface-level modifications; (3) models are overall more brittle to natural, fluent modifications such as syntax or style changes (and especially to negation), compared to corruption-style tests such as letter flipping; (4) the ability of a model to use a linguistic feature in generation does not correlate to its robustness to this feature on downstream tasks. These findings highlight the importance of systematic robustness testing for understanding model behaviors.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2025

Calibrated Generative AI as Meta-Reviewer: A Systemic Functional Linguistics Discourse Analysis of Reviews of Peer Reviews

Gabriela C. Zapata, Bill Cope, Mary Kalantzis et al.

This study investigates the use of generative AI to support formative assessment through machine generated reviews of peer reviews in graduate online courses in a public university in the United States. Drawing on Systemic Functional Linguistics and Appraisal Theory, we analyzed 120 metareviews to explore how generative AI feedback constructs meaning across ideational, interpersonal, and textual dimensions. The findings suggest that generative AI can approximate key rhetorical and relational features of effective human feedback, offering directive clarity while also maintaining a supportive stance. The reviews analyzed demonstrated a balance of praise and constructive critique, alignment with rubric expectations, and structured staging that foregrounded student agency. By modeling these qualities, AI metafeedback has the potential to scaffold feedback literacy and enhance leaner engagement with peer review.

en cs.AI, cs.HC

Detail Sumber

DOAJ Open Access 2024

EXPLORING Z GENERATION ATTITUDES TOWARD VARIETIES OF ENGLISH(ES)

Laela Rohadatul Aisy, Ribut Wahyudi

English, one of the most dominant languages, has undergone transformations and divergences that have created a variety of variations in different parts of the world. The fact is that English has more than 160 acknowledged variations of accents across the globe. Each variation from standard English to a distinctive local reflects its unique culture and history. This study aims to investigate Generation Z's attitudes towards variations of Englishes of their English as foreign language communication experiences. This research adopted qualitative research benefiting from Saraceni's (2010) Space, Culture, Ideology and Psychology (SCIP) model to understand variety of English(es). Four English literature students were selected as respondents when they were still in their 7th semester and aged 21-22 at an Islamic university (under Ministry of Religious Affairs) in East Java, Indonesia. The results revealed that the dominance of American English is still the benchmark in most participants' preferences. A number of competing and interconnected factors such as habits, motivations, and practices with the influence of family, social, educational, and environmental factors shape their preferences on English(es) varieties. The participants showed positive, contradictory attitudes (positive and negative) to negative attitude towards the varieties of Englishes.

Language and Literature, Philology. Linguistics

Detail DOI Sumber

DOAJ Open Access 2024

Hands-On Data Gathering

Claudia Mejía de la Pava, Silvia Manessi

Data collection is a fundamental aspect of linguistic and sociolinguistic research methodologies, and while many scholars may choose to gather their own data, others tend to rely solely on data that are readily available. There are various different methods of elicitation, including conducting interviews, questionnaires or leading focus groups to gather data, either as the main source of information or to complement existing material, which will most likely enhance both the quality of the research as well enriching the researcher’s personal experience. This paper aims to elucidate the experience gained by two researchers who implemented different data-collection methodologies during their research projects, to gather suitable data and conduct their analysis of the English language – one written the other oral – in two diverse, yet specialized, settings. They introduce their subjects of study and the communities in which the studies took place in order to help the reader get acquainted with the context, and then proceed to describe the processes they carried out. Their aim is to offer a retrospective on the decisions made, the evolution of their data-gathering approaches to determine the suitability of the focal methodologies, and the results. The aim of the paper is to contribute to the literature and applied linguistics research fields by disseminating empirical information on the collection of data within different communities of practice, so that it might be used, as a brief guide, to encourage and assist fellow researchers should they decide to embark in their own data collection projects.

Philology. Linguistics

Detail DOI Sumber

arXiv Open Access 2024

Komodo: A Linguistic Expedition into Indonesia's Regional Languages

Louis Owen, Vishesh Tripathi, Abhay Kumar et al.

The recent breakthroughs in Large Language Models (LLMs) have mostly focused on languages with easily available and sufficient resources, such as English. However, there remains a significant gap for languages that lack sufficient linguistic resources in the public domain. Our work introduces Komodo-7B, 7-billion-parameter Large Language Models designed to address this gap by seamlessly operating across Indonesian, English, and 11 regional languages in Indonesia. Komodo-7B is a family of LLMs that consist of Komodo-7B-Base and Komodo-7B-Instruct. Komodo-7B-Instruct stands out by achieving state-of-the-art performance in various tasks and languages, outperforming the benchmarks set by OpenAI's GPT-3.5, Cohere's Aya-101, Llama-2-Chat-13B, Mixtral-8x7B-Instruct-v0.1, Gemma-7B-it , and many more. This model not only demonstrates superior performance in both language-specific and overall assessments but also highlights its capability to excel in linguistic diversity. Our commitment to advancing language models extends beyond well-resourced languages, aiming to bridge the gap for those with limited linguistic assets. Additionally, Komodo-7B-Instruct's better cross-language understanding contributes to addressing educational disparities in Indonesia, offering direct translations from English to 11 regional languages, a significant improvement compared to existing language translation services. Komodo-7B represents a crucial step towards inclusivity and effectiveness in language models, providing to the linguistic needs of diverse communities.

en cs.CL

Detail Sumber

arXiv Open Access 2024

Linguistically Grounded Analysis of Language Models using Shapley Head Values

Marcell Fekete, Johannes Bjerva

Understanding how linguistic knowledge is encoded in language models is crucial for improving their generalisation capabilities. In this paper, we investigate the processing of morphosyntactic phenomena, by leveraging a recently proposed method for probing language models via Shapley Head Values (SHVs). Using the English language BLiMP dataset, we test our approach on two widely used models, BERT and RoBERTa, and compare how linguistic constructions such as anaphor agreement and filler-gap dependencies are handled. Through quantitative pruning and qualitative clustering analysis, we demonstrate that attention heads responsible for processing related linguistic phenomena cluster together. Our results show that SHV-based attributions reveal distinct patterns across both models, providing insights into how language models organize and process linguistic information. These findings support the hypothesis that language models learn subnetworks corresponding to linguistic theory, with potential implications for cross-linguistic model analysis and interpretability in Natural Language Processing (NLP).

en cs.CL

Detail Sumber

arXiv Open Access 2024

A systematic investigation of learnability from single child linguistic input

Yulu Qin, Wentao Wang, Brenden M. Lake

Language models (LMs) have demonstrated remarkable proficiency in generating linguistically coherent text, sparking discussions about their relevance to understanding human language learnability. However, a significant gap exists between the training data for these models and the linguistic input a child receives. LMs are typically trained on data that is orders of magnitude larger and fundamentally different from child-directed speech (Warstadt and Bowman, 2022; Warstadt et al., 2023; Frank, 2023a). Addressing this discrepancy, our research focuses on training LMs on subsets of a single child's linguistic input. Previously, Wang, Vong, Kim, and Lake (2023) found that LMs trained in this setting can form syntactic and semantic word clusters and develop sensitivity to certain linguistic phenomena, but they only considered LSTMs and simpler neural networks trained from just one single-child dataset. Here, to examine the robustness of learnability from single-child input, we systematically train six different model architectures on five datasets (3 single-child and 2 baselines). We find that the models trained on single-child datasets showed consistent results that matched with previous work, underscoring the robustness of forming meaningful syntactic and semantic representations from a subset of a child's linguistic input.

en cs.CL, cs.LG

Detail Sumber

arXiv Open Access 2024

Linguistics Theory Meets LLM: Code-Switched Text Generation via Equivalence Constrained Large Language Models

Garry Kuwanto, Chaitanya Agarwal, Genta Indra Winata et al.

Code-switching, the phenomenon of alternating between two or more languages in a single conversation, presents unique challenges for Natural Language Processing (NLP). Most existing research focuses on either syntactic constraints or neural generation, with few efforts to integrate linguistic theory with large language models (LLMs) for generating natural code-switched text. In this paper, we introduce EZSwitch, a novel framework that combines Equivalence Constraint Theory (ECT) with LLMs to produce linguistically valid and fluent code-switched text. We evaluate our method using both human judgments and automatic metrics, demonstrating a significant improvement in the quality of generated code-switching sentences compared to baseline LLMs. To address the lack of suitable evaluation metrics, we conduct a comprehensive correlation study of various automatic metrics against human scores, revealing that current metrics often fail to capture the nuanced fluency of code-switched text. Additionally, we create CSPref, a human preference dataset based on human ratings and analyze model performance across ``hard`` and ``easy`` examples. Our findings indicate that incorporating linguistic constraints into LLMs leads to more robust and human-aligned generation, paving the way for scalable code-switching text generation across diverse language pairs.

en cs.CL

Detail Sumber

DOAJ Open Access 2023

Finding the ‘Right’ Irish for the New Testament

Mícheál Hoyne

An Irish translation of the New Testament was published in Dublin in 1602. This publication, and the translation work which underlay it, did not appear in a vacuum: two earlier printed books in Irish had paved the way, viz. John Carswell’s translation of Knox’s Forme of Prayer and Ministrations of the Sacraments, published in Edinburgh in 1567, and Seaán Ó Cearnaigh’s primer of the Irish language and catechism translation, published in Dublin in 1571. This paper seeks to shed light on the process by which an appropriate register was arrived at for Protestant printing in Irish, and in particular for the New Testament, through an examination of some of the linguistic and stylistic features of these texts, with regard both to decisions made by the individual translators and to sociolinguistic factors which may have limited their room to manoeuvre. These factors include contemporary conceptions of and attitudes to different language varieties, the lack of alternative models, and the nature and level of education received by individual translators. This paper builds upon the pioneering research of Ailbhe Ó Corráin (2013) to show that linguistically that portion of the Irish New Testament completed after 1597 has a more colloquial and dialectal quality than that which preceded it. This is tentatively connected with specific changes in the team responsible.

Philology. Linguistics

Detail DOI Sumber

DOAJ Open Access 2023

A Digital Markets Act és a Digital Services Act várható hatásai a német jogrendszerben a magyar gyakorlat szempontjából

Miklós Szirbik, Sára Bernáth

A Nemzeti Média- és Hírközlési Hatóság 2023 januárjában „Szabadság és biztonság a digitalizálódó világban” címmel közzétett aktuális stratégiája értelmében a hatóság egyebek között a digitális szolgáltatásokról, valamint a digitális piacokról szóló európai uniós rendeletek alkalmazásából eredő és azokat kiegészítő hazai szabályozási igények vizsgálatát, szabályozási javaslatok előkészítésének szabályozási és szakértői támogatását határozta meg egyik kiemelt feladataként. A Gazdasági Versenyhivatal 2023 márciusában tagságot nyert abban a magas szintű nemzetközi munkacsoportban, amely az Európai Bizottságot segíti a digitális piacokról szóló rendelet szabályainak betartatásában. A jelen tanulmány a Nemzeti Média- és Hírközlési Hatóság, a Gazdasági Versenyhivatal és egyéb érintett hazai döntéshozók, ágazati szereplők stratégiájához és feladatköreik ellátásához oly módon kíván szakmai támogatást nyújtani, hogy részletesen bemutatja a német jogrendszerben körvonalazódó változásokat, gyakorlatokat a digitális szolgáltatásokról és a digitális piacokról szóló jogi aktusokkal kapcsolatban. A kutatást a német piac mérete és az ebből fakadó számos jogeset indokolja. Fontos tényező továbbá, hogy a német társadalom hagyományosan érzékeny az adatvédelem, a média és a demokrácia, továbbá a fenntartható gazdaság közötti kapcsolatokra, így a digitális szolgáltatásokról és a digitális piacokról szóló rendeletek végrehajtása során sok impulzusra számíthatunk Németország részéről, amely – a kevés ilyen tagállam egyikeként – már az uniós jogalkotást megelőzve, saját nemzeti jogszabályokban foglalt versenyjogi vagy médiaszabályozási eszközökkel beavatkozott a digitális piacok folyamataiba.

Communication. Mass media, Property

Detail DOI Sumber

arXiv Open Access 2023

Founding a mathematical diffusion model in linguistics. The case study of German syntactic features in the North-Eastern Italian dialects

I. Lazzizzera

The initial motivation for this work was the linguistic case of the spread of Germanic syntactic features into Romance dialects of North-Eastern Italy, which occurred after the immigration of German people to Tyrol during the High Middle Ages. To obtain a representation of the data over the territory suitable for a mathematical formulation, an interactive map is produced as a first step, using tools of what is called Geographic Data Science. A smooth two-dimensional surface G is introduced, expressing locally which fraction of territory uses a given German language feature: it is obtained by a piecewise cubic curvature minimizing interpolant of the discrete function that says if at any surveyed locality that feature is used or not. This surface G is thought of as the value at the present time of a function describing a diffusion-convection phenomenon in two dimensions (here said tidal mode), which is subjected in a very natural way to the same equation used in physics, introducing a contextual diffusivity concept: it is shown that with two different assumptions about diffusivity, solutions of this equation, evaluated at the present time, fit well with the data interpolated by G, thus providing two convincing different pictures of diffusion-convection in the case under study, albeit simplifications and approximations. Very importantly, it is shown that the linguistic diffusion model known to linguists as Schmidt waves can be counted among the solutions of the diffusion equation

en cs.CL

Detail DOI Sumber

S2 Open Access 2022

A New Literary History of the Long Twelfth Century

Mark Faulkner

A New Literary History of the Long Twelfth Century offers a new narrative of what happened to English language writing in the long twelfth century, the period that saw the end of the Old English tradition and the beginning of Middle English writing. It discusses numerous neglected or unknown texts, focusing particularly on documents, chronicles and sermons. To tell the story of this pivotal period, it adopts approaches from both literary criticism and historical linguistics, finding a synthesis for them in a twenty-first century philology. It develops new methodologies for addressing major questions about twelfth-century texts, including when they were written, how they were read and their relationship to earlier works. Essential reading for anyone interested in what happened to English after the Norman Conquest, this study lays the groundwork for the coming decade's work on transitional English.

4 sitasi en

Detail DOI Sumber

CrossRef Open Access 2022

Expressive Syntax in Russian Linguistics: A Systematic Review

Yelena Nickolaevna Vorobyova

The purpose of the systematic review is to identify the main areas in the study of expressive syntax in Russian linguistics. Scientific novelty of the paper lies in the fact that it reviews and systematises the works of Soviet and Russian linguists up to 2021, which are devoted to the syntactic aspect of expressiveness research using the material of Russian and foreign languages. The result of the conducted analysis amounts to defining the main areas and identifying the key tasks of this actively developing branch of syntactic research, determining the most discussed expressive syntactic forms, as well as describing the methodology of their research.

3 sitasi en

Detail DOI Sumber

DOAJ Open Access 2022

El metalenguaje político y la producción de ideología

Juan Alonso Aldama

El objetivo de este articulo es describir los mecanismos semióticos de las correlaciones semánticas que la comunicación política crea con el fin de convertirlas en "gramáticas políticas" y, por tanto, en "normas". Se trata, pues, de estudiar la transformación a través de operaciones de "síntesis de lo heterogéneo" realizadas por la vinculación de elementos del discurso político con otros pertenecientes a paradigmas semánticos en principio completamente ajenos a éste, con el objetivo de producir generalizaciones explicativas del mundo político en su totalidad. Este tipo de operaciones, que se inician a nivel de los textos y que acaban produciendo metasemióticas políticas, son posibles gracias a una propiedad especifica del lenguaje, a saber, la de su “elasticidad”.

Communication. Mass media

Detail DOI Sumber

arXiv Open Access 2022

ImmunoLingo: Linguistics-based formalization of the antibody language

Mai Ha Vu, Philippe A. Robert, Rahmad Akbar et al.

Apparent parallels between natural language and biological sequence have led to a recent surge in the application of deep language models (LMs) to the analysis of antibody and other biological sequences. However, a lack of a rigorous linguistic formalization of biological sequence languages, which would define basic components, such as lexicon (i.e., the discrete units of the language) and grammar (i.e., the rules that link sequence well-formedness, structure, and meaning) has led to largely domain-unspecific applications of LMs, which do not take into account the underlying structure of the biological sequences studied. A linguistic formalization, on the other hand, establishes linguistically-informed and thus domain-adapted components for LM applications. It would facilitate a better understanding of how differences and similarities between natural language and biological sequences influence the quality of LMs, which is crucial for the design of interpretable models with extractable sequence-functions relationship rules, such as the ones underlying the antibody specificity prediction problem. Deciphering the rules of antibody specificity is crucial to accelerating rational and in silico biotherapeutic drug design. Here, we formalize the properties of the antibody language and thereby establish not only a foundation for the application of linguistic tools in adaptive immune receptor analysis but also for the systematic immunolinguistic studies of immune receptor specificity in general.

en q-bio.QM, cs.LG

Detail DOI Sumber

arXiv Open Access 2022

Helpful Neighbors: Leveraging Neighbors in Geographic Feature Pronunciation

Llion Jones, Richard Sproat, Haruko Ishikawa et al.

If one sees the place name Houston Mercer Dog Run in New York, how does one know how to pronounce it? Assuming one knows that Houston in New York is pronounced "how-ston" and not like the Texas city, then one can probably guess that "how-ston" is also used in the name of the dog park. We present a novel architecture that learns to use the pronunciations of neighboring names in order to guess the pronunciation of a given target feature. Applied to Japanese place names, we demonstrate the utility of the model to finding and proposing corrections for errors in Google Maps. To demonstrate the utility of this approach to structurally similar problems, we also report on an application to a totally different task: Cognate reflex prediction in comparative historical linguistics. A version of the code has been open-sourced (https://github.com/google-research/google-research/tree/master/cognate_inpaint_neighbors).

en cs.CL

Detail DOI Sumber

Hasil untuk "Philology. Linguistics"