Hasil "Language. Linguistic theory. Comparative grammar"

DOAJ Open Access 2026

Front Matter

Translating and interpreting

DOAJ Open Access 2025

Gordon, Robert M Gordon

This paper revisits the Simulation Theory (ST) as a framework for understanding human social cognition, challenging traditional ‘theory of mind’ or ‘folk psychology’ approaches. While these theory-based models posit that humans use an implicit body of knowledge to interpret and predict others’ behavior, ST emphasizes the use of mental simulation, leveraging the brain’s existing mechanisms for planning and prediction. By employing a predictive coding strategy, the brain minimizes cognitive load, interpreting others' actions through ‘inverse planning’ – a process that reuses one’s own action planning system to hypothesize the goals and intentions of others. The concept of agent-neutral coding is introduced, proposing that inputs for self and others are initially shared, reducing the need for explicit mental state attributions. This approach not only economizes cognitive resources but aligns with evolutionary perspectives on human social interaction in small, cohesive groups. In addition, the paper explores the role of perspective-taking and error correction in adapting shared mental representation. This reevaluation of ST underscores its efficiency and adaptability, offering a streamlined alternative to theory-based accounts of social cognition.

Aesthetics, Language. Linguistic theory. Comparative grammar

Detail DOI Sumber

DOAJ Open Access 2025

Eksplikacja znaku percepcyjnego zapachu w świadomości użytkowników języka polskiego oraz w definicjach słownikowych (na materiale eksperymentu psycholingwistycznego)

Alina Yodlovska

EXPLICATION OF THE PERCEPTUAL SIGN ”SMELL” IN THE MINDS OF POLISH LANGUAGE USERS AND IN DICTIONARY DEFINITIONS (BASED ON THE MATERIAL OF A PSYCHOLINGUISTIC EXPERIMENT) The problem of explicating perceptual features in vocabulary definitions is still underdeveloped in linguistics. Each interpretation of a meaning in a dictionary is one of the possible variants of its description. The aim of my research is to identify and characterise the correlation between the content of the definitions of words with the perceptual feature “smell” and the content of the corresponding concepts in the minds of Polish language users. The following methods were used in the study: analysis of dictionary definitions, conceptual analysis, psycholinguistic experiment. Using the questions from the experiment, I was able to identify 12 terms that respondents mention most often (over 100 times) and associate with smell, and also to refer to dictionary definitions to see if there is a perceptual sign in the structure of the definitions. A perceptual sign (perceptual feature) in linguistics refers to a meaningful unit whose meaning is derived from perception, i.e. perception by the senses. It is a sign that can be perceived through the senses, such as sight, hearing or touch, and which has a certain meaning for the perceiver. The perceptual sign of smell in Polish dictionaries is represented by the lexemes such as aromatic, fragrant, pleasant, pungent, unpleasant smell. Vocabulary definitions mostly fail to capture the perception of smell: the realities that respondents most often mentioned do not include the perception of smell, but are presented descriptively.

Language. Linguistic theory. Comparative grammar

Detail DOI Sumber

arXiv Open Access 2025

SAGE:Specification-Aware Grammar Extraction for Automated Test Case Generation with LLMs

Aditi, Hyunwoo Park, Sicheol Sung et al.

Grammar-based test case generation has proven effective for competitive programming problems, but generating valid and general grammars from natural language specifications remains a key challenge, especially under limited supervision. Context-Free Grammars with Counters (CCFGs) have recently been introduced as a formalism to represent such specifications with logical constraints by storing and reusing counter values during derivation. In this work, we explore the use of open-source large language models (LLMs) to induce CCFGs from specifications using a small number of labeled examples and verifiable reward-guided reinforcement learning. Our approach first fine-tunes an open-source LLM to perform specification-to-grammar translation, and further applies Group Relative Policy Optimization (GRPO) to enhance grammar validity and generality. We also examine the effectiveness of iterative feedback for open and closed-source LLMs in correcting syntactic and semantic errors in generated grammars. Experimental results show that our approach SAGE achieves stronger generalization and outperforms 17 open and closed-source LLMs in both grammar quality and test effectiveness, improving over the state-of-the-art by 15.92%p in grammar validity and 12.34%p in test effectiveness. We provide our implementation and dataset at the following anonymous repository:https://anonymous.4open.science/r/SAGE-5714

en cs.CL

Detail Sumber

arXiv Open Access 2025

LengClaro2023: A Dataset of Administrative Texts in Spanish with Plain Language adaptations

Belén Agüera-Marco, Itziar Gonzalez-Dios

In this work, we present LengClaro2023, a dataset of legal-administrative texts in Spanish. Based on the most frequently used procedures from the Spanish Social Security website, we have created for each text two simplified equivalents. The first version follows the recommendations provided by arText claro. The second version incorporates additional recommendations from plain language guidelines to explore further potential improvements in the system. The linguistic resource created in this work can be used for evaluating automatic text simplification (ATS) systems in Spanish.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2025

A Survey of Theory of Mind in Large Language Models: Evaluations, Representations, and Safety Risks

Hieu Minh "Jord" Nguyen

Theory of Mind (ToM), the ability to attribute mental states to others and predict their behaviour, is fundamental to social intelligence. In this paper, we survey studies evaluating behavioural and representational ToM in Large Language Models (LLMs), identify important safety risks from advanced LLM ToM capabilities, and suggest several research directions for effective evaluation and mitigation of these risks.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2025

Generative Linguistics, Large Language Models, and the Social Nature of Scientific Success

Sophie Hao

Chesi's (forthcoming) target paper depicts a generative linguistics in crisis, foreboded by Piantadosi's (2023) declaration that "modern language models refute Chomsky's approach to language." In order to survive, Chesi warns, generativists must hold themselves to higher standards of formal and empirical rigor. This response argues that the crisis described by Chesi and Piantadosi actually has little to do with rigor, but is rather a reflection of generativists' limited social ambitions. Chesi ties the fate of generative linguistics to its intellectual merits, but the current success of language model research is social in nature as much as it is intellectual. In order to thrive, then, generativists must do more than heed Chesi's call for rigor; they must also expand their ambitions by giving outsiders a stake in their future success.

en cs.CL

Detail Sumber

DOAJ Open Access 2024

Is Transitivity Register-specific? A Corpus Exploration of Online Science News

Bonifacio Tala Cunanan

This paper deals with parsing and exploring transitivity in ten online science news articles in one of the leading national broadsheets in the Philippines. The corpus analysis used two software applications: AntConc and UAM Corpus. Specifically, this study compared the transitivity features of ten online science news with the transitivity features of non-science-based registers. The transitivity analysis of the science-based and the non-science-based registers has resulted in identifying and classifying the underlying semantic roles of Participants, Process types, and Circumstances. Identifying the semantic features across the registers has resulted in identifying the prominent and foregrounded features of transitivity in the corpus. This study reiterates that foregrounding is crucial in stylistic analysis of texts. The analysis has established that while the non-science-based registers use the six Process types, the science-based registers primarily use the Material and the Relational Process types. This disproportionate occurrence of the Process types, especially the non-use of the Behavioral process types, strongly corroborates the notion that transitivity is register-specific. The present study supports a systemic and functional stance that considers language a network of choices applicable across varied genres and registers. Also, this study recommends the use of software applications in language pedagogy. Finally, the findings may also facilitate teaching L2 academic writing and producing science-based materials.

Language. Linguistic theory. Comparative grammar

Detail DOI Sumber

DOAJ Open Access 2024

Mannsnavnet Gautulv ‒ Guttul

Atle Steinar Langekiehl

The region Østfold/Follo in Norway had three extremely rare male personal names, Gautulv, Sakulv and Sjøfar, the first of which is the topic of this article. The etymology of the Norse Gautulfr is the wolf from Götaland. A Swedish rune stone mentions Gautulv, and the first Swedish written medieval sources for the name precede the Norwegian ones, although most of the namesakes lived in Norway. The nobility figures far more prominently than other social groups in medieval sources, and in Norway, the first known Gautulvs and people with the patronym Gautulvsson undoubtedly belonged to the nobility. The name Gautulv is also present in five anthropotoponyms: one in Østfold, three in Vestfold on the opposite side of the Oslofjord and one in Trøndelag. Later, Guttul became the most commonly used form of this anthroponym, which probably went extinct in Norway when the farmer Guttul Hansen Søtland died in Trøgstad in Østfold in 1797.

Language. Linguistic theory. Comparative grammar

Detail Sumber

arXiv Open Access 2024

Grammar-based Game Description Generation using Large Language Models

Tsunehiko Tanaka, Edgar Simo-Serra

Game Description Language (GDL) provides a standardized way to express diverse games in a machine-readable format, enabling automated game simulation, and evaluation. While previous research has explored game description generation using search-based methods, generating GDL descriptions from natural language remains a challenging task. This paper presents a novel framework that leverages Large Language Models (LLMs) to generate grammatically accurate game descriptions from natural language. Our approach consists of two stages: first, we gradually generate a minimal grammar based on GDL specifications; second, we iteratively improve the game description through grammar-guided generation. Our framework employs a specialized parser that identifies valid subsequences and candidate symbols from LLM responses, enabling gradual refinement of the output to ensure grammatical correctness. Experimental results demonstrate that our iterative improvement approach significantly outperforms baseline methods that directly use LLM outputs. Our code is available at https://github.com/tsunehiko/ggdg

en cs.AI

Detail DOI Sumber

arXiv Open Access 2024

PRODIS -- a speech database and a phoneme-based language model for the study of predictability effects in Polish

Zofia Malisz, Jan Foremski, Małgorzata Kul

We present a speech database and a phoneme-level language model of Polish. The database and model are designed for the analysis of prosodic and discourse factors and their impact on acoustic parameters in interaction with predictability effects. The database is also the first large, publicly available Polish speech corpus of excellent acoustic quality that can be used for phonetic analysis and training of multi-speaker speech technology systems. The speech in the database is processed in a pipeline that achieves a 90% degree of automation. It incorporates state-of-the-art, freely available tools enabling database expansion or adaptation to additional languages.

en cs.CL, cs.SD

Detail Sumber

arXiv Open Access 2024

Scaling Behavior of Machine Translation with Large Language Models under Prompt Injection Attacks

Zhifan Sun, Antonio Valerio Miceli-Barone

Large Language Models (LLMs) are increasingly becoming the preferred foundation platforms for many Natural Language Processing tasks such as Machine Translation, owing to their quality often comparable to or better than task-specific models, and the simplicity of specifying the task through natural language instructions or in-context examples. Their generality, however, opens them up to subversion by end users who may embed into their requests instructions that cause the model to behave in unauthorized and possibly unsafe ways. In this work we study these Prompt Injection Attacks (PIAs) on multiple families of LLMs on a Machine Translation task, focusing on the effects of model size on the attack success rates. We introduce a new benchmark data set and we discover that on multiple language pairs and injected prompts written in English, larger models under certain conditions may become more susceptible to successful attacks, an instance of the Inverse Scaling phenomenon (McKenzie et al., 2023). To our knowledge, this is the first work to study non-trivial LLM scaling behaviour in a multi-lingual setting.

en cs.CL

Detail Sumber

arXiv Open Access 2024

Using Grammar Masking to Ensure Syntactic Validity in LLM-based Modeling Tasks

Lukas Netz, Jan Reimer, Bernhard Rumpe

We present and evaluate a method called grammar masking, which is used to guide large language models (LLMs) toward producing syntactically correct models for a given context-free grammar. Prompt engineering methods such as few-shot learning or priming can be used to improve the chances of an LLM producing correct syntax, but the more complex the grammar, the more time-consuming and less promising these methods become. Previous work is focused primarily on the usage of either language model training or prompt engineering. In this work, a method is presented that restricts the output to a given grammar using constrained decoding to ensure the output adheres to a valid syntax. We use several DSLs built with MontiCore and task multiple LLMs to produce models with and without constrained decoding. A corresponding parser is used to confirm the syntactic correctness of each model. We show that grammar masking can dramatically improve the modeling capabilities of several LLMs, reducing the need for well-refined prompting while increasing the chance of producing correct models.

en cs.CL, cs.AI

Detail Sumber

DOAJ Open Access 2023

An ever-changing world: Interconnections in the Ancient Near East and the Eastern Mediterranean (ca. 3500-300 BC)

António José Gonçalves de Freitas, Roxana Flammini

The Ancient Near East and the Eastern Mediterranean were geographical and sociopolitical scenarios with fluent and constant connectivity from the earliest times in history. Prestige goods and raw materials found their way from one side to another through extensive networks even before the emergence of the state in Egypt and Mesopotamia, integrating movements not only of goods but also of people, technologies, cultural practices, gods, languages, and ideas (Wilkinson et al. 2011; Warburton 2020). In this volume, we named them “interconnections” to precisely emphasize the relevance of exchange in the adoption, modification, or re-adaptation of foreign traces. The influence of incoming technologies and the shaping of identities in such a dynamic world, always moving, is also considered. Naturally, many diverse theoretical approaches were proposed over time to explain those interconnections, contributing to completing the never-ending panorama of relationships (e.g. Warburton 2020: 1-21). At the same time, nowadays a comprehensive amount of evidence is usually considered in explaining those interconnections, mainly material remains, textual registers, and iconography.

History of scholarship and learning. The humanities, Philology. Linguistics

Detail Sumber

arXiv Open Access 2023

Large Language Models are legal but they are not: Making the case for a powerful LegalLLM

Thanmay Jayakumar, Fauzan Farooqui, Luqman Farooqui

Realizing the recent advances in Natural Language Processing (NLP) to the legal sector poses challenging problems such as extremely long sequence lengths, specialized vocabulary that is usually only understood by legal professionals, and high amounts of data imbalance. The recent surge of Large Language Models (LLMs) has begun to provide new opportunities to apply NLP in the legal domain due to their ability to handle lengthy, complex sequences. Moreover, the emergence of domain-specific LLMs has displayed extremely promising results on various tasks. In this study, we aim to quantify how general LLMs perform in comparison to legal-domain models (be it an LLM or otherwise). Specifically, we compare the zero-shot performance of three general-purpose LLMs (ChatGPT-20b, LLaMA-2-70b, and Falcon-180b) on the LEDGAR subset of the LexGLUE benchmark for contract provision classification. Although the LLMs were not explicitly trained on legal data, we observe that they are still able to classify the theme correctly in most cases. However, we find that their mic-F1/mac-F1 performance is up to 19.2/26.8\% lesser than smaller models fine-tuned on the legal domain, thus underscoring the need for more powerful legal-domain LLMs.

en cs.CL

Detail Sumber

DOAJ Open Access 2022

Shadow Education in Hong Kong: An Insight From Local Private Tutors

Richard Ching Ho Cheng

Shadow education has remained a supportive role in many countries, and studies have shown that it has had both positive and negative impacts on students’ academic performance. However, the views so far for private tutors have often been neglected by researchers. Private tutors are one of the important stakeholders in shadow education since they are the knowledge providers and facilitators in the classroom. Their opinions can help to show a more realistic picture of shadow education in Hong Kong. In this study, the focus is on investigating Hong Kong shadow education from private tutors’ perspectives. There were 20 private tutors from local private tutorial centres participating in this study, and they were invited to individual interviews to express their ideas about shadow education in Hong Kong. Thematic analysis was used to organize and analyze the data in this study. The results showed that private tutors felt shadow education in Hong Kong is too ‘materialistic,’ and sometimes they felt lost when teaching because of the result-oriented atmosphere in the Hong Kong education system. Furthermore, social inequalities and washback were reported as well. This has further highlighted some of the negative impacts brought by shadow education.

Language acquisition

Detail DOI Sumber

DOAJ Open Access 2022

Il rapporto tra livello di difficoltà e livello di conoscenza della lingua nei testi delle prove di comprensione scritta della lingua italiana

Venturis, Antonio

This study aims to show the readability limits of the B and C language level Italian texts, which can be used for the reading assessment and the characteristics they depend on. For this purpose, 184 Italian texts used in the certification exams of the KPG system were analysed in an attempt to reveal the main traits (quantitative, lexical and syntactic) of the texts defining their readability and their correlation with the language level for which they are intended. The correlation analysis indicated that cognitive characteristics are more important than quantitative characteristics, while a high correlation of lexical variables with B1 and B2 levels and syntactic variables with C1 and C2 levels emerged.

Language. Linguistic theory. Comparative grammar

Detail DOI Sumber

arXiv Open Access 2022

MALM: Mixing Augmented Language Modeling for Zero-Shot Machine Translation

Kshitij Gupta

Large pre-trained language models have brought remarkable progress in NLP. Pre-training and Fine-tuning have given state-of-art performance across tasks in text processing. Data Augmentation techniques have also helped build state-of-art models on low or zero resource tasks. Many works in the past have attempted at learning a single massively-multilingual machine translation model for zero-shot translation. Although those translation models are producing correct translations, the main challenge is those models are producing the wrong languages for zero-shot translation. This work and its results indicate that prompt conditioned large models do not suffer from off-target language errors i.e. errors arising due to translation to wrong languages. We empirically demonstrate the effectiveness of self-supervised pre-training and data augmentation for zero-shot multi-lingual machine translation.

en cs.CL, cs.LG

Detail Sumber

S2 Open Access 2021

Spanish South American and Brazilian Demonyms: Morphosyntactic Structure and Axiological Values

O. Chesnokova, Чеснокова Ольга Станиславовна, M. Radović et al.

This article contributes to the study of the grammar, semantics, expressive values of a very special onymic category, inhabitants names, or demonyms. The authors examine and compare demonyms in two varieties of Romanic languages - South American Spanish and Brazilian Portuguese from both a multi-modal point of view and empirical evidence (synchronic and diachronic data) and systemize them by concentrating on their morphosyntactic distinctive features and semiotic content. The perspective revealed in this paper aims to study, define and analyze the main tendencies in the creation of denominative adjectives functioning as demonyms, to establish the parameters of discrepancies and convergences regarding demonyms of the region in question. The authors based on analyzing the motivation sources, grammar formulae, suffix combinations, socio-cultural varieties and the occurrence of homonymy, as well as their significance, axiological values in forming a group identity and the stereotypes they might lead to. The methodology has included semiotic, cognitive, historic and comparative analysis - a synchronic conceptual transdisciplinary analysis. The results of the study confirm the premise that the demonym suffixes randomly alternate in derivation processes, but also demonstrate the existence of other morphological and pragmatic trends while outlining the role of demonyms in migration processes, formation of a linguistic landscape and axiological values.

3 sitasi en Sociology

Detail DOI Sumber

arXiv Open Access 2021

Linguistic Dependencies and Statistical Dependence

Jacob Louis Hoover, Alessandro Sordoni, Wenyu Du et al.

Are pairs of words that tend to occur together also likely to stand in a linguistic dependency? This empirical question is motivated by a long history of literature in cognitive science, psycholinguistics, and NLP. In this work we contribute an extensive analysis of the relationship between linguistic dependencies and statistical dependence between words. Improving on previous work, we introduce the use of large pretrained language models to compute contextualized estimates of the pointwise mutual information between words (CPMI). For multiple models and languages, we extract dependency trees which maximize CPMI, and compare to gold standard linguistic dependencies. Overall, we find that CPMI dependencies achieve an unlabelled undirected attachment score of at most $\approx 0.5$. While far above chance, and consistently above a non-contextualized PMI baseline, this score is generally comparable to a simple baseline formed by connecting adjacent words. We analyze which kinds of linguistic dependencies are best captured in CPMI dependencies, and also find marked differences between the estimates of the large pretrained language models, illustrating how their different training schemes affect the type of dependencies they capture.

en cs.CL, cs.IT

Detail DOI Sumber

Hasil untuk "Language. Linguistic theory. Comparative grammar"