Hasil untuk "Language. Linguistic theory. Comparative grammar"

Menampilkan 20 dari ~4435338 hasil · dari DOAJ, arXiv, Semantic Scholar, CrossRef

JSON API
DOAJ Open Access 2026
Language in Education Policy Debates in Africa Today

Prosperous Nankindu, Henry Hollan Ssembatya

Through critical research methods using document analysis, this study investigated the current Language in Education Policy (LiEP) debates in Africa. There are many such debates going on across the continent, but the literature available is very thin. Three multilingual African countries, the Federal Republic of Nigeria, the Republic of Congo, and the Islamic Republic of Mauritania, were selected for the study. The study found that colonial languages are dominant and are the languages mainly used as media of instruction in schools and languages of assessment in the sampled countries. Although three countries were selected, the debates are not any different in the rest of the African countries. The findings of this research are generalisable to the situation across the entire continent, thus critical in influencing future LiEP on the content. It is imperative to note that the use of colonial languages in education should not be at the expense of African languages.

Language. Linguistic theory. Comparative grammar, Computational linguistics. Natural language processing
arXiv Open Access 2025
A First Context-Free Grammar Applied to Nawatl Corpora Augmentation

Juan-José Guzmán-Landa, Juan-Manuel Torres-Moreno, Miguel Figueroa-Saavedra et al.

In this article we introduce a context-free grammar (CFG) for the Nawatl language. Nawatl (or Nahuatl) is an Amerindian language of the $π$-language type, i.e. a language with few digital resources, in which the corpora available for machine learning are virtually non-existent. The objective here is to generate a significant number of grammatically correct artificial sentences, in order to increase the corpora available for language model training. We want to show that a grammar enables us significantly to expand a corpus in Nawatl which we call $π$-\textsc{yalli}. The corpus, thus enriched, enables us to train algorithms such as FastText and to evaluate them on sentence-level semantic tasks. Preliminary results show that by using the grammar, comparative improvements are achieved over some LLMs. However, it is observed that to achieve more significant improvement, grammars that model the Nawatl language even more effectively are required.

en cs.CL, cs.AI
arXiv Open Access 2025
AI Thinking as a Meaning-Centered Framework: Reimagining Language Technologies Through Community Agency

Jose F Quesada

While language technologies have advanced significantly, current approaches fail to address the complex sociocultural dimensions of linguistic preservation. AI Thinking proposes a meaning-centered framework that would transform technological development from creating tools FOR communities to co-creating solutions WITH them. This approach recognizes that meaningful solutions emerge through the interplay of cultural understanding, community agency, and technological innovation. The proposal articulates a holistic methodology and a five-layer technological ecosystem where communities maintain control over their linguistic and cultural knowledge representation. This systematic integration of community needs, cultural preservation, and advanced capabilities could revolutionize how we approach linguistic diversity preservation in the digital age.

en cs.CL, cs.AI
arXiv Open Access 2025
PL-Guard: Benchmarking Language Model Safety for Polish

Aleksandra Krasnodębska, Karolina Seweryn, Szymon Łukasik et al.

Despite increasing efforts to ensure the safety of large language models (LLMs), most existing safety assessments and moderation tools remain heavily biased toward English and other high-resource languages, leaving majority of global languages underexamined. To address this gap, we introduce a manually annotated benchmark dataset for language model safety classification in Polish. We also create adversarially perturbed variants of these samples designed to challenge model robustness. We conduct a series of experiments to evaluate LLM-based and classifier-based models of varying sizes and architectures. Specifically, we fine-tune three models: Llama-Guard-3-8B, a HerBERT-based classifier (a Polish BERT derivative), and PLLuM, a Polish-adapted Llama-8B model. We train these models using different combinations of annotated data and evaluate their performance, comparing it against publicly available guard models. Results demonstrate that the HerBERT-based classifier achieves the highest overall performance, particularly under adversarial conditions.

en cs.CL
arXiv Open Access 2024
Toward Cultural Interpretability: A Linguistic Anthropological Framework for Describing and Evaluating Large Language Models (LLMs)

Graham M. Jones, Shai Satran, Arvind Satyanarayan

This article proposes a new integration of linguistic anthropology and machine learning (ML) around convergent interests in both the underpinnings of language and making language technologies more socially responsible. While linguistic anthropology focuses on interpreting the cultural basis for human language use, the ML field of interpretability is concerned with uncovering the patterns that Large Language Models (LLMs) learn from human verbal behavior. Through the analysis of a conversation between a human user and an LLM-powered chatbot, we demonstrate the theoretical feasibility of a new, conjoint field of inquiry, cultural interpretability (CI). By focusing attention on the communicative competence involved in the way human users and AI chatbots co-produce meaning in the articulatory interface of human-computer interaction, CI emphasizes how the dynamic relationship between language and culture makes contextually sensitive, open-ended conversation possible. We suggest that, by examining how LLMs internally "represent" relationships between language and culture, CI can: (1) provide insight into long-standing linguistic anthropological questions about the patterning of those relationships; and (2) aid model developers and interface designers in improving value alignment between language models and stylistically diverse speakers and culturally diverse speech communities. Our discussion proposes three critical research axes: relativity, variation, and indexicality.

en cs.CY, cs.CL
arXiv Open Access 2024
Integrating Emotional and Linguistic Models for Ethical Compliance in Large Language Models

Edward Y. Chang

This research develops advanced methodologies for Large Language Models (LLMs) to better manage linguistic behaviors related to emotions and ethics. We introduce DIKE, an adversarial framework that enhances the LLMs' ability to internalize and reflect global human values, adapting to varied cultural contexts to promote transparency and trust among users. The methodology involves detailed modeling of emotions, classification of linguistic behaviors, and implementation of ethical guardrails. Our innovative approaches include mapping emotions and behaviors using self-supervised learning techniques, refining these guardrails through adversarial reviews, and systematically adjusting outputs to ensure ethical alignment. This framework establishes a robust foundation for AI systems to operate with ethical integrity and cultural sensitivity, paving the way for more responsible and context-aware AI interactions.

en cs.CL, cs.AI
arXiv Open Access 2024
Morphological evaluation of subwords vocabulary used by BETO language model

Óscar García-Sierra, Ana Fernández-Pampillón Cesteros, Miguel Ortega-Martín

Subword tokenization algorithms used by Large Language Models are significantly more efficient and can independently build the necessary vocabulary of words and subwords without human intervention. However, those subwords do not always align with real morphemes, potentially impacting the models' performance, though it remains uncertain when this might occur. In previous research, we proposed a method to assess the morphological quality of vocabularies, focusing on the overlap between these vocabularies and the morphemes of a given language. Our evaluation method was built on three quality measures, relevance, cohesion, and morphological accuracy, and a procedure for their assessment. By applying this method to vocabularies created by three subword tokenization algorithms, BPE, Wordpiece, and Unigram, we concluded that these vocabularies generally exhibit very low morphological quality. In this article, we apply this evaluation to the tokenizer of BETO, a BERT language model trained on large Spanish corpora. This evaluation, along with our previous results, helped us conclude that its vocabulary has a low morphological quality, and we also found that training the tokenizer in a larger corpus does not improve the morphological quality of the generated vocabulary. Additionally, this evaluation helps clarify the algorithm used by the tokenizer, that is, Wordpiece, given the inconsistencies between the authors' claims and the model's configuration.

en cs.CL, cs.AI
arXiv Open Access 2024
Back to School: Translation Using Grammar Books

Jonathan Hus, Antonios Anastasopoulos

Machine translation systems for high resource languages perform exceptionally well and produce high quality translations. Unfortunately, the vast majority of languages are not considered high resource and lack the quantity of parallel sentences needed to train such systems. These under-represented languages are not without resources, however, and bilingual dictionaries and grammar books are available as linguistic reference material. With current large language models (LLMs) supporting near book-length contexts, we can begin to use the available material to ensure advancements are shared among all of the world's languages. In this paper, we demonstrate incorporating grammar books in the prompt of GPT-4 to improve machine translation and evaluate the performance on 16 topologically diverse low-resource languages, using a combination of reference material to show that the machine translation performance of LLMs can be improved using this method.

en cs.CL
arXiv Open Access 2024
Decoding Probing: Revealing Internal Linguistic Structures in Neural Language Models using Minimal Pairs

Linyang He, Peili Chen, Ercong Nie et al.

Inspired by cognitive neuroscience studies, we introduce a novel `decoding probing' method that uses minimal pairs benchmark (BLiMP) to probe internal linguistic characteristics in neural language models layer by layer. By treating the language model as the `brain' and its representations as `neural activations', we decode grammaticality labels of minimal pairs from the intermediate layers' representations. This approach reveals: 1) Self-supervised language models capture abstract linguistic structures in intermediate layers that GloVe and RNN language models cannot learn. 2) Information about syntactic grammaticality is robustly captured through the first third layers of GPT-2 and also distributed in later layers. As sentence complexity increases, more layers are required for learning grammatical capabilities. 3) Morphological and semantics/syntax interface-related features are harder to capture than syntax. 4) For Transformer-based models, both embeddings and attentions capture grammatical features but show distinct patterns. Different attention heads exhibit similar tendencies toward various linguistic phenomena, but with varied contributions.

en cs.CL, q-bio.NC
arXiv Open Access 2024
Small Language Models Also Work With Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas

Bastian Bunzeck, Daniel Duran, Leonie Schade et al.

Recent work investigates whether LMs learn human-like linguistic generalizations and representations from developmentally plausible amounts of data. Yet, the basic linguistic units processed in these LMs are determined by subword-based tokenization, which limits their validity as models of learning at and below the word level. In this paper, we explore the potential of tokenization-free, phoneme- and grapheme-based language models. We demonstrate that small models based on the Llama architecture can achieve strong linguistic performance on standard syntactic and novel lexical/phonetic benchmarks when trained with character-level vocabularies. We further show that phoneme-based models almost match grapheme-based models in standard tasks and novel evaluations. Our findings suggest a promising direction for creating more linguistically plausible language models that are better suited for computational studies of language acquisition and processing.

en cs.CL
arXiv Open Access 2024
Decomposed Prompting: Probing Multilingual Linguistic Structure Knowledge in Large Language Models

Ercong Nie, Shuzhou Yuan, Bolei Ma et al.

Probing the multilingual knowledge of linguistic structure in LLMs, often characterized as sequence labeling, faces challenges with maintaining output templates in current text-to-text prompting strategies. To solve this, we introduce a decomposed prompting approach for sequence labeling tasks. Diverging from the single text-to-text prompt, our prompt method generates for each token of the input sentence an individual prompt which asks for its linguistic label. We test our method on the Universal Dependencies part-of-speech tagging dataset for 38 languages, using both English-centric and multilingual LLMs. Our findings show that decomposed prompting surpasses the iterative prompting baseline in efficacy and efficiency under zero- and few-shot settings. Moreover, our analysis of multilingual performance of English-centric LLMs yields insights into the transferability of linguistic knowledge via multilingual prompting.

en cs.CL
S2 Open Access 2024
M. A. Castrénin kielioppien kansainvälinen tausta

Riku Erkkilä

In this paper I discuss the international influences on the grammars that M. A. Castrén produced. Matthias Alexander Castrén is undoubtedly one of the most prominent researchers in the history of Uralic studies. His work includes nine grammars (eight published and one unpublished) of some twenty languages. He can also be credited for launching the scientific study of the Uralic languages. However, Castrén did not produce his grammars in a vacuum, but rather he was well educated in the modern linguistic theories of his time. More specifically, Castrén implemented the ideas of comparative grammar used also by such renowned linguists as Rasmus Rask, Jacob Grimm, and Franz Bopp. This paper examines the influence that linguistic ideas of the nineteenth century had on the descriptive choices Castrén made in his grammars. I examine the structure of Castrén’s grammars and compare them to other grammars of the time which Castrén knew. The analysis is confined to a rather general level discussing such phenomena as how parts of speech are described, and how inflection is presented. The analysis shows that in many cases, Castrén did not have one single model, but rather his work followed the general ideas of comparative grammar. This paper confirms that Uralistics has been an international field of study from its beginnings.

DOAJ Open Access 2023
Les partícules confirmatòries (com a question tags i com a respostes confirmatives) en català (amb especial referència al nord-occidental i al rossellonès)

Xavier Rull

Questions can be open-ended (waiting for an affirmative or negative answer) but they can also have a confirmatory interpretation (i.e. there is an assumption that may be confirmed). In the latter case, question tags (like oh) may appear in the questions. This paper reports all the question tags in Catalan and the precise syntactic contexts in which they occur (at the beginning or at the end; although in principle they can all occur in two syntactic contexts, not all of them appear everywhere). Special emphasis is placed on the north-western dialects.

Semantics, Comparative grammar
DOAJ Open Access 2022
Emmanuel Macron facing Covid-19: A president in search of image repair

Keren Sadoun-Kerber, Stéphane Wahnich

How can a political leader impose unusual demands in a democracy in the face of Covid-19 when he was already arousing a great deal of mistrust, long before this crisis, because of a distorted image that undermines his legitimacy and authority in the public sphere? This article identifies the discursive steps taken by the French President to repair his image in his first two addresses on the Coronavirus. It shows that his strategies of image repair are indirect and differ from those listed on the subject in scholarly literature. Emmanuel Macron’s objective is to strengthen his weakened legitimacy and build his position of authority. These two concepts are clearly distinguished in the analysis proposed here and are important axes for the re-elaboration of the President’s ethos, whose objective is to increase the confidence of the French in his leadership and in the measures announced to stem the pandemic.

Style. Composition. Rhetoric
arXiv Open Access 2022
Inferring Rewards from Language in Context

Jessy Lin, Daniel Fried, Dan Klein et al.

In classic instruction following, language like "I'd like the JetBlue flight" maps to actions (e.g., selecting that flight). However, language also conveys information about a user's underlying reward function (e.g., a general preference for JetBlue), which can allow a model to carry out desirable actions in new contexts. We present a model that infers rewards from language pragmatically: reasoning about how speakers choose utterances not only to elicit desired actions, but also to reveal information about their preferences. On a new interactive flight-booking task with natural language, our model more accurately infers rewards and predicts optimal actions in unseen environments, in comparison to past work that first maps language to actions (instruction following) and then maps actions to rewards (inverse reinforcement learning).

en cs.CL, cs.AI
DOAJ Open Access 2021
The strength of morphophonological schemas: Consonant mutations in Polish

Bartłomiej Czaplicki

Selective elimination of consonant mutations in Polish provides evidence supporting construction-based sublexicons and morphophonological schemas extracted from them. Morphophonological schemas exhibit various strength depending on their type frequency, they refer to morpheme-specific classes of segments and their impact is continuously mediated by paradigm uniformity pressures. A low-frequency and a highfrequency pattern are analyzed: agent nouns in -ist-a/-yst-a and diminutives in -ek. Two kinds of frequency are important predictors of pattern modification: type frequency and token frequency. As for the impact of type frequency, the less frequent a pattern is in the lexicon, the more susceptible it is to modifications promoted by paradigm uniformity. Schemas are ranked based on the type frequency of the morphological patterns they encode and interact with paradigm uniformity constraints. Schemas representing less frequent patterns are outranked by paradigm uniformity constraints and are thus more likely to be modified than those representing more frequent patterns. Regarding token frequency, the greater stability of high-frequency words than comparable low-frequency words is linked to their strong representations and a constraint promoting the use of stored representations. The continuous effect of paradigm uniformity effects gets support from the results of a nonce-word experiment.

Language. Linguistic theory. Comparative grammar
DOAJ Open Access 2021
Facing Anxiety through Ego Defense Mechanisms on The Walking Dead: Michonne Movie Game

Fery Setiawan, I Gusti Agung Sri Rwa Jayantini, Ida Bagus Gde Nova Winarta et al.

With the advancement of technology, a newly adopted form of literary work was found, known as a movie game. It is one of the game genres that resembles real movies that provide choices to the players who can actively determine the story, ending, and characters' action. In literary work, a character is one of the intrinsic elements that can be interestingly analyzed from the psychological perspective, including Ego Defense Mechanisms. This study focuses on how the protagonist in a movie game entitled The Walking Dead: Michonne faced her anxiety through her ego defense mechanisms. It aims to identify the types of ego defense mechanisms and how they can help the protagonist cope with her anxiety. The discussion was based on psychoanalysis theory, namely Ego Defense Mechanisms proposed by Sigmund Freud. This study elaborated the description of the protagonist's actions when she faced anxiety through a qualitative method. The analysis revealed that the protagonist adopted the four types of Ego Defense Mechanisms: repression, sublimation, rationalization, and aggression. It was found that aggression was the most frequently adopted mechanism, followed by sublimation, rationalization, and the least was repression. This finding indicated that the protagonist chose aggression to release resentment or dissatisfaction, especially when facing objects or other characters that create anxiety or frustration. It implies that the persistence of aggression, sublimation, rationalization, and repression could reflect the presence of continual threats in the environment from which people should survive.

Language. Linguistic theory. Comparative grammar
DOAJ Open Access 2021
Unbestimmte Subjekte: zur problematischen Äquivalenz von deutschem man und italienischem si

Peter Paschke

German sentences with man and Italian sentences with si impersonale or si passivante are often presented as equivalent in contrastive grammars. However, this functional equation proves to be problematic when Italian students refer with man to their own role as authors, such as in: “Darauf wird man aber im folgenden Kapitel eingehen”. Evidently, man cannot refer to the speaker role, while in the same context the Italian si is well suitable. Starting from this interference error, the paper examines the possible range of reference of the two pronouns. It turns out that the most common reading of man and si in both languages is the generic one, which can be paraphrased as “everyone”. Systematic divergences, on the other hand, occur in the particular reading, i. e. when referring to single unspecified subjects. While the German man characterizes the subject as anonymous and does never include listeners or speakers (e. g: Gestern hat man bei uns eingebrochen; man ≈ ‘jemand’, ‘somebody’), the Italian si, according to the verb class (transitive, unergative, unaccusative, etc.), can or must be read as speaker-exclusive (Mi si è raccontato che ...; si ≈ ‘qualcuno’, ‘someone’) or as speaker-inclusive (Ieri si è andati al ristorante; si ≈ ‘noi’, ‘we’). The speaker-inclusive reading also occurs when si is used in academic texts as a substitute for the established form of speaker (author) reference by means of the 1st person plural (noi, ‘we’). In addition to man and si, other forms of indeterminate subjects are examined, namely the non-anaphoric uses of German “sie (pl.)” (Sie haben schon wieder die Preise erhöht.) and of the Italian 3rd person plural null subject (Ti hanno cercato.) as well as the so-called impersonal passive form in German (Es wird gemurmelt.).

Computational linguistics. Natural language processing, Language. Linguistic theory. Comparative grammar

Halaman 20 dari 221767