Hasil "Language. Linguistic theory. Comparative grammar"

arXiv Open Access 2026

LSR: Linguistic Safety Robustness Benchmark for Low-Resource West African Languages

Godwin Abuh Faruna

Safety alignment in large language models relies predominantly on English-language training data. When harmful intent is expressed in low-resource languages, refusal mechanisms that hold in English frequently fail to activate. We introduce LSR (Linguistic Safety Robustness), the first systematic benchmark for measuring cross-lingual refusal degradation in West African languages: Yoruba, Hausa, Igbo, and Igala. LSR uses a dual-probe evaluation protocol - submitting matched English and target-language probes to the same model - and introduces Refusal Centroid Drift (RCD), a metric that quantifies how much of a model's English refusal behavior is lost when harmful intent is encoded in a target language. We evaluate Gemini 2.5 Flash across 14 culturally grounded attack probes in four harm categories. English refusal rates hold at approximately 90 percent. Across West African languages, refusal rates fall to 35-55 percent, with Igala showing the most severe degradation (RCD = 0.55). LSR is implemented in the Inspect AI evaluation framework and is available as a PR-ready contribution to the UK AISI's inspect_evals repository. A live reference implementation and the benchmark dataset are publicly available.

en cs.CL, cs.AI

Detail Sumber

DOAJ Open Access 2025

Rethinking Collective Story

Piotr F. Piekutowski

This article introduces characterisation of the tender narrator concept by Polish writer Olga Tokarczuk, which was a central point in her Nobel Prize lecture (2019). During the identification, three key elements of Tokarczuk’s project are specified: the bond of diegetic forms with climate and environmental crisis of the Anthropocene; dynamically changing, fragmented collective and individual perspectives; and the titular narrative tenderness manifested in sensitivity to more-than-human voices, networks, and relations. Through this, the potential of this idea is included in the repertoire of econarratological research and, more broadly, non-anthropocentric narrative theories. To detail the manifestations of the fourth-person narrative, as the tender narrative is also called, this paper problematises spatiotemporal experiences based on the example of Tokarczuk’s novel The Empusium (2024). In the analysis of how representations of time and space are mediated in the tender story, aspects such as interdependencies, despatialisation and fragmentation are brought to the fore.

Language. Linguistic theory. Comparative grammar, Style. Composition. Rhetoric

Detail DOI Sumber

DOAJ Open Access 2025

¿Tiempo o evidencia? Valores del pretérito perfecto compuesto en el español de España y Ecuador

Clara Tellez-Perez

El tiempo constituye la noción básica que permite comprender el funcionamiento del sistema verbal español. Sin embargo, ciertos usos propios de Ecuador han dado pie al desarrollo de nuevas teorías basadas en la evidencialidad para explicar, en particular, el valor del pretérito perfecto compuesto (PPC) en dicha variedad. Con un planteamiento de corte psicolingüístico, este estudio pretende recabar datos a través de una encuesta sobre la producción y percepción del PPC en Ecuador. Esta información se contrastará con datos de la variedad peninsular, donde los tiempos de pasado no se vinculan con lecturas evidenciales. Los resultados han reflejado diferencias estadísticamente significativas entre ambas variedades; no obstante, salvo en contextos mirativos, estas tienden a ser leves, lo que lleva a cuestionar la necesidad de recurrir a teorías evidenciales para explicar el funcionamiento del PPC ecuatoriano.

Romanic languages, Philology. Linguistics

Detail DOI Sumber

arXiv Open Access 2025

TrackList: Tracing Back Query Linguistic Diversity for Head and Tail Knowledge in Open Large Language Models

Ioana Buhnila, Aman Sinha, Mathieu Constant

Large Language Models (LLMs) have proven efficient in giving definition-type answers to user input queries. While for humans giving various types of answers, such as examples and paraphrases, is an easy task, LLMs struggle to provide correct answers for other than definition-type queries. In this study, we evaluated this drop in performance using TrackList, a fine-grained linguistic and statistical analysis pipeline to investigate the impact of the pre-training data on LLMs answers to diverse linguistic queries. We also introduce RefoMed-EN, an English dataset consisting of 6170 human-annotated medical terms alongside their corresponding definitions, denominations, exemplifications, explanations, or paraphrases. We studied whether the high frequency of a concept (head) or low frequency (tail) impacts the language model's performance. We evaluated the quality of the LLM's output using syntactic and semantic similarity metrics, statistical correlations and embeddings. Results showed that the LLM's task performance for definition type questions is the highest, while for the exemplification type it is the lowest. Additionally, we showed that for definition-type questions, large language models are prone to paraphrase more on popular and frequent knowledge and less on tail and technical knowledge, especially in the expert texts.

en cs.CL

Detail Sumber

arXiv Open Access 2025

Kinship in Speech: Leveraging Linguistic Relatedness for Zero-Shot TTS in Indian Languages

Utkarsh Pathak, Chandra Sai Krishna Gunda, Anusha Prakash et al.

Text-to-speech (TTS) systems typically require high-quality studio data and accurate transcriptions for training. India has 1369 languages, with 22 official using 13 scripts. Training a TTS system for all these languages, most of which have no digital resources, seems a Herculean task. Our work focuses on zero-shot synthesis, particularly for languages whose scripts and phonotactics come from different families. The novelty of our work is in the augmentation of a shared phone representation and modifying the text parsing rules to match the phonotactics of the target language, thus reducing the synthesiser overhead and enabling rapid adaptation. Intelligible and natural speech was generated for Sanskrit, Maharashtrian and Canara Konkani, Maithili and Kurukh by leveraging linguistic connections across languages with suitable synthesisers. Evaluations confirm the effectiveness of this approach, highlighting its potential to expand speech technology access for under-represented languages.

en cs.CL, cs.CV

Detail Sumber

arXiv Open Access 2025

Exploring Performance Variations in Finetuned Translators of Ultra-Low Resource Languages: Do Linguistic Differences Matter?

Isabel Gonçalves, Paulo Cavalin, Claudio Pinhanez

Finetuning pre-trained language models with small amounts of data is a commonly-used method to create translators for ultra-low resource languages such as endangered Indigenous languages. However, previous works have reported substantially different performances with translators created using similar methodology and data. In this work we systematically explored possible causes of the performance difference, aiming to determine whether it was a product of different cleaning procedures, limitations of the pre-trained models, the size of the base model, or the size of the training dataset, studying both directions of translation. Our studies, using two Brazilian Indigenous languages, related but with significant structural linguistic characteristics, indicated none or very limited influence from those training factors, suggesting differences between languages may play a significant role in the ability to produce translators by fine-tuning pre-trained models.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2024

3D-LEX v1.0: 3D Lexicons for American Sign Language and Sign Language of the Netherlands

Oline Ranum, Gomer Otterspeer, Jari I. Andersen et al.

In this work, we present an efficient approach for capturing sign language in 3D, introduce the 3D-LEX v1.0 dataset, and detail a method for semi-automatic annotation of phonetic properties. Our procedure integrates three motion capture techniques encompassing high-resolution 3D poses, 3D handshapes, and depth-aware facial features, and attains an average sampling rate of one sign every 10 seconds. This includes the time for presenting a sign example, performing and recording the sign, and archiving the capture. The 3D-LEX dataset includes 1,000 signs from American Sign Language and an additional 1,000 signs from the Sign Language of the Netherlands. We showcase the dataset utility by presenting a simple method for generating handshape annotations directly from 3D-LEX. We produce handshape labels for 1,000 signs from American Sign Language and evaluate the labels in a sign recognition task. The labels enhance gloss recognition accuracy by 5% over using no handshape annotations, and by 1% over expert annotations. Our motion capture data supports in-depth analysis of sign features and facilitates the generation of 2D projections from any viewpoint. The 3D-LEX collection has been aligned with existing sign language benchmarks and linguistic resources, to support studies in 3D-aware sign language processing.

en cs.CV, cs.AI

Detail Sumber

arXiv Open Access 2024

NAVCON: A Cognitively Inspired and Linguistically Grounded Corpus for Vision and Language Navigation

Karan Wanchoo, Xiaoye Zuo, Hannah Gonzalez et al.

We present NAVCON, a large-scale annotated Vision-Language Navigation (VLN) corpus built on top of two popular datasets (R2R and RxR). The paper introduces four core, cognitively motivated and linguistically grounded, navigation concepts and an algorithm for generating large-scale silver annotations of naturally occurring linguistic realizations of these concepts in navigation instructions. We pair the annotated instructions with video clips of an agent acting on these instructions. NAVCON contains 236, 316 concept annotations for approximately 30, 0000 instructions and 2.7 million aligned images (from approximately 19, 000 instructions) showing what the agent sees when executing an instruction. To our knowledge, this is the first comprehensive resource of navigation concepts. We evaluated the quality of the silver annotations by conducting human evaluation studies on NAVCON samples. As further validation of the quality and usefulness of the resource, we trained a model for detecting navigation concepts and their linguistic realizations in unseen instructions. Additionally, we show that few-shot learning with GPT-4o performs well on this task using large-scale silver annotations of NAVCON.

en cs.CL, cs.CV

Detail Sumber

arXiv Open Access 2024

A Systematic Assessment of Language Models with Linguistic Minimal Pairs in Chinese

Yikang Liu, Yeting Shen, Hongao Zhu et al.

We present ZhoBLiMP, the largest linguistic minimal pair benchmark for Chinese, with over 100 paradigms, ranging from topicalization to the \textit{Ba} construction. We then train from scratch a suite of Chinese language models (LMs) with different tokenizers, parameter sizes, and token volumes, to study the learning curves of LMs on Chinese. To mitigate the biases introduced by unequal lengths of the sentences in a minimal pair, we propose a new metric named sub-linear length normalized log-probabilities (SLLN-LP). Using SLLN-LP as the metric, our results show that \textsc{Anaphor}, \textsc{Quantifiers}, and \textsc{Ellipsis} in Chinese are difficult for LMs even up to 32B parameters, and that SLLN-LP successfully mitigates biases in ZhoBLiMP, JBLiMP and BLiMP. We conclude that future evaluations should be more carefully designed to consider the intricate relations between linking functions, LMs, and targeted minimal pairs.

en cs.CL

Detail Sumber

arXiv Open Access 2023

MILPaC: A Novel Benchmark for Evaluating Translation of Legal Text to Indian Languages

Sayan Mahapatra, Debtanu Datta, Shubham Soni et al.

Most legal text in the Indian judiciary is written in complex English due to historical reasons. However, only a small fraction of the Indian population is comfortable in reading English. Hence legal text needs to be made available in various Indian languages, possibly by translating the available legal text from English. Though there has been a lot of research on translation to and between Indian languages, to our knowledge, there has not been much prior work on such translation in the legal domain. In this work, we construct the first high-quality legal parallel corpus containing aligned text units in English and nine Indian languages, that includes several low-resource languages. We also benchmark the performance of a wide variety of Machine Translation (MT) systems over this corpus, including commercial MT systems, open-source MT systems and Large Language Models. Through a comprehensive survey by Law practitioners, we check how satisfied they are with the translations by some of these MT systems, and how well automatic MT evaluation metrics agree with the opinions of Law practitioners.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2023

PB-LLM: Partially Binarized Large Language Models

Yuzhang Shang, Zhihang Yuan, Qiang Wu et al.

This paper explores network binarization, a radical form of quantization, compressing model weights to a single bit, specifically for Large Language Models (LLMs) compression. Due to previous binarization methods collapsing LLMs, we propose a novel approach, Partially-Binarized LLM (PB-LLM), which can achieve extreme low-bit quantization while maintaining the linguistic reasoning capacity of quantized LLMs. Specifically, our exploration first uncovers the ineffectiveness of naive applications of existing binarization algorithms and highlights the imperative role of salient weights in achieving low-bit quantization. Thus, PB-LLM filters a small ratio of salient weights during binarization, allocating them to higher-bit storage, i.e., partially-binarization. PB-LLM is extended to recover the capacities of quantized LMMs, by analyzing from the perspective of post-training quantization (PTQ) and quantization-aware training (QAT). Under PTQ, combining the concepts from GPTQ, we reconstruct the binarized weight matrix guided by the Hessian matrix and successfully recover the reasoning capacity of PB-LLM in low-bit. Under QAT, we freeze the salient weights during training, explore the derivation of optimal scaling factors crucial for minimizing the quantization error, and propose a scaling mechanism based on this derived scaling strategy for residual binarized weights. Those explorations and the developed methodologies significantly contribute to rejuvenating the performance of low-bit quantized LLMs and present substantial advancements in the field of network binarization for LLMs.The code is available at https://github.com/hahnyuan/BinaryLLM.

en cs.LG, cs.AI

Detail Sumber

S2 Open Access 2022

Supporting the academic staff of the internationalised university: a project at the Department of Management

Elena Borsetto

ABSTRACT One of the consequences of the growing number of English-taught programmes offered by European universities is the top-down request that lecturers switch from their L1 to English, for teaching, while the administrative staff is required to communicate with international students. To help the academic staff cope with the change, some universities have provided them with training and support. In Italy, a project ran from 2015 to 2018 at the Department of Management of a university in the northern part of the country, and consisted of a support service in which the staff could arrange appointments with a language tutor, to discuss issues related to their use of English in academia. During one-on-one sessions, fieldnotes were kept about the participants' difficulties and needs. Through the lens of Grounded Theory, data were coded and grouped into categories, according to their types and frequency. The findings revealed that problems mainly concerned vocabulary, pronunciation, and grammar. The project is briefly presented, followed by a description of the linguistic and communicative problems most found. This study provides insights into the everyday language practices of the academic staff, since it identified the problems solved through the support service and those to be tackled in the future.

1 sitasi en

Detail DOI Sumber

DOAJ Open Access 2022

Exploring factors affecting peer-led team learning in EFL classes: A case of secondary schools in Ethiopia

Mastewal Misganaw Alemayehu

Peer-led team learning (PLTL) has become common in ESL classrooms across Ethiopia. This study explores factors affecting PLTL in students' verbal participation in English as a Foreign Language (EFL). A descriptive survey was employed as a research method, and mixed approach data collection methods were used. Twenty-four EFL teachers and 114 students of three secondary schools in Ethiopia were taken as the research participants by systematic random sampling. The data collected from questionnaires, interviews, and classroom observation were analyzed qualitatively and quantitatively using a statistical tool in frequency, percentages, ANOVA and multiple regression. The findings indicated that students differ significantly in their level of verbal participation in PLTL groups. Of the twenty-two expected factors, no single factor predicted whether students would participate in PLTL groups. More than one factor was usually working together, or one factor led onto another to affect students' participation. Personality characteristics, motivational factors, and group situation factors were significant to student participation in PLTL. Not every student could get the opportunities to become a group leader, and the groups were static. Since there was an absence of active monitoring, most groups drifted away from tasks and were involved in noisy chat in their mother tongue. Few students in a group dominated others who persevered at group activities. The qualitative findings are consistent with the quantitative ones.

Special aspects of education, Language. Linguistic theory. Comparative grammar

Detail DOI Sumber

DOAJ Open Access 2022

Performances de gênero em obras televisivas de inspiração histórica: um estudo de O Quinto dos Infernos e Novo Mundo

Jarlene Rodrigues Reis, Denise Costa Oliveira Siqueira

Considerando a recorrência e o destaque de Dom Pedro I e Dona Leopoldina em produções midiáticas de inspiração histórica, o objetivo deste artigo é refletir sobre performances de gênero desses personagens em duas obras de ficção televisiva: a minissérie O Quinto dos Infernos (2002) e a telenovela Novo Mundo (2017). Leituras sobre cultura das mídias, narrativas e performances de gênero forneceram pistas para uma análise crítica dos papéis interpretados nas duas obras, estudadas com o uso de uma metodologia de análise da comunicação narrativa. Observou-se que as representações midiáticas do casal se constroem de acordo com lógicas que, desde o tempo de colônia, têm se renovado no Brasil, fazendo sobreviver antigas formas de dominação e hegemonia masculina, branca e heterossexual.

Discourse analysis

Detail DOI Sumber

DOAJ Open Access 2022

No nie mów! Słownik pragmatemów?! Rozważania nad leksykograficznym opisem wyrażeń pragmatycznych w ujęciu kontrastywnym (na przykładzie polskich ekwiwalentów francuskiego pragmatemu Tu m’en diras tant!)

Michał Hrabia

CONSIDERATIONS ON THE LEXICOGRAPHIC DESCRIPTION OF PRAGMATEMES IN A CONTRASTIVE APPROACH (BASED ON THE EXAMPLE OF POLISH EQUIVALENTS OF THE FRENCH PRAGMATEME TU M’EN DIRAS TANT!) This article deals with the lexicographic description of pragmatemes in a contrastive approach. In the first part, the author briefly reviews definitions of the denomination “pragmateme” in Polish and French-language linguistic literature, finally opting for a prototypical understanding of this term. The following chapters are devoted to presenting the concept of the projected multilingual dictionary of pragmatemes. Having outlined the main principles of the dictionary’s macro- and microstructure, the author proceeds to a detailed discussion of the dictionary entries for two Polish equivalents of the French pragmatic formula Tu m’en diras tant!: Co ty nie powiesz! and No nie mów!.

Language. Linguistic theory. Comparative grammar

Detail DOI Sumber

S2 Open Access 2022

Department of Slavic Languages and Literatures

Bożena Shallcross

en

Detail Sumber

S2 Open Access 2022

Типологічна характеристика логічної сегментації речень української та англійської мови

А. В. Пестушко

The proposed article is devoted to the comparative study of the means of expression of the logical segmentation in English and Ukrainian. The topic of the article is determined by the fact that the study of functional perspective of a sentence is now the most important stage in the formation of communicative and functional grammar of any language. The article analyzes the general theoretic foundations of the notion of “segmentation”. The main features of sentence segmentation in linguistics are characterized. In the course of the work main stages of scientific thought in the field of comparative linguistics are reflected, such concepts as “segmentation”, “logical segmentation” are studied, and principles of the logical segmentation of sentences in the languages under consideration are analysed. A systematic analysis of the available data on the problem of segmentation is performed. The means of segmentation expression in the languages are identified and analyzed. It has been revealed that the complex of the main differences in the segmentation of sentences in Ukrainian and English languages. The complex analysis of pedagogical, psychological and linguistic knowledge of the research problem has been established. The mechanism of segmentation, which is one of the special mechanisms providing all aspects of communication, has been revealed. Real links in a sentence are analysed and described. The knowledge on the peculiarities of segmentation of the English-language text has been identified, summarized and systematized. A comparative-typological description of the two languages on the problem has been carried out. Based on these theories, the main problems of sentence segmentation in English are highlighted. Based on the comparison with the English language, the syntactic, grammatical and lexical ways of separating the topic and the rheme are highlighted and described in detail in a comparative perspective. In the course of the study, the universal mechanisms of sentence segmentation in English and Ukrainian are revealed, as well as the areas that are selective in either English or Ukrainian. The article considers the phenomenon of segmentation in different aspects: the history of the development of structures as expressive units of literary language, the types of segmentation are analyzed.

en

Detail DOI Sumber

DOAJ Open Access 2021

Pronominal right-dislocation in Norwegian

Elena Karagjosova, Kaja Borthen

The goal of the paper is to propose a holistic analysis of the discourse properties and the interpretational effects of pronominal right-dislocation in Norwegian. Previous research has suggested that this is a topic construction, and it has been shown that the right-dislocated pronoun may affect reference assignment, is sometimes used in cases of discourse breaks, is associated with contrastiveness, and may lead to interpretational effects such as “emphasis” and “mitigation”. Based on Norwegian authentic corpus material, Givón’s (1983a) notion of marked constructions, and Sperber and Wilson’s (1986/1995) relevance theory, we present a novel analysis that connects the various properties of the construction together. A central aspect of our analysis is the assumption that marked constructions increase the accessibility of contrastive interpretations, which in turn may trigger the derivation of certain types of implicatures. Since the analysis is mainly based on assumptions about human cognition, the study makes cross-linguistic predictions despite its focus on one language.

Language. Linguistic theory. Comparative grammar

Detail DOI Sumber

DOAJ Open Access 2021

A good servant but a bad master: Finnish Translators’ Perceptions on Translation Technology

Leena Salmi

This paper reports the results of a survey measuring Finnish professional translators’ (N=450) perceptions on translation technology. The paper also introduces a tool that can be used to study the perception of such changes, the NEMP model (New Educational Models or Paradigms). Main results indicate that translation technology is perceived by professional translators as an efficient tool that has changed translating in practice, but does not encourage to creativity. Business translators seem to have a more positive perception on translation technology than other respondents (literary or audiovisual translators). As for the method, the results indicate that the NEMP model is a useful method for studying the perception of translation technology.

Translating and interpreting

Detail DOI Sumber

arXiv Open Access 2021

Ad Text Classification with Transformer-Based Natural Language Processing Methods

Umut Özdil, Büşra Arslan, D. Emre Taşar et al.

In this study, a natural language processing-based (NLP-based) method is proposed for the sector-wise automatic classification of ad texts created on online advertising platforms. Our data set consists of approximately 21,000 labeled advertising texts from 12 different sectors. In the study, the Bidirectional Encoder Representations from Transformers (BERT) model, which is a transformer-based language model that is recently used in fields such as text classification in the natural language processing literature, was used. The classification efficiencies obtained using a pre-trained BERT model for the Turkish language are shown in detail.

en cs.CL

Detail Sumber

Hasil untuk "Language. Linguistic theory. Comparative grammar"