Hasil "Philology. Linguistics"

arXiv Open Access 2026

Linguistic Similarity Within Centralized FLOSS Development

Matthew Gaughan, Aaron Shaw, Darren Gergle

When free/libre and open source software (FLOSS) stewards centralize project development, they potentially undermine project sustainability and impact how contributors talk to each other. To study the relationship between steward-centralized development and contributor discussion, we compared the development of three Wikimedia platform features that the Wikimedia Foundation (WMF) built in MediaWiki. In a mixed-methods multi-case comparison, we used repository mining, linguistic style features, and principal component analysis to track MediaWiki feature development and issue discussions. Contrary to both our intuition and prior work, there were no identifiable differences in the linguistic style of WMF-affiliates and external contributors, even when feature development was guided by WMF contributions. From these results, we offer two provocations to the study of collaborative FLOSS development: (1) stewards dominate development according to their own use of specific project functionality; (2) centralized project development does not entail hierarchical language within project discussions.

en cs.SE, cs.HC

Detail DOI Sumber

arXiv Open Access 2025

Polishing Every Facet of the GEM: Testing Linguistic Competence of LLMs and Humans in Korean

SungHo Kim, Nayeon Kim, Taehee Jeon et al.

We introduce the $\underline{Ko}rean \underline{G}rammar \underline{E}valuation Bench\underline{M}ark (KoGEM)$, designed to assess the linguistic competence of LLMs and humans in Korean. KoGEM consists of 1.5k multiple-choice QA pairs covering five main categories and 16 subcategories. The zero-shot evaluation of 27 LLMs of various sizes and types reveals that while LLMs perform remarkably well on straightforward tasks requiring primarily definitional knowledge, they struggle with tasks that demand the integration of real-world experiential knowledge, such as phonological rules and pronunciation. Furthermore, our in-depth analysis suggests that incorporating such experiential knowledge could enhance the linguistic competence of LLMs. With KoGEM, we not only highlight the limitations of current LLMs in linguistic competence but also uncover hidden facets of LLMs in linguistic competence, paving the way for enhancing comprehensive language understanding. Our code and dataset are available at: https://github.com/SungHo3268/KoGEM.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2025

Linguistically Informed Tokenization Improves ASR for Underresourced Languages

Massimo Daul, Alessio Tosolini, Claire Bowern

Automatic speech recognition (ASR) is a crucial tool for linguists aiming to perform a variety of language documentation tasks. However, modern ASR systems use data-hungry transformer architectures, rendering them generally unusable for underresourced languages. We fine-tune a wav2vec2 ASR model on Yan-nhangu, a dormant Indigenous Australian language, comparing the effects of phonemic and orthographic tokenization strategies on performance. In parallel, we explore ASR's viability as a tool in a language documentation pipeline. We find that a linguistically informed phonemic tokenization system substantially improves WER and CER compared to a baseline orthographic tokenization scheme. Finally, we show that hand-correcting the output of an ASR model is much faster than hand-transcribing audio from scratch, demonstrating that ASR can work for underresourced languages.

en cs.CL

Detail Sumber

arXiv Open Access 2025

A Computational Approach to Analyzing Language Change and Variation in the Constructed Language Toki Pona

Daniel Huang, Hyoun-A Joo

This study explores language change and variation in Toki Pona, a constructed language with approximately 120 core words. Taking a computational and corpus-based approach, the study examines features including fluid word classes and transitivity in order to examine (1) changes in preferences of content words for different syntactic positions over time and (2) variation in usage across different corpora. The results suggest that sociolinguistic factors influence Toki Pona in the same way as natural languages, and that even constructed linguistic systems naturally evolve as communities use them.

en cs.CL

Detail Sumber

arXiv Open Access 2025

Investigating the interaction of linguistic and mathematical reasoning in language models using multilingual number puzzles

Antara Raaghavi Bhattacharya, Isabel Papadimitriou, Kathryn Davidson et al.

Across languages, numeral systems vary widely in how they construct and combine numbers. While humans consistently learn to navigate this diversity, large language models (LLMs) struggle with linguistic-mathematical puzzles involving cross-linguistic numeral systems, which humans can learn to solve successfully. We investigate why this task is difficult for LLMs through a series of experiments that untangle the linguistic and mathematical aspects of numbers in language. Our experiments establish that models cannot consistently solve such problems unless the mathematical operations in the problems are explicitly marked using known symbols ($+$, $\times$, etc., as in "twenty + three"). In further ablation studies, we probe how individual parameters of numeral construction and combination affect performance. While humans use their linguistic understanding of numbers to make inferences about the implicit compositional structure of numerals, LLMs seem to lack this notion of implicit numeral structure. We conclude that the ability to flexibly infer compositional rules from implicit patterns in human-scale data remains an open challenge for current reasoning models.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2025

I Think, Therefore I Am Under-Qualified? A Benchmark for Evaluating Linguistic Shibboleth Detection in LLM Hiring Evaluations

Julia Kharchenko, Tanya Roosta, Aman Chadha et al.

This paper introduces a comprehensive benchmark for evaluating how Large Language Models (LLMs) respond to linguistic shibboleths: subtle linguistic markers that can inadvertently reveal demographic attributes such as gender, social class, or regional background. Through carefully constructed interview simulations using 100 validated question-response pairs, we demonstrate how LLMs systematically penalize certain linguistic patterns, particularly hedging language, despite equivalent content quality. Our benchmark generates controlled linguistic variations that isolate specific phenomena while maintaining semantic equivalence, which enables the precise measurement of demographic bias in automated evaluation systems. We validate our approach along multiple linguistic dimensions, showing that hedged responses receive 25.6% lower ratings on average, and demonstrate the benchmark's effectiveness in identifying model-specific biases. This work establishes a foundational framework for detecting and measuring linguistic discrimination in AI systems, with broad applications to fairness in automated decision-making contexts.

en cs.CL

Detail Sumber

DOAJ Open Access 2025

Texto, discurso e gramática no ensino de português como língua estrangeira: um caminho didático para uma abordagem articulada

Florencia Miranda

Resumo O objetivo deste trabalho é apresentar uma proposta de articulação entre texto, discurso e gramática enquanto objetos de ensino de línguas. Especificamente, propomos um possível caminho didático para a abordagem articulada desses objetos no caso do ensino de português como língua estrangeira para estudantes hispanofalantes. Também comentamos alguns efeitos dessa intervenção. O trabalho assume o quadro teórico-epistemológico do Interacionismo Sociodiscursivo (Bronckart 1997, 2019), abordando sua vertente didática (Dolz et al. 2009), e retomando os trabalhos de Bronckart (2016) e Bulea Bronckart (2015) sobre o problema do ensino da gramática. Para tanto, apresentamos resultados parciais de uma pesquisa realizada no âmbito de um projeto que aplicou uma proposta de intervenção didática em um curso de pós-graduação. Nesse âmbito, analisamos dados de um corpus pedagógico que inclui, por um lado, os materiais didáticos utilizados nas aulas para o ensino de gêneros textuais que tematizam a “apresentação pessoal” (biodata e e-mail de contato acadêmico) e, por outro lado, as produções textuais dos estudantes reunidos durante os ciclos letivos 2020, 2021, 2022 e 2024. O corpus inclui, também, comentários dos estudantes recolhidos mediante enquetes.

Philology. Linguistics

Detail DOI Sumber

DOAJ Open Access 2025

Nářečí nejzápadnějších rusínských enkláv v kontaktu se slovenštinou a polštinou. Fonologie a morfologie

Michal Vašíček

The goal of the research is to contribute to the current knowledge of historical Slavic dialectology and to the understanding of the nature of language contact between dialects of related languages in the Carpathians. The article is devoted to the dialects of the westernmost East Slavic language islands on the territory of Slovakia. This concerns the dialects of the municipalities of Osturňa and Ihľany in the north of Spiš and a group of southern Spiš enclaves, which includes the municipalities of Závadka, Poráč, Slovinky, Helcmanovce, Úhorná, Pača, and Hačava. According to the traditional classification, these dialects belong to the Southern Lemko group of south-western Ukrainian dialects. In many features, they match the neighbouring dialects of Slovak and Polish, with which they have been in contact for centuries. A high degree of contact- induced West Slavic phenomena is typical for the entire Lemko area, but it is evident from the professional literature and recent field research that the dialects of the language islands were exposed to the strongest foreign influence. The present study focuses on the phonological and morphological level of the selected dialects and tries to determine which changes in the dialect systems were caused by contact with West Slavic dialects. The basic research method is a comparison of the dialect systems of the mentioned municipalities with the systems of the neighbouring East Slovak and Lesser Poland Goral dialects. We have come to the conclusion that the nature of language contact in different municipalities differed. While in most localities it was a matter of gradual long-term contact, in the municipalities of Osturňa and Slovinky the mixed settlement had a decisive influence on the development of their dialects.

Philology. Linguistics

Detail DOI Sumber

arXiv Open Access 2024

A Measure for Transparent Comparison of Linguistic Diversity in Multilingual NLP Data Sets

Tanja Samardzic, Ximena Gutierrez, Christian Bentz et al.

Typologically diverse benchmarks are increasingly created to track the progress achieved in multilingual NLP. Linguistic diversity of these data sets is typically measured as the number of languages or language families included in the sample, but such measures do not consider structural properties of the included languages. In this paper, we propose assessing linguistic diversity of a data set against a reference language sample as a means of maximising linguistic diversity in the long run. We represent languages as sets of features and apply a version of the Jaccard index suitable for comparing sets of measures. In addition to the features extracted from typological data bases, we propose an automatic text-based measure, which can be used as a means of overcoming the well-known problem of data sparsity in manually collected features. Our diversity score is interpretable in terms of linguistic features and can identify the types of languages that are not represented in a data set. Using our method, we analyse a range of popular multilingual data sets (UD, Bible100, mBERT, XTREME, XGLUE, XNLI, XCOPA, TyDiQA, XQuAD). In addition to ranking these data sets, we find, for example, that (poly)synthetic languages are missing in almost all of them.

en cs.CL

Detail Sumber

arXiv Open Access 2024

MahaSQuAD: Bridging Linguistic Divides in Marathi Question-Answering

Ruturaj Ghatage, Aditya Kulkarni, Rajlaxmi Patil et al.

Question-answering systems have revolutionized information retrieval, but linguistic and cultural boundaries limit their widespread accessibility. This research endeavors to bridge the gap of the absence of efficient QnA datasets in low-resource languages by translating the English Question Answering Dataset (SQuAD) using a robust data curation approach. We introduce MahaSQuAD, the first-ever full SQuAD dataset for the Indic language Marathi, consisting of 118,516 training, 11,873 validation, and 11,803 test samples. We also present a gold test set of manually verified 500 examples. Challenges in maintaining context and handling linguistic nuances are addressed, ensuring accurate translations. Moreover, as a QnA dataset cannot be simply converted into any low-resource language using translation, we need a robust method to map the answer translation to its span in the translated passage. Hence, to address this challenge, we also present a generic approach for translating SQuAD into any low-resource language. Thus, we offer a scalable approach to bridge linguistic and cultural gaps present in low-resource languages, in the realm of question-answering systems. The datasets and models are shared publicly at https://github.com/l3cube-pune/MarathiNLP .

en cs.CL, cs.LG

Detail Sumber

arXiv Open Access 2023

SpeechGLUE: How Well Can Self-Supervised Speech Models Capture Linguistic Knowledge?

Takanori Ashihara, Takafumi Moriya, Kohei Matsuura et al.

Self-supervised learning (SSL) for speech representation has been successfully applied in various downstream tasks, such as speech and speaker recognition. More recently, speech SSL models have also been shown to be beneficial in advancing spoken language understanding tasks, implying that the SSL models have the potential to learn not only acoustic but also linguistic information. In this paper, we aim to clarify if speech SSL techniques can well capture linguistic knowledge. For this purpose, we introduce SpeechGLUE, a speech version of the General Language Understanding Evaluation (GLUE) benchmark. Since GLUE comprises a variety of natural language understanding tasks, SpeechGLUE can elucidate the degree of linguistic ability of speech SSL models. Experiments demonstrate that speech SSL models, although inferior to text-based SSL models, perform better than baselines, suggesting that they can acquire a certain amount of general linguistic knowledge from just unlabeled speech data.

en cs.CL, cs.SD

Detail Sumber

arXiv Open Access 2023

A Cross-Linguistic Pressure for Uniform Information Density in Word Order

Thomas Hikaru Clark, Clara Meister, Tiago Pimentel et al.

While natural languages differ widely in both canonical word order and word order flexibility, their word orders still follow shared cross-linguistic statistical patterns, often attributed to functional pressures. In the effort to identify these pressures, prior work has compared real and counterfactual word orders. Yet one functional pressure has been overlooked in such investigations: the uniform information density (UID) hypothesis, which holds that information should be spread evenly throughout an utterance. Here, we ask whether a pressure for UID may have influenced word order patterns cross-linguistically. To this end, we use computational models to test whether real orders lead to greater information uniformity than counterfactual orders. In our empirical study of 10 typologically diverse languages, we find that: (i) among SVO languages, real word orders consistently have greater uniformity than reverse word orders, and (ii) only linguistically implausible counterfactual orders consistently exceed the uniformity of real orders. These findings are compatible with a pressure for information uniformity in the development and usage of natural languages.

en cs.CL

Detail Sumber

arXiv Open Access 2023

Linguistic representations for fewer-shot relation extraction across domains

Sireesh Gururaja, Ritam Dutt, Tinglong Liao et al.

Recent work has demonstrated the positive impact of incorporating linguistic representations as additional context and scaffolding on the in-domain performance of several NLP tasks. We extend this work by exploring the impact of linguistic representations on cross-domain performance in a few-shot transfer setting. An important question is whether linguistic representations enhance generalizability by providing features that function as cross-domain pivots. We focus on the task of relation extraction on three datasets of procedural text in two domains, cooking and materials science. Our approach augments a popular transformer-based architecture by alternately incorporating syntactic and semantic graphs constructed by freely available off-the-shelf tools. We examine their utility for enhancing generalization, and investigate whether earlier findings, e.g. that semantic representations can be more helpful than syntactic ones, extend to relation extraction in multiple domains. We find that while the inclusion of these graphs results in significantly higher performance in few-shot transfer, both types of graph exhibit roughly equivalent utility.

en cs.CL

Detail Sumber

DOAJ Open Access 2023

Brandes in the Czech-speaking World: Dismissed as a Critic, Embraced as a Name

Helena Březinová, Jana Lainto

In our contribution, we explore the Czech-speaking discourse related to Georg Brandes in the Bohemian Lands in the late nineteenth and the early twentieth century, which means before and shortly after Czechs gained their independence from Austria-Hungary in 1918. Our research of archival sources, especially periodicals and private letters, enables us to confidently claim that the impact of Brandes’s criticism on the Czech arts was rather insignificant. At the same time, the sources give a clear picture that the Czech-speaking intelligentsia were interested in using Brandes’s symbolic capital to promote their struggle for Czech cultural autonomy. Thus, it was not Brandes’s works that can be considered influential in the Czech context but his persona. This strategy of using Brandes’s symbolic capital mirrors his own efforts to be viewed as an international intermediary. Finally, we explore the East-West dynamics in Brandes’s relationship with Czechs and vice versa, and here, we identify a considerable asymmetry

Philology. Linguistics

Detail DOI Sumber

arXiv Open Access 2022

Towards Afrocentric NLP for African Languages: Where We Are and Where We Can Go

Ife Adebara, Muhammad Abdul-Mageed

Aligning with ACL 2022 special Theme on "Language Diversity: from Low Resource to Endangered Languages", we discuss the major linguistic and sociopolitical challenges facing development of NLP technologies for African languages. Situating African languages in a typological framework, we discuss how the particulars of these languages can be harnessed. To facilitate future research, we also highlight current efforts, communities, venues, datasets, and tools. Our main objective is to motivate and advocate for an Afrocentric approach to technology development. With this in mind, we recommend \textit{what} technologies to build and \textit{how} to build, evaluate, and deploy them based on the needs of local African communities.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2022

LingMess: Linguistically Informed Multi Expert Scorers for Coreference Resolution

Shon Otmazgin, Arie Cattan, Yoav Goldberg

While coreference resolution typically involves various linguistic challenges, recent models are based on a single pairwise scorer for all types of pairs. We present LingMess, a new coreference model that defines different categories of coreference cases and optimize multiple pairwise scorers, where each scorer learns a specific set of linguistic challenges. Our model substantially improves pairwise scores for most categories and outperforms cluster-level performance on Ontonotes and 5 additional datasets. Our model is available in https://github.com/shon-otmazgin/lingmess-coref

en cs.CL

Detail Sumber

DOAJ Open Access 2022

El lenguaje y las barreras que impiden la comunicación

Rosario Rodríguez del Busto, Emanuel Exequiel Guanco

La comunicación constituye uno de los desafíos más grande de la humanidad, ya que a pesar de compartir un mismo sistema, la comunicación no siempre es posible. En muchas ocasiones, el uso que los hablantes hacen del lenguaje es un reproductor de esta problemática. Es por esto, que el tema que nos convoca son los discursos sociales en torno al lenguaje, puestos en diálogo con dos teorías: la teoría saussureana y la teoría peirceana. Como punto de partida vamos a analizar, desde estas dos perspectivas, ciertas barreras y desigualdades que produce el uso del lenguaje.

Social sciences (General), Discourse analysis

Detail DOI Sumber

DOAJ Open Access 2022

Stereotypes of beauty in Russian and Uzbek linguistic cultures (based on Russian and Uzbek proverbs)

R. K. Abdullaeva

The analysis of paremic units is carried out, the system of representation of the aesthetic evaluation of the concept of “beauty” in the Russian and Uzbek proverbial language pictures of the world is described. Proverbs with the semantics of “beauty” and their role in the process of reflecting the spiritual, moral and socio-cultural worldview of a person are considered. Based on the material of Russian and Uzbek proverbs, the assessment of the concept of “beauty” (appearance) against the background of other types of assessment (character qualities moral and aesthetic assessment; human actions ethical and aesthetic assessment; mind/soul intellectual and aesthetic; physiological characteristics valeological and aesthetic); evaluation of activity pragmatic-aesthetic). In the compared languages, similarities and differences in the representation of a person’s aesthetic assessment on the basis of signs are established: appearance an act, an act a state, a state a conclusion, the presence of beauty a quality of character, a person’s nomination in accordance with the aesthetic standards of beauty. An assessment of the spiritual characteristics of a person (method of action, nature of actions, attitude towards others, quality of character) is determined, which is interconnected with the assessment of visual behavior. The representative language units and their semantic features (anthropocentric, personality-individual, value-evaluative and emotional-evaluative) are compared, characteristic of a particular language in five thematic groups in the compared languages (“Possession of beauty”, “Attractive force, a good phenomenon desired by the heart and soul”, “Pure and flawless”, “Unprecedented, amazing”, “Attitude towards others/towards oneself”).

Philology. Linguistics

Detail DOI Sumber

DOAJ Open Access 2022

Analisis Semiotika Roland Barthes Pada Iklan Pocari Sweat Versi Ramadhan 1442 H

Lukman Hakim, Oktavia Monalisa

Advertising is one way that companies use to promote their products or services to the public. The object analyzed in this research is the advertisement for Pocari Sweat version of Ramadhan 1442 H. As a commercial product, Pocari Sweat is not spared from a promotion so that Pocari Sweat makes an advertisement video with the theme of Ramadan. The purpose of this study was to analyze the audio visuals of the Ramadan 1442 H version of the Pocari Sweat advertisement. This study used a descriptive qualitative method using Roland Barthes' semiotic analysis approach. The results of this study indicate that the Ramadan 1442 H version of the Pocari Sweat advertisement intends to improve the company's image by using television advertising media and using rising actresses to become advertisement stars. In addition, this advertisement also intends to promote the product through advertising messages that the Pocari Sweat product is very useful for those who are fasting.

Communication. Mass media

Detail DOI Sumber

arXiv Open Access 2021

LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video Question Answering

Jingjing Jiang, Ziyi Liu, Nanning Zheng

Video Question Answering (VideoQA), aiming to correctly answer the given question based on understanding multi-modal video content, is challenging due to the rich video content. From the perspective of video understanding, a good VideoQA framework needs to understand the video content at different semantic levels and flexibly integrate the diverse video content to distill question-related content. To this end, we propose a Lightweight Visual-Linguistic Reasoning framework named LiVLR. Specifically, LiVLR first utilizes the graph-based Visual and Linguistic Encoders to obtain multi-grained visual and linguistic representations. Subsequently, the obtained representations are integrated with the devised Diversity-aware Visual-Linguistic Reasoning module (DaVL). The DaVL considers the difference between the different types of representations and can flexibly adjust the importance of different types of representations when generating the question-related joint representation, which is an effective and general representation integration method. The proposed LiVLR is lightweight and shows its performance advantage on two VideoQA benchmarks, MRSVTT-QA and KnowIT VQA. Extensive ablation studies demonstrate the effectiveness of LiVLR key components.

en cs.CV

Detail Sumber

Hasil untuk "Philology. Linguistics"