The performance of Language Models (LMs) on low-resource, morphologically rich languages like Sinhala remains largely unexplored, particularly regarding script variation in digital communication. Sinhala exhibits script duality, with Unicode used in formal contexts and Romanized text dominating social media, while mixed-script usage is common in practice. This paper benchmarks 24 open-source LMs on Unicode, Romanized and mixed-script Sinhala using perplexity evaluation across diverse text sources. Results reveal substantial script sensitivity, with median performance degradation exceeding 300 times from Unicode to Romanized text. Critically, model size shows no correlation with script-handling competence, as smaller models often outperform architectures 28 times larger. Unicode performance strongly predicts mixed-script robustness but not Romanized capability, demonstrating that single-script evaluation substantially underestimates real-world deployment challenges. These findings establish baseline LM capabilities for Sinhala and provide practical guidance for model selection in multi-script low-resource environments.
Manurag Khullar, Utkarsh Desai, Poorva Malviya
et al.
Large Language Models (LLMs) are increasingly deployed in high-stakes clinical applications in India. Speakers of Indian languages frequently communicate using romanized text rather than native scripts, yet existing research rarely quantifies or evaluates this orthographic variation in real world applications. We investigate how romanization impacts the reliability of LLMs in a critical domain: maternal and newborn healthcare triage. We benchmark leading LLMs on a real world dataset of user-generated health queries spanning five Indian languages and Nepali. Our results reveal consistent degradation in performance for romanized messages, with gap reaching up to 24 points across languages and models. We propose and evaluate an Uncertainty-based Selective Routing method to close this script gap. At our partner maternal health organization alone, this gap could cause nearly 2 million excess errors in triage. Our findings highlight a critical safety blind spot in LLM-based health systems: models that appear to understand romanized input may still fail to act on it reliably.
The use of derogatory terms in languages that employ code mixing, such as Roman Urdu, presents challenges for Natural Language Processing systems due to unstated grammar, inconsistent spelling, and a scarcity of labeled data. In this work, we propose a QLoRA based fine tuning framework to improve offensive language detection in Roman Urdu-English text. We translated the Roman Urdu-English code mixed dataset into English using Google Translate to leverage English LLMs, while acknowledging that this translation reduces direct engagement with code mixing features. Our focus is on classification performance using English translated low resource inputs. We fine tuned several transformers and large language models, including Meta LLaMA 3 8B, Mistral 7B v0.1, LLaMA 2 7B, ModernBERT, and RoBERTa, with QLoRA for memory efficient adaptation. Models were trained and evaluated on a manually annotated Roman Urdu dataset for offensive vs non offensive content. Of all tested models, the highest F1 score of 91.45 was attained by Meta LLaMA 3 8B, followed by Mistral 7B at 89.66, surpassing traditional transformer baselines. These results demonstrate the efficacy of QLoRA in fine tuning high performing models for low resource environments such as code mixed offensive language detection, and confirm the potential of LLMs for this task. This work advances a scalable approach to Roman Urdu moderation and paves the way for future multilingual offensive detection systems based on LLMs.
The diversity and complexity of Indic languages present unique challenges for natural language processing (NLP) tasks, particularly in the domain of question answering (QA).To address these challenges, this paper explores the application of State Space Models (SSMs),to build efficient and contextually aware QA systems tailored for Indic languages. SSMs are particularly suited for this task due to their ability to model long-term and short-term dependencies in sequential data, making them well-equipped to handle the rich morphology, complex syntax, and contextual intricacies characteristic of Indian languages. We evaluated multiple SSM architectures across diverse datasets representing various Indic languages and conducted a comparative analysis of their performance. Our results demonstrate that these models effectively capture linguistic subtleties, leading to significant improvements in question interpretation, context alignment, and answer generation. This work represents the first application of SSMs to question answering tasks in Indic languages, establishing a foundational benchmark for future research in this domain. We propose enhancements to existing SSM frameworks, optimizing their applicability to low-resource settings and multilingual scenarios prevalent in Indic languages.
Muhammad Ahmad, Muhammad Waqas, Ameer Hamza
et al.
Hope is a positive emotional state involving the expectation of favorable future outcomes, while hope speech refers to communication that promotes optimism, resilience, and support, particularly in adverse contexts. Although hope speech detection has gained attention in Natural Language Processing (NLP), existing research mainly focuses on high-resource languages and standardized scripts, often overlooking informal and underrepresented forms such as Roman Urdu. To the best of our knowledge, this is the first study to address hope speech detection in code-mixed Roman Urdu by introducing a carefully annotated dataset, thereby filling a critical gap in inclusive NLP research for low-resource, informal language varieties. This study makes four key contributions: (1) it introduces the first multi-class annotated dataset for Roman Urdu hope speech, comprising Generalized Hope, Realistic Hope, Unrealistic Hope, and Not Hope categories; (2) it explores the psychological foundations of hope and analyzes its linguistic patterns in code-mixed Roman Urdu to inform dataset development; (3) it proposes a custom attention-based transformer model optimized for the syntactic and semantic variability of Roman Urdu, evaluated using 5-fold cross-validation; and (4) it verifies the statistical significance of performance gains using a t-test. The proposed model, XLM-R, achieves the best performance with a cross-validation score of 0.78, outperforming the baseline SVM (0.75) and BiLSTM (0.76), with gains of 4% and 2.63% respectively.
Denominal verbs, in spite of their name, can be derived from either a noun or a root. In non-morphologically transparent languages, only semantic cues help distinguish the two classes, i.e., the entailment of existence of the corresponding noun (Kiparsky 1982, 1997). In this work, we present a novel semantic test which is the first attempt at distinguishing noun-derived from root-derived Instrumental Denominal Verbs (IDV) on a purely semantic basis, overcoming the flaws observed in previous syntactic tests. By explicitly asking Italian native speakers to mention the instruments that can be used to perform the action denoted by the verb, we measured the entailment of existence through the number of instrument nouns produced and the frequency of production of the corresponding instrument noun. Our test also contained parasynthetic verbs, whose behavior was influenced by the interaction between their derivation process and their meaning.
Atnafu Lambebo Tonja, Olga Kolesnikova, Alexander Gelbukh
et al.
Recent research in natural language processing (NLP) has achieved impressive performance in tasks such as machine translation (MT), news classification, and question-answering in high-resource languages. However, the performance of MT leaves much to be desired for low-resource languages. This is due to the smaller size of available parallel corpora in these languages, if such corpora are available at all. NLP in Ethiopian languages suffers from the same issues due to the unavailability of publicly accessible datasets for NLP tasks, including MT. To help the research community and foster research for Ethiopian languages, we introduce EthioMT -- a new parallel corpus for 15 languages. We also create a new benchmark by collecting a dataset for better-researched languages in Ethiopia. We evaluate the newly collected corpus and the benchmark dataset for 23 Ethiopian languages using transformer and fine-tuning approaches.
In this paper we discuss a first attempt to build a language learning game for brazilian indigenous languages and the challenges around it. We present a design for the tool with gamification aspects. Then we describe a process to automatically generate language exercises and questions from a dependency treebank and a lexical database for Tupian languages. We discuss the limitations of our prototype highlighting ethical and practical implementation concerns. Finally, we conclude that new data gathering processes should be established in partnership with indigenous communities and oriented for educational purposes.
En català central, el pronom clític datiu de tercera persona del plural es realitza com els, els hi, li o lis. Aquest fenomen de variació s’ha tractat sobretot en relació amb diverses combinacions pronominals, però no tant de manera aïllada i a partir de mostres d’ús. S’analitza longitudinalment l’ús d’aquestes formes i la seva relació amb la llengua inicial en mostres de 69 informants de Manlleu i Mataró obtingudes a 1r i 4t d’ESO per a l’avaluació de la competència oral en un context interactiu. S’observen diferències significatives en l’ús de les variants per catalanoparlants, castellanoparlants i bilingües inicials. Els resultats indiquen que hi hauria un canvi lingüístic en curs i que possiblement la dimensió sociolingüística associada a cada forma s’adquiriria en aquesta etapa. Les variants seguirien l’ordre següent, de més a menys marcades: lis, els, li, els hi. Els resultats encoratgen nous estudis sobre aquest i altres fenòmens de variació en català contemporani i que considerin variables independents relacionades amb la llengua inicial, com la llengua amb els progenitors o la d’identificació, per contribuir a la (socio)lingüística i a la fonamentació empírica d’anàlisis lingüístiques d’anònims.
Daniel Lundén, Lars Hummelgren, Jan Kudlicka
et al.
Universal probabilistic programming languages (PPLs) make it relatively easy to encode and automatically solve statistical inference problems. To solve inference problems, PPL implementations often apply Monte Carlo inference algorithms that rely on execution suspension. State-of-the-art solutions enable execution suspension either through (i) continuation-passing style (CPS) transformations or (ii) efficient, but comparatively complex, low-level solutions that are often not available in high-level languages. CPS transformations introduce overhead due to unnecessary closure allocations -- a problem the PPL community has generally overlooked. To reduce overhead, we develop a new efficient selective CPS approach for PPLs. Specifically, we design a novel static suspension analysis technique that determines parts of programs that require suspension, given a particular inference algorithm. The analysis allows selectively CPS transforming the program only where necessary. We formally prove the correctness of the analysis and implement the analysis and transformation in the Miking CorePPL compiler. We evaluate the implementation for a large number of Monte Carlo inference algorithms on real-world models from phylogenetics, epidemiology, and topic modeling. The evaluation results demonstrate significant improvements across all models and inference algorithms.
Atnafu Lambebo Tonja, Hellina Hailu Nigatu, Olga Kolesnikova
et al.
This paper describes CIC NLP's submission to the AmericasNLP 2023 Shared Task on machine translation systems for indigenous languages of the Americas. We present the system descriptions for three methods. We used two multilingual models, namely M2M-100 and mBART50, and one bilingual (one-to-one) -- Helsinki NLP Spanish-English translation model, and experimented with different transfer learning setups. We experimented with 11 languages from America and report the setups we used as well as the results we achieved. Overall, the mBART setup was able to improve upon the baseline for three out of the eleven languages.
Daniel Lundén, Gizem Çaylak, Fredrik Ronquist
et al.
Probabilistic Programming Languages (PPLs) allow users to encode statistical inference problems and automatically apply an inference algorithm to solve them. Popular inference algorithms for PPLs, such as sequential Monte Carlo (SMC) and Markov chain Monte Carlo (MCMC), are built around checkpoints -- relevant events for the inference algorithm during the execution of a probabilistic program. Deciding the location of checkpoints is, in current PPLs, not done optimally. To solve this problem, we present a static analysis technique that automatically determines checkpoints in programs, relieving PPL users of this task. The analysis identifies a set of checkpoints that execute in the same order in every program run -- they are aligned. We formalize alignment, prove the correctness of the analysis, and implement the analysis as part of the higher-order functional PPL Miking CorePPL. By utilizing the alignment analysis, we design two novel inference algorithm variants: aligned SMC and aligned lightweight MCMC. We show, through real-world experiments, that they significantly improve inference execution time and accuracy compared to standard PPL versions of SMC and MCMC.
Atnafu Lambebo Tonja, Tadesse Destaw Belay, Israel Abebe Azime
et al.
This survey delves into the current state of natural language processing (NLP) for four Ethiopian languages: Amharic, Afaan Oromo, Tigrinya, and Wolaytta. Through this paper, we identify key challenges and opportunities for NLP research in Ethiopia. Furthermore, we provide a centralized repository on GitHub that contains publicly available resources for various NLP tasks in these languages. This repository can be updated periodically with contributions from other researchers. Our objective is to identify research gaps and disseminate the information to NLP researchers interested in Ethiopian languages and encourage future research in this domain.
Im interdisziplinären Spannungsfeld zwischen Geschichtswissenschaft und Linguistik werden in diesem Beitrag Oral History-Interviews als zentrale Ressource beider Fächer ausgewertet. Die Fallstudie untersucht die Sprache des Erinnerns in italienischen und französischen lebensgeschichtlichen Interviews des Archivs Zwangsarbeit 1939-1945 und des LangAge-Korpus. In fünf Arbeitsschritten setzt sie die in der Oral History-Forschung etablierte Herangehensweise, Hypothesen aus den Daten heraus zu entwickeln, mit korpuslinguistischen Methoden um. So werden die Vorkommen der Personalpronomina wir und ich verglichen, und Schlüsselwörter (Keywords) und häufige Wortkombinationen (N-Gramme) ermittelt. Diese quantitativen Analysen werden durch die exemplarische Analyse häufiger Wendungen wie ‘ich erinnere mich’ oder ‘in einem bestimmten Moment’ vertieft. Dabei werden der biographische Zugang der Oral History und das geschichtswissenschaftliche Kontextwissen mit den Ergebnissen der korpuslinguistischen Analyse zusammengeführt. Als problematisch erweisen sich dabei unterschiedliche Transkriptionsstandards zwischen Oral History und Linguistik. Die neue Forschungsumgebung Oral-History.Digital unterstützt in Zukunft solche sammlungsübergreifenden und doch quellennahen Analysen lebensgeschichtlicher Interviews.
Probabilistic programming languages (PPLs) allow users to encode arbitrary inference problems, and PPL implementations provide general-purpose automatic inference for these problems. However, constructing inference implementations that are efficient enough is challenging for many real-world problems. Often, this is due to PPLs not fully exploiting available parallelization and optimization opportunities. For example, handling probabilistic checkpoints in PPLs through continuation-passing style transformations or non-preemptive multitasking -- as is done in many popular PPLs -- often disallows compilation to low-level languages required for high-performance platforms such as GPUs. To solve the checkpoint problem, we introduce the concept of PPL control-flow graphs (PCFGs) -- a simple and efficient approach to checkpoints in low-level languages. We use this approach to implement RootPPL: a low-level PPL built on CUDA and C++ with OpenMP, providing highly efficient and massively parallel SMC inference. We also introduce a general method of compiling universal high-level PPLs to PCFGs and illustrate its application when compiling Miking CorePPL -- a high-level universal PPL -- to RootPPL. The approach is the first to compile a universal PPL to GPUs with SMC inference. We evaluate RootPPL and the CorePPL compiler through a set of real-world experiments in the domains of phylogenetics and epidemiology, demonstrating up to 6x speedups over state-of-the-art PPLs implementing SMC inference.
Transformer-based language models such as BERT have outperformed previous models on a large number of English benchmarks, but their evaluation is often limited to English or a small number of well-resourced languages. In this work, we evaluate monolingual, multilingual, and randomly initialized language models from the BERT family on a variety of Uralic languages including Estonian, Finnish, Hungarian, Erzya, Moksha, Karelian, Livvi, Komi Permyak, Komi Zyrian, Northern Sámi, and Skolt Sámi. When monolingual models are available (currently only et, fi, hu), these perform better on their native language, but in general they transfer worse than multilingual models or models of genetically unrelated languages that share the same character set. Remarkably, straightforward transfer of high-resource models, even without special efforts toward hyperparameter optimization, yields what appear to be state of the art POS and NER tools for the minority Uralic languages where there is sufficient data for finetuning.
Brian Yan, Siddharth Dalmia, David R. Mortensen
et al.
Building language-universal speech recognition systems entails producing phonological units of spoken sound that can be shared across languages. While speech annotations at the language-specific phoneme or surface levels are readily available, annotations at a universal phone level are relatively rare and difficult to produce. In this work, we present a general framework to derive phone-level supervision from only phonemic transcriptions and phone-to-phoneme mappings with learnable weights represented using weighted finite-state transducers, which we call differentiable allophone graphs. By training multilingually, we build a universal phone-based speech recognition model with interpretable probabilistic phone-to-phoneme mappings for each language. These phone-based systems with learned allophone graphs can be used by linguists to document new languages, build phone-based lexicons that capture rich pronunciation variations, and re-evaluate the allophone mappings of seen language. We demonstrate the aforementioned benefits of our proposed framework with a system trained on 7 diverse languages.
This paper focuses on cortadora relative clauses – or non-pronominal relative clauses –, a special type of relativization registered in Rio de la Plata Spanish varieties, in which the preposition is deleted or ‘chopped’ (‘cortada’). This phenomenon is an example of preposition optionality and has been previously studied in Brazilian and European Portuguese, and French within the framework of generative grammar. The main goal of this paper is to show that this syntactic-discursive phenomenon is basically the same in all these three closely related Romance languages, following the works of Kato (2010), Valer (2008), and Tarallo (1983) for Portuguese, Bouchard (1981) for French, and Caviglia & Malcuori (2007) for Spanish. Only inherent prepositions (a, de, con, en, and por) can be deleted, and the position of the topic in Topic Phrases is the one relativised in cortadora relative clauses. The data used for Spanish examples come from the COLEM-Argentina and Uruguay corpora (COLEM: Corpus oral de la lengua española en Montreal).