Hasil "Ural-Altaic languages"

arXiv Open Access 2026

Building a Strong Instruction Language Model for a Less-Resourced Language

Domen Vreš, Tjaša Arčon, Timotej Petrič et al.

Large language models (LLMs) have become an essential tool for natural language processing and artificial intelligence in general. Current open-source models are primarily trained on English texts, resulting in poorer performance on less-resourced languages and cultures. We present a set of methodological approaches necessary for the successful adaptation of an LLM to a less-resourced language, and demonstrate them using the Slovene language. We present GaMS3-12B, a generative model for Slovene with 12 billion parameters, and demonstrate that it is the best-performing open-source model for Slovene within its parameter range. We adapted the model to the Slovene language using three-stage continual pre-training of the Gemma 3 model, followed by two-stage supervised fine-tuning (SFT). We trained the model on a combination of 140B Slovene, English, Bosnian, Serbian, and Croatian pretraining tokens, and over 200 thousand English and Slovene SFT examples. We evaluate GaMS3-12B on the Slovenian-LLM-Eval datasets, English-to-Slovene translation, and the Slovene LLM arena. We show that the described model outperforms 12B Gemma 3 across all three scenarios and performs comparably to much larger commercial GPT-4o in the Slovene LLM arena, achieving a win rate of over 60 %.

en cs.CL, cs.LG

Detail Sumber

arXiv Open Access 2025

Multilingual State Space Models for Structured Question Answering in Indic Languages

Arpita Vats, Rahul Raja, Mrinal Mathur et al.

The diversity and complexity of Indic languages present unique challenges for natural language processing (NLP) tasks, particularly in the domain of question answering (QA).To address these challenges, this paper explores the application of State Space Models (SSMs),to build efficient and contextually aware QA systems tailored for Indic languages. SSMs are particularly suited for this task due to their ability to model long-term and short-term dependencies in sequential data, making them well-equipped to handle the rich morphology, complex syntax, and contextual intricacies characteristic of Indian languages. We evaluated multiple SSM architectures across diverse datasets representing various Indic languages and conducted a comparative analysis of their performance. Our results demonstrate that these models effectively capture linguistic subtleties, leading to significant improvements in question interpretation, context alignment, and answer generation. This work represents the first application of SSMs to question answering tasks in Indic languages, establishing a foundational benchmark for future research in this domain. We propose enhancements to existing SSM frameworks, optimizing their applicability to low-resource settings and multilingual scenarios prevalent in Indic languages.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2025

FUSE : A Ridge and Random Forest-Based Metric for Evaluating MT in Indigenous Languages

Rahul Raja, Arpita Vats

This paper presents the winning submission of the RaaVa team to the AmericasNLP 2025 Shared Task 3 on Automatic Evaluation Metrics for Machine Translation (MT) into Indigenous Languages of America, where our system ranked first overall based on average Pearson correlation with the human annotations. We introduce Feature-Union Scorer (FUSE) for Evaluation, FUSE integrates Ridge regression and Gradient Boosting to model translation quality. In addition to FUSE, we explore five alternative approaches leveraging different combinations of linguistic similarity features and learning paradigms. FUSE Score highlights the effectiveness of combining lexical, phonetic, semantic, and fuzzy token similarity with learning-based modeling to improve MT evaluation for morphologically rich and low-resource languages. MT into Indigenous languages poses unique challenges due to polysynthesis, complex morphology, and non-standardized orthography. Conventional automatic metrics such as BLEU, TER, and ChrF often fail to capture deeper aspects like semantic adequacy and fluency. Our proposed framework, formerly referred to as FUSE, incorporates multilingual sentence embeddings and phonological encodings to better align with human evaluation. We train supervised models on human-annotated development sets and evaluate held-out test data. Results show that FUSE consistently achieves higher Pearson and Spearman correlations with human judgments, offering a robust and linguistically informed solution for MT evaluation in low-resource settings.

en cs.CL

Detail Sumber

DOAJ Open Access 2025

Kirjalik soome keel: kas ehe rahvakeel või midagi hoopis muud?

Kaisa Häkkinen

The first Estonian book was printed 500 years ago; no copies of the book have survived. The history of publishing in Finnish dates to 1543, with Mikael Agricola’s Abc-book. Agricola went on to publish eight additional works. This development was driven by the Lutheran Reformation on both sides of the Gulf of Finland. This article discusses the reasons why the paths of development diverged so markedly between the sister languages Finnish and Estonian. Finland was part of Sweden, but remained geographically distinct. A Swedish-speaking population lived along the coast, but these were ordinary peasants like their Finnish neighbours rather than a social upper class. Estonia, in contrast, had a German-speaking minority that was permanently present throughout the country as a dominant force. The clergy was also German-speaking. Finland formed a single diocese in the Church of Sweden. From the early 14th century onward, most bishops of Turku were born in Finland and spoke Finnish as their mother tongue. Local parish communities likely preferred priests with whom they could communicate in their own language. Swedish was not described as an official or socially superior language by Sebastian Münster in his Cosmographia (1544) Although scholars writing in Finnish knew the language, the literary register was different from everyday speech. Almost all texts were word for word translations from other languages, such as Latin, German, or Swedish. In a sense, the old literary Finnish was colonized from within by Finns themselves: Sweden had not yet had time to develop a strong national identity or sense of superiority, often combined with linguistic colonialism. Swedish-speaking administrators did thus not interfere with how Finns used their own language, but rather encouraged Finns to use pure Finnish. In the 19th century, in the spirit of national awakening, the language was thoroughly reformed in terms of both vocabulary and grammar. When Finland declared independence in 1917, the state of the language was strong and stable. Today, however, the situation is shifting once again, as English is exerting more and more influence on everyday language use. It appears that Finns are once more adopting a foreign model at the expense of their own language.

Other Finnic languages and dialects

Detail DOI Sumber

DOAJ Open Access 2025

Indefiniitpronoomenite "keegi" ja "miski" käändevormide varieerumine suulises keeles

Annika Kängsepp

Artikkel käsitleb gi-/ki-liiteliste indefiniitpronoomenite keegi ja miski käändevormide varieerumise ulatust ja varieerumist mõjutavaid tegureid kahe eesti keele suulise kõne korpuse, Eesti Rahvus ringhäälingu raadiosaadete korpuse ja Eesti taskuhäälingukorpuse põhjal. Kokku analüüsisin 975 käändevormi, millest 487 moodustasid pronoomeni keegi ja 488 miski vormid. Tulemustest selgus, et keegi puhul esinesid vormid, kus -gi/-ki paiknes käändelõpu järel, ning vormid, kus see paiknes käändelõpu ees või kahe ühesuguse käändelõpu vahel üsna võrdselt, vastavalt 54,2% ja 45,8%. Miski puhul oli nimetatud vormide osakaal 85,3% ja 14,7%. -gi/-ki asukoha varieerumist käändevormides mõjutas keegi puhul statistiliselt kõige tugevamalt kõnetempo ning miski puhul kõneliik. Binomiaalse logistilise regressiooni segamudeli järgi on 65% käändevormide varieerumisest seletatav kõnelejate individuaalsete erinevustega. *** "Variation in the case forms of the indefinite pronouns keegi ‘someone’ and miski ‘something’ in spoken Estonian" *** In the case forms of the indefinite pronouns keegi ‘someone’, miski ‘something’, kumbki ‘either’, and ükski ‘none’, the -gi/-ki can be placed after the case ending (e.g., kellelegi), before the case ending (e.g., kellegile), between two case endings (e.g., kellelegile) or before and after the case ending (e.g, kellegilegi) (Rull 1917, Saareste 1923). This variation has a strong dialectal background: forms with -gi/-ki after the case ending have historically been common only in Southern and Northeast Estonia (Saareste 1955: 16). In this article, I used data from Estonian Public Broadcasting’s Radio Corpus (Lippus et al. 2023a) and Estonian Podcast Corpus (Lippus et al. 2023b) to provide an overview of the extent of variation and to describe the factors influencing this variation. Results indicate that for keegi, -gi/-ki appears after (54.2%) and before or between two case endings (45,8%) at nearly equal frequencies, while for miski, the proportions are 85.3% and 14.7%, respectively. The primary factors influencing this variation for keegi were speech tempo, and for miski polarity.

Philology. Linguistics, Finnic. Baltic-Finnic

Detail DOI Sumber

DOAJ Open Access 2025

Sergey Pürbü'nün Çeçek Manzumesi Üzerine Notlar

İlker Tosun

Tuva Türkçesi, XX. yüzyılın ilk yarısına kadar alfabeden yoksun bir Türk lehçesi olarak yalnızca sözlü kültür aracılığıyla varlığını sürdürmüştür. XIX. yüzyıldan itibaren Tuva sözlü mirası çeşitli araştırmacılar tarafından derlenip yayımlansa da Tuva Türkçesiyle yazılmış bir eser bulunmamaktadır. Bu nedenle, 1930 yılında Latin harflerine dayalı bir alfabenin kabulü, modern Tuva edebiyatının başlangıcı sayılmaktadır. Sovyetler Birliği’nde yetişen ilk Tuvalı aydın kuşak, ideolojik çerçeveye uygun eserler üretmiş; ancak bu eserler biçim ve tema bakımından sözlü edebiyat geleneğiyle yakın bağlarını da korumuştur. Bu bağlamda Sergey Bakizoviç Pürbü, modern Tuva edebiyatının kurucu isimlerinden biri olarak öne çıkmaktadır. Onun Çeçek adlı şiiri, Tuva edebiyatının ilk manzumesi kabul edilmekte ve sözlü gelenek ile modern edebiyatın birleşimini örneklemektedir. Eserde Çeçek ile Mergen’in aşkı anlatılmakla birlikte, arka planda Tuva toplumunun feodal düzene ve Mançu hâkimiyetine karşı mücadelesi işlenmektedir. Vergilerle ezilen ve “tos erii” adı verilen işkencelere maruz kalan halkın yaşadığı toplumsal yıkım, şiirde hem bireysel hem de kolektif direnişin zeminini oluşturmaktadır. Bu çalışma, Türkiye Türkolojisi’nde sınırlı biçimde tanınan modern Tuva edebiyatını Çeçek manzumesi özelinde inceleyerek, edebî modernleşme ile sözlü kültür arasındaki etkileşimi tartışmayı amaçlamaktadır.

Ural-Altaic languages

Detail DOI Sumber

arXiv Open Access 2024

EthioMT: Parallel Corpus for Low-resource Ethiopian Languages

Atnafu Lambebo Tonja, Olga Kolesnikova, Alexander Gelbukh et al.

Recent research in natural language processing (NLP) has achieved impressive performance in tasks such as machine translation (MT), news classification, and question-answering in high-resource languages. However, the performance of MT leaves much to be desired for low-resource languages. This is due to the smaller size of available parallel corpora in these languages, if such corpora are available at all. NLP in Ethiopian languages suffers from the same issues due to the unavailability of publicly accessible datasets for NLP tasks, including MT. To help the research community and foster research for Ethiopian languages, we introduce EthioMT -- a new parallel corpus for 15 languages. We also create a new benchmark by collecting a dataset for better-researched languages in Ethiopia. We evaluate the newly collected corpus and the benchmark dataset for 23 Ethiopian languages using transformer and fine-tuning approaches.

en cs.CL

Detail Sumber

arXiv Open Access 2024

Building a Language-Learning Game for Brazilian Indigenous Languages: A Case of Study

Gustavo Polleti

In this paper we discuss a first attempt to build a language learning game for brazilian indigenous languages and the challenges around it. We present a design for the tool with gamification aspects. Then we describe a process to automatically generate language exercises and questions from a dependency treebank and a lexical database for Tupian languages. We discuss the limitations of our prototype highlighting ethical and practical implementation concerns. Finally, we conclude that new data gathering processes should be established in partnership with indigenous communities and oriented for educational purposes.

en cs.CL

Detail Sumber

DOAJ Open Access 2024

Mahmud al-kashgari's divanü lügati't-türk i-ii-iii (2024). azerbaijan national academy of sciences, nasimi institute of linguistics, baku: azerbaijan

Cemile

Divanu Lügati't Türk (DLT), written by Kashgarli Mahmud in the XI century, has been translated and printed into various languages and Turkic dialects over time. Khalid Said Khojayev is one of the important researchers working on DLT. The first translation of DLT from Arabic to Latin letters belongs to Khalid Said. Khalid Said's translation is important in terms of being the first translation of DLT in Latin script both in Azerbaijan Turkic and in the Turkic world. Khalid Said started the translation of the work in 1935 and finished the translation in 1936, but could not publish the work. The translation of DLT in Azerbaijani language was unearthed after 86 years. With the decision of the Presidium of the Azerbaijan National Academy of Sciences dated 29 January 2024, Nesimi was prepared for publication by the Institute of Linguistics. This study includes the introduction of the work published by the Presidium of the Azerbaijan National Academy of Sciences. In the study, the features of Khalid Said's translation of DLT, the edition and structure of the work were emphasised.

Language and Literature, Ural-Altaic languages

Detail DOI Sumber

arXiv Open Access 2023

Natural Language Processing in Ethiopian Languages: Current State, Challenges, and Opportunities

Atnafu Lambebo Tonja, Tadesse Destaw Belay, Israel Abebe Azime et al.

This survey delves into the current state of natural language processing (NLP) for four Ethiopian languages: Amharic, Afaan Oromo, Tigrinya, and Wolaytta. Through this paper, we identify key challenges and opportunities for NLP research in Ethiopia. Furthermore, we provide a centralized repository on GitHub that contains publicly available resources for various NLP tasks in these languages. This repository can be updated periodically with contributions from other researchers. Our objective is to identify research gaps and disseminate the information to NLP researchers interested in Ethiopian languages and encourage future research in this domain.

en cs.CL

Detail Sumber

arXiv Open Access 2023

Parallel Corpus for Indigenous Language Translation: Spanish-Mazatec and Spanish-Mixtec

Atnafu Lambebo Tonja, Christian Maldonado-Sifuentes, David Alejandro Mendoza Castillo et al.

In this paper, we present a parallel Spanish-Mazatec and Spanish-Mixtec corpus for machine translation (MT) tasks, where Mazatec and Mixtec are two indigenous Mexican languages. We evaluated the usability of the collected corpus using three different approaches: transformer, transfer learning, and fine-tuning pre-trained multilingual MT models. Fine-tuning the Facebook M2M100-48 model outperformed the other approaches, with BLEU scores of 12.09 and 22.25 for Mazatec-Spanish and Spanish-Mazatec translations, respectively, and 16.75 and 22.15 for Mixtec-Spanish and Spanish-Mixtec translations, respectively. The findings show that the dataset size (9,799 sentences in Mazatec and 13,235 sentences in Mixtec) affects translation performance and that indigenous languages work better when used as target languages. The findings emphasize the importance of creating parallel corpora for indigenous languages and fine-tuning models for low-resource translation tasks. Future research will investigate zero-shot and few-shot learning approaches to further improve translation performance in low-resource settings. The dataset and scripts are available at \url{https://github.com/atnafuatx/Machine-Translation-Resources}

en cs.CL

Detail Sumber

arXiv Open Access 2022

@C -- augmented version of C programming language

Iosif Iulian Petrila

The augmented version of C programming language is presented. The language was completed with a series of low-level and high-level facilities to enlarge the language usage spectrum to various computing systems, operations, users. The ambiguities and inconsistencies have been resolved by managing problematic and undefined languages elements through an interpretation and management similar to that used in the case of other C syntax based languages. The proposed augmentative completeness elements, through @C approach, preserve the spirit of C language and its basic characteristics through compatibility with the standard version but also allow rejuvenation and bring C language to the present programming languages state of the art.

en cs.PL, cs.FL

Detail Sumber

DOAJ Open Access 2022

Relationships between language types and their traces

ZEMFİRA ŞAHBAZOVA, MƏLEYKƏ MİRZƏLİ

According to their philological or typological characteristics, the languages of the world are grouped as follows, as accepted by most linguists: 1. Root languages; 2. Agglutinative languages; 3. Inflected languages; 4. Incorporated languages. Although each group has its own characteristics, they are interconnected and integrated. In this article, attention was paid to the divisions of languages from a typological point of view in order to show that these divisions are interconnected and integrated. In other words, a number of features of the formation of a sentence as a word or a lexical unit, as in incorporation, were shown in the Azerbaijani language. After the transformation of a sentence into a lexical unit, some transformation events that are considered acceptable in the language (the loss of many features, the process of gaining new features in place of the lost) were identified. Then a conditional structural classification was made according to the components of the sentence, and the division was given according to the semantic nuances. Sentences based on materials included in the general lexicon and dialect words of the Azerbaijani language was theoretically examinated in the article. First, the typological division of languages was discussed, and it was clarified that no sharp boundaries can be drawn between these divisions. Seeing the traces of each other in each language type shows that the connection of thinking with language is due to the sociological and social relations of the members of the society belonging to different language types. It seems to us that this topic requires more scientific research. In general, the relationship between language types and their integration into each other should be specifically investigated.

Language and Literature, Ural-Altaic languages

Detail DOI Sumber

arXiv Open Access 2021

Differentiable Allophone Graphs for Language-Universal Speech Recognition

Brian Yan, Siddharth Dalmia, David R. Mortensen et al.

Building language-universal speech recognition systems entails producing phonological units of spoken sound that can be shared across languages. While speech annotations at the language-specific phoneme or surface levels are readily available, annotations at a universal phone level are relatively rare and difficult to produce. In this work, we present a general framework to derive phone-level supervision from only phonemic transcriptions and phone-to-phoneme mappings with learnable weights represented using weighted finite-state transducers, which we call differentiable allophone graphs. By training multilingually, we build a universal phone-based speech recognition model with interpretable probabilistic phone-to-phoneme mappings for each language. These phone-based systems with learned allophone graphs can be used by linguists to document new languages, build phone-based lexicons that capture rich pronunciation variations, and re-evaluate the allophone mappings of seen language. We demonstrate the aforementioned benefits of our proposed framework with a system trained on 7 diverse languages.

en cs.CL, cs.SD

Detail Sumber

arXiv Open Access 2020

Combining Pretrained High-Resource Embeddings and Subword Representations for Low-Resource Languages

Machel Reid, Edison Marrese-Taylor, Yutaka Matsuo

The contrast between the need for large amounts of data for current Natural Language Processing (NLP) techniques, and the lack thereof, is accentuated in the case of African languages, most of which are considered low-resource. To help circumvent this issue, we explore techniques exploiting the qualities of morphologically rich languages (MRLs), while leveraging pretrained word vectors in well-resourced languages. In our exploration, we show that a meta-embedding approach combining both pretrained and morphologically-informed word embeddings performs best in the downstream task of Xhosa-English translation.

en cs.CL, cs.LG

Detail Sumber

arXiv Open Access 2020

Correctness of Sequential Monte Carlo Inference for Probabilistic Programming Languages

Daniel Lundén, Johannes Borgström, David Broman

Probabilistic programming is an approach to reasoning under uncertainty by encoding inference problems as programs. In order to solve these inference problems, probabilistic programming languages (PPLs) employ different inference algorithms, such as sequential Monte Carlo (SMC), Markov chain Monte Carlo (MCMC), or variational methods. Existing research on such algorithms mainly concerns their implementation and efficiency, rather than the correctness of the algorithms themselves when applied in the context of expressive PPLs. To remedy this, we give a correctness proof for SMC methods in the context of an expressive PPL calculus, representative of popular PPLs such as WebPPL, Anglican, and Birch. Previous work have studied correctness of MCMC using an operational semantics, and correctness of SMC and MCMC in a denotational setting without term recursion. However, for SMC inference -- one of the most commonly used algorithms in PPLs as of today -- no formal correctness proof exists in an operational setting. In particular, an open question is if the resample locations in a probabilistic program affects the correctness of SMC. We solve this fundamental problem, and make four novel contributions: (i) we extend an untyped PPL lambda calculus and operational semantics to include explicit resample terms, expressing synchronization points in SMC inference; (ii) we prove, for the first time, that subject to mild restrictions, any placement of the explicit resample terms is valid for a generic form of SMC inference; (iii) as a result of (ii), our calculus benefits from classic results from the SMC literature: a law of large numbers and an unbiased estimate of the model evidence; and (iv) we formalize the bootstrap particle filter for the calculus and discuss how our results can be further extended to other SMC algorithms.

en cs.PL

Detail DOI Sumber

DOAJ Open Access 2020

Focus position in SOV ~ SVO-varying languages – evidence from Enets, Nganasan, and Dolgan

Chris Lasse Däbritz

It is well known that the basic word order pattern of a language is closely intertwined with the syntactic realization of argument focus constituents. SVO languages exhibit a focus position at the sentence’s right periphery, SOV languages exhibit an immediately preverbal focus position. The study at hand examines both the basic word order patterns and the syntactic realization of focus in Enets, Nganasan and Dolgan. The major outcome is that Nganasan and Dolgan are much more flexible with respect to their basic word order pattern and, in consequence, exhibit both an immediately preverbal focus position and a right-peripheral focus position, whilst Enets realizes argument focus constituents almost exclusively immediately preverbally. Kokkuvõte. Chris Lasse Däbritz: Fookuse asend SOV ~ SVO variatsiooniga keeltes – tõendus eenetsi, nganassaani ja dolgaani keeltest. On üldiselt teada, et ühe keele põhiline sõnajärjestus on tihedas seoses (kitsalt) fookustatud konstituentide süntaksiga. SVO keeltes on (kitsas) fookus reali seeritud lause paremas perifeerias, SOV keeltes on (kitsas) fookus realiseeritud vahetult verbi ees. Selles artiklis uuritakse nii põhilist sõnajärjestust kui ka fookuse süntaksit eenetsi, nganassaani ja dolgaani keeltes. Uurimuse kõige olulisem tulemus on see, et nganassaani ja dolgaani keeltes on põhiline sõnajärjestus tunduvalt paindlikum kui eenetsi keeles. Sellepärast realiseeritakse nganassaani ja dolgaani keeltes (kitsalt) fookustatud konstituendid nii vahetult verbi ees kui ka lause paremas perifeerias, samas kui eenetsi keeles realiseeritakse nad ainult vahetult verbi ees. Аннотация. Крис Лассе Дэбриц: Позиция фокуса в языках с вариативным порядком слов SOV ~ SVO – данные энецкого, нганасанского и долганского языков. Как известно, базовый порядок слов в языке определяет синтаксическую позицию фокусных аргументов. В языках SVO фокусные составляющие ставятся в конце предложения, в языках SOV – непосредственно перед глаголом. В данной статье рассматриваются базовый порядок слов и позиция фокуса в энецком, нганасанском и долганском языках и показывается, что в нганасанском и долганском языках порядок слов намного более свободен и, соответственно, фокусные аргументы могут располагаться как перед глаголом, так и в конце предложения; а в энецком языке позиция фокуса – почти всегда непосредственно перед глаголом.

Philology. Linguistics, Finnic. Baltic-Finnic

Detail DOI Sumber

DOAJ Open Access 2020

Foreword

Gerson Klumpp, Valentin Gusev

Foreword

Philology. Linguistics, Finnic. Baltic-Finnic

Detail DOI Sumber

arXiv Open Access 2019

Multi-Element Long Distance Dependencies: Using SPk Languages to Explore the Characteristics of Long-Distance Dependencies

Abhijit Mahalunkar, John D. Kelleher

In order to successfully model Long Distance Dependencies (LDDs) it is necessary to understand the full-range of the characteristics of the LDDs exhibited in a target dataset. In this paper, we use Strictly k-Piecewise languages to generate datasets with various properties. We then compute the characteristics of the LDDs in these datasets using mutual information and analyze the impact of factors such as (i) k, (ii) length of LDDs, (iii) vocabulary size, (iv) forbidden subsequences, and (v) dataset size. This analysis reveal that the number of interacting elements in a dependency is an important characteristic of LDDs. This leads us to the challenge of modelling multi-element long-distance dependencies. Our results suggest that attention mechanisms in neural networks may aide in modeling datasets with multi-element long-distance dependencies. However, we conclude that there is a need to develop more efficient attention mechanisms to address this issue.

en cs.LG, cs.FL

Detail Sumber

DOAJ Open Access 2019

Editorial Foreword

Crina Leon

The second issue of volume 11 of The Romanian Journal for Baltic and Nordic Studies includes on the one hand topics of interest for scholars focusing on Medieval Studies, and on the other hand, further approaches aspects related to dissent and conformism, which were tackled during the tenth annual international conference on Nordic and Baltic studies in Romania. The conference entitled Dissent versus Conformism in the Nordic, Baltic and Black Sea Areas was held in Constanţa between June 6-8, 2019.

Finnic. Baltic-Finnic, Social Sciences

Detail Sumber

Hasil untuk "Ural-Altaic languages"