Localizing AI: Evaluating Open-Weight Language Models for Languages of Baltic States
Jurgita Kapovciut.e-Dzikien.e, Toms Bergmanis, Mārcis Pinnis
Although large language models (LLMs) have transformed our expectations of modern language technologies, concerns over data privacy often restrict the use of commercially available LLMs hosted outside of EU jurisdictions. This limits their application in governmental, defence, and other data-sensitive sectors. In this work, we evaluate the extent to which locally deployable open-weight LLMs support lesser-spoken languages such as Lithuanian, Latvian, and Estonian. We examine various size and precision variants of the top-performing multilingual open-weight models, Llama~3, Gemma~2, Phi, and NeMo, on machine translation, multiple-choice question answering, and free-form text generation. The results indicate that while certain models like Gemma~2 perform close to the top commercially available models, many LLMs struggle with these languages. Most surprisingly, however, we find that these models, while showing close to state-of-the-art translation performance, are still prone to lexical hallucinations with errors in at least 1 in 20 words for all open-weight multilingual LLMs.
5 sitasi
en
Computer Science
Predictability of Microsyntactic Units across Slavic Languages: A translation-based Study
M. Kunilovskaya, Iuliia Zaitova, Wei Xue
et al.
2 sitasi
en
Computer Science
Identifying Filled Pauses in Speech Across South and West Slavic Languages
Nikola Ljubešić, Ivan Porupski, Peter Rupnik
et al.
,
Images of Kyiv Region and Volhynia in Travelogue “Across Western Lands, Old and New” by V.L. Kign-Dedlov
S. S. Zhdanov
This study examines the representation of Malorossiya (Little Russia) and Volhynia spaces within the travel narrative “Across Western Lands, Old and New” by V.L. Kign-Dedlov (1856–1908). The research is conducted through a semiotic-imagological analysis to address gaps in contemporary Russian humanities regarding spatial liminality, particularly Ukrainian themes. The novelty lies in analyzing elements of spatial imagery specific to Kyiv region and Volhynia, which have received limited attention from literary scholars thus far. It demonstrates that representations of these regions are characterized by travesty-like forms and contentual conflictuality due to their multicultural nature encompassing Russian, German, Polish, Little Russian, and Jewish identities. Additionally, it identifies a theme of uncertainty and centrality both anthropologically and spatially. This ambiguity is especially evident in depictions of Volhynia as an intermediate topography between Malorossiya and Belarus. Furthermore, the study highlights how the author juxtaposes sacred historical landscapes with provincial and banal present-day realities, emphasizing nostalgia for past glories.
Slavic languages. Baltic languages. Albanian languages
Fine-Tuned Transformer-Based Weighted Soft Voting Ensemble for Persuasion Technique Classification in Slavic Languages
Mahshar Yahan, Sakib Sarker, M. Islam
This paper explores detecting persuasion techniques in Slavic languages using both single transformer models and weighted soft voting ensemble methods. We focused on identifying the presence of persuasion in Bulgarian, Polish, Slovene, and Russian text fragments. We have applied various preprocessing steps to improve model performance. Our experiments show that weighted soft voting ensembles consistently outperform single models in most languages, achieving F1-scores of 0.867 for Bulgarian, 0.902 for Polish, and 0.804 for Russian. For Slovene, the single SlovakBERT model performed best with an F1-score of 0.823, just ahead of the ensemble. These results demonstrate that combining monolingual and multilingual transformer models is effective for robust persuasion detection in low-resource Slavic languages.
Fine‐Tuned Transformers for Detection and Classification of Persuasion Techniques in Slavic Languages
E. Loginova
This paper details a system developed for the SlavicNLP 2025 Shared Task on the Detection and Classification of Persuasion Techniques in Texts for Slavic Languages. The shared task comprises two subtasks: binary detection of persuasive content within text fragments and multi-class, multi-label identification of specific persuasion techniques at the token level. Our primary approach for both subtasks involved fine-tuning pre-trained multilingual Transformer models. The resulting systems reached F1 score of 0.92 in paragraph-level detection (ranked third on average). We present our system architecture, data handling, training procedures, and official results, alongside areas for future improvement.
Associative plurals in Slavic languages and beyond
Beccy Lewis
abstract:This paper examines the associative plural construction in languages that use the multiplicative plural marker to derive the associative plural (henceforth plural pattern languages) and establishes a novel typological generalization about such languages: all plural pattern languages are either articleless or have affixal articles. To account for this previously unnoticed property of plural pattern languages, a new analysis is presented in which the plural pattern involves incorporation of Num0 to the head of a functional projection AssociativeP. It is then argued that associative plurals in South Slavic—which involve plural possessives—also show the plural pattern (i.e., also involve incorporation of the Num head), showing that the current approach can unify cross-linguistically dissimilar associative plurals under one analysis.
Robust Detection of Persuasion Techniques in Slavic Languages via Multitask Debiasing and Walking Embeddings
Ewelina Księżniak, Krzysztof Węcel, Marcin Sawiński
Our approach to Subtask 1 integrates fine-tuned multilingual transformer models with two complementary robustness-oriented strategies: Walking Embeddings and Content-Debiasing. With the first, we tried to understand the change in embeddings when various manipulation techniques were applied. The latter leverages a supervised contrastive objective over semantically equivalent yet stylistically divergent text pairs, generated via GPT-4. We conduct extensive experiments, including 5-fold cross-validation and out-of-domain evaluation, and explore the impact of contrastive loss weighting.
Distance-based approach reveals convergence effects in word order among the languages of the Circum-Baltic linguistic
area
Ilja A. Seržant, Berfin Aktas, Maria Ovsjannikova
et al.
We probe a new approach to linguistic areas. Instead of similarity of a feature across languages of the area, we focus on its adaptation to the area. Adaptation is a set of changes and/or retentions in a language towards, but not necessarily into, similarity with the other languages of the area. Technically, we estimate adaptation by comparing the distance between the focus language from the area and a geographically and genealogically closely related language outside of the area (its benchmark language) as tertium comparationis . If the focus language is closer to the area than its benchmark, we interpret it as evidence for adaptation towards the other languages of the area. Adaptation includes all possible scenarios of change and non-change. We test word order and find that all languages of the CB area show effects of adaptation, with Baltic Romani and both Baltic languages being in the center of the area.
CLASSLA-web: Comparable Web Corpora of South Slavic Languages Enriched with Linguistic and Genre Annotation
Nikola Ljubešić, Taja Kuzman
This paper presents a collection of highly comparable web corpora of Slovenian, Croatian, Bosnian, Montenegrin, Serbian, Macedonian, and Bulgarian, covering thereby the whole spectrum of official languages in the South Slavic language space. The collection of these corpora comprises a total of 13 billion tokens of texts from 26 million documents. The comparability of the corpora is ensured by a comparable crawling setup and the usage of identical crawling and post-processing technology. All the corpora were linguistically annotated with the state-of-the-art CLASSLA-Stanza linguistic processing pipeline, and enriched with document-level genre information via the Transformer-based multilingual X-GENRE classifier, which further enhances comparability at the level of linguistic annotation and metadata enrichment. The genre-focused analysis of the resulting corpora shows a rather consistent distribution of genres throughout the seven corpora, with variations in the most prominent genre categories being well-explained by the economic strength of each language community. A comparison of the distribution of genre categories across the corpora indicates that web corpora from less developed countries primarily consist of news articles. Conversely, web corpora from economically more developed countries exhibit a smaller proportion of news content, with a greater presence of promotional and opinionated texts.
9 sitasi
en
Computer Science
Reflexivity patterns in West-Slavic languages
M. Ivanová
The aim of this paper is to analyse differences in reflexive coding in Slovak, Czech and Polish and to evaluate the factors responsible for using phonologically more or less complex reflexive markers. To address this issue, we looked at preferred reflexive coding strategies in Slovak, Czech, and Polish, relying on data extracted from InterCorp multilingual corpus. The results are then verified by data from monolingual corpora of investigated languages for one semantic group of verbs labelled as ‘Prevarication’ in FrameNet. The results show that semantic frame underlying the meaning of lexical items cannot be the only possible explanation for distribution of reflexive markers but there are also other semantic, syntactic and pragmatic factors playing a pivotal role in reflexive coding strategies, often unique for a given language.
Laughter Closed up in Horror: The Case of the Hungarian Ballad about a Walled-up Woman
Leszek Hensel
Laughter Closed up in Horror: The Case of the Hungarian Ballad about a Walled-up Woman
The paper is dedicated to a paradox. By approaching the topic of violence, I am actually trying to answer the question of whether and how it is possible to combine it with laughter. I was inspired to adopt this perspective by research on the different types of humorous recycling of one of the most popular works of Hungarian folk literature, the ballad Kőmives Kelemenné [Kelemen Mason’s Wife], which is a story of men’s violence against a woman: the walling-up of the master mason’s wife so that the husband and his team could build a durable castle and receive the agreed payment for their work. The article is based on a rich material base, including both short lexical jokes and longer satires and humorous sketches that have been produced in Hungary since the beginning of the 20th century. In addition to written texts, it refers to humorous drawings published in newspapers and magazines and to versions of the ballad that are present in the internet space. Based on theoretical works by, among others, S. Freud, V. Raskin, R. A. Martin, it seeks an answer to the question of the image of Hungarian society that the analysed materials reflect.
Humor zamknięty w horrorze. Przypadek węgierskiej ballady o zamurowanej kobiecie
Artykuł poświęcony jest pewnemu paradoksowi. Podejmując temat przemocy, w istocie próbuje odpowiedzieć na pytanie, czy i jaki sposób jest możliwe jej połączenie ze śmiechem. Do przyjęcia takiej perspektywy skłaniają badania nad różnymi rodzajami humorystycznego recyklingu jednego z najpopularniejszych utworów węgierskiej literatury ludowej – ballady Kőmives Kelemenné [Żona Kelemena Murarza]. Ballada opowiada o zamurowaniu żony mistrza murarskiego, aby mąż i jego współpracownicy mogli zbudować trwały zamek i otrzymać wysoką zapłatę za swoją pracę. Artykuł opiera się na bogatej bazie materiałowej, obejmującej zarówno krótkie żarty leksykalne, jak i dłuższe satyry i humorystyczne skecze, które powstawały na Węgrzech od początku XX wieku do czasów współczesnych. Oprócz tekstów pisanych odwołuje się do humorystycznych rysunków publikowanych w gazetach i czasopismach oraz do wersji ballady obecnych w przestrzeni internetowej. Biorąc za podstawę prace teoretyczne m.in. S. Freuda, V. Raskina, R. A. Martina, autor poszukuje odpowiedzi na pytanie o obraz społeczeństwa węgierskiego, jaki odzwierciedlają analizowane materiały.
Ethnology. Social and cultural anthropology, Slavic languages. Baltic languages. Albanian languages
Sakha Language in a Northern Industrial Monotown: Linguistic Distribution in Speech Repertoire of Sakha People
I. A. Danilov
This study explores the functioning of the Yakut language in the conditions of the northern industrial monotown Mirny (Republic of Sakha (Yakutia)). The study identifies the features of linguistic distribution in the speech repertoire of ethnic Sakha based on data from a sociolinguistic survey and interviews (n=279). Descriptive statistical methods and content analysis are employed. The results reveal an asymmetric nature of Yakut-Russian bilingualism with Russian language predominance in public communication. While Sakha individuals exhibit a high level of language competence in the Yakut language, its application is limited. In familial and friendly communication, the Yakut language maintains significant positions; however, its demand is minimal in the professional-business sphere, especially in industry. Among individuals with higher education, the percentage of Russian speakers at work reaches 68.34% compared to 55.77% among those with only secondary vocational education. Russian language dominates in management, service sectors, and law enforcement (75—100%). Only in fields such as healthcare, education, and culture does the Yakut language maintain strong positions. The native language is primarily perceived by Sakha as a symbolic marker of identity, yielding to Russian in social prestige and communicative power. The vitality prospects of the Yakut language in the city are assessed ambiguously and are linked to the effectiveness of language policy in key institutional contexts (education, media, government).
Slavic languages. Baltic languages. Albanian languages
SCIENTIFIC APPROACH OF PROF. A.G. SHIROKOVA AND COMPARATIVE STUDIES OF THE GRAMMAR OF SLAVIC LANGUAGES
L. Karpenko
The article deals with the contribution to the comparative study of Slavic languages of Prof. A.G. Shirokova. Her approach, based on the study of functional equivalence of languages, makes it possible to investigate similarities and differences in the ways of expressing the same meaningful categories, to reveal the national-linguistic specificity of their functioning. The features of grammatical conceptualization of the category of spatiality in Russian against the background of Bulgarian are shown. The emphasized expression of spatiality in Russian language is manifested in the systematic organization of spatial prepositional and case forms, in the prevalence of everyday sentences with spatial determinants, in the productivity of spatial derivational models, in the redundancy of spatial semantics units in speech.
Gulag Literature in Italy (1991-2022). A Few Notes on Editions and Re-editions of a 'Minimal Canon'
Maurizia Calusio
The paper deals with the publishing history in the post-Soviet era of three very prominent works in Gulag literature in Italian translation – Life and Fate by Vasilij Grossman, Kolyma Tales by Varlam Šalamov and The Gulag Archipelago by Aleksandr Solženicyn – in order to reconstruct their Italian reception in the years 1991-2022. The controversies surrounding the publication of these works in Italy are also discussed, and it is noted how the Italian press has paid particular attention to them, with major Italian writers reviewing the editions and contributing to their Italian reception. Finally, the lack of critical reflection by Italian Russianists on Gulag literature is discussed.
History of Eastern Europe, Slavic languages. Baltic languages. Albanian languages
Celebrating Third Anniversary of October Revolution in Altai Province in 1920
K. A. Tishkina
This article examines the process of organizing celebratory events for the third anniversary of the October Revolution in the Altai Province. The focus is on the preparatory work, including the planning of festive activities, the coordination of themes for propaganda work, and the distribution of red fabric for making flags, among other things. The relevance of the study lies not only in the interest in the formation of the Soviet holiday calendar but also in the everyday life of the population in Siberia under the policy of ‘prodrazverstka’. By drawing on a wide range of sources, the article identifies the peculiarities of celebrating the third anniversary of the October Revolution, taking into account differences between urban and rural areas. It is determined that in cities, celebratory events were mass-oriented and covered all segments of the population, including parades, performances, and concerts. In rural areas, there were often no events due to weak propaganda work. Visiting individual graves and communal burial sites of those who died fighting for Soviet power became an important element of the holiday. The article concludes that celebrating the anniversary of the October Revolution in Siberia was a key aspect of the propaganda policy in favor of the Soviet state.
Slavic languages. Baltic languages. Albanian languages
When Do the Leaves Fall and When Do We Reap? – Semantic Analysis of Folk Month Names in the Languages of Eastern Europe
Krešimir Sučević Međeral
The abundance of folk month names in different languages of Eastern Europe results in specific mental pictures in a speaker's mind. Translation usually fails to communicate the elaborate connotations implied, even in the case of closely related languages. The paper gives a comparative semantical analysis of folk month names in the Slavic languages, the Baltic languages, Finnish, Estonian, Hungarian, Romanian and Albanian, identifying the common features, the differences and possible reasons for both. The aim of the paper is to provide data for both linguistic and ethnological mapping of Eastern Europe, as an area that had been under less influence from the Latin-dominated culture and had in that way better preserved some original, less uniform linguistic features.
Round Table, Devoted to the Discussion of the Museum Exposition on the Novel The Brothers Karamazov in Staraya Russa, Held on May 24th, 2021 during the 36th International Readings in Staraya Russa “Dostoevsky and Contemporary Age”, titled “The Novel The Brothers Karamazov in the 21st Century:Interpretations, Resonances, Comparisons”
Tatiana A. Kasatkina, Yulia V. Yukhnovich
The publication contains the record of the round table discussion “The Novel The Brothers Karamazov in the 21st Century: Interpretations, Resonances, Comparisons”, focused on the discussion of the Museum exposition dedicated to the novel The Brothers Karamazov in Staraya Russa, which took place on May 24th, 2021, in the framework of the 36th International Dostoevsky Readings “Dostoevsky and Contemporary Age”.
Slavic languages. Baltic languages. Albanian languages
INDO-EUROPEAN MUSICAL IDIOM AND INDO-EUROPEAN ETHNOGENESIS
N. Nazarov
This article for the first time proposes a methodological bridge between comparative and historical linguistics, classical philology (on the one hand) and ethnomusicology (on the other hand). Thus, it is possible to verify the results obtained independently in various fields of humanities of the 20th century. The present article compares rhythmical patterns of songs with explicit ritual use from different ethnic groups speaking Indo-European languages (Slavic, Baltic, Greek, Albanian, Iranian, and others). In these songs, there are two main types of stable rhythm patterns (0101/0100 and 011/010) which correlate with two main types of Indo-European poetic metrical patterns independently reconstructed by Indo-European linguistics. It suggests that folk songs preserved a range of possible Indo-European rhythmic patterns, even though languages have changed prosodic properties over time. The present study follows the track of previous studies, which showed partial parallels in Baltic, Slavic-Balcan, and Indo-Aryan song patterns, which loans could not explain. Further analysis of rhythmic patterns revealed their connections with prosodic and syntactic properties of Indo-European sentence. Relative rhythmic complexity of possible Indo-European musical idiom may be correlated to the rhythmic complexity of textiles that may be connected to some Indo-European speaking ethnic groups. The parallels outside the Indo-European cultural range show possible Uralic and Caucasian components of Indo-European musical idiom, which may indicate components of ethnogenesis of Indo-European ethnic groups. Thus ethnomusicology may be regarded as an independent source for studying prehistory, along with linguistics, archeology, and genetics. Though proposed in the 20th century by B. Bartok, V. Goshovsky, G. Luko, this program of ethnomusicological research has not yet been implemented.
Puhrepuhr (Grandchild) in the Shahpur’s Inscription at Haji-Abad by Comparison with Mesopotamian Texts
B. Sam, M. Naghizadeh, Sara Mohammadi Avandi
et al.