Hasil "Slavic languages. Baltic languages. Albanian languages"

S2 Open Access 2025

Localizing AI: Evaluating Open-Weight Language Models for Languages of Baltic States

Jurgita Kapovciut.e-Dzikien.e, Toms Bergmanis, Mārcis Pinnis

Although large language models (LLMs) have transformed our expectations of modern language technologies, concerns over data privacy often restrict the use of commercially available LLMs hosted outside of EU jurisdictions. This limits their application in governmental, defence, and other data-sensitive sectors. In this work, we evaluate the extent to which locally deployable open-weight LLMs support lesser-spoken languages such as Lithuanian, Latvian, and Estonian. We examine various size and precision variants of the top-performing multilingual open-weight models, Llama~3, Gemma~2, Phi, and NeMo, on machine translation, multiple-choice question answering, and free-form text generation. The results indicate that while certain models like Gemma~2 perform close to the top commercially available models, many LLMs struggle with these languages. Most surprisingly, however, we find that these models, while showing close to state-of-the-art translation performance, are still prone to lexical hallucinations with errors in at least 1 in 20 words for all open-weight multilingual LLMs.

5 sitasi en Computer Science

Detail DOI Sumber

S2 Open Access 2025

Predictability of Microsyntactic Units across Slavic Languages: A translation-based Study

M. Kunilovskaya, Iuliia Zaitova, Wei Xue et al.

2 sitasi en Computer Science

Detail Sumber

S2 Open Access 2025

Identifying Filled Pauses in Speech Across South and West Slavic Languages

Nikola Ljubešić, Ivan Porupski, Peter Rupnik et al.

,

1 sitasi en

Detail DOI Sumber

DOAJ Open Access 2025

Images of Kyiv Region and Volhynia in Travelogue “Across Western Lands, Old and New” by V.L. Kign-Dedlov

S. S. Zhdanov

This study examines the representation of Malorossiya (Little Russia) and Volhynia spaces within the travel narrative “Across Western Lands, Old and New” by V.L. Kign-Dedlov (1856–1908). The research is conducted through a semiotic-imagological analysis to address gaps in contemporary Russian humanities regarding spatial liminality, particularly Ukrainian themes. The novelty lies in analyzing elements of spatial imagery specific to Kyiv region and Volhynia, which have received limited attention from literary scholars thus far. It demonstrates that representations of these regions are characterized by travesty-like forms and contentual conflictuality due to their multicultural nature encompassing Russian, German, Polish, Little Russian, and Jewish identities. Additionally, it identifies a theme of uncertainty and centrality both anthropologically and spatially. This ambiguity is especially evident in depictions of Volhynia as an intermediate topography between Malorossiya and Belarus. Furthermore, the study highlights how the author juxtaposes sacred historical landscapes with provincial and banal present-day realities, emphasizing nostalgia for past glories.

Slavic languages. Baltic languages. Albanian languages

Detail DOI Sumber

arXiv Open Access 2025

Neural Network Verification is a Programming Language Challenge

Lucas C. Cordeiro, Matthew L. Daggitt, Julien Girard-Satabin et al.

Neural network verification is a new and rapidly developing field of research. So far, the main priority has been establishing efficient verification algorithms and tools, while proper support from the programming language perspective has been considered secondary or unimportant. Yet, there is mounting evidence that insights from the programming language community may make a difference in the future development of this domain. In this paper, we formulate neural network verification challenges as programming language challenges and suggest possible future solutions.

en cs.PL, cs.LG

Detail Sumber

arXiv Open Access 2025

Language Generation: Complexity Barriers and Implications for Learning

Marcelo Arenas, Pablo Barceló, Luis Cofré et al.

Kleinberg and Mullainathan showed that language generation in the limit is always possible at the level of computability: given enough positive examples, a learner can eventually generate data indistinguishable from a target language. However, such existence results do not address feasibility. We study the sample complexity of language generation in the limit for several canonical classes of formal languages. Our results show that infeasibility already appears for context-free and regular languages, and persists even for strict subclasses such as locally threshold testable languages, as well as for incomparable classes such as non-erasing pattern languages, a well-studied class in the theory of language identification. Overall, our results establish a clear gap between the theoretical possibility of language generation in the limit and its computational feasibility.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2025

JGU Mainz's Submission to the WMT25 Shared Task on LLMs with Limited Resources for Slavic Languages: MT and QA

Hossain Shaikh Saadi, Minh Duc Bui, Mario Sanz-Guerrero et al.

This paper presents the JGU Mainz submission to the WMT25 Shared Task on LLMs with Limited Resources for Slavic Languages: Machine Translation and Question Answering, focusing on Ukrainian, Upper Sorbian, and Lower Sorbian. For each language, we jointly fine-tune a Qwen2.5-3B-Instruct model for both tasks with parameter-efficient finetuning. Our pipeline integrates additional translation and multiple-choice question answering (QA) data. For Ukrainian QA, we further use retrieval-augmented generation. We also apply ensembling for QA in Upper and Lower Sorbian. Experiments show that our models outperform the baseline on both tasks.

en cs.CL

Detail Sumber

arXiv Open Access 2025

Advancing Uto-Aztecan Language Technologies: A Case Study on the Endangered Comanche Language

Jesus Alvarez C, Daua D. Karajeanes, Ashley Celeste Prado et al.

The digital exclusion of endangered languages remains a critical challenge in NLP, limiting both linguistic research and revitalization efforts. This study introduces the first computational investigation of Comanche, an Uto-Aztecan language on the verge of extinction, demonstrating how minimal-cost, community-informed NLP interventions can support language preservation. We present a manually curated dataset of 412 phrases, a synthetic data generation pipeline, and an empirical evaluation of GPT-4o and GPT-4o-mini for language identification. Our experiments reveal that while LLMs struggle with Comanche in zero-shot settings, few-shot prompting significantly improves performance, achieving near-perfect accuracy with just five examples. Our findings highlight the potential of targeted NLP methodologies in low-resource contexts and emphasize that visibility is the first step toward inclusion. By establishing a foundation for Comanche in NLP, we advocate for computational approaches that prioritize accessibility, cultural sensitivity, and community engagement.

en cs.CL, cs.LG

Detail DOI Sumber

S2 Open Access 2025

Fine-Tuned Transformer-Based Weighted Soft Voting Ensemble for Persuasion Technique Classification in Slavic Languages

Mahshar Yahan, Sakib Sarker, M. Islam

This paper explores detecting persuasion techniques in Slavic languages using both single transformer models and weighted soft voting ensemble methods. We focused on identifying the presence of persuasion in Bulgarian, Polish, Slovene, and Russian text fragments. We have applied various preprocessing steps to improve model performance. Our experiments show that weighted soft voting ensembles consistently outperform single models in most languages, achieving F1-scores of 0.867 for Bulgarian, 0.902 for Polish, and 0.804 for Russian. For Slovene, the single SlovakBERT model performed best with an F1-score of 0.823, just ahead of the ensemble. These results demonstrate that combining monolingual and multilingual transformer models is effective for robust persuasion detection in low-resource Slavic languages.

en

Detail DOI Sumber

S2 Open Access 2025

Fine‐Tuned Transformers for Detection and Classification of Persuasion Techniques in Slavic Languages

E. Loginova

This paper details a system developed for the SlavicNLP 2025 Shared Task on the Detection and Classification of Persuasion Techniques in Texts for Slavic Languages. The shared task comprises two subtasks: binary detection of persuasive content within text fragments and multi-class, multi-label identification of specific persuasion techniques at the token level. Our primary approach for both subtasks involved fine-tuning pre-trained multilingual Transformer models. The resulting systems reached F1 score of 0.92 in paragraph-level detection (ranked third on average). We present our system architecture, data handling, training procedures, and official results, alongside areas for future improvement.

en

Detail DOI Sumber

S2 Open Access 2025

Associative plurals in Slavic languages and beyond

Beccy Lewis

abstract:This paper examines the associative plural construction in languages that use the multiplicative plural marker to derive the associative plural (henceforth plural pattern languages) and establishes a novel typological generalization about such languages: all plural pattern languages are either articleless or have affixal articles. To account for this previously unnoticed property of plural pattern languages, a new analysis is presented in which the plural pattern involves incorporation of Num0 to the head of a functional projection AssociativeP. It is then argued that associative plurals in South Slavic—which involve plural possessives—also show the plural pattern (i.e., also involve incorporation of the Num head), showing that the current approach can unify cross-linguistically dissimilar associative plurals under one analysis.

en

Detail DOI Sumber

S2 Open Access 2025

Robust Detection of Persuasion Techniques in Slavic Languages via Multitask Debiasing and Walking Embeddings

Ewelina Księżniak, Krzysztof Węcel, Marcin Sawiński

Our approach to Subtask 1 integrates fine-tuned multilingual transformer models with two complementary robustness-oriented strategies: Walking Embeddings and Content-Debiasing. With the first, we tried to understand the change in embeddings when various manipulation techniques were applied. The latter leverages a supervised contrastive objective over semantically equivalent yet stylistically divergent text pairs, generated via GPT-4. We conduct extensive experiments, including 5-fold cross-validation and out-of-domain evaluation, and explore the impact of contrastive loss weighting.

en

Detail DOI Sumber

S2 Open Access 2025

Distance-based approach reveals convergence effects in word order among the languages of the Circum-Baltic linguistic area

Ilja A. Seržant, Berfin Aktas, Maria Ovsjannikova et al.

We probe a new approach to linguistic areas. Instead of similarity of a feature across languages of the area, we focus on its adaptation to the area. Adaptation is a set of changes and/or retentions in a language towards, but not necessarily into, similarity with the other languages of the area. Technically, we estimate adaptation by comparing the distance between the focus language from the area and a geographically and genealogically closely related language outside of the area (its benchmark language) as tertium comparationis . If the focus language is closer to the area than its benchmark, we interpret it as evidence for adaptation towards the other languages of the area. Adaptation includes all possible scenarios of change and non-change. We test word order and find that all languages of the CB area show effects of adaptation, with Baltic Romani and both Baltic languages being in the center of the area.

en

Detail DOI Sumber

S2 Open Access 2024

CLASSLA-web: Comparable Web Corpora of South Slavic Languages Enriched with Linguistic and Genre Annotation

Nikola Ljubešić, Taja Kuzman

This paper presents a collection of highly comparable web corpora of Slovenian, Croatian, Bosnian, Montenegrin, Serbian, Macedonian, and Bulgarian, covering thereby the whole spectrum of official languages in the South Slavic language space. The collection of these corpora comprises a total of 13 billion tokens of texts from 26 million documents. The comparability of the corpora is ensured by a comparable crawling setup and the usage of identical crawling and post-processing technology. All the corpora were linguistically annotated with the state-of-the-art CLASSLA-Stanza linguistic processing pipeline, and enriched with document-level genre information via the Transformer-based multilingual X-GENRE classifier, which further enhances comparability at the level of linguistic annotation and metadata enrichment. The genre-focused analysis of the resulting corpora shows a rather consistent distribution of genres throughout the seven corpora, with variations in the most prominent genre categories being well-explained by the economic strength of each language community. A comparison of the distribution of genre categories across the corpora indicates that web corpora from less developed countries primarily consist of news articles. Conversely, web corpora from economically more developed countries exhibit a smaller proportion of news content, with a greater presence of promotional and opinionated texts.

9 sitasi en Computer Science

Detail DOI Sumber

S2 Open Access 2024

Reflexivity patterns in West-Slavic languages

M. Ivanová

The aim of this paper is to analyse differences in reflexive coding in Slovak, Czech and Polish and to evaluate the factors responsible for using phonologically more or less complex reflexive markers. To address this issue, we looked at preferred reflexive coding strategies in Slovak, Czech, and Polish, relying on data extracted from InterCorp multilingual corpus. The results are then verified by data from monolingual corpora of investigated languages for one semantic group of verbs labelled as ‘Prevarication’ in FrameNet. The results show that semantic frame underlying the meaning of lexical items cannot be the only possible explanation for distribution of reflexive markers but there are also other semantic, syntactic and pragmatic factors playing a pivotal role in reflexive coding strategies, often unique for a given language.

1 sitasi en

Detail DOI Sumber

DOAJ Open Access 2024

Laughter Closed up in Horror: The Case of the Hungarian Ballad about a Walled-up Woman

Leszek Hensel

Laughter Closed up in Horror: The Case of the Hungarian Ballad about a Walled-up Woman The paper is dedicated to a paradox. By approaching the topic of violence, I am actually trying to answer the question of whether and how it is possible to combine it with laughter. I was inspired to adopt this perspective by research on the different types of humorous recycling of one of the most popular works of Hungarian folk literature, the ballad Kőmives Kelemenné [Kelemen Mason’s Wife], which is a story of men’s violence against a woman: the walling-up of the master mason’s wife so that the husband and his team could build a durable castle and receive the agreed payment for their work. The article is based on a rich material base, including both short lexical jokes and longer satires and humorous sketches that have been produced in Hungary since the beginning of the 20th century. In addition to written texts, it refers to humorous drawings published in newspapers and magazines and to versions of the ballad that are present in the internet space. Based on theoretical works by, among others, S. Freud, V. Raskin, R. A. Martin, it seeks an answer to the question of the image of Hungarian society that the analysed materials reflect. Humor zamknięty w horrorze. Przypadek węgierskiej ballady o zamurowanej kobiecie Artykuł poświęcony jest pewnemu paradoksowi. Podejmując temat przemocy, w istocie próbuje odpowiedzieć na pytanie, czy i jaki sposób jest możliwe jej połączenie ze śmiechem. Do przyjęcia takiej perspektywy skłaniają badania nad różnymi rodzajami humorystycznego recyklingu jednego z najpopularniejszych utworów węgierskiej literatury ludowej – ballady Kőmives Kelemenné [Żona Kelemena Murarza]. Ballada opowiada o zamurowaniu żony mistrza murarskiego, aby mąż i jego współpracownicy mogli zbudować trwały zamek i otrzymać wysoką zapłatę za swoją pracę. Artykuł opiera się na bogatej bazie materiałowej, obejmującej zarówno krótkie żarty leksykalne, jak i dłuższe satyry i humorystyczne skecze, które powstawały na Węgrzech od początku XX wieku do czasów współczesnych. Oprócz tekstów pisanych odwołuje się do humorystycznych rysunków publikowanych w gazetach i czasopismach oraz do wersji ballady obecnych w przestrzeni internetowej. Biorąc za podstawę prace teoretyczne m.in. S. Freuda, V. Raskina, R. A. Martina, autor poszukuje odpowiedzi na pytanie o obraz społeczeństwa węgierskiego, jaki odzwierciedlają analizowane materiały.

Ethnology. Social and cultural anthropology, Slavic languages. Baltic languages. Albanian languages

Detail DOI Sumber

DOAJ Open Access 2024

Sakha Language in a Northern Industrial Monotown: Linguistic Distribution in Speech Repertoire of Sakha People

I. A. Danilov

This study explores the functioning of the Yakut language in the conditions of the northern industrial monotown Mirny (Republic of Sakha (Yakutia)). The study identifies the features of linguistic distribution in the speech repertoire of ethnic Sakha based on data from a sociolinguistic survey and interviews (n=279). Descriptive statistical methods and content analysis are employed. The results reveal an asymmetric nature of Yakut-Russian bilingualism with Russian language predominance in public communication. While Sakha individuals exhibit a high level of language competence in the Yakut language, its application is limited. In familial and friendly communication, the Yakut language maintains significant positions; however, its demand is minimal in the professional-business sphere, especially in industry. Among individuals with higher education, the percentage of Russian speakers at work reaches 68.34% compared to 55.77% among those with only secondary vocational education. Russian language dominates in management, service sectors, and law enforcement (75—100%). Only in fields such as healthcare, education, and culture does the Yakut language maintain strong positions. The native language is primarily perceived by Sakha as a symbolic marker of identity, yielding to Russian in social prestige and communicative power. The vitality prospects of the Yakut language in the city are assessed ambiguously and are linked to the effectiveness of language policy in key institutional contexts (education, media, government).

Slavic languages. Baltic languages. Albanian languages

Detail DOI Sumber

arXiv Open Access 2024

Anti-Context-Free languages

Carles Cardó

Context-free languages can be characterized in several ways. This article studies projective linearisations of languages of simple dependency trees, i.e., dependency trees in which a node can govern at most one node with a given syntactic function. We prove that the projective linearisations of local languages of simple dependency trees coincide with the context-free languages. Simple dependency trees suggest alternative dual notions of locality and projectivity, which permits defining a dual language for each context-free language. We call this new class of languages anti-context-free. These languages are related to some linguistic constructions exhibiting the so-called cross-serial dependencies that were historically important for the development of computational linguistics. We propose that this duality could be a relevant linguistic phenomenon.

en cs.FL

Detail Sumber

S2 Open Access 2024

SCIENTIFIC APPROACH OF PROF. A.G. SHIROKOVA AND COMPARATIVE STUDIES OF THE GRAMMAR OF SLAVIC LANGUAGES

L. Karpenko

The article deals with the contribution to the comparative study of Slavic languages of Prof. A.G. Shirokova. Her approach, based on the study of functional equivalence of languages, makes it possible to investigate similarities and differences in the ways of expressing the same meaningful categories, to reveal the national-linguistic specificity of their functioning. The features of grammatical conceptualization of the category of spatiality in Russian against the background of Bulgarian are shown. The emphasized expression of spatiality in Russian language is manifested in the systematic organization of spatial prepositional and case forms, in the prevalence of everyday sentences with spatial determinants, in the productivity of spatial derivational models, in the redundancy of spatial semantics units in speech.

en

Detail DOI Sumber

DOAJ Open Access 2023

Gulag Literature in Italy (1991-2022). A Few Notes on Editions and Re-editions of a 'Minimal Canon'

Maurizia Calusio

The paper deals with the publishing history in the post-Soviet era of three very prominent works in Gulag literature in Italian translation – Life and Fate by Vasilij Grossman, Kolyma Tales by Varlam Šalamov and The Gulag Archipelago by Aleksandr Solženicyn – in order to reconstruct their Italian reception in the years 1991-2022. The controversies surrounding the publication of these works in Italy are also discussed, and it is noted how the Italian press has paid particular attention to them, with major Italian writers reviewing the editions and contributing to their Italian reception. Finally, the lack of critical reflection by Italian Russianists on Gulag literature is discussed.

History of Eastern Europe, Slavic languages. Baltic languages. Albanian languages

Detail DOI Sumber

Hasil untuk "Slavic languages. Baltic languages. Albanian languages"