Hasil "Other Finnic languages and dialects"

arXiv Open Access 2026

Meenz bleibt Meenz, but Large Language Models Do Not Speak Its Dialect

Minh Duc Bui, Manuel Mager, Peter Herbert Kann et al.

Meenzerisch, the dialect spoken in the German city of Mainz, is also the traditional language of the Mainz carnival, a yearly celebration well known throughout Germany. However, Meenzerisch is on the verge of dying out-a fate it shares with many other German dialects. Natural language processing (NLP) has the potential to help with the preservation and revival efforts of languages and dialects. However, so far no NLP research has looked at Meenzerisch. This work presents the first research in the field of NLP that is explicitly focused on the dialect of Mainz. We introduce a digital dictionary-an NLP-ready dataset derived from an existing resource (Schramm, 1966)-to support researchers in modeling and benchmarking the language. It contains 2,351 words in the dialect paired with their meanings described in Standard German. We then use this dataset to answer the following research questions: (1) Can state-of-the-art large language models (LLMs) generate definitions for dialect words? (2) Can LLMs generate words in Meenzerisch, given their definitions? Our experiments show that LLMs can do neither: the best model for definitions reaches only 6.27% accuracy and the best word generation model's accuracy is 1.51%. We then conduct two additional experiments in order to see if accuracy is improved by few-shot learning and by extracting rules from the training set, which are then passed to the LLM. While those approaches are able to improve the results, accuracy remains below 10%. This highlights that additional resources and an intensification of research efforts focused on German dialects are desperately needed.

en cs.CL

Detail Sumber

arXiv Open Access 2026

Dialect Matters: Cross-Lingual ASR Transfer for Low-Resource Indic Language Varieties

Akriti Dhasmana, Aarohi Srivastava, David Chiang

We conduct an empirical study of cross-lingual transfer using spontaneous, noisy, and code-mixed speech across a wide range of Indic dialects and language varieties. Our results indicate that although ASR performance is generally improved with reduced phylogenetic distance between languages, this factor alone does not fully explain performance in dialectal settings. Often, fine-tuning on smaller amounts of dialectal data yields performance comparable to fine-tuning on larger amounts of phylogenetically-related, high-resource standardized languages. We also present a case study on Garhwali, a low-resource Pahari language variety, and evaluate multiple contemporary ASR models. Finally, we analyze transcription errors to examine bias toward pre-training languages, providing additional insight into challenges faced by ASR systems on dialectal and non-standardized speech.

en cs.CL

Detail Sumber

DOAJ Open Access 2026

Suur-Soome varjus. Hinnangutest Ilmari Mannineni teadustööle

Ildikó Lehtinen

In the shadow of Greater Finland. The framework of Ilmari Manninen’s scholarly work in Finland The objective of this article is to examine Ilmari Manninen’s position within Finnish ethnology on the basis of archival sources. The material consists mainly of Manninen’s correspondence, and the methodological approach is a close re-reading of his scholarly work. Manninen spent six years in Estonia, where he adapted well to the university and museum life in Tartu. The opening of the Estonian National Museum was a milestone and a testament to the success of the working team. Manninen’s return to Finnish ethnology was not self-evident. Although he maintained contact with Finland through A. M. Tallgren and Kaarle Krohn, he was regarded as an outsider by the Finnish National Museum and the University of Helsinki. By applying the historical research concept of the community of experience, this article situates Manninen within the field of Finnish ethnography. Manninen’s life’s work was evaluated in the context of his application for a professorship. His work in Estonia and his contribution as Head of the Ethnographic Department at the National Museum of Finland seem to have been acknowledged. However, his scholarly research on Estonian folk culture does not appear to have received the same level of recognition. Manninen’s research focused on Estonia and Karelia. Albert Hämäläinen’s research, on the other hand, concentrated on Western and Central Finland. This raises the question of whether it may have been a matter of a confrontation between East and West.

Other Finnic languages and dialects

Detail DOI Sumber

S2 Open Access 2026

Sugulussõnavara mitmekesisus ja levikumustrid eesti regilauludes

Helina Harend

The diversity and distribution patterns of kinship vocabulary in Estonian runosongs This article examines the core kinship vocabulary found in Estonian runosongs, focusing on the distribution of stems and stem variants of four key kinship terms: mother, father, sister, and brother. The kinship vocabulary in Estonian runosongs displays remarkable diversity, particularly in the terms denoting female and male parents. Among the analyzed terms, those referring to mother are the most frequent. In contrast, terms for sister and brother exhibit limited variation, represented by only two or three distinct stems. The regional distribution pattern reveals clear differences between the North and South Estonian language areas. The coastal regions of western and northern Estonia stand out – the terms eit (mother) and taat (father) occur more frequently in runosongs from these areas. The analysis highlights a consistent differentiation between North and South Estonian across all the studied terms. However, in the case of eit/ema and taat/isa, this division does not align with the current main dialect boundary but lies further north. The distribution of õde/sõsar (sister) and vend/veli (brother), on the other hand, corresponds more closely to the historical North and South Estonian dialect boundary observed in contemporary dialects. Kinship terms with a broader Finnic distribution have been better preserved in South Estonian runosongs.

en

Detail DOI Sumber

arXiv Open Access 2025

Dialect Normalization using Large Language Models and Morphological Rules

Antonios Dimakis, John Pavlopoulos, Antonios Anastasopoulos

Natural language understanding systems struggle with low-resource languages, including many dialects of high-resource ones. Dialect-to-standard normalization attempts to tackle this issue by transforming dialectal text so that it can be used by standard-language tools downstream. In this study, we tackle this task by introducing a new normalization method that combines rule-based linguistically informed transformations and large language models (LLMs) with targeted few-shot prompting, without requiring any parallel data. We implement our method for Greek dialects and apply it on a dataset of regional proverbs, evaluating the outputs using human annotators. We then use this dataset to conduct downstream experiments, finding that previous results regarding these proverbs relied solely on superficial linguistic information, including orthographic artifacts, while new observations can still be made through the remaining semantics.

en cs.CL

Detail Sumber

arXiv Open Access 2025

Low-Resource Dialect Adaptation of Large Language Models: A French Dialect Case-Study

Eeham Khan, Firas Saidani, Owen Van Esbroeck et al.

Despite the widespread adoption of Large Language Models (LLMs), their strongest capabilities remain largely confined to a small number of high-resource languages for which there is abundant training data. Recently, continual pre-training (CPT) has emerged as a means to fine-tune these models to low-resource regional dialects. In this paper, we study the use of CPT for dialect learning under tight data and compute budgets. Using low-rank adaptation (LoRA) and compute-efficient continual pre-training, we adapt three LLMs to the Québec French dialect using a very small dataset and benchmark them on the COLE suite. Our experiments demonstrate an improvement on the minority dialect benchmarks with minimal regression on the prestige language benchmarks with around 1% of model parameters updated. Analysis of the results demonstrate that gains are highly contingent on corpus composition. These findings indicate that CPT with parameter-efficient fine-tuning (PEFT) can narrow the dialect gap by providing cost-effective and sustainable language resource creation, expanding high-quality LLM access to minority linguistic communities. To support reproducibility and broaden access, we release the first Québec French LLMs on Hugging Face.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2025

A Novel Dialect-Aware Framework for the Classification of Arabic Dialects and Emotions

Nasser A Alsadhan

Arabic is one of the oldest languages still in use today. As a result, several Arabic-speaking regions have developed dialects that are unique to them. Dialect and emotion recognition have various uses in Arabic text analysis, such as determining an online customer's origin based on their comments. Furthermore, intelligent chatbots that are aware of a user's emotions can respond appropriately to the user. Current research in emotion detection in the Arabic language lacks awareness of how emotions are exhibited in different dialects, which motivates the work found in this study. This research addresses the problems of dialect and emotion classification in Arabic. Specifically, this is achieved by building a novel framework that can identify and predict Arabic dialects and emotions from a given text. The framework consists of three modules: A text-preprocessing module, a classification module, and a clustering module with the novel capability of building new dialect-aware emotion lexicons. The proposed framework generated a new emotional lexicon for different dialects. It achieved an accuracy of 88.9% in classifying Arabic dialects, which outperforms the state-of-the-art results by 6.45 percentage points. Furthermore, the framework achieved 89.1-79% accuracy in detecting emotions in the Egyptian and Gulf dialects, respectively.

en cs.CL, cs.LG

Detail DOI Sumber

arXiv Open Access 2025

Arabic Dialect Classification using RNNs, Transformers, and Large Language Models: A Comparative Analysis

Omar A. Essameldin, Ali O. Elbeih, Wael H. Gomaa et al.

The Arabic language is among the most popular languages in the world with a huge variety of dialects spoken in 22 countries. In this study, we address the problem of classifying 18 Arabic dialects of the QADI dataset of Arabic tweets. RNN models, Transformer models, and large language models (LLMs) via prompt engineering are created and tested. Among these, MARBERTv2 performed best with 65% accuracy and 64% F1-score. Through the use of state-of-the-art preprocessing techniques and the latest NLP models, this paper identifies the most significant linguistic issues in Arabic dialect identification. The results corroborate applications like personalized chatbots that respond in users' dialects, social media monitoring, and greater accessibility for Arabic communities.

en cs.CL, cs.AI

Detail Sumber

DOAJ Open Access 2025

Andmepõhine vaade tärkavale eesti kirjakogukonnale 1800–1940. Rahvusbibliograafia, andmete rikastamine ja rahvakoolid

Peeter Tinits

A data-intensive view on the emerging Estonian written language community, 1800-1940: National bibliography, data enrichment, and public schools The Estonian National Bibliography aims to provide a comprehensive overview of all printed works published in Estonia, by Estonians, or about Estonians. Bibliographic data science has suggested that national bibliographies can be used to study cultural history using the tools of digital humanities. In this article, I explore the historical details surrounding the emergence of the Estonian written language community, with a focus on the mid- to late 19th century. I introduce the Estonian National Bibliography as a data source and trace the emergence of the Estonian written language community within its records. Specifically, I look at changes in the status of the Estonian language within its language area, the backgrounds of the contributors to book publishing, and the possible influence of public schools. I enrich these analyses by incorporating other biographical, demographic, geographic, and historical datasets. Over time, books published in the Estonian language area began to rely more on Estonian, building on an earlier German-language tradition. Despite the early importance of multilingual contributors, the main contributors soon began working mostly in Estonian. From the early 19th century through the National Awakening, the average age of new contributors decreased by around 10 years, reflecting the steady influx of younger educated people. The analysis of the contributors’ birth backgrounds reveals that dialectal diversity among community members gradually increased, with some local variation. For a few decades around the time of the National Awakening, being born in the Governorate of Livonia doubled one’s chances of joining the written language community. This can be partly explained by differences in educational policy that led the Governorate of Estonia to fall behind in expanding its public school network. This article demonstrates how national bibliographic data can be used to study cultural history. Combining it with other data sources enables researchers to address questions from different disciplines – for example, assessing the impact of historical education policies or investigating shifts in dialectal diversity within a language community.

Other Finnic languages and dialects

Detail DOI Sumber

DOAJ Open Access 2025

obadus ja plinder

Lembit Vaba

obadus and plinder Quite a blow According to Julius Mägiste, obadus ’metal loop’ is a Russian loanword; cf. Russian óбод ’hoop, ring; arched fastening’ and dialectal обóдь ’ring, hook, clasp, bow’. The article notes that the meaning ’blow, hit’ developed in the semantic field of obadus within the Estonian linguistic context, where many fastening devices have acquired a secondary sense ’blow, hit’. Examples include haak ’hook, part of a latch or fastening’ ⇒ ’a boxing hook delivered from below or the side with a bent arm’; kiil ’iron or wooden wedge used for joining or sealing objects or their parts’ ⇒ ’blow, hit’; kõrvakiil ’a slap, a hit to the ear with the hand’; põõn ’a cross-board used to join and reinforce wooden items’ ⇒ ’blow, hit’. In a bind The article argues that the colloquial plinder ’difficult situation, bind, predicament, deprivation’ is a loan from the German dialect word Splinter ’(wood) splinter, sliver’, whose more widespread Standard German form is Splitter (without the nasal in the stem). From plinder, the shortened, irregular dialectal form plinn is likely to have developed. The semantics of lack or deprivation are reflected in the German adjectives splitternack, resp. splinternackt, splinternackend, etc. ’naked, bare(-wood)’, literally ’splinter-bare’. Baltic German also attests the adjective splinterkahl ’completely bare, stripped bare’. In Estonian, plinder has not been recorded with the meaning ’wood splinter, sliver, etc.’, which invites a consideration of expressions in which unpleasant situations someone may get into (or find themselves in) are described using imagery involving an ork ’spike’, (puu)pind ’ (wood) splinter’, or (puu)pulgad ’(wooden) sticks’.

Other Finnic languages and dialects

Detail DOI Sumber

DOAJ Open Access 2025

Võrdlevalt eesti vanemast leksikograafiast

Madis Jürviste

A comparative view on early Estonian lexicography The earliest Estonian dictionaries, published in the 17th and 18th centuries, have typically been studied in a narrow context, with a focus on individual sources or the authors in isolation. However, this period merits a comparative cross-source analysis, describing the lexicographical landscape from a broader perspective. This constitutes the main objective of this article. It provides a concise review of the existing literature on this period in Estonian lexicography, followed by a comparative analysis of the macrostructure and microstructure of these dictionaries. The sources analyzed in this study range from Heinrich Stahl’s Vocabula (1637), the first known printed Estonian-German dictionary, up to August Wilhelm Hupel’s Wörterbuch (1780, 2nd ed. 1818). In addition, the article covers Johannes Gutslaff’s Nomenclator (1648), Heinrich Göseken’s Farrago vocabulorum (1660), Salomo Heinrich Vestring’s Lexicon (first half of the 18th century, originally unpublished in print), and Anton Thor Helle’s Vocabularium (1732). The analysis reveals an evolution from a rather non-systematic presentation of linguistic information in the 17th-century sources to the much stricter approach of the 18th-century dictionaries. Most of these dictionaries were preceded by grammars within the same volumes, with the dictionary entries (primary word forms, declensions) more or less linked to these grammatical descriptions, which were heavily influenced by the German language. The content of these early dictionaries still warrants further extensive research: recent studies utilizing large language models indicate that this new technology can play a crucial role in analyzing the content of historical dictionaries (Jürviste et al., in review). In-depth research in this field is essential for a better understanding of the historical development of Estonian lexical strata.

Other Finnic languages and dialects

Detail DOI Sumber

DOAJ Open Access 2025

Suhtumisest siugudesse Eesti kesk- ja varauusaja ajalooallikates

Inna Põltsam-Jürjo

Snakes appear relatively infrequently in medieval and early modern Estonian written and visual sources, and when they do, they are typically portrayed in a negative light. In theological writings and religious art, snakes symbolized Satan. Normative and narrative sources of the era often described Estonians through references to snakes, emphasizing their paganism, barbarism, and superstition. However, these sources contain little information about actual reptiles. Attitudes toward snakes – and knowledge about them – varied depending on a person’s background: urban and rural dwellers, Germans and Estonians, and individuals of different social classes all had different perceptions. Estonia’s old, pre-Christian folk traditions clearly reflect a positive, respectful attitude towards snakes; they were often seen as beneficial creatures. In contrast, the elite typically interpreted animals through a textual lens, relying on scholarly writings rather than direct observation or empirical data. As a result, snakes were understood largely in symbolic terms, reflecting the strong influence of Christian teachings on human perceptions of animals. During the medieval and early modern periods, medicine was, for the most part, the only field in which snakes held economic significance. The Enlightenment’s advances in medical and natural sciences also shaped attitudes toward snakes. In scholarly texts, practical, observation-based descriptions of reptiles as biological species gradually gained prominence, standing alongside previously dominant theological interpretations.

Other Finnic languages and dialects

Detail DOI Sumber

DOAJ Open Access 2025

Vestluskaaslase algatatud parandused telesarja „Õnne 13” telefonikõnedes

Andriela Rääbis

Other-initiated repair in the phone calls of the TV series Õnne 13 This paper analyzes phone calls from the Estonian television series Õnne 13 (“Happiness Street 13”) using the methodological framework of conversation analysis. The dataset consists of 60 everyday phone calls from various seasons (1994–2024). A defining characteristic of these calls is their one-sidedness, with one participant simulating a dialogue. By applying the core analytical technique of conversation analysis – the next-turn proof procedure – it is possible to reconstruct the missing turns of the interlocutor based on how the speaker interprets the previous turn. In film discourse, the model includes not only the ratified participants but also the television audience. Therefore, turns in film dialogue must be designed with both the fictional addressee and the audience in mind. Since the viewer cannot hear the other participant, it is the visible character’s task to convey their turns to the audience. One technique used for this purpose is the initiation of repair. In the analyzed conversations, 41 repair initiations were identified: 14 initiated by the caller and 27 by the recipient. Most repair sequences occur at the beginning of the conversation. Repair primarily addresses problems of acceptance (17 instances) and understanding (11 instances). The most common strategies for repair initiation are repetition of the trouble source and open-class initiators (e.g., ah? ‘what?’, misasja? ‘what?’). The relationship between problems and strategies in the analyzed telephone conversations differs somewhat from that in authentic conversations. While open-class initiators typically address hearing problems in real-life interaction, here they more often indicate acceptance problems. Most repair initiations are prosodically marked – expressed in a louder voice, with a tone of surprise, fear, disappointment, or laughter, often stretched and/or strongly emphasized. Marked prosody is also applied to types of repair initiations that are typically unmarked in authentic conversation.

Other Finnic languages and dialects

Detail DOI Sumber

DOAJ Open Access 2025

Punktike ja palatalisatsioon. Lähivaade XIX sajandi esimese poole kirjaviisiuuendusele

Külli Prillop, Külli Habicht, Kristiina Praakli

Palatalization is not indicated in standard Estonian orthography. In the 1820s, Otto Wilhelm Masing proposed marking palatalization with a small dot placed beneath the vowel preceding the palatalized consonant. This article examines the extent to which Masing’s proposal was adopted and explores the reasons why the palatalization mark ultimately failed to become part of standard written Estonian. Masing also introduced the letter õ to distinguish between the sounds ö /ø/ and õ /ɤ/ in writing. That innovation was successful, even though – much like palatalization – õ varied regionally in pronunciation and the new character made printing more costly. Only a few printers and authors experimented with the palatalization mark, and all but two soon abandoned it. The only ones who remained committed to the idea were Heinrich Rosenplänter and Masing himself. However, they were unable to provide clear rules for when the palatalization mark should be used and were inconsistent in applying it. As a result, neither any printed texts nor the spoken language of the time offered a reliable model or standard for marking palatalization. Although the palatalization mark never caught on in general writing, it was adopted to some extent by authors of primers, who used it in the earliest stages of reading instruction.

Other Finnic languages and dialects

Detail DOI Sumber

DOAJ Open Access 2025

Inimese ja looduse suhte mõtestamine esseistikas. Funktsionaalne vaade eesti esseistikale

Asko Lõhmus

The problem of how knowledge becomes understanding of complex phenomena, such as social-ecological systems, is of both theoretical and applied interest. This article is built on the premise that published essays serve as an archive of these cognitive processes. An important reason is that essay-writing is practised across different cultural systems (including scientists), providing potential grounds for cultural exchange. I conduct a functional analysis of 108 essays, selected as a stratified sample from 36 Estonian authors representing three groups (literature; science; other) since the 20th century. The main criteria were that (i) each essay explores the relationship between humans (societies) and the nature, along with the interactions between both systems; and (ii) the author has at least three such essays (if more, three were chosen to reflect a diversity of thought). The presence of texts by all three groups of authors reveals three periods of heightened discourse on human-nature relationships: 1967–1980, 1996–2003, and from 2017 onwards. Structurally, ¾ of the essays represent isolated lines of thought (only 10% form coherent programs), where the authors elaborate on their own ideas while referencing mainly distant (foreign) authors. Direct reference to other Estonian authors or building on one’s own prior work is rare. Thus, essays mostly function as “idea laboratories”, with limited trans-disciplinary co-creation of understanding. Furthermore, while readily incorporating new knowledge and ideas, the essays reveal a relatively stable mixture of three main ethical attitudes towards nature throughout their history. By distinguishing the key components of social-ecological complexity (Fig. 2), a “frame text” was defined as its most complete (or, secondarily, the most concise) representation. Jaan Kaplinski’s “Ecology and Economics” (1972/1996) and Hando Runnel’s “Caretakers and Saviours” (1988) can be seen as frame texts among Estonian essays on human-nature relationships.

Other Finnic languages and dialects

Detail DOI Sumber

DOAJ Open Access 2025

„Ma olen kõnelenud!”. Varauusaegse Tartu ülikooli (1632–1710) akadeemiliste oratsioonide korpusest

Rahel Toomik

“I have spoken!” The corpus of academic orations at the University of Tartu during the early modern period (1632–1710) This article introduces the genre of orations, or academic speeches, at the University of Tartu during the early modern period (1632–1710) and examines their role as a central literary, pedagogical, social, and performative practice in university life. The Tartu oration corpus consists of approximately 230 printed texts, mostly in Latin, authored by both students and other members of the academic community. These speeches were composed on a wide range of topics, and for various purposes and occasions. Alongside a description of the corpus based on bibliographic metadata and a comparative analysis of its paratextual features (title pages, dedications, and congratulatory poems), the article focuses on identifying the key characteristics of the oration genre and distinguishing orations from other academic genres, particularly disputations, which have often received greater scholarly attention and overshadowed orations in historical research. The article also explores the value of orations as sources for intellectual history and considers why academic speeches and the oratory tradition have at times been overlooked or dismissed. It provides an overview of existing research on the Tartu oration corpus, offers new perspectives for understanding the genre, highlights accessibility issues related to bibliographic data, and reflects on how and why distant reading techniques used in the digital humanities could be used to further investigate and elevate the corpus.

Other Finnic languages and dialects

Detail DOI Sumber

DOAJ Open Access 2025

Omastamine või lihtsalt tõlge? Seto laulud eesti keeles

Andreas Kalkun

Appropriation or mere translation? Seto songs in Estonian Translation and adaptation are common phenomena in folklore. However, when translation is undertaken by folklorists in positions of power and driven by ideological motives, the matter becomes problematic. The relationship between Estonian folkloristics and Seto folklore has been both variable and complex. For historical reasons, the Setos themselves have not had a voice in the processes of publishing or translating folklore. This article examines three cases of translation that demonstrate how even seemingly “innocent” acts of translation can amount to cultural appropriation. Although the cases originate from different periods and reflect differing ideological or scholarly agendas, they reveal similar mechanisms of appropriation. Friedrich Reinhold Kreutzwald drew upon Seto folklore to substantiate his theory of the development of Estonian song and to prove the authenticity of the epic Kalevipoeg. Seto song culture provided Kreutzwald with an ideal opportunity to lend empirical support to his theories. He regarded the alteration and translation of Seto songs as an editorial process intended to uncover their original form. Jakob Hurt translated Seto lyric-epic songs for his popular anthology of folk songs intended “for all Estonians”. Once rendered in Estonian, these songs began a new life in school textbooks and popular publications as Estonian folk songs. Soviet folkloristics, in turn, employed invasive methods to produce ideologically acceptable folklore. During the Stalinist period, folklorists under pressure commissioned Seto women to compose politically suitable songs. A selection of these was published in school textbooks as examples of Soviet Estonian folklore – initially alongside the Seto-language originals, and later solely in Estonian.

Other Finnic languages and dialects

Detail DOI Sumber

arXiv Open Access 2024

Dallah: A Dialect-Aware Multimodal Large Language Model for Arabic

Fakhraddin Alwajih, Gagan Bhatia, Muhammad Abdul-Mageed

Recent advancements have significantly enhanced the capabilities of Multimodal Large Language Models (MLLMs) in generating and understanding image-to-text content. Despite these successes, progress is predominantly limited to English due to the scarcity of high quality multimodal resources in other languages. This limitation impedes the development of competitive models in languages such as Arabic. To alleviate this situation, we introduce an efficient Arabic multimodal assistant, dubbed Dallah, that utilizes an advanced language model based on LLaMA-2 to facilitate multimodal interactions. Dallah demonstrates state-of-the-art performance in Arabic MLLMs. Through fine-tuning six Arabic dialects, Dallah showcases its capability to handle complex dialectal interactions incorporating both textual and visual elements. The model excels in two benchmark tests: one evaluating its performance on Modern Standard Arabic (MSA) and another specifically designed to assess dialectal responses. Beyond its robust performance in multimodal interaction tasks, Dallah has the potential to pave the way for further development of dialect-aware Arabic MLLMs.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2024

Cross-Dialect Information Retrieval: Information Access in Low-Resource and High-Variance Languages

Robert Litschko, Oliver Kraus, Verena Blaschke et al.

A large amount of local and culture-specific knowledge (e.g., people, traditions, food) can only be found in documents written in dialects. While there has been extensive research conducted on cross-lingual information retrieval (CLIR), the field of cross-dialect retrieval (CDIR) has received limited attention. Dialect retrieval poses unique challenges due to the limited availability of resources to train retrieval models and the high variability in non-standardized languages. We study these challenges on the example of German dialects and introduce the first German dialect retrieval dataset, dubbed WikiDIR, which consists of seven German dialects extracted from Wikipedia. Using WikiDIR, we demonstrate the weakness of lexical methods in dealing with high lexical variation in dialects. We further show that commonly used zero-shot cross-lingual transfer approach with multilingual encoders do not transfer well to extremely low-resource setups, motivating the need for resource-lean and dialect-specific retrieval models. We finally demonstrate that (document) translation is an effective way to reduce the dialect gap in CDIR.

en cs.CL, cs.IR

Detail Sumber

arXiv Open Access 2024

Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT

Kazuki Yamauchi, Yuki Saito, Hiroshi Saruwatari

We explore cross-dialect text-to-speech (CD-TTS), a task to synthesize learned speakers' voices in non-native dialects, especially in pitch-accent languages. CD-TTS is important for developing voice agents that naturally communicate with people across regions. We present a novel TTS model comprising three sub-modules to perform competitively at this task. We first train a backbone TTS model to synthesize dialect speech from a text conditioned on phoneme-level accent latent variables (ALVs) extracted from speech by a reference encoder. Then, we train an ALV predictor to predict ALVs tailored to a target dialect from input text leveraging our novel multi-dialect phoneme-level BERT. We conduct multi-dialect TTS experiments and evaluate the effectiveness of our model by comparing it with a baseline derived from conventional dialect TTS methods. The results show that our model improves the dialectal naturalness of synthetic speech in CD-TTS.

en cs.SD, cs.CL

Detail Sumber

Hasil untuk "Other Finnic languages and dialects"