Soome-ugri usuelu eesti etnograafide pilgu läbi
Piret Koosa
Finno-Ugric religious life through the eyes of Estonian ethnographers
From the 1960s to the 1990s, the Estonian National Museum organized numerous field expeditions to other Finno-Ugric peoples within the Soviet Union. The primary aim of these trips was to collect objects of material heritage. During the expeditions, ethnographers produced photographs, drawings, and occasionally video recordings, as well as written ethnographic descriptions and field diaries. These materials provide valuable insights into the lives of Finno-Ugric peoples during this period. They also enable us to better understand Estonian ethnographers’ perspectives on their smaller kindred peoples, including their motivations for collecting, as well as the similarities and differences they perceived between themselves and the locals. One interesting topic that ethnographers occasionally address concerns local spiritual and religious practices and beliefs. Most of the peoples visited by the ethnographers had religious backgrounds – either Russian Orthodox or traditional animist – that differed from those of most Estonians. Another layer was added by the officially atheist Soviet framework. In this paper, I examine the fragmentary references to religion found in the ethnographers’ accounts, focusing on the contexts in which these comments appear and on how the researchers’ positionality is reflected in their observations.
Other Finnic languages and dialects
Turn Complexity of Context-free Languages, Pushdown Automata and One-Counter Automata
Giovanni Pighizzini
A turn in a computation of a pushdown automaton is a switch from a phase in which the height of the pushdown store increases to a phase in which it decreases. Given a pushdown or one-counter automaton, we consider, for each string in its language, the minimum number of turns made in accepting computations. We prove that it cannot be decided if this number is bounded by any constants. Furthermore, we obtain a non-recursive trade-off between pushdown and one-counter automata accepting in a finite number of turns and finite-turn pushdown automata, that are defined requiring that the constant bound is satisfied by each accepting computation. We prove that there are languages accepted in a sublinear but not constant number of turns, with respect to the input length. Furthermore, there exists an infinite proper hierarchy of complexity classes, with the number of turns bounded by different sublinear functions. In addition, there is a language requiring a number of turns which is not constant but grows slower than each of the functions defining the above hierarchy.
Towards Comprehensive Semantic Speech Embeddings for Chinese Dialects
Kalvin Chang, Yiwen Shao, Jiahong Li
et al.
Despite having hundreds of millions of speakers, Chinese dialects lag behind Mandarin in speech and language technologies. Most varieties are primarily spoken, making dialect-to-Mandarin speech-LLMs (large language models) more practical than dialect LLMs. Building dialect-to-Mandarin speech-LLMs requires speech representations with cross-dialect semantic alignment between Chinese dialects and Mandarin. In this paper, we achieve such a cross-dialect semantic alignment by training a speech encoder with ASR (automatic speech recognition)-only data, as demonstrated by speech-to-speech retrieval on a new benchmark of spoken Chinese varieties that we contribute. Our speech encoder further demonstrates state-of-the-art ASR performance on Chinese dialects. Together, our Chinese dialect benchmark, semantically aligned speech representations, and speech-to-speech retrieval evaluation lay the groundwork for future Chinese dialect speech-LLMs. We release the benchmark at https://github.com/kalvinchang/yubao.
„Pizarro” Eestis. August von Kotzebue teoste inglise adaptsioonid Eesti raamatukogudes ja nende tulmelugu
Kairit Kaur
Pizarro in Estonia: English adaptations of August von Kotzebue’s works in Estonian libraries and their provenance
Around 1800, the German dramatist August von Kotzebue was immensely popular in the Anglophone world. Since his fame as a playwright began in Tallinn and he was married to three noblewomen from Estonia, this article examines whether any of the numerous English adaptations of his works reached Estonia. Based on the collections of Estonian research libraries, it appears that only a few such texts have survived: Kotzebue’s memoirs describing his deportation to Siberia in 1800, and two adaptations of his play Die Spanier in Peru, one by Richard Brinsley Sheridan and another by Thomas Dutton. By analyzing marks of provenance and accession book entries, the study attempts to identify the owners of these volumes in order to determine the readership of English-language books in Estonia. It seems that the censorship policies of the Russian Empire around 1800 prevented the import of English editions at the height of Kotzebue’s fame in England (1798–1800). The earliest evidence of an adaptation in Estonia dates from 1819, the year of Kotzebue’s assassination, in Tartu. The copy of Sheridan’s Pizarro held at the Tallinn University Academic Library is probably a counterfeit edition of an English print (1799), very likely produced in Germany. It belonged either to the teacher and folklorist Alexander Heinrich Neus or Moritz Gedner, the censor of foreign newspapers in Riga, and entered the collection in 1867 or 1868. Dutton’s adaptation belonged to Friedrich Wilhelm Becker, a teacher at the Tallinn Gymnasium, and was donated in 1848. The interest in English appears in both cases to coincide with an interest in Swedish. One copy of the Siberian memoirs, translated by Benjamin Beresford (1802), belonged to Oskar Kallas, Estonian ambassador to the United Kingdom in 1922–1934, and is preserved in the Estonian Literary Museum. Another copy of this translation was acquired by the Tallinn University Academic Library in the 21st century, in 2006.
Other Finnic languages and dialects
Regilaul kui varieeruv süsteem. Suulise kultuuri toimemehhanismide modelleerimise võimalused
Mari Väina
This article aims to highlight the value and potential of combining a large, well-organized collection of Finnic runosongs and computational research for obtaining new information on various aspects of the tradition and, more broadly, also for data-driven modelling of the general mechanisms underlying oral poetry. The article summarizes developments over the past five years as well as prospects for future research.
Folkloric creativity differs in several respects from modern, authorship-based creativity. Rooted in oral cultures, folklore has evolved as a tool for long-term information storage. Hence, variability is an essential feature of folklore, enabling both the preservation of relevant information and adaptation to changing circumstances. As a central concept in folkloristics, variability and the dynamics of stability and innovation have been at the core of folkloristics throughout its history, but recent advancements in computational power and text analysis methodologies, as well as several research funding projects have led to significant progress in the field. Processing large collections of texts has improved our understanding of biases within the material, revealed the layered nature of variation, provided new insights into historical communication patterns and regional peculiarities of runosongs, opened fresh avenues for transdisciplinary research on the region’s past, and enabled data-driven exploration on the systematics of folkloric creativity. Understanding variability in runosongs offers insights into human creativity, memory, and the broader dynamics of oral tradition.
Other Finnic languages and dialects
Ühe kriisi kaks nägu. Koroonapandeemiaga toimetuleku metafoorne mõtestamine kriisijuhtide pilgu läbi
Mai Raet, Ene Vainik
While many studies have examined metaphors in COVID-19 discourse, few have focused on how healthcare leaders use metaphors to reflect on the crisis from a post-pandemic perspective. This study analyzes the conceptual metaphors used in reflective discourse by Estonian healthcare leaders to better understand the challenges of medical crisis management and preparedness.
The material is drawn from the 2023 Estonian Health Board report COVID-19 Pandemic Descriptive Analysis and Lessons Learned, which includes a medical overview by virologist Irja Lutsar and interviews with 20 crisis-management experts conducted by Kantar Emor. Grounded in Cognitive Metaphor Theory (CMT), the study compares metaphor use in scientific and everyday discourse, treating them as two complementary perspectives on the same crisis. It explores how these metaphors offer insights into how healthcare leaders conceptualized and made sense of the COVID-19 pandemic.
The analysis shows that the medical overview maintains a neutral, statistics-oriented tone and relies on conventional metaphors, primarily framing the pandemic as a natural force, with only occasional and superficial use of war metaphors. By contrast, the interviews reveal more personal and vivid imagery, often drawing on war and school metaphors. Metaphors of struggle and burden highlight the increased workload and a sense of heroism among healthcare workers, while the virus itself is sometimes personified.
By considering these complementary perspectives together, the study advances understanding of healthcare leadership discourse and the role of metaphor in shaping post-pandemic policy reflection and communication.
Other Finnic languages and dialects
Juhan Liivi luule suhted kirjandusliku traditsiooniga. Varane looming
Tanar Kirs
This article explores the relationship between Juhan Liiv’s poetry and the literary traditions of both Estonian and world poetry in his time. It focuses on Liiv’s early work, during which his distinctive poetic style emerged. While previous research in Estonian literary studies has identified certain aesthetic influences on Liiv, the prevailing view has held that his writing style developed largely in isolation from literary tradition. I compiled all previous observations on Liiv’s influences, connected and expanded upon these findings, and concluded that: (1) Liiv’s development as a poet took place within the context of German literature, particularly under the influence of Friedrich Schiller; (2) he refined his poetic form in the manner of Heinrich Heine’s poetry; and (3) he sought his lyrical voice in the footsteps of Johann Wolfgang von Goethe. Thus, Juhan Liiv’s lyricism – and with it, the beginnings of the first renewal in Estonian poetry – developed in close contact with world poetry and literary thought.
Other Finnic languages and dialects
Evaluating Standard and Dialectal Frisian ASR: Multilingual Fine-tuning and Language Identification for Improved Low-resource Performance
Reihaneh Amooie, Wietse de Vries, Yun Hao
et al.
Automatic Speech Recognition (ASR) performance for low-resource languages is still far behind that of higher-resource languages such as English, due to a lack of sufficient labeled data. State-of-the-art methods deploy self-supervised transfer learning where a model pre-trained on large amounts of data is fine-tuned using little labeled data in a target low-resource language. In this paper, we present and examine a method for fine-tuning an SSL-based model in order to improve the performance for Frisian and its regional dialects (Clay Frisian, Wood Frisian, and South Frisian). We show that Frisian ASR performance can be improved by using multilingual (Frisian, Dutch, English and German) fine-tuning data and an auxiliary language identification task. In addition, our findings show that performance on dialectal speech suffers substantially, and, importantly, that this effect is moderated by the elicitation approach used to collect the dialectal data. Our findings also particularly suggest that relying solely on standard language data for ASR evaluation may underestimate real-world performance, particularly in languages with substantial dialectal variation.
CrossTL: A Universal Programming Language Translator with Unified Intermediate Representation
Nripesh Niketan, Vaatsalya Shrivastva
We present CrossTL, a universal programming language translator enabling bidirectional translation between multiple languages through a unified intermediate representation called CrossGL. Traditional approaches require separate translators for each language pair, leading to exponential complexity growth. CrossTL uses a single universal IR to facilitate translations between CUDA, HIP, Metal, DirectX HLSL, OpenGL GLSL, Vulkan SPIR-V, Rust, and Mojo, with Slang support in development. Our system consists of: language-specific lexers/parsers converting source code to ASTs, bidirectional CrossGL translation modules implementing ToCrossGLConverter classes for importing code and CodeGen classes for target generation, and comprehensive backend implementations handling full translation pipelines. We demonstrate effectiveness through comprehensive evaluation across programming domains, achieving successful compilation and execution across all supported backends. The universal IR design enables adding new languages with minimal effort, requiring only language-specific frontend/backend components. Our contributions include: (1) a unified IR capturing semantics of multiple programming paradigms, (2) a modular architecture enabling extensibility, (3) a comprehensive framework supporting GPU compute, graphics programming, and systems languages, and (4) empirical validation demonstrating practical viability of universal code translation. CrossTL represents a significant step toward language-agnostic programming, enabling write-once, deploy-everywhere development.
Graph Rewriting Language as a Platform for Quantum Diagrammatic Calculi
Kayo Tei, Haruto Mishina, Naoki Yamamoto
et al.
Systematic discovery of optimization paths in quantum circuit simplification remains a challenge. Today, ZX-calculus, a computing model for quantum circuit transformation, is attracting attention for its highly abstract graph-based approach. Whereas existing tools such as PyZX and Quantomatic offer domain-specific support for quantum circuit optimization, visualization and theorem-proving, we present a complementary approach using LMNtal, a general-purpose hierarchical graph rewriting language, to establish a diagrammatic transformation and verification platform with model checking. Our methodology shows three advantages: (1) manipulation of ZX-diagrams through native graph transformation rules, enabling direct implementation of basic rules; (2) quantified pattern matching via QLMNtal extensions, greatly simplifying rule specification; and (3) interactive visualization and validation of optimization paths through state space exploration. Through case studies, we demonstrate how our framework helps understand optimization paths and design new algorithms and strategies. This suggests that the declarative language LMNtal and its toolchain could serve as a new platform to investigate quantum circuit transformation from a different perspective.
Assessing Thai Dialect Performance in LLMs with Automatic Benchmarks and Human Evaluation
Peerat Limkonchotiwat, Kanruethai Masuk, Surapon Nonesung
et al.
Large language models show promising results in various NLP tasks. Despite these successes, the robustness and consistency of LLMs in underrepresented languages remain largely unexplored, especially concerning local dialects. Existing benchmarks also focus on main dialects, neglecting LLMs' ability on local dialect texts. In this paper, we introduce a Thai local dialect benchmark covering Northern (Lanna), Northeastern (Isan), and Southern (Dambro) Thai, evaluating LLMs on five NLP tasks: summarization, question answering, translation, conversation, and food-related tasks. Furthermore, we propose a human evaluation guideline and metric for Thai local dialects to assess generation fluency and dialect-specific accuracy. Results show that LLM performance declines significantly in local Thai dialects compared to standard Thai, with only proprietary models like GPT-4o and Gemini2 demonstrating some fluency
Revisiting Common Assumptions about Arabic Dialects in NLP
Amr Keleg, Sharon Goldwater, Walid Magdy
Arabic has diverse dialects, where one dialect can be substantially different from the others. In the NLP literature, some assumptions about these dialects are widely adopted (e.g., ``Arabic dialects can be grouped into distinguishable regional dialects") and are manifested in different computational tasks such as Arabic Dialect Identification (ADI). However, these assumptions are not quantitatively verified. We identify four of these assumptions and examine them by extending and analyzing a multi-label dataset, where the validity of each sentence in 11 different country-level dialects is manually assessed by speakers of these dialects. Our analysis indicates that the four assumptions oversimplify reality, and some of them are not always accurate. This in turn might be hindering further progress in different Arabic NLP tasks.
Sotsiaal-kultuuriline vaade eesti ja sloveeni alkoholiteemalistele vanasõnadele
Saša Babič, Piret Voolaid
This article explores Estonian and Slovenian proverbs related to alcohol and drinking with the aim of interpreting these proverbs in their broader sociocultural context and analysing the controversies embedded in proverbs on this topic. Considering that alcohol is not consumed in the same form everywhere, the article examines Slovenian material as representative of the geographical region of Southern Europe and the Slavic language group, and Estonian material as representative of the geographical region of Northern Europe and the Finno-Ugric language group. The units encompassed by the research contain the following words: wine, beer, spirits (vodka), drunk, and drinking. We analyse how proverbs – as culturally metaphorical units often considered the cornerstone of national identity and a compass of ethnic morals – fit into this ethnic pretext. We focus on three different aspects: those regarded as specific to national culture; gender and family (gender inequality, drinking alcohol as a symbol of masculinity); and the ambivalence surrounding drinking.
Other Finnic languages and dialects
Huumor ja/kui vägivald? Folkloristlik-lingvistiline lähenemine
Liisi Laineste, Władysław Chłopicki
"Humour and/as violence? A folkloristic-linguistic approach". Humour has long been viewed as a subversive phenomenon that points to the incongruities and shortcomings noticed and ridiculed by members of society. The intentions of humour creators are always difficult to pinpoint. Whether with good reason or not, most humour has a target, and the depiction of this target is inevitably politically incorrect, insulting, or even downright blasphemous. Jokes violate harmonious coexistence, collective identity, and communication norms, and that seems to lie at the core of humour. We claim that the superiority and incongruity approaches to humour (traditionally, three major approaches to humour are recognized: superiority, incongruity, and relief) relate to the idea that humour contains some form of clash, either between individuals or meanings. This clash is an element that connects many existing humour theories and is increasingly useful in explaining contemporary forms of humour. Our theoretical contribution analyzes ideas about humour proposed by humour scholars over centuries, particularly those proposed by folklorists and linguists. Humour violates logic, expectations, or linguistic conventions in order to “kidnap” its recipient where they do not intend to go. This is especially true in the anonymous space of the internet, where users enjoy little responsibility for what they say and can usually deny any harmful intention – the ambiguity of humour allows for that. The bottom line is that humour can be used to convey a variety of messages and achieve different goals, some prosocial and others more aggressive. Humour in itself, however, is inherently neither friendly nor aggressive, and its meaning always emerges in context.
Other Finnic languages and dialects
Riim eesti poeetilises kultuuris. Teooriad ja praktikad
Maria-Kristiina Lotman, Rebekka Lotman
In the early periods of Estonian literary poetry, rhyme was almost an integral aspect of verse. However, since the late 1950s, its significance in written poetry has gradually diminished, relegating it to a marginal position by the end of the 20th century. Contemporary acclaimed poetry collections of the 21st century predominantly feature free verse, while rhyme remains prominent in popular poetry styles like pop music lyrics, stage poetry, social media poetry, and hip-hop. Consequently, the theoretical discourse on rhyme has also resurfaced as a relevant issue. This article has two objectives. The first objective is to provide an overview of the language-based issues related to Estonian rhyme, previous approaches to rhyme, and the development of Estonian rhyming poetry from the 17th century to the present day. The second objective is to establish a new methodology for studying the formal structure of Estonian rhyme, analyzing it at the phonemic, prosodic, and word boundary levels. This methodology is applied to the statistical analysis of end-rhyme poetry by authors from various periods, ranging from the second half of the 19th century to contemporary popular poetry. The research reveals two different operational directions in poetry – on the one hand, the aspiration towards perfect rhyme, accurate in both phonemic and durational structure, and on the other hand, resistance to automation and the use of “worn-out” rhymes, employing all possible freedoms in rhyming techniques. It is noteworthy that in the popular poetry of the new era, strategies found in the poetry of the 19th century are being revived. Rap poetry, in particular, is rich in rhyming techniques, incorporating all previously established diversification strategies, from full rhymes to various modernist experiments.
Other Finnic languages and dialects
Mõõnav sugu sealpool head ja kurja. Friedrich Nietzsche vitaalsed afektid Jaan Oksa loomingu kõverpeeglis
Leo Luks
This article aims to reconstruct the philosophical message found in Jaan Oks’ fragmentary works “Females” (Emased, 1908), “Flesh” (Ihu, 1908), and “Nameless Beast” (Nimetu elajas, 1909). The author argues that within these works, amidst alternating and colliding voices, a louder and more distinct one emerges – a prophetic-philosophical narrator who uses unfolding sketches to provide broad metaphysical generalizations and assessments, primarily concerning human nature and gender differences. In addition to reconstructing the philosophical positions, these are compared with Friedrich Nietzsche’s views on the same subjects, exploring the extent of Nietzsche’s influence on Oks’ works.
In the first step of the interpretation, the article examines how Oks’ texts define human nature. Human beings are reduced to natural creatures lacking a soul, with sexual drive at the core of their essence. A comparison reveals significant similarities between Oks and Nietzsche’s perspectives on human nature. Both also declare the invalidity of Christian moral teachings. However, Oks’ texts do not lead to a celebration of the natural human sexual drive; instead, an aversion persists. The article continues with a reconstruction of the philosophical generalizations explaining this situation.
In the second step, the article explores the relationship of Oks’ texts with secularization. It becomes apparent that the divine, transcendent sphere cannot be simply erased from Oks’ works, despite declarations to the contrary. For Oks, God remains a problem addressed through blasphemous fantasies in the examined works. Additionally, the hypothetical godless life is not enjoyable for Oks’ philosophical voice; existential meaninglessness dominates. The mood contained in Oks’ works is compared with Nietzsche’s concept of the “death of God”, finding significant similarities.
In the third aspect, the article examines the relationships between the two sexes. Here, too, substantial overlaps are identified between Oks and Nietzsche’s perspectives. Both assert an insurmountable tension and conflict between the sexes, emphasizing women’s greater affinity to nature. In Oks’ works, men are attributed the desire to distance themselves from their nature – sexual drive – and retreat into solitude. From this distance, the female sex is insulted and degraded so vehemently and at such great length that a parallel can be drawn to a blasphemous relationship with God.
The final section examines the pursuit of detachment from life, of overcoming humanity, attributed to the poet in Oks’ works, comparing it to Nietzsche’s concept of art. Despite several similarities, a significant difference emerges: Nietzsche understands art’s main purpose vitalistically, as a means to overcome decadence, the defining movement of his era. This requires the overcoming of humanity as it exists. Oks’ conception, on the other hand, remains programmatically decadent and is directed towards the decline and end of humankind.
Other Finnic languages and dialects
Millise lauliku lapsepõli? Laulu „Kui ma olin väiksekene” allikatest ja autoritest
Taive Särg
The childhood of which songster? On the origins and authors of a beloved song
One of the most iconic and seminal musical works of Estonian culture, the song Lauliku lapsepõli (“Childhood of the Songster”), often referred to by its opening words, “Kui ma olin väiksekene” (“When I Was Little”), was composed on the basis of Estonian folk songs in collaboration between Aino Tamm (1864–1945), one of the first professionally trained Estonian singers, and Miina Härma (1864–1941), the first Estonian female composer. The song has been performed in various formats – including choir, solo, and social singing – published in numerous songbooks, and even reintroduced into oral tradition.
Drawing on the collections of the Estonian Folklore Archives, earlier publications, and prior research, this article identifies the origins of the folk song lyrics and melodies used in Härma and Tamm’s composition more precisely than previous studies, revealing a blend of folk melodies – specifically, shepherds’ calling tunes – from Tarvastu parish and the lyrics of a folk song from Põlva parish. The choir and solo arrangements (the latter with piano accompaniment) reflect the influence of European modern art music. The study also considers the broader role of imitation and recreation in the intersection of Estonian oral and written traditions within the song’s development.
The lyrics and structure of Tamm and Härma’s Lauliku lapsepõli are based on Ado Grenzstein’s song cycle of the same name, written in the traditional regilaul (runosong) style. Grenzstein’s work was first published in his Kooli laulmise raamat (“School Singing Book”, 1878) and subsequently included in several songbooks and reprints.
Grenzstein’s version featured a melody adapted from a traditional regilaul tune from Tõstamaa parish and four original melodies in regilaul style, paired with translated and adapted regilaul lyrics. The same text can be found in the Võro language in the first volume of Vana kannel (“Old Harp” or “Old Psaltery”), a collection of songs from Põlva parish, published and edited by Jakob Hurt. In this publication, the lyrics are a part of a longer regilaul titled Mille mull palʹlʹo sõnno suuh? (“Why Do I Have So Many Words in My Mouth?”), compiled by Hurt from two different manuscripts (1854–1875).
For Lauliku lapsepõli, Tamm and Härma composed a melody based on shepherds’ calling tunes (karjasehelletused) featuring the refrain “alle-a, alle-la”, which Tamm remembered from her childhood in Tarvastu. Härma arranged these melodies into a solo song with piano accompaniment, likely in 1891, which was first performed in 1892 and published in 1923. She later created a multi-voice choir arrangement, printed in 1895 and first performed in 1896. The melody with the original lyrics and refrain was published in 1934, though the original manuscript has not been found. Two overlapping transcriptions of the melody (without the lyrics) and the lyrics (with Grenzstein’s melody) of Lauliku lapsepõli are preserved in a manuscript of folk melodies by Miina Härma in the Estonian Folklore Archives. This manuscript was compiled in 1895–1896 when Härma was collecting folk melodies for the Finnish Literature Society. However, the arrangement of Lauliku lapsepõli predates Härma’s folk melody manuscript, suggesting that the song was composed before the fieldwork, based on personal communication with Tamm.
A textual and melodic analysis demonstrates how, within the sphere of written culture, Lauliku lapsepõli has undergone gradual transformations comparable to those usually seen in oral tradition. The article argues that imitation, recreation, and reimagining were common creative practices during the formative period of Estonian written culture, suggesting that the human urge for recreation is perhaps universal, transcending cultural systems.
Other Finnic languages and dialects
DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages
Fahim Faisal, Orevaoghene Ahia, Aarohi Srivastava
et al.
Language technologies should be judged on their usefulness in real-world use cases. An often overlooked aspect in natural language processing (NLP) research and evaluation is language variation in the form of non-standard dialects or language varieties (hereafter, varieties). Most NLP benchmarks are limited to standard language varieties. To fill this gap, we propose DIALECTBENCH, the first-ever large-scale benchmark for NLP on varieties, which aggregates an extensive set of task-varied variety datasets (10 text-level tasks covering 281 varieties). This allows for a comprehensive evaluation of NLP system performance on different language varieties. We provide substantial evidence of performance disparities between standard and non-standard language varieties, and we also identify language clusters with large performance divergence across tasks. We believe DIALECTBENCH provides a comprehensive view of the current state of NLP for language varieties and one step towards advancing it further. Code/data: https://github.com/ffaisal93/DialectBench
Exploring Bengali Religious Dialect Biases in Large Language Models with Evaluation Perspectives
Azmine Toushik Wasi, Raima Islam, Mst Rafia Islam
et al.
While Large Language Models (LLM) have created a massive technological impact in the past decade, allowing for human-enabled applications, they can produce output that contains stereotypes and biases, especially when using low-resource languages. This can be of great ethical concern when dealing with sensitive topics such as religion. As a means toward making LLMS more fair, we explore bias from a religious perspective in Bengali, focusing specifically on two main religious dialects: Hindu and Muslim-majority dialects. Here, we perform different experiments and audit showing the comparative analysis of different sentences using three commonly used LLMs: ChatGPT, Gemini, and Microsoft Copilot, pertaining to the Hindu and Muslim dialects of specific words and showcasing which ones catch the social biases and which do not. Furthermore, we analyze our findings and relate them to potential reasons and evaluation perspectives, considering their global impact with over 300 million speakers worldwide. With this work, we hope to establish the rigor for creating more fairness in LLMs, as these are widely used as creative writing agents.
Phonotactic Complexity across Dialects
Ryan Soh-Eun Shim, Kalvin Chang, David R. Mortensen
Received wisdom in linguistic typology holds that if the structure of a language becomes more complex in one dimension, it will simplify in another, building on the assumption that all languages are equally complex (Joseph and Newmeyer, 2012). We study this claim on a micro-level, using a tightly-controlled sample of Dutch dialects (across 366 collection sites) and Min dialects (across 60 sites), which enables a more fair comparison across varieties. Even at the dialect level, we find empirical evidence for a tradeoff between word length and a computational measure of phonotactic complexity from a LSTM-based phone-level language model-a result previously documented only at the language level. A generalized additive model (GAM) shows that dialects with low phonotactic complexity concentrate around the capital regions, which we hypothesize to correspond to prior hypotheses that language varieties of greater or more diverse populations show reduced phonotactic complexity. We also experiment with incorporating the auxiliary task of predicting syllable constituency, but do not find an increase in the negative correlation observed.