Contested wars and ongoing colonization in the Nenets tundra
This article examines how long-term colonial domination shapes attitudes toward war, violence, and the state among Nenets reindeer herders in the European tundra and the Polar Urals. Drawing on historical sources, contemporary media, open-source data on wartime casualties and mobilization, and long-term ethnographic fieldwork among private reindeer-herding families (2000–2017), I trace changes in relations between Nenets and lutsa – a term referring to Russians and, more broadly, to settlement-dwelling, reindeerless others. Adopting a longue durée perspective, the article follows shifts from early fur-tribute extraction and violent encounters, through Orthodox missionization and Soviet collectivization, to the Second World War and post-Soviet resource extraction. Particular attention is given to an episode of armed resistance – the 1943 Polar Urals uprising, locally known as mandalada – which contributed to the emergence of so-called “Independents”, that is, private reindeer herders who remained outside collective farms for much of the Soviet period. I explore how this past is remembered, silenced, ethically evaluated, or reinterpreted in present-day family narratives. Theoretically, the article draws on Jean and John Comaroff’s notion of colonialism as a “politics of perception and experience” and on Webb Keane’s concept of “ethical affordances” to show how colonial domination works not only through political and economic subordination but also by reshaping what counts as a “proper” life, a respectable person, or a shameful choice, and which forms of violence, compliance, or withdrawal come to appear reasonable or unthinkable. In the final section, I relate this longue durée history to the present, arguing that Nenets have suffered disproportionately high per-capita losses in Russia’s war against Ukraine. I suggest that this cannot be reduced to “brainwashing” or pure coercion, but must be understood within a tense field of economic and informational pressure, local expectations, ideals of masculinity, and historically produced desires for recognition as equal citizens in a Russian-dominated society. I also show how different social groups among the Nenets position themselves in distinct ways vis-à-vis the authoritarian state and Russians, and how the notion of the “real person” (nyeney nyenetsya) functions as a protective framework.
The article deals with the parallel preposited adnominal non-partitive genitive in Baltic, Balto-Finnic and Scandinavian languages from the point of view of diachrony, areal linguistics and language typology where the latter perspective is based mostly on Greenberg's language universals. Data from Scandinavian historical syntax (Braunmüller's studies in word order) reveal a far more complex picture concerning the historical roots of the preposited genitive in the Nordic languages than one can observe from the smooth surface of to-day's Swedish, Danish and Norwegian which corresponds so neatly with the Baltic and Balto-Finnic languages with respect to the phenomenon in question.
Chaoju Tang, Vincent J. van Heuven, Wilbert Heeringa
et al.
In this article, we tested some specific claims made in the literature on relative distances among European languages and among Chinese dialects, suggesting that some language varieties within the Sinitic family traditionally called dialects are, in fact, more linguistically distant from one another than some European varieties that are traditionally called languages. More generally, we examined whether distances among varieties within and across European language families were larger than those within and across Sinitic language varieties. To this end, we computed lexico-phonetic as well as syntactic distance measures for comparable language materials in six Germanic, five Romance and six Slavic languages, as well as for six Mandarin and nine non-Mandarin (‘southern’) Chinese varieties. Lexico-phonetic distances were expressed as the length-normalized MPI-weighted Levenshtein distances computed on the 100 most frequently used nouns in the 32 language varieties. Syntactic distance was implemented as the (complement of) the Pearson correlation coefficient found for the PoS trigram frequencies established for a parallel corpus of the same four texts translated into each of the 32 languages. The lexico-phonetic distances proved to be relatively large and of approximately equal magnitude in the Germanic, Slavic and non-Mandarin Chinese language varieties. However, the lexico-phonetic distances among the Romance and Mandarin languages were considerably smaller, but of similar magnitude. Cantonese (Guangzhou dialect) was lexico-phonetically as distant from Standard Mandarin (Beijing dialect) as European language pairs such as Portuguese–Italian, Portuguese–Romanian and Dutch–German. Syntactically, however, the differences among the Sinitic varieties were about ten times smaller than the differences among the European languages, both within and across the families—which provides some justification for the Chinese tradition of calling the Sinitic varieties dialects of the same language.
In the laboratory of utopian world reform: The Tallinn teacher Edgar von Wahl and the international auxiliary language movement
The creator of one of the most successful planned languages of the first half of the 20th century, Occidental (today known as Interlingue), was Edgar von Wahl (or de Wahl), a teacher from Tallinn. His success story was far from inevitable, but the factors that made it possible have received little attention in the existing literature. This article seeks to fill this gap by analyzing these factors as well as Wahl’s contribution to the international auxiliary language movement. The article mainly discusses the activities of three organizations central to the spread of the international auxiliary language idea: the Delegation for the Adoption of an International Auxiliary Language, the League of Nations, and the International Auxiliary Language Association. An analysis of the relationship between Wahl’s choices in interlinguistics and these organizations reveals that by the first half of the 20th century, the idea of an international auxiliary language had become remarkably popular all over the world and had attracted the attention of several institutions beyond the circles of dedicated enthusiasts. This widespread recognition was the primary precondition for Wahl’s success. However, his personal strategies and decades-long experience in the constructed language community were also crucial. He had been engaged with constructed languages since essentially the beginning of their heyday, that is, from the early years of Volapük and Esperanto. During this time, he had gained valuable experience with the technical intricacies of language creation through continuous dialogue with like-minded individuals, by contributing to various projects, and by navigating the institutional dynamics and competition within these circles. In the 1920s, Wahl successfully applied these experiences to promote his own language project, Occidental, and to mobilize new adherents to the movement around it. In doing so, he – unlike many other experimenters – managed to draw significant attention to his project and person and ultimately made a valuable contribution to the development of the international auxiliary language idea.
This article explores the representation of the sense of smell in Carolina Pihelgas’s novel “Looking at the Night” (Vaadates ööd, 2022). Among the senses, smell is particularly difficult to capture in language, often requiring the use of metaphors, comparisons, and figurative expressions. In Pihelgas’s work, smells are not employed merely to describe the environment but also to mediate the characters’ inner lives, memories, traumas, and intergenerational experiences.
The novel consists of three parts, recounting the stories of grandmother, mother, and daughter. Smells weave these narratives together, signifying both personal and historical experiences: they evoke trauma, the ambivalence of the mother-daughter relationship, the preservation of memory, as well as broader political and social conditions. Pihelgas often attributes smells to phenomena that lacks them literally (“the smell of a life drifting meaninglessly”, “the scent of anxiety”), thereby rendering emotions corporeal and poetic. The analysis identifies two types of olfactory descriptions: source-based smells (e.g. of rooms, nature, objects), which create a realistic atmosphere and connect characters to their surroundings; and abstract or poetic smells, which convey feelings, memories, and affective experiences.
The article situates Pihelgas’s poetics within the framework of sensory studies, suggesting that smells contribute to the novel’s emotional tension and its atmosphere of gravity. The poetics of smell functions as a bridge between body, memory, and history, linking intimate, personal experience with cultural and intergenerational trauma.
Until the mid-18th century, written Latvian and Estonian served as ‘means of colonization and Christianization’ (Lukas 2020: 42), since they were written by educated Germans who had immigrated to the country. The first steps towards a distinctly Latvian writing system in the Latvian-speaking region of historic Livonia came through the influence of the Moravian Church. The Moravian Brethren, active worldwide, promoted egalitarian practices based on a hierarchical equality between languages, which set in motion a process of decolonization. Through the parallel use of Latvian and German, the inclusive forms of address (‘brother’ and ‘sister’), the recognition of feminine forms, and the explicit marking of linguistic and ethnic affiliation in official Herrnhut historiography, an ‘imagined decolonized space’ was created within the Moravian community in Livonia. This in turn contributed to the Latvians’ search for a shared cultural identity. The text types introduced by the Moravians were locally adapted and served as a foundation for new forms of creativity, which, among other effects, enabled the development of Latvian as a monolingual national language. The prevailing paradigm of external description gradually gave way to the self-description of Latvian culture.
David Beauchemin, Yan Tremblay, Mohamed Amine Youssef
et al.
The tasks of idiom understanding and dialect understanding are both well-established benchmarks in natural language processing. In this paper, we propose combining them, and using regional idioms as a test of dialect understanding. Towards this end, we propose three new benchmark datasets for the Quebec dialect of French: QFrCoRE, which contains 4,633 instances of idiomatic phrases, and QFrCoRT, which comprises 171 regional instances of idiomatic words, and a new benchmark for French Metropolitan expressions, MFrCoE, which comprises 4,938 phrases. We explain how to construct these corpora, so that our methodology can be replicated for other dialects. Our experiments with 111 LLMs reveal a critical disparity in dialectal competence: while models perform well on French Metropolitan, 65.77% of them perform significantly worse on Quebec idioms, with only 9.0% favoring the regional dialect. These results confirm that our benchmarks are a reliable tool for quantifying the dialect gap and that prestige-language proficiency does not guarantee regional dialect understanding.
Contact languages like English exhibit rich regional variations in the form of dialects, which are often used by dialect speakers interacting with generative models. However, can multimodal generative models effectively produce content given dialectal textual input? In this work, we study this question by constructing a new large-scale benchmark spanning six common English dialects. We work with dialect speakers to collect and verify over 4200 unique prompts and evaluate on 17 image and video generative models. Our automatic and human evaluation results show that current state-of-the-art multimodal generative models exhibit 32.26% to 48.17% performance degradation when a single dialect word is used in the prompt. Common mitigation methods such as fine-tuning and prompt rewriting can only improve dialect performance by small margins (< 7%), while potentially incurring significant performance degradation in Standard American English (SAE). To this end, we design a general encoder-based mitigation strategy for multimodal generative models. Our method teaches the model to recognize new dialect features while preserving SAE performance. Experiments on models such as Stable Diffusion 1.5 show that our method is able to simultaneously raise performance on five dialects to be on par with SAE (+34.4%), while incurring near zero cost to SAE performance.
Xaver Maria Krückl, Verena Blaschke, Barbara Plank
Reliable slot and intent detection (SID) is crucial in natural language understanding for applications like digital assistants. Encoder-only transformer models fine-tuned on high-resource languages generally perform well on SID. However, they struggle with dialectal data, where no standardized form exists and training data is scarce and costly to produce. We explore zero-shot transfer learning for SID, focusing on multiple Bavarian dialects, for which we release a new dataset for the Munich dialect. We evaluate models trained on auxiliary tasks in Bavarian, and compare joint multi-task learning with intermediate-task training. We also compare three types of auxiliary tasks: token-level syntactic tasks, named entity recognition (NER), and language modelling. We find that the included auxiliary tasks have a more positive effect on slot filling than intent classification (with NER having the most positive effect), and that intermediate-task training yields more consistent performance gains. Our best-performing approach improves intent classification performance on Bavarian dialects by 5.1 and slot filling F1 by 8.4 percentage points.
Jiugeng Sun, Rita Sevastjanova, Sina Ahmadi
et al.
Dialects suffer from the scarcity of computational textual resources as they exist predominantly in spoken rather than written form and exhibit remarkable geographical diversity. Collecting dialect data and subsequently integrating it into current language technologies present significant obstacles. Gamification has been proven to facilitate remote data collection processes with great ease and on a substantially wider scale. This paper introduces Dia-Lingle, a gamified interface aimed to improve and facilitate dialectal data collection tasks such as corpus expansion and dialect labelling. The platform features two key components: the first challenges users to rewrite sentences in their dialects, identifies them through a classifier and solicits feedback, and the other one asks users to match sentences to their geographical locations. Dia-Lingle combines active learning with gamified difficulty levels, strategically encouraging prolonged user engagement while efficiently enriching the dialect corpus. Usability evaluation shows that our interface demonstrates high levels of user satisfaction. We provide the link to Dia-Lingle: https://dia-lingle.ivia.ch/, and demo video: https://youtu.be/0QyJsB8ym64.
Research on cross-dialectal transfer from a standard to a non-standard dialect variety has typically focused on text data. However, dialects are primarily spoken, and non-standard spellings cause issues in text processing. We compare standard-to-dialect transfer in three settings: text models, speech models, and cascaded systems where speech first gets automatically transcribed and then further processed by a text model. We focus on German dialects in the context of written and spoken intent classification -- releasing the first dialectal audio intent classification dataset -- with supporting experiments on topic classification. The speech-only setup provides the best results on the dialect data while the text-only setup works best on the standard data. While the cascaded systems lag behind the text-only models for German, they perform relatively well on the dialectal data if the transcription system generates normalized, standard-like output.
Although botany was not taught as an independent subject at the Swedish University of Tartu, plants were addressed within the context of medical studies. Professors of medicine were required to teach natural sciences on the basis of Johannes Magirus’ (?–1596) textbook “Physiologia peripatetica ex Aristotele”.
At other Swedish Universities of the 17th century, at Uppsala and Turku, the first botanical treatises were written by professors of medicine, who also spearheaded the establishment of the first botanical gardens. In Tartu there was no botanical garden during that period, founding one was discussed to support medical studies. However, a disputation on plants – titled “De plantis” – emerged from the Swedish University of Tartu in 1647. The praeses of the disputation, Professor of Natural Sciences Johannes Erici Stregnensis (Stiernstråle, 1607–1686), and the respondent, student Andreas Arvidi (ca 1620–1673), both came from Strängnäs, Sweden, where botany was taught at the local gymnasium. A year later, the same student engaged in a debate on medicine presided by Professor of Medicine Sebastian Wirdig. This serves as evidence of the classical connection between botany and medicine.
Maximilian Spliethöver, Sai Nikhil Menon, Henning Wachsmuth
Dialects introduce syntactic and lexical variations in language that occur in regional or social groups. Most NLP methods are not sensitive to such variations. This may lead to unfair behavior of the methods, conveying negative bias towards dialect speakers. While previous work has studied dialect-related fairness for aspects like hate speech, other aspects of biased language, such as lewdness, remain fully unexplored. To fill this gap, we investigate performance disparities between dialects in the detection of five aspects of biased language and how to mitigate them. To alleviate bias, we present a multitask learning approach that models dialect language as an auxiliary task to incorporate syntactic and lexical variations. In our experiments with African-American English dialect, we provide empirical evidence that complementing common learning approaches with dialect modeling improves their fairness. Furthermore, the results suggest that multitask learning achieves state-of-the-art performance and helps to detect properties of biased language more reliably.
"1960 Estonian Song Festival as a source of unity and division
between Estonian communities home and abroad." Khrushchev’s policy of building cultural relations with the West opened up a possibility for communication between Estonians home and abroad in the late 1950s. From the Soviet perspective, this contact was aimed at weakening the anti-Soviet attitudes of refugee organizations and splitting the community that fled to the West. As a consequence of the non-recognition of the annexation of the Baltic states, a portion of the community was determined not to visit their homeland until it was free. Another part, however, did not see much wrong with a trip to their native land. The disagreements and internal divisions of the community inspired the Soviet officials to influence the refugees in different ways. Making use of people’s homesickness and their wish to see friends and family, a number of people were selected for positive inclusion through cultural contact and other means in order to encourage other refugees to follow their example and change their behaviour and attitudes. The 1960 Song Festival seemed like a perfect opportunity to invite people to visit their homeland. The Festival was also meant to coax the acclaimed conductor and composer Eduard Tubin to visit Estonia.
However, the Song Festival failed to activate communication to the desired degree and Tubin also refused to participate. The manipulation of Tubin, which focused on his professional motivation and connections with former colleagues and acquaintances, finally bore fruit in December 1961. The Ministry of State Security of the Estonian SSR considered this a breakthrough in changing the attitudes of the exiles, as the “emigrant leaders” reacted to Tubin’s visit with calls to refrain from visiting Estonia, which in turn sowed resentment among those who were looking forward to visiting their homeland and deepened internal contradictions. The debate that erupted around attending the Song Festival in Tallinn as well as Tubin’s trip to Estonia brought into view the fragmentation of the Estonian diaspora, the existential concerns of people living in exile that the Soviet authorities deliberately amplified in order to use them for their benefit through various influence operations.
This article is a sequel to “Struggle over the ages I: Reading Kreutzwald’s “Kalevipoeg” as opposite in meaning to Goethe’s “Faust”” (Parksepp 2023). The pair of articles aims to exemplify how Harold Bloom’s and Paul de Man’s complementary views on rhetoric and tropes help to reveal the differences and similarities between Goethe’s and Kreutzwald’s masterpieces. The objective of the first article was to describe how, according to the anxiety of influence theory, the limiting tropes of irony, metonymy and metaphor can be used to depict the change in meaning from “Faust” to “Kalevipoeg”, whereas this article shows how the Bloomian recurring tropes of synecdoche, hyperbole and metalepsis define ways in which the Estonian national epic recreates the meaning of Goethe’s grand tragedy. Underneath all the stylistic and genre conventions lies a story of two opposite protagonists with a similar fate. Both Faust and Kalevipoeg commit crimes, although the full tragedy of Gretchen is reduced to a synecdoche: an episode where Kalevipoeg meets an island maiden whose death he causes by accident. Both protagonists get help from the devil or hell, but Mephistopheles’ and Faust’s imaginative paper money finds a hyperbolical counterpart in the later work: Kalevipoeg defeats the devil and brings real treasures out from the underworld. However, a Bloomian metaleptic reversal of history occurs in the final episodes of both works. The salvation of Faust raises questions; one could argue that it is an example of Aristotle’s deus ex machina method, where the fate of Faust’s soul does not follow from the plot. By comparison, the final scene of “Kalevipoeg” seems more logical. Although the deceased national hero is briefly cheered in heaven, he is still sent to hell as a gatekeeper. Even on a white horse his disabled body is a gruesome reminder of his blood guilt. His wrongdoings and evil deeds are not forgotten as in “Faust”, where the protagonist is saved from hell due to his eternal drive. Even the prophetic final lines of “Kalevipoeg” include a warning taboo: with the return of the mighty Estonian hero, evil will also be set free. Therefore, on a rhetorical level, it can be argued that the end of “Kalevipoeg” is more refined and fulfils the Bloomian trope-reversing trope of metalepsis. Rhetorically, the ending of “Kalevipoeg” can be read as an original to the closing scene of “Faust”, where the interrogation of good and evil (the hero and the devil bound together at hell’s gate) is replaced with an idea of good always prevailing over evil (an all-powerful heaven with Mater Gloriosa just waiting for Faust). Of course, this is only one point of view, but it shows how Bloomian ideas can be used methodically to analyse great works of literature. It also raises the objectively unanswerable question of why great works of literature survive over the ages.
Daniel Lundén, Lars Hummelgren, Jan Kudlicka
et al.
Universal probabilistic programming languages (PPLs) make it relatively easy to encode and automatically solve statistical inference problems. To solve inference problems, PPL implementations often apply Monte Carlo inference algorithms that rely on execution suspension. State-of-the-art solutions enable execution suspension either through (i) continuation-passing style (CPS) transformations or (ii) efficient, but comparatively complex, low-level solutions that are often not available in high-level languages. CPS transformations introduce overhead due to unnecessary closure allocations -- a problem the PPL community has generally overlooked. To reduce overhead, we develop a new efficient selective CPS approach for PPLs. Specifically, we design a novel static suspension analysis technique that determines parts of programs that require suspension, given a particular inference algorithm. The analysis allows selectively CPS transforming the program only where necessary. We formally prove the correctness of the analysis and implement the analysis and transformation in the Miking CorePPL compiler. We evaluate the implementation for a large number of Monte Carlo inference algorithms on real-world models from phylogenetics, epidemiology, and topic modeling. The evaluation results demonstrate significant improvements across all models and inference algorithms.
David Ifeoluwa Adelani, Hannah Liu, Xiaoyu Shen
et al.
Despite the progress we have recorded in the last few years in multilingual natural language processing, evaluation is typically limited to a small set of languages with available datasets which excludes a large number of low-resource languages. In this paper, we created SIB-200 -- a large-scale open-sourced benchmark dataset for topic classification in 200 languages and dialects to address the lack of evaluation dataset for Natural Language Understanding (NLU). For many of the languages covered in SIB-200, this is the first publicly available evaluation dataset for NLU. The dataset is based on Flores-200 machine translation corpus. We annotated the English portion of the dataset and extended the sentence-level annotation to the remaining 203 languages covered in the corpus. Despite the simplicity of this task, our evaluation in full-supervised setting, cross-lingual transfer setting and prompting of large language model setting show that there is still a large gap between the performance of high-resource and low-resource languages when multilingual evaluation is scaled to numerous world languages. We found that languages unseen during the pre-training of multilingual language models, under-represented language families (like Nilotic and Altantic-Congo), and languages from the regions of Africa, Americas, Oceania and South East Asia, often have the lowest performance on our topic classification dataset. We hope our dataset will encourage a more inclusive evaluation of multilingual language models on a more diverse set of languages. https://github.com/dadelani/sib-200
Atnafu Lambebo Tonja, Hellina Hailu Nigatu, Olga Kolesnikova
et al.
This paper describes CIC NLP's submission to the AmericasNLP 2023 Shared Task on machine translation systems for indigenous languages of the Americas. We present the system descriptions for three methods. We used two multilingual models, namely M2M-100 and mBART50, and one bilingual (one-to-one) -- Helsinki NLP Spanish-English translation model, and experimented with different transfer learning setups. We experimented with 11 languages from America and report the setups we used as well as the results we achieved. Overall, the mBART setup was able to improve upon the baseline for three out of the eleven languages.
The North Sámi (NS) language encapsulates four primary dialectal variants that are related but that also have differences in their phonology, morphology, and vocabulary. The unique geopolitical location of NS speakers means that in many cases they are bilingual in Sámi as well as in the dominant state language: Norwegian, Swedish, or Finnish. This enables us to study the NS variants both with respect to the spoken state language and their acoustic characteristics. In this paper, we investigate an extensive set of acoustic features, including MFCCs and prosodic features, as well as state-of-the-art self-supervised representations, namely, XLS-R, WavLM, and HuBERT, for the automatic detection of the four NS variants. In addition, we examine how the majority state language is reflected in the dialects. Our results show that NS dialects are influenced by the state language and that the four dialects are separable, reaching high classification accuracy, especially with the XLS-R model.