Automating the Analysis and Improvement of Dynamic Programming Algorithms with Applications to Natural Language Processing
Tim Vieira
This thesis develops a system for automatically analyzing and improving dynamic programs, such as those that have driven progress in natural language processing and computer science, more generally, for decades. Finding a correct program with the optimal asymptotic runtime can be unintuitive, time-consuming, and error-prone. This thesis aims to automate this laborious process. To this end, we develop an approach based on 1. a high-level, domain-specific language called Dyna for concisely specifying dynamic programs 2. a general-purpose solver to efficiently execute these programs 3. a static analysis system that provides type analysis and worst-case time/space complexity analyses 4. a rich collection of meaning-preserving transformations to programs, which systematizes the repeated insights of numerous authors when speeding up algorithms in the literature 5. a search algorithm for identifying a good sequence of transformations that reduce the runtime complexity, given an initial, correct program We show that, in practice, automated search -- like the mental search performed by human programmers -- can find substantial improvements to the initial program. Empirically, we show that many speed-ups described in the NLP literature could have been discovered automatically by our system. We provide a freely available prototype system at https://github.com/timvieira/dyna-pi.
NILE: Formalizing Natural-Language Descriptions of Formal Languages
Tristan Kneisel, Marko Schmellenkamp, Fabian Vehlken
et al.
This paper explores how natural-language descriptions of formal languages can be compared to their formal representations and how semantic differences can be explained. This is motivated from educational scenarios where learners describe a formal language (presented, e.g., by a finite state automaton, regular expression, pushdown automaton, context-free grammar or in set notation) in natural language, and an educational support system has to (1) judge whether the natural-language description accurately describes the formal language, and to (2) provide explanations why descriptions are not accurate. To address this question, we introduce a representation language for formal languages, Nile, which is designed so that Nile expressions can mirror the syntactic structure of natural-language descriptions of formal languages. Nile is sufficiently expressive to cover a broad variety of formal languages, including all regular languages and fragments of context-free languages typically used in educational contexts. Generating Nile expressions that are syntactically close to natural-language descriptions then allows to provide explanations for inaccuracies in the descriptions algorithmically. In experiments on an educational data set, we show that LLMs can translate natural-language descriptions into equivalent, syntactically close Nile expressions with high accuracy - allowing to algorithmically provide explanations for incorrect natural-language descriptions. Our experiments also show that while natural-language descriptions can also be translated into regular expressions (but not context-free grammars), the expressions are often not syntactically close and thus not suitable for providing explanations.
Invariants and Lacunae in Translation of Russian Interjectional Phrases: A Study of English Translations of M. A. Bulgakov's “The Master and Margarita”
A. S. Izvolenskaya
This article explores the peculiarities of identifying and overcoming linguocultural lacunae in translated literary texts, specifically focusing on the stylistic material of M. A. Bulgakov's novel “The Master and Margarita.” By employing an inductive method of linguistic-cognitive analysis, we identify several implicit lacunae. The main findings are as follows: 1) The invariants revealed during the analysis of the translated literary text serve as a means to bridge significant linguistic and cultural lacunae; 2) Analyzing certain translation choices can provide material for investigating the nature of lacunae and their existence within literary texts from both synchronic and diachronic perspectives. We propose a structure for an academic translation commentary designed to integrate the artistic narrative into its historical context, thereby addressing substantial lacunae. The commentary is viewed as an integral part of the translation and the creative translation process. This study examines the following translations of “The Master and Margarita”: by Michael Glenny (1967), Mirra Ginsburg (1967), Diana L. Burgin and Katherine T. O'Connor (1993), Richard Pevear and Larissa Volokhonsky (1997), and Hugh Aplin (2008).
Slavic languages. Baltic languages. Albanian languages
Towards Multilingual LLM Evaluation for Baltic and Nordic languages: A study on Lithuanian History
Yevhen Kostiuk, Oxana Vitman, Łukasz Gagała
et al.
In this work, we evaluated Lithuanian and general history knowledge of multilingual Large Language Models (LLMs) on a multiple-choice question-answering task. The models were tested on a dataset of Lithuanian national and general history questions translated into Baltic, Nordic, and other languages (English, Ukrainian, Arabic) to assess the knowledge sharing from culturally and historically connected groups. We evaluated GPT-4o, LLaMa3.1 8b and 70b, QWEN2.5 7b and 72b, Mistral Nemo 12b, LLaMa3 8b, Mistral 7b, LLaMa3.2 3b, and Nordic fine-tuned models (GPT-SW3 and LLaMa3 8b). Our results show that GPT-4o consistently outperformed all other models across language groups, with slightly better results for Baltic and Nordic languages. Larger open-source models like QWEN2.5 72b and LLaMa3.1 70b performed well but showed weaker alignment with Baltic languages. Smaller models (Mistral Nemo 12b, LLaMa3.2 3b, QWEN 7B, LLaMa3.1 8B, and LLaMa3 8b) demonstrated gaps with LT-related alignment with Baltic languages while performing better on Nordic and other languages. The Nordic fine-tuned models did not surpass multilingual models, indicating that shared cultural or historical context alone does not guarantee better performance.
In Search of Lost National Epic: Poem Kastītis ir Juraite by Auszra poet Andrius Vištelis
Ieva Kristinaitytė
The article discusses the manuscript poem Kastītis ir Juraite (1889) by Andrius Vištelis and the vision of an imagined utopian Lithuanian culture emerging in the poem. Vištelis was a then-known poet of the 19th-century Lithuanian national movement who, in the current historiography, is perceived as a marginal writer. The poem has not been analysed until now, although it is one of the earliest texts in Lithuanian about the myth of Jūratė and Kastytis. The poem can be considered as an attempt to create the missing Lithuanian national epic for the emerging modern Lithuanian national community. The cultural tensions and negotiations that appear in the poem are reviewed. The images appearing in the poem that was destined to unite the community are also analysed.
The first part of the article discusses the paradigm of poetic and linguistic reconstruction and its significance for the creative program of one of the initiators of Auszra, thus including Vištelis in the significant field of the romanticism, which was manifested in the Polish-Lithuanian Commonwealth, whereas the second part analyses the idea of a new egalitarian society proclaimed in the poem Kastītis ir Juraite in the 19th century through the context of utopias. In the analysed poem, a new religion emerges, which was supposed to combine pagan and Christian ‘traditions’. The circulation of Hugues Felicité Robert Lamennais philosophy on the work of Vištelis is discussed, the circumstances of the poem’s creation are specified, and the possible narrative of the poem is reconstructed from the drafts, thereby supplementing the discourse of the first national Lithuanian paper Auszra.
Literature (General), Slavic languages. Baltic languages. Albanian languages
Filipino Benchmarks for Measuring Sexist and Homophobic Bias in Multilingual Language Models from Southeast Asia
Lance Calvin Lim Gamboa, Mark Lee
Bias studies on multilingual models confirm the presence of gender-related stereotypes in masked models processing languages with high NLP resources. We expand on this line of research by introducing Filipino CrowS-Pairs and Filipino WinoQueer: benchmarks that assess both sexist and anti-queer biases in pretrained language models (PLMs) handling texts in Filipino, a low-resource language from the Philippines. The benchmarks consist of 7,074 new challenge pairs resulting from our cultural adaptation of English bias evaluation datasets, a process that we document in detail to guide similar forthcoming efforts. We apply the Filipino benchmarks on masked and causal multilingual models, including those pretrained on Southeast Asian data, and find that they contain considerable amounts of bias. We also find that for multilingual models, the extent of bias learned for a particular language is influenced by how much pretraining data in that language a model was exposed to. Our benchmarks and insights can serve as a foundation for future work analyzing and mitigating bias in multilingual models.
Cross-lingual Named Entity Corpus for Slavic Languages
Jakub Piskorski, Michał Marcińczuk, Roman Yangarber
This paper presents a corpus manually annotated with named entities for six Slavic languages - Bulgarian, Czech, Polish, Slovenian, Russian, and Ukrainian. This work is the result of a series of shared tasks, conducted in 2017-2023 as a part of the Workshops on Slavic Natural Language Processing. The corpus consists of 5 017 documents on seven topics. The documents are annotated with five classes of named entities. Each entity is described by a category, a lemma, and a unique cross-lingual identifier. We provide two train-tune dataset splits - single topic out and cross topics. For each split, we set benchmarks using a transformer-based neural network architecture with the pre-trained multilingual models - XLM-RoBERTa-large for named entity mention recognition and categorization, and mT5-large for named entity lemmatization and linking.
Cocobo: Exploring Large Language Models as the Engine for End-User Robot Programming
Yate Ge, Yi Dai, Run Shan
et al.
End-user development allows everyday users to tailor service robots or applications to their needs. One user-friendly approach is natural language programming. However, it encounters challenges such as an expansive user expression space and limited support for debugging and editing, which restrict its application in end-user programming. The emergence of large language models (LLMs) offers promising avenues for the translation and interpretation between human language instructions and the code executed by robots, but their application in end-user programming systems requires further study. We introduce Cocobo, a natural language programming system with interactive diagrams powered by LLMs. Cocobo employs LLMs to understand users' authoring intentions, generate and explain robot programs, and facilitate the conversion between executable code and flowchart representations. Our user study shows that Cocobo has a low learning curve, enabling even users with zero coding experience to customize robot programs successfully.
Nikolai Gogol’s Unknown Book, 1834: Intention, Context, Reminiscences
Igor А. Vinogradov
The research is devoted to one of the many creative ideas of Gogol in 1834. The textual analysis shows the unity of the two historical works of the writer, which were previously published as separate works. These manuscripts testify that Gogol's “Bibliography of the Middle Ages” and the course of medieval history written at the same time (of ten lectures) represent a special edition that was being prepared for publication (which remained unpublished for unknown reasons). For the first time, the reader got acquainted with the “Bibliography of the Middle Ages” and Gogol’s ten university lectures in 1896, but until that moment these materials, published separately, have not been comprehended as a single whole prepared for publication. The publication of the book, which did not take place in 1834, is put in connection with Gogol's then cooperation with the Minister of Public Education S.S. Uvarov. During this period, thanks to the minister, Gogol entered the department of general history of St. Petersburg University and published four articles in the ministerial journal. The article analyzes the content of Gogol's lecture course and its relation to his other works. The author of the article proposes a possible title for the untitled book, based on the surviving Gogol lecture program.
Literature (General), Slavic languages. Baltic languages. Albanian languages
Creating Lexical Resources for Endangered Languages
Khang Nhut Lam, Feras Al Tarouti, Jugal Kalita
This paper examines approaches to generate lexical resources for endangered languages. Our algorithms construct bilingual dictionaries and multilingual thesauruses using public Wordnets and a machine translator (MT). Since our work relies on only one bilingual dictionary between an endangered language and an "intermediate helper" language, it is applicable to languages that lack many existing resources.
Annotating Norwegian Language Varieties on Twitter for Part-of-Speech
Petter Mæhlum, Andre Kåsen, Samia Touileb
et al.
Norwegian Twitter data poses an interesting challenge for Natural Language Processing (NLP) tasks. These texts are difficult for models trained on standardized text in one of the two Norwegian written forms (Bokmål and Nynorsk), as they contain both the typical variation of social media text, as well as a large amount of dialectal variety. In this paper we present a novel Norwegian Twitter dataset annotated with POS-tags. We show that models trained on Universal Dependency (UD) data perform worse when evaluated against this dataset, and that models trained on Bokmål generally perform better than those trained on Nynorsk. We also see that performance on dialectal tweets is comparable to the written standards for some models. Finally we perform a detailed analysis of the errors that models commonly make on this data.
P. Lazarević Di Giacomo, M.R. Leto (a cura di), L’Adriatico tra sogno e realtà, Edizioni dell’Orso, Alessandria 2019 (= Studi interadriatici, 1), pp. 470.
Natka Badurina
Book Review
History of Eastern Europe, Slavic languages. Baltic languages. Albanian languages
Exploring Transformer Based Models to Identify Hate Speech and Offensive Content in English and Indo-Aryan Languages
Somnath Banerjee, Maulindu Sarkar, Nancy Agrawal
et al.
Hate speech is considered to be one of the major issues currently plaguing online social media. Repeated and repetitive exposure to hate speech has been shown to create physiological effects on the target users. Thus, hate speech, in all its forms, should be addressed on these platforms in order to maintain good health. In this paper, we explored several Transformer based machine learning models for the detection of hate speech and offensive content in English and Indo-Aryan languages at FIRE 2021. We explore several models such as mBERT, XLMR-large, XLMR-base by team name "Super Mario". Our models came 2nd position in Code-Mixed Data set (Macro F1: 0.7107), 2nd position in Hindi two-class classification(Macro F1: 0.7797), 4th in English four-class category (Macro F1: 0.8006) and 12th in English two-class category (Macro F1: 0.6447).
Representations of Language Varieties Are Reliable Given Corpus Similarity Measures
Jonathan Dunn
This paper measures similarity both within and between 84 language varieties across nine languages. These corpora are drawn from digital sources (the web and tweets), allowing us to evaluate whether such geo-referenced corpora are reliable for modelling linguistic variation. The basic idea is that, if each source adequately represents a single underlying language variety, then the similarity between these sources should be stable across all languages and countries. The paper shows that there is a consistent agreement between these sources using frequency-based corpus similarity measures. This provides further evidence that digital geo-referenced corpora consistently represent local language varieties.
Formation of Border Line between Kazakhstan and Siberia in 1920s - Early 1930s (The Case of Cherlaksky District)
K. B. Korzhenevsky
The process of changing the border line of the Akmola province of the Kazak Autonomous Republic (until 1925 - the Kyrgyz Republic) of the Autonomous Republic with the adjacent territory of the Omsk District of the Siberian Territory (until 1925 - the Omsk Province), consisting in the transition of the Cherlak District (Stepanovskaya, Dobrovolskaya, Cherlakskaya and part of the Bostandyk-Tuska Volost) to the Omsk District in the second half of the 1920s is considered in the article. On the basis of archival documents first involved in the scientific revolution, a detailed analysis of the inclusion of the Cherlaksky district in the Omsk province is carried out. The history of this border issue in 1922-1923, the reason for which was the petition of residents of border villages is discussed in the article. The course and features of the consideration of this issue by local and central authorities, the work of the conciliation commission for the reception and transfer of Cherlak territory are described. It is concluded that the transfer of the Cherlak district to Omsk took place under the conditions of the manifestation of an active civic position of local residents, support for this border issue by the Cossack and Siberian authorities and was carried out only after the general regionalization of Kazakhstan by the end of the 1920s.
Slavic languages. Baltic languages. Albanian languages
Derivatives with Formant -<i>teka</i> in Modern Russian
E. M. Melnikova, Z. K. Gaponova
The article is devoted to the analysis of word-formation activity of the suffix - teka in modern Russian. The material was the data of modern dictionaries and a card file of the use of lexemes collected by the authors. The relevance of the study is due to the activation of the derivational potential of the formant of - teka in the Russian language. It is shown that there is still no definite opinion among linguistic scholars regarding the status of a component of Greek origin - teka : it is interpreted as a suffix, a connected root, a suffixoid. It is demonstrated by examples that it joins the free and connected foundations of Russian and foreign origin and expresses the meanings ‘collection’ and ‘repository’, metonymically related to each other and not always shared in dictionary interpretations. Particular attention is paid to the analysis of new tokens containing a suffixoid - teka ( pivoteka, lakoeka, enoteka, novoteka, poleznoteka, flirtoteka, etc.). t is noted that the ability of an affixoid to join free roots and the transparency of its semantics lead to the appearance of new formations. The article proves that this formant expands its compatibility in modern Russian. It is shown that it joins the foundations of not only the substantive, but also the adjective and verb type, while the mechanisms of language play are realized.
Slavic languages. Baltic languages. Albanian languages
Towards Large-Scale Data Mining for Data-Driven Analysis of Sign Languages
Boris Mocialov, Graham Turner, Helen Hastie
Access to sign language data is far from adequate. We show that it is possible to collect the data from social networking services such as TikTok, Instagram, and YouTube by applying data filtering to enforce quality standards and by discovering patterns in the filtered data, making it easier to analyse and model. Using our data collection pipeline, we collect and examine the interpretation of songs in both the American Sign Language (ASL) and the Brazilian Sign Language (Libras). We explore their differences and similarities by looking at the co-dependence of the orientation and location phonological parameters
Formanty kategorialne – ich miejsce w (semantycznej i formalnej) strukturze języka
Zuzanna Topolińska
Category formants – their place in (semantic and formal) language structure
The author argues that the so-called category some of the grammaticalized segments of the corresponding semantic fields that are especially important for successful linguistic communication. She presents semantic categorical paradigms in Polish and in Macedonian, along with their formal correlates, i.e., categorical markers. She states that in the course of time, the old postpositional bound markers, suffixes and infixes are often eliminated, while the prepositional markers (prefixes, prepositions, verbal particles…) multiply. The Macedonian situation demonstrates that this type of evolution is accelerated in multilingual environment.
Formanty kategorialne – ich miejsce w (semantycznej i formalnej) strukturze języka
Autorka dowodzi, że tzw. formanty kategorialne należą do tych gramatycznych segmentów pól semantycznych, które są szczególnie ważne dla pomyślnej komunikacji językowej. Przedstawia semantyczne paradygmaty kategorii w języku polskim i macedońskim oraz ich formalne, tj. kategorialne, wskaźniki. Stwierdza, że z biegiem czasu stare, postpozycyjne, związane wskaźniki, sufiksy, infiksy, partykuły werbalne są często eliminowane, a wskaźniki przyimkowe (przedrostki, przyimki, cząstki słowne…) multiplikowane. Sytuacja w Macedonii świadczy o tym, że ten rodzaj ewolucji ulega przyspieszeniu w środowisku wielojęzycznym.
Ethnology. Social and cultural anthropology, Slavic languages. Baltic languages. Albanian languages
Self-Presentation in Campaign Discourse: Corpus-Assisted Study (A Case of B. Sanders’s and H. Clinton’s Campaign Speeches)
T. V. Kuraleva, V. N. Kondrashova
The paper is devoted to the peculiarities of the implementation of self-presentation strategy in the American campaign discourse. The material included the election speeches of Bernie Sanders and Hillary Clinton, which politicians made during the election campaign from April 2015 to July 2016. The article focuses on the corpus analysis of election speeches. The results of comparative analysis of lexical features of candidates’ pre-election speeches are presented. In addition, the paper proposes a classification of communicative tactics, with the help of which the strategy of self-presentation is implemented (the tactics of self-praise, appeals to former merits, reducing the distance from voters and the tactics of political promises). The analysis allows the authors to make a conclusion about the more aggressive self-presentation of H. Clinton and the important role of personal self-presentation in her speeches. It is shown that B. Sanders focuses on identifying himself with his voters. The analysis of the collocations with components I and we confirms these conclusions. The novelty of the research is seen in the fact that for the analysis of the material from the point of view of pragmatics, the authors use both methods of pragmalinguistic analysis and corpus methods, which makes it possible to present a comprehensive description of the material under study.
Slavic languages. Baltic languages. Albanian languages
TeSSLa: Temporal Stream-based Specification Language
Lukas Convent, Sebastian Hungerecker, Martin Leucker
et al.
Runtime verification is concerned with monitoring program traces. In particular, stream runtime verification (SRV) takes the program trace as input streams and incrementally derives output streams. SRV can check logical properties and compute temporal metrics and statistics from the trace. We present TeSSLa, a temporal stream-based specification language for SRV. TeSSLa supports timestamped events natively and is hence suitable for streams that are both sparse and fine-grained, which often occur in practice. We prove results on TeSSLa's expressiveness and compare different TeSSLa fragments to (timed) automata, thereby inheriting various decidability results. Finally, we present a monitor implementation and prove its correctness.