Stop Jostling: Adaptive Negative Sampling Reduces the Marginalization of Low-Resource Language Tokens by Cross-Entropy Loss
Galim Turumtaev
Neural language models often struggle with low-resource languages due to the limited availability of training data, making tokens from these languages rare in the training set. This paper addresses a specific challenge during training: rare tokens are disproportionately affected by marginalization, which prevents them from learning effectively. We propose a thresholding technique that reduces the impact of this marginalization, allowing rare tokens to benefit from more meaningful alignment. Through experiments with a character-level language model, we demonstrate that this method significantly improves performance on low-resource language validation data. This work is the first to show how negative sampling can be applied to improve the representation of rare tokens by limiting the harmful influence of excessive marginalization, offering a new approach to enhancing language model performance for underrepresented languages.
Illocutionary Force of Adverbial Clauses: Evidence from German
A. V. Averina
This study posits that adverbial clauses in the German can be categorized based on their degree of illocutionary force into three groups: (1) incompatible with modal indicators; (2) allowing the use of modal words denoting epistemic modality and inferentiality, but unable to incorporate modal particles within their structure; and (3) permitting the use of any modal indicators, including modal particles. The objective of this paper is to establish a relationship between the semantics of adverbial clauses and the manifestation of their illocutionary force. Data from the DECOW 16A and DWDS corpora are utilized. Through a descriptive methodology, the structural and semantic characteristics of adverbial clauses are elucidated. Componential analysis facilitates the construction of semantic models for complex sentences containing adverbial clauses. Contextual analysis is employed to uncover the influence of contextual surroundings on the illocutionary force of subordinate clauses, while transformational analysis is used to determine the degree of semantic cohesion among the components of complex sentences. It is demonstrated that the semantics of relationships between parts of a complex sentence predetermines the illocutionary independence or dependence of the adverbial clause and may influence its structural synsemantics / autosemantics.
Slavic languages. Baltic languages. Albanian languages
Wojna w dystopijnych powieściach Władimira Sorokina w świetle rozważań o przekroczeniu postmodernizmu
Anna Stryjakowska
Celem artykułu jest określenie, czy obraz wojny w powieściach Теллурия, Manaraga i Доктор Гарин Władimira Sorokina przekracza granice postmodernizmu. Metodologiczną podstawę analizy stanowią koncepcje literatury post-postmodernistycznej. Analiza prowadzi do wniosku, że z punktu widzenia zaangażowania politycznego i zdolności języka do reprezentowania rzeczywistości obraz wojny w dystopiach Sorokina pozostaje drugorzędny wobec dynamiki współczesnego świata, ponieważ pisarz skupia się na ironicznej transpozycji ogólnych trendów i dyskursu wojennego. Refleksja porównawcza nad utworami pozwala jednak zauważyć przejście pisarza od rozproszonych historii budujących dystans wobec odbiorcy w kierunku nawiązywania więzi przez kreację angażującego protagonisty, działającego w warunkach permanentnej wojny.
Slavic languages. Baltic languages. Albanian languages, History (General) and history of Europe
BASIC METHODS OF MOTIVATING STUDENTS OF NON-LINGUISTIC SPECIALTIES WHEN TEACHING ENGLISH IN HIGHER EDUCATIONAL INSTITUTIONS
N.A. Liubimova, Ch.E. Kulmatova, G.A. Pryanikova
et al.
The article introduces various ways to shape students’ motivation in the process of teaching a foreign language for non-linguistic specialties at higher education establishments. Various methods have been proposed to increase students’ educational motivation: country studies material, videos, extracurricular activities, and multimedia. Country studies contribute to developing of communication and speaking skills, enriching knowledge about the culture, geography, and history of the country of the studied language. The usage of video material makes classes more interesting, illustrative, and dynamic. The proper organization of extracurricular activities has a great positive impact on the educational process. The active utilization of multimedia presentations, simulators, online platforms, services, applications, and artificial intelligence, encourages the modern process of education to take advantages of the latest technologies. The practical significance of this article is that the authors, the academicians of the Department of Foreign Languages of Kyrgyz -Russian Slavic University, basing on their own experience in teaching a foreign language atthe non-linguistic university, provide practical recommendations to solve the problem of motivation, which may be of interest to teachers at schools and other higher education institutions.
Idefix-Closed Languages and Their Application in Contextual Grammars
Marvin Ködding, Bianca Truthe
In this paper, we continue the research on the power of contextual grammars with selection languages from subfamilies of the family of regular languages. We investigate infix-, prefix-, and suffix-closed languages (referred to as idefix-closed languages) and compare such language families to some other subregular families of languages (finite, monoidal, nilpotent, combinational, (symmetric) definite, ordered, non-counting, power-separating, commutative, circular, union-free, star, and comet languages). Further, we compare the families of the hierarchies obtained for external and internal contextual grammars with the language families defined by these new types for the selection. In this way, we extend the existing hierarchies by new language families. Moreover, we solve an open problem regarding internal contextual grammars with suffix-closed selection languages.
Identity Testing for Stochastic Languages
Smayan Agarwal, Shobhit Singh, Aalok Thakkar
Determining whether an unknown distribution matches a known reference is a cornerstone problem in distributional analysis. While classical results establish a rigorous framework in the case of distributions over finite domains, real-world applications in computational linguistics, bioinformatics, and program analysis demand testing over infinite combinatorial structures, particularly strings. In this paper, we initiate the theoretical study of identity testing for stochastic languages, bridging formal language theory with modern distribution property testing. We first propose a polynomial-time algorithm to verify if a finite state machine represents a stochastic language, and then prove that rational stochastic languages can approximate an arbitrary probability distribution. Building on these representations, we develop a truncation-based identity testing algorithm that distinguishes between a known and an unknown distributions with sample complexity $\widetildeΘ\left( \frac{\sqrt{n}}{\varepsilon^2} + \frac{n}{\log n} \right)$ where $n$ is the size of the truncated support. Our approach leverages the exponential decay inherent in rational stochastic languages to bound truncation error, then applies classical finite-domain testers to the restricted problem. This work establishes the first identity testing framework for infinite discrete distributions, opening new directions in probabilistic formal methods and statistical analysis of structured data.
Density of rational languages under shift invariant measures
Valérie Berthé, Herman Goulet-Ouellet, Dominique Perrin
We study density of rational languages under shift invariant probability measures on spaces of two-sided infinite words, which generalizes the classical notion of density studied in formal languages and automata theory. The density for a language is defined as the limit in average (if it exists) of the probability that a word of a given length belongs to the language. We establish the existence of densities for all rational languages under all shift invariant measures. We also give explicit formulas under certain conditions, in particular when the language is aperiodic. Our approach combines tools and ideas from semigroup theory and ergodic theory.
It's Not Easy Being Green: On the Energy Efficiency of Programming Languages
Nicolas van Kempen, Hyuk-Je Kwon, Dung Tuan Nguyen
et al.
Does the choice of programming language affect energy consumption? Previous highly visible studies have established associations between certain programming languages and energy consumption. A causal misinterpretation of this work has led academics and industry leaders to use or support certain languages based on their claimed impact on energy consumption. This paper tackles this causal question directly: it develops a detailed causal model capturing the complex relationship between programming language choice and energy consumption. This model identifies and incorporates several critical but previously overlooked factors that affect energy usage. These factors, such as distinguishing programming languages from their implementations, the impact of the application implementations themselves, the number of active cores, and memory activity, can significantly skew energy consumption measurements if not accounted for. We show -- via empirical experiments, improved methodology, and careful examination of anomalies -- that when these factors are controlled for, notable discrepancies in prior work vanish. Our analysis suggests that the choice of programming language implementation has no significant impact on energy consumption beyond execution time.
On Quantum Programming Languages
Benoît Valiron
This thesis (Habilitation à diriger des recherches) presents some of my research contributions since my Ph.D defense in 2008. I have had the chance to participate in the development of quantum programming languages since their early developments: the presentation aims to present my point of view on the evolution of the subject, my contributions, and the current research trends in the community. The target audience is a graduate student interested in pointers to the field of quantum programming languages.
“Come mosche, qui e là…”: i rumors come fonte di comunicazione nello spazio del post-folklore (traduzione di Manuel Paludi)
Igor’ Orlov
Italian translation of “Slovo mukhi, tut i tam...”: slukhi kak istochnik kommunikatsii v prostranstve postfol’klora by Igor’ Orlov.
Slavic languages. Baltic languages. Albanian languages
Note on dissecting power of regular languages
Josef Rukavicka
Let $c>1$ be a real constant. We say that a language $L$ is $c$-\emph{constantly growing} if for every word $u\in L$ there is a word $v\in L$ with $\vert u\vert<\vert v\vert\leq c+\vert u\vert$. We say that a language $L$ is $c$-\emph{geometrically growing} if for every word $u\in L$ there is a word $v\in L$ with $\vert u\vert<\vert v\vert\leq c\vert u\vert$. Given a language $L$, we say that $L$ is $REG$-\emph{dissectible} if there is a regular language $R$ such that $\vert L\setminus R\vert=\infty$ and $\vert L\cap R\vert=\infty$. In 2013, it was shown that every $c$-constantly growing language $L$ is $REG$-dissectible. In 2023, the following open question has been presented: "Is the family of geometrically growing languages $REG$-dissectible?" We construct a $c$-geometrically growing language $L$ that is not $REG$-dissectible. Hence we answer negatively to the open question.
Activities of Arctic Commission under Council of People’s Commissars of USSR in 1928—1930: from Tasks of Scientific Research to Plans for Economic Development
N. I. Burnasheva
The organizational formation of the system of state management of the Arctic development processes in the USSR during the first fiveyear plans, is considered. The relevance of considering the problems of the development of the Arctic during the period under review is due to the growing interest in this issue in modern conditions. The activities of the Arctic Commission formed under the government, which was entrusted with the role of a single coordinator of state events and a center for regulating the activities of various institutions and departments in the Arctic region, are shown. Attention is drawn to the initial period of activity of the government commission, associated with the country's adoption of a course towards industrialization and the transition to a system of planned regulation of the economy. Under these conditions, the installations of scientific expeditions that had previously explored the Arctic possessions began to transform radically, the transfer of research work to solving applied, economic problems began. On the basis of archival sources, it has been established that the Arctic Commission initiated and led a huge complex of works on organizing and coordinating research and practical economic work in the north of the country. It is concluded that during the years of the first fiveyear plans, the process of development of the Arctic territories of the USSR entered a turning point in its development. The most important component of the new stage in the development of the Arctic was the transfer of managerial initiative in organizing work in the north of the country into the hands of the state.
Slavic languages. Baltic languages. Albanian languages
Lang-n-Send: Processes That Send Languages
Matteo Cimini
We present Lang-n-Send, a pi-calculus that is equipped with language definitions. Processes can define languages in operational semantics, and use them to execute programs. Furthermore, processes can send and receive pieces of operational semantics through channels. We present a reduction semantics for Lang-n-Send, and we offer examples that demonstrate some of the scenarios that Lang-n-Send captures.
Ordering Regular Languages and Automata: Complexity
Giovanna D'Agostino, Davide Martincigh, Alberto Policriti
Given an order of the underlying alphabet we can lift it to the states of a finite deterministic automaton: to compare states we use the order of the strings reaching them. When the order on strings is the co-lexicographic one \emph{and} this order turns out to be total, the DFA is called Wheeler. This recently introduced class of automata -- the \emph{Wheeler automata} -- constitute an important data-structure for languages, since it allows the design and implementation of a very efficient tool-set of storage mechanisms for the transition function, supporting a large variety of substring queries. In this context it is natural to consider the class of regular languages accepted by Wheeler automata, i.e. the Wheeler languages. An inspiring result in this area is the following: it has been shown that, as opposed to the general case, the classic determinization by powerset construction is \emph{polynomial} on Wheeler automata. As a consequence, most classical problems, when considered on this class of automata, turn out to be "easy" -- that is, solvable in polynomial time. In this paper we consider computational problems related to Wheelerness, but starting from non-deterministic automata. We also consider the case of \emph{reduced} non-deterministic ones -- a class of NFA where recognizing Wheelerness is still polynomial, as for DFA's. Our collection of results shows that moving towards non-determinism is, in most cases, a dangerous path leading quickly to intractability. Moreover, we start a study of "state complexity" related to Wheeler DFA and languages, proving that the classic construction for the intersection of languages turns out to be computationally simpler on Wheeler DFA than in the general case. We also provide a construction for the minimum Wheeler DFA recognizing a given Wheeler language.
On the Generative Capacity of Contextual Grammars with Strictly Locally Testable Selection Languages
Jürgen Dassow, Bianca Truthe
We continue the research on the generative capacity of contextual grammars where contexts are adjoined around whole words (externally) or around subwords (internally) which belong to special regular selection languages. All languages generated by contextual grammars where all selection languages are elements of a certain subregular language family form again a language family. We investigate contextual grammars with strictly locally testable selection languages and compare those families to families which are based on finite, monoidal, nilpotent, combinational, definite, suffix-closed, ordered, commutative, circular, non-counting, power-separating, or union-free languages.
Lj. Banjanin, P. Lazarević Di Giacomo, K. Stantchev (a cura di), Per Aleksander Naumow. Studi in suo onore, Edizioni dell’Orso, Alessandria 2019
Stefano Aloe
Book Review
History of Eastern Europe, Slavic languages. Baltic languages. Albanian languages
La casa senza amareni
Gian Piero Piretto
The article focuses on the lack of intérieur in Chekhov’s Cherry Orchard. The almost total absence of indications concerning the stage design underscores the spiritual emptiness of the house, a mansion that is no longer a home, where new characters (new social classes) arise in addition to the old landlords and take their places in organizing and altering the situation. Svetlana Boym’s considerations on “diasporic intimacy” help to emphasise the attitude of Ranevskaya towards the garden: she already lives elsewhere and her declarations of love for the orchard are nothing but empty, nostalgic words. She, and most of the other characters, long to leave and abandon the cherry orchard forever. The nursery is the only “furnished” room, according the Chekhov’s indications. The rest of the house is devoid of furniture, pictures, curtains, the “marks” that, according to Walter Benjamin, define the bourgeois intérieur of the 19th century. Old servant Firs, kitschy scenes, minor characters are the equivalent of the souvenirs that Soviet emigrants bring along when they leave the country. A short reflection on Stanislav Zhukovsky’s paintings of Russian mansion intérieur helps to demonstrate that The Cherry Orchard is not a drama, but a vaudeville.
Slavic languages. Baltic languages. Albanian languages
Quantitative fine-grained human evaluation of machine translation systems: a case study on English to Croatian
Filip Klubicka, Antonio Toral, V. M. Sánchez-Cartagena
This paper presents a quantitative fine-grained manual evaluation approach to comparing the performance of different machine translation (MT) systems. We build upon the well-established multidimensional quality metrics (MQM) error taxonomy and implement a novel method that assesses whether the differences in performance for MQM error types between different MT systems are statistically significant. We conduct a case study for English-to-Croatian, a language direction that involves translating into a morphologically rich language, for which we compare three MT systems belonging to different paradigms: pure phrase-based, factored phrase-based and neural. First, we design an MQM-compliant error taxonomy tailored to the relevant linguistic phenomena of Slavic languages, which made the annotation process feasible and accurate. Errors in MT outputs were then annotated by two annotators following this taxonomy. Subsequently, we carried out a statistical analysis which showed that the best-performing system (neural) reduces the errors produced by the worst system (pure phrase-based) by more than half (54%). Moreover, we conducted an additional analysis of agreement errors in which we distinguished between short (phrase-level) and long distance (sentence-level) errors. We discovered that phrase-based MT approaches are of limited use for long distance agreement phenomena, for which neural MT was found to be especially effective.
60 sitasi
en
Computer Science
The global space of international students in 2010
M. Börjesson
ABSTRACT International students have become an increasingly important research object – not only on the basis of the overall expansion and importance of international students in higher education and in national economic policies, but also since they constitute a strategic research object for understanding the global landscape of higher education. By using correspondence analysis on a data set on countries of destination and regions of origin, the global space of international students is depicted. The analysis reveals a structure with three main poles, a Pacific pole, a Central European one and a French/Iberian one. The three poles correspond to three different logics of recruitment: a market logic, a proximity logic and a colonial logic. The three poles and logics are also related to linguistic structures. The Pacific/Market pole is dominated by English, while the Central European pole has German and Slavic languages as a common denominator, and the French and Iberian pole has French, Spanish and Portuguese in common with their former colonies. It is argued that the Pacific/Market pole is the dominating pole in the space due to the high concentration of resources of different sorts, including economic, political, educational, scientific and not least, linguistic assets.
2. Common Slavic in the light of language contact and areal linguistics: Issues of methodology and the history of research
Vít Boček