Stop Jostling: Adaptive Negative Sampling Reduces the Marginalization of Low-Resource Language Tokens by Cross-Entropy Loss
Galim Turumtaev
Neural language models often struggle with low-resource languages due to the limited availability of training data, making tokens from these languages rare in the training set. This paper addresses a specific challenge during training: rare tokens are disproportionately affected by marginalization, which prevents them from learning effectively. We propose a thresholding technique that reduces the impact of this marginalization, allowing rare tokens to benefit from more meaningful alignment. Through experiments with a character-level language model, we demonstrate that this method significantly improves performance on low-resource language validation data. This work is the first to show how negative sampling can be applied to improve the representation of rare tokens by limiting the harmful influence of excessive marginalization, offering a new approach to enhancing language model performance for underrepresented languages.
OpenLID-v3: Improving the Precision of Closely Related Language Identification - An Experience Report
Mariia Fedorova, Nikolay Arefyev, Maja Buljan
et al.
Language identification (LID) is an essential step in building high-quality multilingual datasets from web data. Existing LID tools (such as OpenLID or GlotLID) often struggle to identify closely related languages and to distinguish valid natural language from noise, which contaminates language-specific subsets, especially for low-resource languages. In this work we extend the OpenLID classifier by adding more training data, merging problematic language variant clusters, and introducing a special label for marking noise. We call this extended system OpenLID-v3 and evaluate it against GlotLID on multiple benchmarks. During development, we focus on three groups of closely related languages (Bosnian, Croatian, and Serbian; Romance varieties of Northern Italy and Southern France; and Scandinavian languages) and contribute new evaluation datasets where existing ones are inadequate. We find that ensemble approaches improve precision but also substantially reduce coverage for low-resource languages. OpenLID-v3 is available on https://huggingface.co/HPLT/OpenLID-v3.
The Difficulties in Emotion Regulation Scale Short Form: A Confirmatory Factor Analysis Among Parents of Children With Mental Health Disorders in Norway
Linda Severinsen, Jan Reidar Stiegler, H. Nissen-Lie
et al.
Difficulties in emotion regulation lie at the core of many mental health challenges. The Difficulties in Emotion Regulation Scale Short Form (DERS‐SF) is a widely used tool for measuring such difficulties. Although it has been translated into several languages and validated across various countries worldwide, its psychometric properties have not been thoroughly examined in Scandinavian populations. This study aimed to replicate the original multidimensional six‐factor DERS‐SF model in a Norwegian sample, compare it with alternative models, and examine its correlation with psychological distress. As part of a randomized controlled trial (RCT), 138 parents of children with mental health disorders completed the DERS‐SF and the 10‐item Symptom Check List (SCL‐10). Confirmatory factor analyses (CFAs) were conducted to compare the six‐factor model with a five‐factor model and two hierarchical models. CFAs supported the six‐factor, five‐factor and hierarchical models. All showing good model fit. The six‐factor model was preferred over the five‐factor model based on theoretical considerations regarding the inclusion of the Awareness factor. Reliability analysis yielded McDonald's omega values ranging from 0.61 to 0.94. A positive correlation between emotion regulation difficulties and psychological distress was observed ( r = 0.445, p < 0.001). The findings supported the applicability of the DERS‐SF in a Norwegian population and its utility for clinical and research purposes. However, as the sample consisted of parents of children receiving mental health care, caution is required when interpreting these results due to potential limitations regarding external validity and generalizability. Trial Registration: ClinicalTrials.gov , identifier: NCT04885036.
Tamil Speakers in Switzerland: An Intergenerational and Typological Perspective
S. Rajamathangi, A. Auer, Gurujegan Murugesan
Since the mid-1980s, many Tamils left their homeland because of the civil war in Sri Lanka (1983–2009) and for other reasons and settled in different countries. More than 40,000 Tamil migrants have come to Switzerland since then, and Tamil is spoken as a heritage language by second- and third-generation speakers who were born and raised in Switzerland. Within this context, it is the aim of the current study to shed light on the difference between Tamil spoken in the first generation (migrant language) and the second generation (heritage language) in the Swiss German and Swiss French parts of Switzerland. We therefore study Tamil, which is part of the Dravidian language family, in different majority language contexts, i.e., a Germanic language and a Romance language, respectively. While some research on Tamil in a diaspora setting already exists on migrated Tamil communities in Lancaster, California (US), East London (UK) and Toronto (Canada), the focus on Switzerland and contact with German and French has not previously been investigated. The data under investigation, which stems from 20 speakers in total (i.e., 5 first-generation and 5 second-generation speakers from the Swiss German and the Swiss French parts respectively), was collected in 2024 by way of a semi-structured interview based on a sociolinguistic questionnaire and a linguistic test. The data serves as the basis for the intergenerational and typological comparison. The analysis reveals systematic intergenerational differences across several morphosyntactic domains, including agreement, negation pattern, case marking, and subject pro-drop. While first-generation speakers retain greater access to dialect-specific and register-sensitive patterns, second-generation speakers show increased reliance on discourse-pragmatic cues and reduced sensitivity to morphologically encoded distinctions. These findings highlight the role of register, input conditions, and discourse context in shaping heritage Tamil across generations in Switzerland.
Von Opitz bis Hunold – programmatische Ansätze zur Etablierung der deutschen Nationalliteratur in Lyrikanthologien
Tomasz Jabłecki
Im 17. und im 18. Jahrhundert entstehen zahlreiche deutschsprachige Lyrikanthologien, deren Vorreden mit einem stark akzentuierten, kulturpatriotischen Interesse sich konsequent sehr stark für die Durchsetzung des Deutschen als Dichtungssprache einsetzen. Sie alle nehmen Bezug auf Opitz, nennen antike, zeitgenössische ausländische sowie deutsche Vorbilder und enthalten poetologische Anweisungen für eine zukünftige Dichtergeneration. Der Beitrag versucht den programmatischen Charakter der Vorrede hervorzuheben und ihren Anteil an der Herausbildung der deutschen Nationalliteratur anhand von wichtigen Beispielen, u.a. Opitz, Neukirch, Hunold, Weichmann, zu veranschaulichen.
Germanic languages. Scandinavian languages, German literature
Code Less to Code More: Streamlining Language Server Protocol and Type System Development for Language Families
Federico Bruzzone, Walter Cazzola, Luca Favalli
Developing editing support for $L$ languages in $E$ editors is complex and time-consuming. Some languages do not provide dedicated editors, while others offer a single native editor. The $\textit{language server protocol}$ (LSP) reduces the language-editor combinations $L \times E$ to $L + E$, where a single language server communicates with editors via LSP plugins. However, overlapping implementations of linguistic components remain an issue. Existing language workbenches struggle with modularity, reusability, and leveraging type systems for language server generation. In this work, we propose: (i) Typelang, a family of domain-specific languages for modular, composable, and reusable type system implementation, (ii) a modular language server generation process, producing servers for languages built in a modular workbench, (iii) the variant-oriented programming paradigm and a cross-artifact coordination layer to manage interdependent software variants, and (iv) an LSP plugin generator, reducing $E$ to $1$ by automating plugin creation for multiple editors. To simplify editing support for language families, each language artifact integrates its own Typelang variant, used to generate language servers. This reduces combinations to $T \times 1$, where $T = L$ represents the number of type systems. Further reuse of language artifacts across languages lowers this to $N \times 1$, where $N << T$, representing unique type systems. We implement Typelang in Neverlang, generating language servers for each artifact and LSP plugins for three editors. Empirical evaluation shows a 93.48% reduction in characters needed for type system implementation and 100% automation of LSP plugin generation, significantly lowering effort for editing support in language families, especially when artifacts are reused.
Bialgebraic Reasoning on Stateful Languages
Sergey Goncharov, Stefan Milius, Lutz Schröder
et al.
Reasoning about program equivalence in imperative languages is notoriously challenging, as the presence of states (in the form of variable stores) fundamentally increases the observational power of program terms. The key desideratum for any notion of equivalence is compositionality, guaranteeing that subprograms can be safely replaced by equivalent subprograms regardless of the context. To facilitate compositionality proofs and avoid boilerplate work, one would hope to employ the abstract bialgebraic methods provided by Turi and Plotkin's powerful theory of mathematical operational semantics (a.k.a. abstract GSOS) or its recent extension by Goncharov et al. to higher-order languages. However, multiple attempts to apply abstract GSOS to stateful languages have thus failed. We propose a novel approach to the operational semantics of stateful languages based on the formal distinction between readers (terms that expect an initial input store before being executed), and writers (running terms that have already been provided with a store). In contrast to earlier work, this style of semantics is fully compatible with abstract GSOS, and we can thus leverage the existing theory to obtain coinductive reasoning techniques. We demonstrate that our approach generates non-trivial compositionality results for stateful languages with first-order and higher-order store and that it flexibly applies to program equivalences at different levels of granularity, such as trace, cost, and natural equivalence.
Idefix-Closed Languages and Their Application in Contextual Grammars
Marvin Ködding, Bianca Truthe
In this paper, we continue the research on the power of contextual grammars with selection languages from subfamilies of the family of regular languages. We investigate infix-, prefix-, and suffix-closed languages (referred to as idefix-closed languages) and compare such language families to some other subregular families of languages (finite, monoidal, nilpotent, combinational, (symmetric) definite, ordered, non-counting, power-separating, commutative, circular, union-free, star, and comet languages). Further, we compare the families of the hierarchies obtained for external and internal contextual grammars with the language families defined by these new types for the selection. In this way, we extend the existing hierarchies by new language families. Moreover, we solve an open problem regarding internal contextual grammars with suffix-closed selection languages.
2-1-3 orders in Dutch verb clusters
P. Ackema, A. Neeleman
A generalization that has emerged in the literature on verb clusters in West Germanic languages is that so-called 2-1-3 orders (where verb n selects verb n+1) are absent with core clustering verbs. In this paper we show that Dutch permits 2-1-3 order as an optional variant of the more widely acknowledged 1-2-3 order. The 2-1-3 order is subject to a range of restrictions involving the lexical items in the cluster, the morphological form of these items, whether other elements intervene, and where focus is placed. We argue that these restrictions are best understood if the 2-1-3 order is derived from an underlying 1-2-3 cluster through a post-syntactic inversion rule. This rule shares various properties with other inversion rules but cannot be reduced entirely to a familiar rule type.
Comparative linguistics in the teaching of Spanish as a foreign language (ELE): applications and didactic strategies
Carlos Melgar García
The acquisition of Spanish as a foreign language (ELE) is shaped by the student’s native language, which can either aid or hinder learning. Comparative Linguistics helps understand language transfer and reduce interference. This study examines its impact on ELE, identifying frequent errors and suggesting contrastive teaching strategies. Recurrent mistakes in verb conjugation, article usage, and pronunciation are analyzed through comparisons with Romance, Germanic, Slavic, and non-Indo-European languages. Adapted teaching materials are proposed. Results emphasize the need to integrate Comparative Linguistics into ELE to optimize Spanish learning, reduce fossilization, and enhance communicative competence. A contrastive approach promotes more effective and tailored learning to linguistic diversity
En oavslutad dikt om ett oavslutat uppror
Ulrika Franke
The legendary Russian literary critic Belinsky famously described Pushkin’s novel in verse Eugene Onegin as an encyclopedia of Russian life. However, this encyclopedia seems seriously incomplete in that it largely leaves out elements of oppression, war, and insurrection. There are many valid explanations for this, but one, very blunt and prosaic, is that oppression and censorship actually worked – that it is absent in the fiction because it was present in reality. As a case in point, this article presents a novel translation into Swedish, with rhymes and meter preserved, of the fragments remaining of the unfinished tenth chapter of Eugene Onegin. This tenth chapter deals with the failed Decembrist uprising of 1825, and the misrule precipitating it, and it is not surprising that it could not be published at the time it was written. Though well known in the academic community, this fragment is rarely published in foreign translations, and as far as known, this is the first translation into a Scandinavian language. The article offers some commentary on the translation and concludes with a few remarks on the value of reading the classics even in times of turmoil.
Resolution agreement in German and Dutch: implications for person feature decomposition
Imke Driemel
Encoder vs Decoder: Comparative Analysis of Encoder and Decoder Language Models on Multilingual NLU Tasks
Dan S. Nielsen, Kenneth C. Enevoldsen, Peter Schneider-Kamp
This paper explores the performance of encoder and decoder language models on multilingual Natural Language Understanding (NLU) tasks, with a broad focus on Germanic languages. Building upon the ScandEval benchmark, initially restricted to evaluating encoder models, we extend the evaluation framework to include decoder models. We introduce a method for evaluating decoder models on NLU tasks and apply it to the languages Danish, Swedish, Norwegian, Icelandic, Faroese, German, Dutch, and English. Through a series of experiments and analyses, we also address research questions regarding the comparative performance of encoder and decoder models, the impact of NLU task types, and the variation across language resources. Our findings reveal that encoder models can achieve significantly better NLU performance than decoder models despite having orders of magnitude fewer parameters. Additionally, we investigate the correlation between decoders and task performance via a UMAP analysis, shedding light on the unique capabilities of decoder and encoder models. This study contributes to a deeper understanding of language model paradigms in NLU tasks and provides valuable insights for model selection and evaluation in multilingual settings.
33 sitasi
en
Computer Science
A Natural Formalized Proof Language
Lihan Xie, Zhicheng Hui, Qinxiang Cao
Artificial intelligence assisted mathematical proof has become a highly focused area nowadays. One key problem in this field is to generate formal mathematical proofs from natural language proofs. Due to historical reasons, the formal proof languages adopted by traditional theorem provers were not intended to represent natural language proofs. Therefore, they are not well-suited for the aforementioned tasks and proof-checking work for educational purposes. In this paper, we design a proof language and its corresponding abstract syntax tree and implement a proof checking tool for it. This language can be easily converted from natural language, thus providing a rich corpus of formal proof. Additionally, it supports the handling of issues in informal proofs through static analysis, and enhances the expressive power of the language by introducing the structure of partial proofs. This design combines the expressiveness of natural language and the accuracy of formal language, resulting in an improved mathematical proof language.
On the Representation and State Complexity of Block Languages
Guilherme Duarte, Nelma Moreira, Luca Prigioniero
et al.
In this paper, we consider block languages, namely sets of words having the same length, and we propose a new representation for these languages. In particular, given an alphabet of size $k$ and a length $\ell$, a block language can be represented by a bitmap of length $k^\ell$, where each bit indicates whether the corresponding word, according to the lexicographical order, belongs, or not, to the language (bit equal to 1 or 0, respectively). First, we show how to convert bitmaps into deterministic and nondeterministic finite automata, and we prove that the machines are minimal. Then, we give an analysis of the maximum number of states sufficient to accept every block language in the deterministic and nondeterministic case. Finally, we study the deterministic and nondeterministic state complexity of several operations on these languages. Being a subclass of finite languages, the upper bounds of operational state complexity known for finite languages apply for block languages as well. However, in several cases, smaller values were found.
Subsequence Matching and Analysis Problems for Formal Languages
Szilárd Zsolt Fazekas, Tore Koß, Florin Manea
et al.
In this paper, we study a series of algorithmic problems related to the subsequences occurring in the strings of a given language, under the assumption that this language is succinctly represented by a grammar generating it, or an automaton accepting it. In particular, we focus on the following problems: Given a string $w$ and a language $L$, does there exist a word of $L$ which has $w$ as subsequence? Do all words of $L$ have $w$ as a subsequence? Given an integer $k$ alongside $L$, does there exist a word of $L$ which has all strings of length $k$, over the alphabet of $L$, as subsequences? Do all words of $L$ have all strings of length $k$ as subsequences? For the last two problems, efficient algorithms were already presented in [Adamson et al., ISAAC 2023] for the case when $L$ is a regular language, and efficient solutions can be easily obtained for the first two problems. We extend that work as follows: we give sufficient conditions on the class of input-languages, under which these problems are decidable; we provide efficient algorithms for all these problems in the case when the input language is context-free; we show that all problems are undecidable for context-sensitive languages. Finally, we provide a series of initial results related to a class of languages that strictly includes the regular languages and is strictly included in the class of context-sensitive languages, but is incomparable to the of class context-free languages; these results deviate significantly from those reported for language-classes from the Chomsky hierarchy.
Tegn på språklig kjønnsdiskriminering i norsk og polsk
Anna Olszewska
Germanic languages. Scandinavian languages, History of Northern Europe. Scandinavia
Norsk som en del av et flerspråklig familierepertoar i en trespråklig familie
Ragni Vik Johnsen
Denne artikkelen tar i bruk analysemetoder fra interaksjonell sosiolingvistikk for å studere hvordan en trespråklig familie innlemmer norsk i familiens flerspråklige repertoar, og hvordan norsk brukes som en ressurs for å konstruere familien som et sosialt fellesskap. Dataene ble samlet inn som en del av et større prosjekt om flerspråklige ungdommer og familier, og i denne artikkelen analyserer jeg egenopptak (cirka 10 timer) fra en trespråklig familie. Fokusfamilien består av to foreldre og fem barn mellom 3 og 18 år, som har flyttet til en nordnorsk by fra et land i Mellom-Amerika og bruker engelsk, spansk og norsk i dagliglivet. Ved hjelp av nære tur-for-tur-analyser av familieinteraksjon ser jeg på hvordan familiemedlemmene bruker flerspråklige ressurser, og da særlig trekk fra norsk, for å forhandle om sosiale posisjoneringer og for å konstruere familien som en sosial arena. Analysene viser at barna i familien ofte posisjoneres som norskeksperter av foreldrene, og at foreldrene posisjonerer seg selv som innlærere. Analysene viser også hvordan norsk tas i bruk i språklige praksiser i familien, og at enkelte familiemedlemmer fungerer som «katalysatorer» for bruk av norsk. Artikkelen viser altså hvordan denne flerspråklige familien tar i bruk alle sine språklige ressurser i hverdagslige aktiviteter som fellesmåltider.
North Germanic. Scandinavian
Towards Porting Operating Systems with Program Synthesis
Jingmei Hu, Eric Lu, David A. Holland
et al.
The end of Moore's Law has ushered in a diversity of hardware not seen in decades. Operating system (and system software) portability is accordingly becoming increasingly critical. Simultaneously, there has been tremendous progress in program synthesis. We set out to explore the feasibility of using modern program synthesis to generate the machine-dependent parts of an operating system. Our ultimate goal is to generate new ports automatically from descriptions of new machines. One of the issues involved is writing specifications, both for machine-dependent operating system functionality and for instruction set architectures. We designed two domain-specific languages: Alewife for machine-independent specifications of machine-dependent operating system functionality and Cassiopea for describing instruction set architecture semantics. Automated porting also requires an implementation. We developed a toolchain that, given an Alewife specification and a Cassiopea machine description, specializes the machine-independent specification to the target instruction set architecture and synthesizes an implementation in assembly language with a customized symbolic execution engine. Using this approach, we demonstrate successful synthesis of a total of 140 OS components from two pre-existing OSes for four real hardware platforms. We also developed several optimization methods for OS-related assembly synthesis to improve scalability. The effectiveness of our languages and ability to synthesize code for all 140 specifications is evidence of the feasibility of program synthesis for machine-dependent OS code. However, many research challenges remain; we also discuss the benefits and limitations of our synthesis-based approach to automated OS porting.
How ISO C became unusable for operating systems development
Victor Yodaiken
The C programming language was developed in the 1970s as a fairly unconventional systems and operating systems development tool, but has, through the course of the ISO Standards process, added many attributes of more conventional programming languages and become less suitable for operating systems development. Operating system programming continues to be done in non-ISO dialects of C. The differences provide a glimpse of operating system requirements for programming languages.