Hasil untuk "Romanic languages"

Menampilkan 20 dari ~3335528 hasil · dari arXiv, DOAJ, CrossRef, Semantic Scholar

JSON API
arXiv Open Access 2026
Targeted Syntactic Evaluation of Language Models on Georgian Case Alignment

Daniel Gallagher, Gerhard Heyer

This paper evaluates the performance of transformer-based language models on split-ergative case alignment in Georgian, a particularly rare system for assigning grammatical cases to mark argument roles. We focus on subject and object marking determined through various permutations of nominative, ergative, and dative noun forms. A treebank-based approach for the generation of minimal pairs using the Grew query language is implemented. We create a dataset of 370 syntactic tests made up of seven tasks containing 50-70 samples each, where three noun forms are tested in any given sample. Five encoder- and two decoder-only models are evaluated with word- and/or sentence-level accuracy metrics. Regardless of the specific syntactic makeup, models performed worst in assigning the ergative case correctly and strongest in assigning the nominative case correctly. Performance correlated with the overall frequency distribution of the three forms (NOM > DAT > ERG). Though data scarcity is a known issue for low-resource languages, we show that the highly specific role of the ergative along with a lack of available training data likely contributes to poor performance on this case. The dataset is made publicly available and the methodology provides an interesting avenue for future syntactic evaluations of languages where benchmarks are limited.

en cs.CL
arXiv Open Access 2025
Using Code Snippets to Teach Programming Languages

Joshua Akingbade, Jianhua Yang, Mir Seyedebrahimi

Coding is a fundamental skill required in the engineering discipline, and much work exists exploring better ways of teaching coding in the higher education context. In particular, Code Snippets (CSs) are approved to be an effective way of introducing programming language units to students. CSs are portions of source code of varying size and content. They can be used in a myriad of ways, one of which is to teach the code they contain as well as its function. To further explore the use of CSs, a pedagogical summer internship project was set up at the Warwick Manufacturing Group (WMG). The scope of the considerations for the study derives from an educational standpoint. Within the evaluations made, the focus was primarily given to pieces of information which proved to provide evidence pertaining to the methodology involved in either teaching or developing teaching materials. By taking the results produced into account from a pedagogical perspective, it was found that several qualities of popular code snippet tutorials which benefit or hinder the learning process, including code length, interactivity, further support, and quality of explanation. These qualities are then combined and used to present a plan for the design of an effective learning resource which makes use of code snippets.

arXiv Open Access 2025
RomanLens: The Role Of Latent Romanization In Multilinguality In LLMs

Alan Saji, Jaavid Aktar Husain, Thanmay Jayakumar et al.

Large Language Models (LLMs) exhibit strong multilingual performance despite being predominantly trained on English-centric corpora. This raises a fundamental question: How do LLMs achieve such multilingual capabilities? Focusing on languages written in non-Roman scripts, we investigate the role of Romanization - the representation of non-Roman scripts using Roman characters - as a potential bridge in multilingual processing. Using mechanistic interpretability techniques, we analyze next-token generation and find that intermediate layers frequently represent target words in Romanized form before transitioning to native script, a phenomenon we term Latent Romanization. Further, through activation patching experiments, we demonstrate that LLMs encode semantic concepts similarly across native and Romanized scripts, suggesting a shared underlying representation. Additionally, for translation into non-Roman script languages, our findings reveal that when the target language is in Romanized form, its representations emerge earlier in the model's layers compared to native script. These insights contribute to a deeper understanding of multilingual representation in LLMs and highlight the implicit role of Romanization in facilitating language transfer.

en cs.CL, cs.AI
arXiv Open Access 2025
Syntax Repair as Language Intersection

Breandan Considine

We introduce a new technique for repairing syntax errors in arbitrary context-free languages. This technique models syntax repair as a language intersection problem by defining a finite language that provably generates every syntactically valid repair within a given edit distance. Leveraging a theoretical connection between the Bar-Hillel construction from formal language theory and CFL reachability from program analysis, we show that repairability in a finite number of typographic edits is polylogarithmic parallel time decidable and provide an enumeration algorithm based on the Brzozowski derivative. Finally, we evaluate this algorithm and its implementation, demonstrating state-of-the-art results on a Python syntax repair benchmark.

en cs.FL, cs.PL
arXiv Open Access 2025
Example-Free Learning of Regular Languages with Prefix Queries

Eve Fernando, Sasha Rubin, Rahul Gopinath

Language learning refers to the problem of inferring a mathematical model which accurately represents a formal language. Many language learning algorithms learn by asking certain types of queries about the language being modeled. Language learning is of practical interest in the field of cybersecurity, where it is used to model the language accepted by a program's input parser (also known as its input processor). In this setting, a learner can only query a string of its choice by executing the parser on it, which limits the language learning algorithms that can be used. Most practical parsers can indicate not only whether the string is valid or not, but also where the parsing failed. This extra information can be leveraged into producing a type of query we call the prefix query. Notably, no existing language learning algorithms make use of prefix queries, though some ask membership queries i.e., they ask whether or not a given string is valid. When these approaches are used to learn the language of a parser, the prefix information provided by the parser remains unused. In this work, we present PL*, the first known language learning algorithm to make use of the prefix query, and a novel modification of the classical L* algorithm. We show both theoretically and empirically that PL* is able to learn more efficiently than L* due to its ability to exploit the additional information given by prefix queries over membership queries. Furthermore, we show how PL* can be used to learn the language of a parser, by adapting it to a more practical setting in which prefix queries are the only source of information available to it; that is, it does not have access to any labelled examples or any other types of queries. We demonstrate empirically that, even in this more constrained setting, PL* is still capable of accurately learning a range of languages of practical interest.

en cs.FL, cs.LG
DOAJ Open Access 2025
Parole de présidents (1958–2022)

Dominique Labbé, Jacques Savoy

Over the past sixty-six years, eight presidents successively headed the Fifth French Republic (de Gaulle, Pompidou, Giscard d’Estaing, Mitterrand, Chirac, Sarkozy, Hollande, Macron). After presenting the corpus of their speeches – 9,202 texts and more than 20 million labelled words – the style of each of them will be characterized by their vocabulary (lemmas and part-of-speech). A deeper analysis reveals the typical sequences of each tenant of the Elysée. Based on an intertextual distance between all presidential speeches, a synthesis can be drawn reflecting the similarities and differences between presidents.

Romanic languages
DOAJ Open Access 2025
Humor Político: el género desde la óptica de la Lingüística de Corpus

Ariel Novodvorski

No presente trabalho, apresentamos um estudo de gênero pelo prisma da Linguística de Corpus, especificamente por meio da análise de um corpus de artigos jornalísticos de opinião, de autoria do escritor Alejandro Borensztein, publicados na coluna dominical intitulada Humor Político, do jornal argentino Clarín. Nossos objetivos são identificar, analisar e descrever elementos lexicogramaticais que apontem para o estabelecimento do humor político enquanto gênero, a partir de uma análise empírico-exploratória do corpus de estudo, por meio de ferramentas, técnicas e recursos característicos da pesquisa em corpora textuais escritos. Recorremos à base teórica da Linguística Sistêmico-Funcional, no tangente à linguagem como semiótica social (Halliday 1978) e às funções semióticas específicas do texto, com valor social na cultura; isto é, os gêneros em relação à vida em sociedade. Partindo da extração de palavras-chave e da análise de linhas de concordância, observamos as ocorrências contextualizadas. Os resultados apontam para diferentes escolhas lexicogramaticais, as quais confirmam o cumprimento de um papel social persuasório dos discursos que emergem do corpus, em que um ponto de vista crítico da situação político-econômica é trazido à baila.

Romanic languages, Philology. Linguistics
S2 Open Access 2025
Problems of fragmentation of the Latin language continuum in the intellectual history of Romanic linguistics

M. Kyrchanoff

The purpose of this study is to analyze the perception and revision of the problems of the history of the common Romance state and the dialectical fragmentation of Latin in the intellectual history of Romance philology. The author analyzes various viewpoints on the common Romance state and the problems of the history of literary and vulgar Latin, including various arguments that were proposed in the history of Romance linguistics. The novelty of the study lies in the analysis of interpretations of the common Romance state and the history of Latin in the contexts of intellectual history and the history of ideas, including the resource of historical revisionism, since the history of science, including Romance linguistics, is a change of various interpretative models. Materials and methods. Methodologically, the article is based on the principles of interdisciplinary historiography proposed in intellectual and cultural histories, which allow, by revising the viewpoints proposed in the previous historiographic tradition, to analyze the problems of the common Romance state as part of the history of ideas and an invented historiographic tradition or intellectual construct. Results. The results of the study suggest that 1) the idea of a common Romance condition in the history of Romance studies became a historiographical concept and an invented tradition, 2) discussions and debates about the existence or absence of a common Romance condition, as well as the dialectal division of Latin and the role of social factors in the development of Vulgar Latin stimulated the search for new ideas, contributed to the development of Romance philology and the genesis of new interpretative models and explanations, 3) the controversy and debatable nature of the concepts proposed in the history of Romance philology is revealed through the prism of interpretative models of diglossia and social linguistics, which actualizes the potential of revisionism as a universal method for generating new meanings and explanations in modern humanities, including Romance linguistics.

S2 Open Access 2019
Life with Two Languages

F. Grosjean

In 1979, the author taught a course on bilingualism for the first time. Since there was no adequate textbook on the topic, he contacted Harvard University Press and offered to write a book on the subject. They accepted and the book came out in 1982. It was original in many ways, among them the personal testimonies of bilinguals and the many examples of bilingual speech. During this time, the author met with the bilingualism specialist, Einar Haugen. He and his wife, Eva, encouraged the author and helped him accept his own bilingualism. He remained in touch with the couple until they passed away in the 1990s. The author honored them recently at the Einar Haugen lecture in Oslo.

183 sitasi en Computer Science
arXiv Open Access 2024
Open foundation models for Azerbaijani language

Jafar Isbarov, Kavsar Huseynova, Elvin Mammadov et al.

The emergence of multilingual large language models has enabled the development of language understanding and generation systems in Azerbaijani. However, most of the production-grade systems rely on cloud solutions, such as GPT-4. While there have been several attempts to develop open foundation models for Azerbaijani, these works have not found their way into common use due to a lack of systemic benchmarking. This paper encompasses several lines of work that promote open-source foundation models for Azerbaijani. We introduce (1) a large text corpus for Azerbaijani, (2) a family of encoder-only language models trained on this dataset, (3) labeled datasets for evaluating these models, and (4) extensive evaluation that covers all major open-source models with Azerbaijani support.

en cs.CL
arXiv Open Access 2024
Beyond Data Quantity: Key Factors Driving Performance in Multilingual Language Models

Sina Bagheri Nezhad, Ameeta Agrawal, Rhitabrat Pokharel

Multilingual language models (MLLMs) are crucial for handling text across various languages, yet they often show performance disparities due to differences in resource availability and linguistic characteristics. While the impact of pre-train data percentage and model size on performance is well-known, our study reveals additional critical factors that significantly influence MLLM effectiveness. Analyzing a wide range of features, including geographical, linguistic, and resource-related aspects, we focus on the SIB-200 dataset for classification and the Flores-200 dataset for machine translation, using regression models and SHAP values across 204 languages. Our findings identify token similarity and country similarity as pivotal factors, alongside pre-train data and model size, in enhancing model performance. Token similarity facilitates cross-lingual transfer, while country similarity highlights the importance of shared cultural and linguistic contexts. These insights offer valuable guidance for developing more equitable and effective multilingual language models, particularly for underrepresented languages.

en cs.CL, cs.AI
arXiv Open Access 2024
Investigating Neural Machine Translation for Low-Resource Languages: Using Bavarian as a Case Study

Wan-Hua Her, Udo Kruschwitz

Machine Translation has made impressive progress in recent years offering close to human-level performance on many languages, but studies have primarily focused on high-resource languages with broad online presence and resources. With the help of growing Large Language Models, more and more low-resource languages achieve better results through the presence of other languages. However, studies have shown that not all low-resource languages can benefit from multilingual systems, especially those with insufficient training and evaluation data. In this paper, we revisit state-of-the-art Neural Machine Translation techniques to develop automatic translation systems between German and Bavarian. We investigate conditions of low-resource languages such as data scarcity and parameter sensitivity and focus on refined solutions that combat low-resource difficulties and creative solutions such as harnessing language similarity. Our experiment entails applying Back-translation and Transfer Learning to automatically generate more training data and achieve higher translation performance. We demonstrate noisiness in the data and present our approach to carry out text preprocessing extensively. Evaluation was conducted using combined metrics: BLEU, chrF and TER. Statistical significance results with Bonferroni correction show surprisingly high baseline systems, and that Back-translation leads to significant improvement. Furthermore, we present a qualitative analysis of translation errors and system limitations.

en cs.CL
arXiv Open Access 2024
Vision Language Model is NOT All You Need: Augmentation Strategies for Molecule Language Models

Namkyeong Lee, Siddhartha Laghuvarapu, Chanyoung Park et al.

Recently, there has been a growing interest among researchers in understanding molecules and their textual descriptions through molecule language models (MoLM). However, despite some early promising developments, the advancement of MoLM still trails significantly behind that of vision language models (VLM). This is because unique challenges exist apart from VLM in the field of MoLM due to 1) a limited amount of molecule-text paired data and 2) missing expertise that occurred due to the specialized areas of focus among the experts. To this end, we propose AMOLE, which 1) augments molecule-text pairs with structural similarity preserving loss, and 2) transfers the expertise between the molecules. Specifically, AMOLE enriches molecule-text pairs by sharing descriptions among structurally similar molecules with a novel structural similarity preserving loss. Moreover, we propose an expertise reconstruction loss to transfer knowledge from molecules that have extensive expertise to those with less expertise. Extensive experiments on various downstream tasks demonstrate the superiority of AMOLE in comprehending molecules and their descriptions, highlighting its potential for application in real-world drug discovery. The source code for AMOLE is available at https://github.com/Namkyeong/AMOLE.

en cs.AI
arXiv Open Access 2024
Mapping 'when'-clauses in Latin American and Caribbean languages: an experiment in subtoken-based typology

Nilo Pedrazzini

Languages can encode temporal subordination lexically, via subordinating conjunctions, and morphologically, by marking the relation on the predicate. Systematic cross-linguistic variation among the former can be studied using well-established token-based typological approaches to token-aligned parallel corpora. Variation among different morphological means is instead much harder to tackle and therefore more poorly understood, despite being predominant in several language groups. This paper explores variation in the expression of generic temporal subordination ('when'-clauses) among the languages of Latin America and the Caribbean, where morphological marking is particularly common. It presents probabilistic semantic maps computed on the basis of the languages of the region, thus avoiding bias towards the many world's languages that exclusively use lexified connectors, incorporating associations between character $n$-grams and English $when$. The approach allows capturing morphological clause-linkage devices in addition to lexified connectors, paving the way for larger-scale, strategy-agnostic analyses of typological variation in temporal subordination.

en cs.CL, cs.IR
DOAJ Open Access 2024
What lexical acquisition has to say about a non-lexicalist architecture of grammar – and vice-versa

Rafael Luis Beraldo, Paulo Ângelo Araújo-Adriano

Distributed Morphology (DM) predicts that the units of syntactic derivation are smaller than words. This paper explores the implications of this prediction for language acquisition research and questions if DM is descriptively and theoretically sound when faced with acquisition phenomena. We first introduce independent evidence supporting sub-word units in acquisition: results from a computational model of lexical acquisition show that slightly more morphologically complex input data, such as Brazilian Portuguese when compared to English, cause a substantial decrease in the model’s performance; children’s early productions when acquiring polysynthetic languages reveal they are attempting to find these languages’ morphological units, instead of relying on chunks of non-analysed material; and words are shown to lack explanatory power in describing language acquisition in terms of storage, bootstrapping, or production. We then bridge the gap between DM and earlier proposals for the identification of words and formal features, briefly outlining a strategy for acquiring morphemes. Finally, we present accounts for two prevalent phenomena linked to language acquisition through the lens of DM: the overregularisation in acquiring irregular verbs, a step observed in children acquiring different languages; and the Brazilian Portuguese verbal paradigm shift, an example of morphological diachronic change. Our findings support the non-lexicalist derivation of words, highlighting that words are incompatible with acquisition from multiple perspectives. This suggests that understanding language acquisition benefits from considering the smaller, morpheme-based units predicted by Distributed Morphology.

Romanic languages, Philology. Linguistics
arXiv Open Access 2023
Regular Expressions in a CS Formal Languages Course

Marco T. Morazán

Regular expressions in an Automata Theory and Formal Languages course are mostly treated as a theoretical topic. That is, to some degree their mathematical properties and their role to describe languages is discussed. This approach fails to capture the interest of most Computer Science students. It is a missed opportunity to engage Computer Science students that are far more motivated by practical applications of theory. To this end, regular expressions may be discussed as the description of an algorithm to generate words in a language that is easily programmed. This article describes a programming-based methodology to introduce students to regular expressions in an Automata Theory and Formal Languages course. The language of instruction is FSM in which there is a regular expression type. Thus, facilitating the study of regular expressions and of algorithms based on regular expressions.

en cs.PL, cs.FL
arXiv Open Access 2023
Hitting the Romans

Henning Fernau, Kevin Mann

Roman domination is one of few examples where the related extension problem is polynomial-time solvable even if the original decision problem is NP-complete. This is interesting, as it allows to establish polynomial-delay enumeration algorithms for finding minimal Roman dominating functions, while it is open for more than four decades if all minimal dominating sets of a graph or if all hitting sets of a hypergraph can be enumerated with polynomial delay. To find the reason why this is the case, we combine the idea of hitting set with the idea of Roman domination. We hence obtain and study two new problems, called Roman Hitting Function and Roman Hitting Set, both generalizing Roman Domination. This allows us to delineate the borderline of polynomial-delay enumerability. Here, we assume what we call the Hitting Set Transversal Thesis, claiming that it is impossible to enumerate all minimal hitting sets of a hypergraph with polynomial delay. Our first focus is on the extension versions of these problems. While doing this, we find some conditions under which the Extension Roman Hitting Function problem is NP-complete. We then use parameterized complexity to get a better understanding of why Extension Roman Hitting Function behaves in this way. Furthermore, we analyze the parameterized and approximation complexity of the underlying optimization problems. We also discuss consequences for Roman variants of other problems like Vertex Cover.

en cs.CC
arXiv Open Access 2022
Toward More Meaningful Resources for Lower-resourced Languages

Constantine Lignos, Nolan Holley, Chester Palen-Michel et al.

In this position paper, we describe our perspective on how meaningful resources for lower-resourced languages should be developed in connection with the speakers of those languages. We first examine two massively multilingual resources in detail. We explore the contents of the names stored in Wikidata for a few lower-resourced languages and find that many of them are not in fact in the languages they claim to be and require non-trivial effort to correct. We discuss quality issues present in WikiAnn and evaluate whether it is a useful supplement to hand annotated data. We then discuss the importance of creating annotation for lower-resourced languages in a thoughtful and ethical way that includes the languages' speakers as part of the development process. We conclude with recommended guidelines for resource development.

en cs.CL, cs.AI
S2 Open Access 2019
On the Impact of Programming Languages on Code Quality

E. Berger, Celeste Hollenbeck, Petr Maj et al.

In a 2014 article, Ray, Posnett, Devanbu, and Filkov claimed to have uncovered a statistically significant association between 11 programming languages and software defects in 729 projects hosted on GitHub. Specifically, their work answered four research questions relating to software defects and programming languages. With data and code provided by the authors, the present article first attempts to conduct an experimental repetition of the original study. The repetition is only partially successful, due to missing code and issues with the classification of languages. The second part of this work focuses on their main claim, the association between bugs and languages, and performs a complete, independent reanalysis of the data and of the statistical modeling steps undertaken by Ray et al. in 2014. This reanalysis uncovers a number of serious flaws that reduce the number of languages with an association with defects down from 11 to only 4. Moreover, the practical effect size is exceedingly small. These results thus undermine the conclusions of the original study. Correcting the record is important, as many subsequent works have cited the 2014 article and have asserted, without evidence, a causal link between the choice of programming language for a given task and the number of software defects. Causation is not supported by the data at hand; and, in our opinion, even after fixing the methodological flaws we uncovered, too many unaccounted sources of bias remain to hope for a meaningful comparison of bug rates across languages.

67 sitasi en Computer Science

Halaman 6 dari 166777