Hasil untuk "Romanic languages"

Menampilkan 20 dari ~3329845 hasil · dari CrossRef, DOAJ, arXiv, Semantic Scholar

JSON API
arXiv Open Access 2026
One Script Instead of Hundreds? On Pretraining Romanized Encoder Language Models

Benedikt Ebing, Lennart Keller, Goran Glavaš

Exposing latent lexical overlap, script romanization has emerged as an effective strategy for improving cross-lingual transfer (XLT) in multilingual language models (mLMs). Most prior work, however, focused on setups that favor romanization the most: (1) transfer from high-resource Latin-script to low-resource non-Latin-script languages and/or (2) between genealogically closely related languages with different scripts. It thus remains unclear whether romanization is a good representation choice for pretraining general-purpose mLMs, or, more precisely, if information loss associated with romanization harms performance for high-resource languages. We address this gap by pretraining encoder LMs from scratch on both romanized and original texts for six typologically diverse high-resource languages, investigating two potential sources of degradation: (i) loss of script-specific information and (ii) negative cross-lingual interference from increased vocabulary overlap. Using two romanizers with different fidelity profiles, we observe negligible performance loss for languages with segmental scripts, whereas languages with morphosyllabic scripts (Chinese and Japanese) suffer degradation that higher-fidelity romanization mitigates but cannot fully recover. Importantly, comparing monolingual LMs with their mLM counterpart, we find no evidence that increased subword overlap induces negative interference. We further show that romanization improves encoding efficiency (i.e., fertility) for segmental scripts at a negligible performance cost.

en cs.CL
DOAJ Open Access 2025
El episodio del alguacil en el tratado séptimo del Lazarillo de Tormes: sátira contra el arzobispo de Toledo Juan Martínez Silíceo y los «retraídos» de la catedral primada

Jesús Fernando Cáseda Teresa

Este estudio pretende explicar la razón de la presencia en la obra de los «retraídos» que a punto están de apalear a Lázaro, ayudante de un alguacil en Toledo. La presencia de estos individuos en la novela, acogidos «en sagrado» en las iglesias de la época, es debida a la permisividad del arzobispo Silíceo, que llenó la catedral de estos delincuentes, provocando el enfado de los clérigos que vieron cómo se desentendía completamente de sus quejas. Este dato y los que se señalan en el estudio ponen de relieve el protagonismo de Lázaro/Silíceo en la obra. Tal circunstancia nos permite establecer una probable fecha de escritura, siempre después de la llegada a Toledo de Juan Martínez Silíceo (1546), y nos sitúa en la pista del creador del texto, probablemente un canónigo de la catedral de Toledo que conocía bien estos incidentes porque también él se quejó en las reuniones capitulares del excesivo número de «retraídos» cobijados en el edificio.

Romanic languages, French literature - Italian literature - Spanish literature - Portuguese literature
DOAJ Open Access 2025
Painting a portrait of Swiss federal in-house translators: a job announcement and survey-based inquiry into their diverse tasks and competences

Paolo Canavese

Research on legal and institutional translators’ competences and evolving profiles abounds, but no studies on the Swiss Confederation exist in this field. This paper aims to characterise the typical profiles, in terms of tasks and competences, of federal translators by combining a corpus analysis of 250 job announcements with a survey of 229 participants from various federal language services. The findings show that federal translators translate and revise a plethora of text genres, including legal but also informative texts, and perform a variety of other tasks such as language consultancy and editing of original texts. In doing so, federal translators bring significant strategic and methodological competences into play, as well as linguistic and cultural competences. Thematic competence is essential but can be acquired on the job. Interpersonal and instrumental competences are important auxiliary competences. Regarding the latter, CAT tool skills are the most frequent requirement. These findings contribute to the ongoing discussion on the evolving profiles of language professionals working in the legal and institutional field, and how training should be adapted to reflect real institutional practices and needs.

Language and Literature, Romanic languages
arXiv Open Access 2025
Finding Compiler Bugs through Cross-Language Code Generator and Differential Testing

Qiong Feng, Xiaotian Ma, Ziyuan Feng et al.

Compilers play a central role in translating high-level code into executable programs, making their correctness essential for ensuring code safety and reliability. While extensive research has focused on verifying the correctness of compilers for single-language compilation, the correctness of cross-language compilation - which involves the interaction between two languages and their respective compilers - remains largely unexplored. To fill this research gap, we propose CrossLangFuzzer, a novel framework that introduces a universal intermediate representation (IR) for JVM-based languages and automatically generates cross-language test programs with diverse type parameters and complex inheritance structures. After generating the initial IR, CrossLangFuzzer applies three mutation techniques - LangShuffler, FunctionRemoval, and TypeChanger - to enhance program diversity. By evaluating both the original and mutated programs across multiple compiler versions, CrossLangFuzzer successfully uncovered 10 confirmed bugs in the Kotlin compiler, 4 confirmed bugs in the Groovy compiler, 7 confirmed bugs in the Scala 3 compiler, 2 confirmed bugs in the Scala 2 compiler, and 1 confirmed bug in the Java compiler. Among all mutators, TypeChanger is the most effective, detecting 11 of the 24 compiler bugs. Furthermore, we analyze the symptoms and root causes of cross-compilation bugs, examining the respective responsibilities of language compilers when incorrect behavior occurs during cross-language compilation. To the best of our knowledge, this is the firstwork specifically focused on identifying and diagnosing compiler bugs in cross-language compilation scenarios. Our research helps to understand these challenges and contributes to improving compiler correctness in multi-language environments.

en cs.PL, cs.SE
arXiv Open Access 2025
Large Language Model Powered Symbolic Execution

Yihe Li, Ruijie Meng, Gregory J. Duck

Large Language Models (LLMs) have emerged as a promising alternative to traditional static program analysis methods, such as symbolic execution, offering the ability to reason over code directly without relying on theorem provers or SMT solvers. However, LLMs are also inherently approximate by nature, and therefore face significant challenges in relation to the accuracy and scale of analysis in real-world applications. Such issues often necessitate the use of larger LLMs with higher token limits, but this requires enterprise-grade hardware (GPUs) and thus limits accessibility for many users. In this paper, we propose LLM-based symbolic execution -- a novel approach that enhances LLM inference via a path-based decomposition of the program analysis tasks into smaller (more tractable) subtasks. The core idea is to generalize path constraints using a generic code-based representation that the LLM can directly reason over, and without translation into another (less-expressive) formal language. We implement our approach in the form of AutoBug, an LLM-based symbolic execution engine that is lightweight and language-agnostic, making it a practical tool for analyzing code that is challenging for traditional approaches. We show that AutoBug can improve both the accuracy and scale of LLM-based program analysis, especially for smaller LLMs that can run on consumer-grade hardware.

arXiv Open Access 2025
Improving Informally Romanized Language Identification

Adrian Benton, Alexander Gutkin, Christo Kirov et al.

The Latin script is often used to informally write languages with non-Latin native scripts. In many cases (e.g., most languages in India), the lack of conventional spelling in the Latin script results in high spelling variability. Such romanization renders languages that are normally easily distinguished due to being written in different scripts - Hindi and Urdu, for example - highly confusable. In this work, we increase language identification (LID) accuracy for romanized text by improving the methods used to synthesize training sets. We find that training on synthetic samples which incorporate natural spelling variation yields higher LID system accuracy than including available naturally occurring examples in the training set, or even training higher capacity models. We demonstrate new state-of-the-art LID performance on romanized text from 20 Indic languages in the Bhasha-Abhijnaanam evaluation set (Madhani et al., 2023a), improving test F1 from the reported 74.7% (using a pretrained neural model) to 85.4% using a linear classifier trained solely on synthetic data and 88.2% when also training on available harvested text.

S2 Open Access 2023
Adaptation and Validation of the German Reading Comprehension Test ELFE II for Spanish

Ariel Cuadro, César Daniel Costa-Ball, Alexa von Hagen et al.

Abstract: The objective of this study is to present the process of adaptation and validation of the German reading comprehension test ELFE II ( Lenhard et al., 2017 ) for Spanish. Both languages exhibit comparable orthographic depth but differ in their word material due to their different origins (Romanic vs. Germanic language roots). We reconstructed the test based on the original structure and construction principles and aligned syllable structure, frequency, and word length. Convergent validity, item analysis, dimensionality, and reliability studies provide evidence to support a unidimensional structure for each scale subtest, which in turn is in accordance with theoretical frameworks.

5 sitasi en
DOAJ Open Access 2023
Perception des étudiants universitaires hispanophones de l’utilisation d’Instagram comme appui à l’enseignement présentiel de la grammaire

Mónica Nieto Escobar

La recherche porte sur l’utilisation du réseau social Instagram pour améliorer l’apprentissage de la grammaire française dans un monde de plus en plus numérique. Comme nous le savons, la grammaire se présente comme l’une des compétences les plus difficiles à maîtriser pour les étudiants hispanophones. Instagram semble offrir un espace d’apprentissage prolongé et complémentaire à la formation présentielle des points grammaticaux abordés en classe. C’est pourquoi une enquête a été réalisée auprès des étudiants pour vérifier si, de leur point de vue, l’utilisation d’Instagram les aide dans l’acquisition des contenus grammaticaux et s’ils trouvent cet outil plus motivant que l’enseignement traditionnel avec le manuel scolaire. L’enseignement de la grammaire a eu lieu à travers une formation hybride basée sur la méthodologie de la classe inversée, utilisant des capsules pédagogiques et une méthode réflexive et contrastive.

Language and Literature, Romanic languages
arXiv Open Access 2023
Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languages

Yash Madhani, Mitesh M. Khapra, Anoop Kunchukuttan

We create publicly available language identification (LID) datasets and models in all 22 Indian languages listed in the Indian constitution in both native-script and romanized text. First, we create Bhasha-Abhijnaanam, a language identification test set for native-script as well as romanized text which spans all 22 Indic languages. We also train IndicLID, a language identifier for all the above-mentioned languages in both native and romanized script. For native-script text, it has better language coverage than existing LIDs and is competitive or better than other LIDs. IndicLID is the first LID for romanized text in Indian languages. Two major challenges for romanized text LID are the lack of training data and low-LID performance when languages are similar. We provide simple and effective solutions to these problems. In general, there has been limited work on romanized text in any language, and our findings are relevant to other languages that need romanized language identification. Our models are publicly available at https://ai4bharat.iitm.ac.in/indiclid under open-source licenses. Our training and test sets are also publicly available at https://ai4bharat.iitm.ac.in/bhasha-abhijnaanam under open-source licenses.

en cs.CL
arXiv Open Access 2023
Romanization-based Large-scale Adaptation of Multilingual Language Models

Sukannya Purkayastha, Sebastian Ruder, Jonas Pfeiffer et al.

Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP. However, their large-scale deployment to many languages, besides pretraining data scarcity, is also hindered by the increase in vocabulary size and limitations in their parameter budget. In order to boost the capacity of mPLMs to deal with low-resource and unseen languages, we explore the potential of leveraging transliteration on a massive scale. In particular, we explore the UROMAN transliteration tool, which provides mappings from UTF-8 to Latin characters for all the writing systems, enabling inexpensive romanization for virtually any language. We first focus on establishing how UROMAN compares against other language-specific and manually curated transliterators for adapting multilingual PLMs. We then study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages. Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups: on languages with unseen scripts and with limited training data without any vocabulary augmentation. Further analyses reveal that an improved tokenizer based on romanized data can even outperform non-transliteration-based methods in the majority of languages.

en cs.CL, cs.LG
arXiv Open Access 2023
Transport via Partial Galois Connections and Equivalences

Kevin Kappelmann

Multiple types can represent the same concept. For example, lists and trees can both represent sets. Unfortunately, this easily leads to incomplete libraries: some set-operations may only be available on lists, others only on trees. Similarly, subtypes and quotients are commonly used to construct new type abstractions in formal verification. In such cases, one often wishes to reuse operations on the representation type for the new type abstraction, but to no avail: the types are not the same. To address these problems, we present a new framework that transports programs via equivalences. Existing transport frameworks are either designed for dependently typed, constructive proof assistants, use univalence, or are restricted to partial quotient types. Our framework (1) is designed for simple type theory, (2) generalises previous approaches working on partial quotient types, and (3) is based on standard mathematical concepts, particularly Galois connections and equivalences. We introduce the notion of partial Galois connections and equivalences and prove their closure properties under (dependent) function relators, (co)datatypes, and compositions. We formalised the framework in Isabelle/HOL and provide a prototype. This is the extended version of "Transport via Partial Galois Connections and Equivalences", 21st Asian Symposium on Programming Languages and Systems, 2023.

en cs.PL, cs.LO
S2 Open Access 2023
FORMATION AND DIFFUSION OF THE SUFFIX -ETTIN ITALIAN PERFECT (PASSATO REMOTO): THEORIES AND REALITY

Irina N. Milyaeva

The article deals with a subject of origins and spread of suffix -ett-. The suffix is used to form the fi rst and the third persons singular and the third person plural of the Italian perfect form (passato remoto) of some second conjugation verbs (-ere conjugation verbs). Since the mid-nineteenth century a number of European experts in Romanic languages have offered different theories of suffi x -ett- origins (ideas by Demattio, Amedeo and others). On considering them we have found the majority of the theories groundless and uncreditable. We believe the Bybee — Slobin supposition to be worth noticing among contemporary ones. Th eir idea suggests the spread of suffix -ett- as a consequence of rethinking the pattern that reads as follows: “Passato remoto can have a segment with a double sibilant”. The double sibilant is really present in such Italian perfect forms as disse, scrisse, visse, resse and in the corresponding forms of cognate verbs. Th is supposition is quite logical but still has a number of drawbacks, which prevents us from admitting it as the principle one. We suppose the theory of suffix -ett- originating from perfect forms of the verb stare offered by Friedrich Diez to be the most logical and verifiable. Documented evidence of it is contained in the works of Italian scholars F. D’Ovidio and R. Spina, yet the latter has some discrepancies between the periodization of the suffi x spread offered in the work and the actual data. Some written sources may have failed to survive till present, which makes it diffi cult to reconstruct an objective picture of the process under consideration development. Yet that doesn’t mean that we shouldn’t tackle the issues of the Italian verb historical morphology. Summarizing and analyzing the views of the experts in Romanic languages mentioned above, we have put forward our own periodization of the phenomenon under consideration basing on the documented data.

DOAJ Open Access 2022
A la derecha, abajo hay los lavaderos. Transgresión del efecto de definitud en construcciones impersonales con haber

Jordina Frago Cañellas

Este artículo tiene como objetivo analizar lingüísticamente la transgresión del efecto de definitud en las construcciones impersonales con haber en muestras del español hablado en Cataluña, concretamente en las ocurrencias obtenidas en el Corpus Oral y Sonoro del Español Rural para las provincias catalanas. Primero se expone cómo resuelven el español y el catalán las construcciones locativo-existenciales y los verbos implicados en dichas estructuras para después analizar la transgresión del efecto de definitud como una posible interferencia del catalán al español. La discusión de los datos se basa en el contraste de las construcciones locativo-existenciales entre las dos lenguas.

Romanic languages, French literature - Italian literature - Spanish literature - Portuguese literature
DOAJ Open Access 2022
Generalized Phasal Defectivity in Brazilian Portuguese

Jairo Nunes

Brazilian Portuguese displays a cluster of apparently unrelated properties that set it aside within Romance. On the one hand, it has lost its third person possessive pronouns (cf. Oliveira e Silva 1985, Perini 1985, Cerqueira 1996, and Müller 1996), its third person accusative and dative clitics (cf. Omena 1978, Duarte 1986, Galves 1987, Kato 1993, Nunes 1993, Cyrino 1997, and Berlinck 2006), and its null subjects and null possessors have become severely restricted (cf. Duarte 1995, Figueiredo Silva 1996, Kato 1999, Ferreira 2000, Modesto 2000, Galves 2001, Floripi 2003, and Rodrigues 2004). On the other hand, it came to allow hyper-raising constructions (cf. Ferreira 2000 and Nunes 2020b), as well as the so-called ‘topic subject’ constructions, where a putative topic controls verbal agreement (cf. Pontes 1987, Galves 1987, Nunes 2017, and Kato and Ordóñez 2019). Moreover, it makes a pervasive use of preposition deletion in relative clauses (cf. Tarallo 1983) and its directional verbs came to select the preposition em ‘in’ instead of a ‘to’ (cf. Wiedemer 2013). In this paper, I argue that these and other seemingly independent changes can be accounted for if there is a general process of underspecification affecting phases in Brazilian Portuguese.

Romanic languages, Philology. Linguistics
arXiv Open Access 2022
Language Resources and Technologies for Non-Scheduled and Endangered Indian Languages

Ritesh Kumar, Bornini Lahiri

In the present paper, we will present a survey of the language resources and technologies available for the non-scheduled and endangered languages of India. While there have been different estimates from different sources about the number of languages in India, it could be assumed that there are more than 1,000 languages currently being spoken in India. However barring some of the 22 languages included in the 8th Schedule of the Indian Constitution (called the scheduled languages), there is hardly any substantial resource or technology available for the rest of the languages. Nonetheless there have been some individual attempts at developing resources and technologies for the different languages across the country. Of late, some financial support has also become available for the endangered languages. In this paper, we give a summary of the resources and technologies for those Indian languages which are not included in the 8th schedule of the Indian Constitution and/or which are endangered.

en cs.CL
S2 Open Access 2021
PunKtuator: A Multilingual Punctuation Restoration System for Spoken and Written Text

Varnith Chordia

Text transcripts without punctuation or sentence boundaries are hard to comprehend for both humans and machines. Punctuation marks play a vital role by providing meaning to the sentence and incorrect use or placement of punctuation marks can often alter it. This can impact downstream tasks such as language translation and understanding, pronoun resolution, text summarization, etc. for humans and machines. An automated punctuation restoration (APR) system with minimal human intervention can improve comprehension of text and help users write better. In this paper we describe a multitask modeling approach as a system to restore punctuation in multiple high resource – Germanic (English and German), Romanic (French)– and low resource languages – Indo-Aryan (Hindi) Dravidian (Tamil) – that does not require extensive knowledge of grammar or syntax of a given language for both spoken and written form of text. For German language and the given Indic based languages this is the first towards restoring punctuation and can serve as a baseline for future work.

15 sitasi en Computer Science

Halaman 2 dari 166493