Hasil untuk "Romanic languages"

Menampilkan 20 dari ~3335557 hasil · dari arXiv, DOAJ, CrossRef, Semantic Scholar

JSON API
S2 Open Access 2023
An Exploratory Study of EFL Learners’ Use of ChatGPT for Language Learning Tasks: Experience and Perceptions

Yangyu Xiao, Yuying Zhi

ChatGPT, a general-purpose intelligent chatbot developed by OpenAI, has introduced numerous opportunities and challenges in the field of language education. With its remarkable ability to generate diverse forms of text, answer questions, and provide translations within minutes, ChatGPT has become an influential tool in the era of advanced AI technology. However, to what extent ChatGPT can be used to assist students in completing language learning tasks remains largely unexplored. Against this background, this study aimed to investigate students’ experiences with ChatGPT and their perceptions of its role in language learning through a small-scale qualitative study. The data were collected through semi-structured interviews with five students at a top-tier international university in China. Students’ responses revealed that ChatGPT has the potential to serve as a valuable learning partner and aid students in completing language-related tasks. Furthermore, participants exhibited critical judgment in evaluating the quality of ideas and outputs generated by ChatGPT, as well as the ability to modify prompts to maximize learning benefits. Such critical judgment offsets the potential threats to academic integrity posed by ChatGPT. Our findings contribute to the understanding of the potential of ChatGPT in language education by adding empirical evidence from students’ perspectives. This study supports the idea that ChatGPT can work as an effective tool for providing students with immediate feedback and personalized learning experiences. Such findings generate implications for future pedagogical practices in the new era by providing students with personalized guidance, designing technology-embedded language support, and developing students’ lifelong learning skills (e.g., autonomy and evaluative judgment) with the support of ChatGPT.

258 sitasi en
arXiv Open Access 2026
Targeted Syntactic Evaluation of Language Models on Georgian Case Alignment

Daniel Gallagher, Gerhard Heyer

This paper evaluates the performance of transformer-based language models on split-ergative case alignment in Georgian, a particularly rare system for assigning grammatical cases to mark argument roles. We focus on subject and object marking determined through various permutations of nominative, ergative, and dative noun forms. A treebank-based approach for the generation of minimal pairs using the Grew query language is implemented. We create a dataset of 370 syntactic tests made up of seven tasks containing 50-70 samples each, where three noun forms are tested in any given sample. Five encoder- and two decoder-only models are evaluated with word- and/or sentence-level accuracy metrics. Regardless of the specific syntactic makeup, models performed worst in assigning the ergative case correctly and strongest in assigning the nominative case correctly. Performance correlated with the overall frequency distribution of the three forms (NOM > DAT > ERG). Though data scarcity is a known issue for low-resource languages, we show that the highly specific role of the ergative along with a lack of available training data likely contributes to poor performance on this case. The dataset is made publicly available and the methodology provides an interesting avenue for future syntactic evaluations of languages where benchmarks are limited.

en cs.CL
arXiv Open Access 2026
Indic-TunedLens: Interpreting Multilingual Models in Indian Languages

Mihir Panchal, Deeksha Varshney, Mamta et al.

Multilingual large language models (LLMs) are increasingly deployed in linguistically diverse regions like India, yet most interpretability tools remain tailored to English. Prior work reveals that LLMs often operate in English centric representation spaces, making cross lingual interpretability a pressing concern. We introduce Indic-TunedLens, a novel interpretability framework specifically for Indian languages that learns shared affine transformations. Unlike the standard Logit Lens, which directly decodes intermediate activations, Indic-TunedLens adjusts hidden states for each target language, aligning them with the target output distributions to enable more faithful decoding of model representations. We evaluate our framework on 10 Indian languages using the MMLU benchmark and find that it significantly improves over SOTA interpretability methods, especially for morphologically rich, low resource languages. Our results provide crucial insights into the layer-wise semantic encoding of multilingual transformers. Our model is available at https://huggingface.co/spaces/MihirRajeshPanchal/IndicTunedLens. Our code is available at https://github.com/MihirRajeshPanchal/IndicTunedLens.

en cs.CL, cs.AI
S2 Open Access 2025
Translanguaging as a Dynamic Strategy for Heritage Language Transmission

S. Karpava, Natalia Ringblom, Anastassia Zabrodskaja

This study explores translanguaging as a flexible and adaptive strategy for heritage language transmission within multilingual families residing in Cyprus, Estonia, and Sweden. Using a qualitative approach, the research examines family language policies, parental beliefs, and the linguistic practices of bilingual and multilingual families, where one parent speaks Russian. The findings reveal how translanguaging supports bilingual development by fostering linguistic adaptability, bridging heritage and societal languages, and accommodating diverse sociolinguistic contexts. Parents in each country implement unique strategies, influenced by local linguistic landscapes, educational systems, and resource availability. In Cyprus, some families strictly adhered to structured methods like the One Parent–One Language strategy, while others adopted a more integrative multilingual approach, seamlessly translanguaging between Russian, Greek, and English in their daily interactions. Estonian and Swedish families display pragmatic adaptations, emphasizing translanguaging’s role in promoting the emotional well-being and linguistic identity of family members. However, certain challenges persist, including societal language dominance, literacy and educational resource scarcity, and the potential overuse of translanguaging in formal communication. By comparing these contexts, the study underscores the need for flexible yet deliberate family language policies, institutional support, and community resources to sustain bilingualism in bilingual and multilingual families. This research contributes to understanding translanguaging’s implications for intergenerational language transmission in minority and immigrant settings, offering insights for educators, linguists, and policymakers on fostering linguistic diversity and equity in globalized societies.

DOAJ Open Access 2025
El papel de la correspondencia privada en el estudio histórico del español nicaragüense: el discurso epistolar de Mariano Barreto

PAULA ALBITRE LAMATA, Carmen Martín Cuadrado

El objetivo principal de la presente investigación es analizar las fórmulas de tratamiento incorporadas en las cinco epístolas que Mariano Barreto dirige a Casanova, Paniagua Prado y Guzmán, y comparar así los resultados obtenidos (cuantitativos y cualitativos) con las cartas ya analizadas entre el nicaragüense y Cuervo. Con ello, se pretende realizar una aproximación al discurso epistolar de Barreto y al estudio histórico y pragmalingüístico del español de Nicaragua en el siglo XIX.

Romanic languages, Philology. Linguistics
S2 Open Access 2025
Ḍom(b)a – Dobošari – Romanies Migrations of Indian Musician Castes to the West

Nikola Rašić

The paper deals with the Indian etymology of the name for the Roma (Rroma, Romanies, Gypsies). It is generally known that the Roma are originally from India and migrated to Europe about 1,000 years ago. Although the Roma are constantly undergoing linguistic assimilation and losing their Indian language, it is still spoken by large populations, especially in Eastern and Southeastern Europe. This language is the only old Indo-Aryan language present in Europe and differs from all known languages of the Indian subcontinent.The name Rroma originates from the Indian word ḍom(b)a, which denotes a lowstatus caste (jat?) composed of nomadic or semi-nomadic musicians and dancers. Members of this group also engage in various other low-value occupations, such as scavenging, undertaking, cremation and similar tasks. Women, in particular, are often involved in dancing and singing, divination, and fortune-telling. Similar Gypsi-like groups with closely related names can be found elsewhere in Asia. These include the Ḍumaki in Pakistan (blacksmiths and musicians), the Lomari in Armenia, Turkey and neighbouring countries in the Middle East (Caucasian Roma), and the Domari in Palestine and surrounding countries in the Middle East and Central Asia. Comparable groups also exist in Tajikistan, Afghanistan, Iran, Kazakhstan and elsewhere in the region. While these communities are Indian in terms of language and origin, they do not use the Romani name or derivatives of ḍom(b)-. The word ḍom(b)a itself is not Indo-Aryan but originates from a language belonging to the Mundic group. In Munda, the term simply means “drum, tympanum” and refers to the main instrument used by Roma musicians. The paper further discusses the relationship between language, religion and caste in India. A related term, Pārya, originates from the Dravidian word for drum (cf. Tamil paṛai “drum”), and is associated with a similarly low caste status. Pārya also denotes a Gypsi-like group in Tajikistan and Uzbekistan that speaks a Rajasthani idiom. The Roma are, therefore, ḍom(b)a – “drummers”. From this Munda loanword, a whole family of words for various types of drums, tympanums, and tambourines developed in Indo-Aryan languages, including Hungarian and Romanian, as well as Croatian and Serbian dob(oš). The root dob- in Croatian is quite obscure and appears primarily in the word dobovati (which describes the sound of pouring rain, resembling the rhythmic beating on the skin of a taut drum). Its near-synonym rominjati (referring to drizzling – rain that falls slowly yet steadily, with accompanying sound) remains a puzzle, unexplained by etymological dictionaries. It may be a remnant from ancient linguistic times that preserved both the forms dom and rom in their original, drumming-related meaning. The paper thus explains the origin of the ethnic name Roma (drummers of non-Aryan origin) while also shedding light on two obscure Croatian words related to drumming.

S2 Open Access 2024
Mastery of Listening and Reading Vocabulary Levels in Relation to CEFR: Insights into Student Admissions and English as a Medium of Instruction

Zhiqing Li, Janis Zhiyou Li, Xiaofang Zhang et al.

Prior to enrolling in an English as a medium of instruction (EMI) institution, students must show an English proficiency level through meeting a benchmark on a standard English proficiency test, which is typically aligned with the Common European Framework of Reference for Languages (CEFR). Along with overall English proficiency, aural/written vocabulary level mastery could also predict students’ success at EMI institutions, as students need adequate English vocabulary knowledge to comprehend lectures and course readings. However, aural/written vocabulary level mastery has yet to be clearly benchmarked to CEFR levels. Therefore, this study aimed to investigate the correlations between students’ aural/written vocabulary level mastery and their CEFR levels. Forty undergraduate students in a Macau EMI university were recruited to take one English proficiency test and two vocabulary level tests (i.e., Listening Vocabulary Levels Test (LVLT) and the Updated Vocabulary Levels Test (UVLT)). Correlation analyses were conducted to explore the relationship between students’ CEFR levels and their mastery of listening and reading vocabulary levels. A positive correlation was found between students’ CEFR levels and their mastery of receptive aural vocabulary levels (ρ = 0.409, p = 0.009). Furthermore, a statistically significant positive correlation was found between students’ CEFR levels and their mastery of receptive written vocabulary levels (ρ = 0.559, p < 0.001). Although positive correlations were observed, no clear pattern was identified regarding the relationship between students’ CEFR levels and their mastery of aural/written vocabulary levels. Regression analyses were further conducted to determine the extent to which the combination of receptive aural and written vocabulary knowledge predicts the CEFR levels. The results indicated that the regression model that included only UVLT scores better predicted the CEFR levels. Given the positive correlations observed between students’ CEFR levels and their mastery of vocabulary levels, this study’s findings suggest the inclusion of aural/written vocabulary levels as additional indicators for ensuring student academic success in EMI institutions. Implications for EMI universities on student admissions, classroom teaching, and provision of additional English courses were provided.

18 sitasi en
arXiv Open Access 2024
Forget NLI, Use a Dictionary: Zero-Shot Topic Classification for Low-Resource Languages with Application to Luxembourgish

Fred Philippy, Shohreh Haddadan, Siwen Guo

In NLP, zero-shot classification (ZSC) is the task of assigning labels to textual data without any labeled examples for the target classes. A common method for ZSC is to fine-tune a language model on a Natural Language Inference (NLI) dataset and then use it to infer the entailment between the input document and the target labels. However, this approach faces certain challenges, particularly for languages with limited resources. In this paper, we propose an alternative solution that leverages dictionaries as a source of data for ZSC. We focus on Luxembourgish, a low-resource language spoken in Luxembourg, and construct two new topic relevance classification datasets based on a dictionary that provides various synonyms, word translations and example sentences. We evaluate the usability of our dataset and compare it with the NLI-based approach on two topic classification tasks in a zero-shot manner. Our results show that by using the dictionary-based dataset, the trained models outperform the ones following the NLI-based approach for ZSC. While we focus on a single low-resource language in this study, we believe that the efficacy of our approach can also transfer to other languages where such a dictionary is available.

en cs.CL, cs.AI
arXiv Open Access 2024
Program Analysis via Multiple Context Free Language Reachability

Giovanna Kobus Conrado, Adam Husted Kjelstrøm, Andreas Pavlogiannis et al.

Context-free language (CFL) reachability is a standard approach in static analyses, where the analysis question is phrased as a language reachability problem on a graph $G$ wrt a CFL L. While CFLs lack the expressiveness needed for high precision, common formalisms for context-sensitive languages are such that the corresponding reachability problem is undecidable. Are there useful context-sensitive language-reachability models for static analysis? In this paper, we introduce Multiple Context-Free Language (MCFL) reachability as an expressive yet tractable model for static program analysis. MCFLs form an infinite hierarchy of mildly context sensitive languages parameterized by a dimension $d$ and a rank $r$. We show the utility of MCFL reachability by developing a family of MCFLs that approximate interleaved Dyck reachability, a common but undecidable static analysis problem. We show that MCFL reachability be computed in $O(n^{2d+1})$ time on a graph of $n$ nodes when $r=1$, and $O(n^{d(r+1)})$ time when $r>1$. Moreover, we show that when $r=1$, the membership problem has a lower bound of $n^{2d}$ based on the Strong Exponential Time Hypothesis, while reachability for $d=1$ has a lower bound of $n^{3}$ based on the combinatorial Boolean Matrix Multiplication Hypothesis. Thus, for $r=1$, our algorithm is optimal within a factor $n$ for all levels of the hierarchy based on $d$. We implement our MCFL reachability algorithm and evaluate it by underapproximating interleaved Dyck reachability for a standard taint analysis for Android. Used alongside existing overapproximate methods, MCFL reachability discovers all tainted information on 8 out of 11 benchmarks, and confirms $94.3\%$ of the reachable pairs reported by the overapproximation on the remaining 3. To our knowledge, this is the first report of high and provable coverage for this challenging benchmark set.

en cs.PL, cs.CC
arXiv Open Access 2024
Prompting Towards Alleviating Code-Switched Data Scarcity in Under-Resourced Languages with GPT as a Pivot

Michelle Terblanche, Kayode Olaleye, Vukosi Marivate

Many multilingual communities, including numerous in Africa, frequently engage in code-switching during conversations. This behaviour stresses the need for natural language processing technologies adept at processing code-switched text. However, data scarcity, particularly in African languages, poses a significant challenge, as many are low-resourced and under-represented. In this study, we prompted GPT 3.5 to generate Afrikaans--English and Yoruba--English code-switched sentences, enhancing diversity using topic-keyword pairs, linguistic guidelines, and few-shot examples. Our findings indicate that the quality of generated sentences for languages using non-Latin scripts, like Yoruba, is considerably lower when compared with the high Afrikaans-English success rate. There is therefore a notable opportunity to refine prompting guidelines to yield sentences suitable for the fine-tuning of language models. We propose a framework for augmenting the diversity of synthetically generated code-switched data using GPT and propose leveraging this technology to mitigate data scarcity in low-resourced languages, underscoring the essential role of native speakers in this process.

en cs.CL
arXiv Open Access 2024
Romanization Encoding For Multilingual ASR

Wen Ding, Fei Jia, Hainan Xu et al.

We introduce romanization encoding for script-heavy languages to optimize multilingual and code-switching Automatic Speech Recognition (ASR) systems. By adopting romanization encoding alongside a balanced concatenated tokenizer within a FastConformer-RNNT framework equipped with a Roman2Char module, we significantly reduce vocabulary and output dimensions, enabling larger training batches and reduced memory consumption. Our method decouples acoustic modeling and language modeling, enhancing the flexibility and adaptability of the system. In our study, applying this method to Mandarin-English ASR resulted in a remarkable 63.51% vocabulary reduction and notable performance gains of 13.72% and 15.03% on SEAME code-switching benchmarks. Ablation studies on Mandarin-Korean and Mandarin-Japanese highlight our method's strong capability to address the complexities of other script-heavy languages, paving the way for more versatile and effective multilingual ASR systems.

en cs.CL, cs.SD
arXiv Open Access 2024
Building pre-train LLM Dataset for the INDIC Languages: a case study on Hindi

Shantipriya Parida, Shakshi Panwar, Kusum Lata et al.

Large language models (LLMs) demonstrated transformative capabilities in many applications that require automatically generating responses based on human instruction. However, the major challenge for building LLMs, particularly in Indic languages, is the availability of high-quality data for building foundation LLMs. In this paper, we are proposing a large pre-train dataset in Hindi useful for the Indic language Hindi. We have collected the data span across several domains including major dialects in Hindi. The dataset contains 1.28 billion Hindi tokens. We have explained our pipeline including data collection, pre-processing, and availability for LLM pre-training. The proposed approach can be easily extended to other Indic and low-resource languages and will be available freely for LLM pre-training and LLM research purposes.

en cs.CL, cs.AI
arXiv Open Access 2024
Triple Roman Domination in Graphs

Hossein Abdollahzadeh Ahangar, M. Pilar Alvarez, Mustapha Chellali et al.

The Roman domination in graphs is well-studied in graph theory. The topic is related to a defensive strategy problem in which the Roman legions are settled in some secure cities of the Roman Empire. The deployment of the legions around the Empire is designed in such a way that a sudden attack to any undefended city could be quelled by a legion from a strong neighbour. There is an additional condition: no legion can move if doing so leaves its base city defenceless. In this manuscript we start the study of a variant of Roman domination in graphs: the triple Roman domination. We consider that any city of the Roman Empire must be able to be defended by at least three legions. These legions should be either in the attacked city or in one of its neighbours. We determine various bounds on the triple Roman domination number for general graphs, and we give exact values for some graph families. Moreover, complexity results are also obtained.

DOAJ Open Access 2024
Afectos cotidianos en Los días azules y La virgen de los sicarios de Fernando Vallejo

Natalia López Rico, Horst Nitschack

This article analyzes the relationship between affects and everyday life in two novels by the Colombian writer Fernando Vallejo (b. 1942), Los días azules (1985) and La virgen de los sicarios (1994). While the first novel portrays forces of affective dynamics such as joy, compassion, and hope that still guide social practices in everyday life, in La virgen de los sicarios these forces disappear and are replaced by hatred, rage, and violence. Consequently, both affects and everyday life undergo a transformation, proving to be, above all, a permanent field of conflict and tensions.

Romanic languages, Philology. Linguistics
DOAJ Open Access 2024
De la incorporació del llenguatge inclusiu a l’afebliment de la llegítima catalana: els propòsits del Projecte de llei d’actualització del Codi civil de Catalunya

Rosa Barceló Compte, Gemma Rubio Gimeno

El Projecte de llei d’actualització, incorporació i modificació de determinats articles del Codi civil de Catalunya, pel qual es modifiquen i s’afegeixen diverses normes al Codi civil de Catalunya, proposa una actualització legislativa amb un vessant doble: un vessant formal, per a l’assoliment progressiu d’un llenguatge amb pretensió inclusiva de gènere, i un vessant material, mitjançant l’adequació de determinades institucions a fi de donar resposta a l’evolució de la societat o per tal de resoldre qüestions que han resultat controvertides en l’aplicació prèvia de la norma. L’estudi que es presenta a continuació es planteja des d’aquesta doble perspectiva: d’una banda, valorar l’oportunitat i l’encert de la revisió lingüística encetada tímidament mitjançant l’adaptació de normes civils i, de l’altra, considerar si la pretesa actualització d’una institució ben ancorada en el dret de successions com és la llegítima aporta alguna millora apreciable en el reforç de la llibertat de la persona causant.

Language and Literature, Romanic languages
arXiv Open Access 2023
Train Global, Tailor Local: Minimalist Multilingual Translation into Endangered Languages

Zhong Zhou, Jan Niehues, Alex Waibel

In many humanitarian scenarios, translation into severely low resource languages often does not require a universal translation engine, but a dedicated text-specific translation engine. For example, healthcare records, hygienic procedures, government communication, emergency procedures and religious texts are all limited texts. While generic translation engines for all languages do not exist, translation of multilingually known limited texts into new, endangered languages may be possible and reduce human translation effort. We attempt to leverage translation resources from many rich resource languages to efficiently produce best possible translation quality for a well known text, which is available in multiple languages, in a new, severely low resource language. We examine two approaches: 1. best selection of seed sentences to jump start translations in a new language in view of best generalization to the remainder of a larger targeted text(s), and 2. we adapt large general multilingual translation engines from many other languages to focus on a specific text in a new, unknown language. We find that adapting large pretrained multilingual models to the domain/text first and then to the severely low resource language works best. If we also select a best set of seed sentences, we can improve average chrF performance on new test languages from a baseline of 21.9 to 50.7, while reducing the number of seed sentences to only around 1,000 in the new, unknown language.

en cs.CL

Halaman 7 dari 166778