Hasil untuk "Romanic languages"

Menampilkan 20 dari ~3336031 hasil · dari CrossRef, arXiv, DOAJ, Semantic Scholar

JSON API
S2 Open Access 2020
A Sentiment Analysis Dataset for Code-Mixed Malayalam-English

Bharathi Raja Chakravarthi, Navya Jose, Shardul Suryawanshi et al.

There is an increasing demand for sentiment analysis of text from social media which are mostly code-mixed. Systems trained on monolingual data fail for code-mixed data due to the complexity of mixing at different levels of the text. However, very few resources are available for code-mixed data to create models specific for this data. Although much research in multilingual and cross-lingual sentiment analysis has used semi-supervised or unsupervised methods, supervised methods still performs better. Only a few datasets for popular languages such as English-Spanish, English-Hindi, and English-Chinese are available. There are no resources available for Malayalam-English code-mixed data. This paper presents a new gold standard corpus for sentiment analysis of code-mixed text in Malayalam-English annotated by voluntary annotators. This gold standard corpus obtained a Krippendorff’s alpha above 0.8 for the dataset. We use this new corpus to provide the benchmark for sentiment analysis in Malayalam-English code-mixed texts.

237 sitasi en Computer Science, Political Science
arXiv Open Access 2026
Bounds for (strong) Roman $k$-dominations

Fahimeh Khosh-Ahang Ghasr

Motivated by resource defense models in networks, such as protecting territories with varying legion strengths, let $k \geq 2$ be an integer. Roman $k$-domination and strong Roman $k$-domination generalize Roman, double Roman, Italian, and double Italian domination to arbitrary number of legions. The main goal of this note is establishing sharp upper bounds for the Roman and strong Roman $k$-domination numbers of connected graphs. These bounds unify and extend prior results for $k=2$ and $k=3$. We also precisely characterize the graphs achieving these bounds.

en math.CO
arXiv Open Access 2025
Roman domination in weighted graphs

Martín Cera, Pedro García-Vázquez, Juan Carlos Valenzuela-Tripodoro

A Roman dominating function for a (non-weighted) graph $G=(V,E)$, is a function $f:V\rightarrow \{0,1,2\}$ such that every vertex $u\in V$ with $f(u)=0$ has at least {one} neighbor $v\in V$ such that $f(v)=2$. The minimum weight $\sum_{v\in V}f(v)$ of a Roman {dominating function} $f$ on $G$ is called the Roman domination number of $G$ and is denoted by $γ_{R}(G)$. A graph {$G= (V,E)$} together with a positive real-valued weight-function $w:V\rightarrow \mathbf{R}^{>0}$ is called a {\it weighted graph} and is denoted by $(G;w)$. The minimum weight $\sum_{v\in V}f(v)w(v)$ of a Roman {dominating function} $f$ on $G$ is called the weighted Roman domination number of $G$ and is denoted by $γ_{wR}(G)$. The domination and Roman domination numbers of unweighted graphs have been extensively studied, particularly for their applications in bioinformatics and computational biology. However, graphs used to model biomolecular structures often require weights to be biologically meaningful. In this paper, we initiate the study of the weighted Roman domination number in weighted graphs. We first establish several bounds for this parameter and present various realizability results. Furthermore, we determine the exact values for several well-known graph families and demonstrate an equivalence between the weighted Roman domination number and the differential of a weighted graph.

en cs.DM, math.CO
arXiv Open Access 2025
Counting and Sampling Traces in Regular Languages

Alexis de Colnet, Kuldeep S. Meel, Umang Mathur

In this work, we study the problems of counting and sampling Mazurkiewicz traces that a regular language touches. Fix an alphabet $Σ$ and an independence relation $\mathbb{I} \subseteq Σ\times Σ$. The input consists of a regular language $L \subseteq Σ^*$, given by a finite automaton with $m$ states, and a natural number $n$ (in unary). For the counting problem, the goal is to compute the number of Mazurkiewicz traces (induced by $\mathbb{I}$) that intersect the $n^\text{th}$ slice $L_n = L \cap Σ^n$, i.e., traces that admit at least one linearization in $L_n$. For the sampling problem, the goal is to output a trace drawn from a distribution that is approximately uniform over all such traces. These tasks are motivated by bounded model checking with partial-order reduction, where an \emph{a priori} estimate of the reduced state space is valuable, and by testing methods for concurrent programs that use partial-order-aware random exploration. We first show that the counting problem is #P-hard even when $L$ is accepted by a deterministic automaton, in sharp contrast to counting words of a DFA, which is polynomial-time solvable. We then prove that the problem lies in #P for both NFAs and DFAs, irrespective of whether $L$ is trace-closed. Our main algorithmic contributions are a \emph{fully polynomial-time randomized approximation scheme} (FPRAS) that, with high probability, approximates the desired count within a prescribed accuracy, and a \emph{fully polynomial-time almost uniform sampler} (FPAUS) that generates traces whose distribution is provably close to uniform.

en cs.FL, cs.CC
arXiv Open Access 2025
Low-Resource Transliteration for Roman-Urdu and Urdu Using Transformer-Based Models

Umer Butt, Stalin Veranasi, Günter Neumann

As the Information Retrieval (IR) field increasingly recognizes the importance of inclusivity, addressing the needs of low-resource languages remains a significant challenge. Transliteration between Urdu and its Romanized form, Roman Urdu, remains underexplored despite the widespread use of both scripts in South Asia. Prior work using RNNs on the Roman-Urdu-Parl dataset showed promising results but suffered from poor domain adaptability and limited evaluation. We propose a transformer-based approach using the m2m100 multilingual translation model, enhanced with masked language modeling (MLM) pretraining and fine-tuning on both Roman-Urdu-Parl and the domain-diverse Dakshina dataset. To address previous evaluation flaws, we introduce rigorous dataset splits and assess performance using BLEU, character-level BLEU, and CHRF. Our model achieves strong transliteration performance, with Char-BLEU scores of 96.37 for Urdu->Roman-Urdu and 97.44 for Roman-Urdu->Urdu. These results outperform both RNN baselines and GPT-4o Mini and demonstrate the effectiveness of multilingual transfer learning for low-resource transliteration tasks.

en cs.CL, cs.AI
arXiv Open Access 2025
Enhancing Plagiarism Detection in Marathi with a Weighted Ensemble of TF-IDF and BERT Embeddings for Low-Resource Language Processing

Atharva Mutsaddi, Aditya Choudhary

Plagiarism involves using another person's work or concepts without proper attribution, presenting them as original creations. With the growing amount of data communicated in regional languages such as Marathi -- one of India's regional languages -- it is crucial to design robust plagiarism detection systems tailored for low-resource languages. Language models like Bidirectional Encoder Representations from Transformers (BERT) have demonstrated exceptional capability in text representation and feature extraction, making them essential tools for semantic analysis and plagiarism detection. However, the application of BERT for low-resource languages remains under-explored, particularly in the context of plagiarism detection. This paper presents a method to enhance the accuracy of plagiarism detection for Marathi texts using BERT sentence embeddings in conjunction with Term Frequency-Inverse Document Frequency (TF-IDF) feature representation. This approach effectively captures statistical, semantic, and syntactic aspects of text features through a weighted voting ensemble of machine learning models.

en cs.CL, cs.AI
arXiv Open Access 2025
Enumeration With Nice Roman Domination Properties

Kevin Mann

Although Extension Perfect Roman Domination is NP-complete, all minimal (with respect to the pointwise order) perfect Roman dominating functions can be enumerated with polynomial delay. This algorithm uses a bijection between minimal perfect Roman dominating functions and Roman dominating functions and the fact that all minimal Roman dominating functions can be enumerated with polynomial delay. This bijection considers the set of vertices with value 2 under the functions. In this paper, we will generalize this idea by defining so called nice Roman Domination properties for which we can employ this method. With this idea, we can show that all minimal maximal Roman Dominating functions can be enumerated with polynomial delay in O(1.9332^n) time. Furthermore, we prove that enumerating all minimal connected/total Roman dominating functions on cobipartite graphs can be achieved with polynomial delay. Additionally, we show the existence of a polynomial-delay algorithm for enumerating all minimal connected Roman dominating function on interval graphs. We show some downsides to this method as well.

en cs.DM, cs.CC
S2 Open Access 2024
The Complexity in Bilingual Code-Switching Research: A Systematic Review

William Rayo, Aldo M. Barrita, Lianelys Cabrera Martinez et al.

This systematic review explored how researchers operationalized bilingualism when investigating the relationship between bilingual code-switching experience and cognition. Through a PRISMA-guided systematic review of thirty-two studies with original data, published in English, focusing on adult non-clinical samples, with bilingualism as a key variable, we aimed to understand the prevalence of these issues. Criteria for inclusion required an assessment of bilingualism beyond language proficiency or age of acquisition, and consideration of naturalistic code-switching behaviors. We report our results through an analysis of themes that included aspects of language that are considered when measuring bilingualism and code-switching experience. We present our findings and offer insights for future research, advocating for the inclusion of sociocultural factors and more complex analytical modeling in bilingualism research to foster an evolution in the field.

15 sitasi en
arXiv Open Access 2024
Automated Collection of Evaluation Dataset for Semantic Search in Low-Resource Domain Language

Anastasia Zhukova, Christian E. Matt, Bela Gipp

Domain-specific languages that use a lot of specific terminology often fall into the category of low-resource languages. Collecting test datasets in a narrow domain is time-consuming and requires skilled human resources with domain knowledge and training for the annotation task. This study addresses the challenge of automated collecting test datasets to evaluate semantic search in low-resource domain-specific German language of the process industry. Our approach proposes an end-to-end annotation pipeline for automated query generation to the score reassessment of query-document pairs. To overcome the lack of text encoders trained in the German chemistry domain, we explore a principle of an ensemble of "weak" text encoders trained on common knowledge datasets. We combine individual relevance scores from diverse models to retrieve document candidates and relevance scores generated by an LLM, aiming to achieve consensus on query-document alignment. Evaluation results demonstrate that the ensemble method significantly improves alignment with human-assigned relevance scores, outperforming individual models in both inter-coder agreement and accuracy metrics. These findings suggest that ensemble learning can effectively adapt semantic search systems for specialized, low-resource languages, offering a practical solution to resource limitations in domain-specific contexts.

en cs.CL
arXiv Open Access 2024
Social Bias in Large Language Models For Bangla: An Empirical Study on Gender and Religious Bias

Jayanta Sadhu, Maneesha Rani Saha, Rifat Shahriyar

The rapid growth of Large Language Models (LLMs) has put forward the study of biases as a crucial field. It is important to assess the influence of different types of biases embedded in LLMs to ensure fair use in sensitive fields. Although there have been extensive works on bias assessment in English, such efforts are rare and scarce for a major language like Bangla. In this work, we examine two types of social biases in LLM generated outputs for Bangla language. Our main contributions in this work are: (1) bias studies on two different social biases for Bangla, (2) a curated dataset for bias measurement benchmarking and (3) testing two different probing techniques for bias detection in the context of Bangla. This is the first work of such kind involving bias assessment of LLMs for Bangla to the best of our knowledge. All our code and resources are publicly available for the progress of bias related research in Bangla NLP.

en cs.CL
arXiv Open Access 2024
Kallaama: A Transcribed Speech Dataset about Agriculture in the Three Most Widely Spoken Languages in Senegal

Elodie Gauthier, Aminata Ndiaye, Abdoulaye Guissé

This work is part of the Kallaama project, whose objective is to produce and disseminate national languages corpora for speech technologies developments, in the field of agriculture. Except for Wolof, which benefits from some language data for natural language processing, national languages of Senegal are largely ignored by language technology providers. However, such technologies are keys to the protection, promotion and teaching of these languages. Kallaama focuses on the 3 main spoken languages by Senegalese people: Wolof, Pulaar and Sereer. These languages are widely spoken by the population, with around 10 million of native Senegalese speakers, not to mention those outside the country. However, they remain under-resourced in terms of machine-readable data that can be used for automatic processing and language technologies, all the more so in the agricultural sector. We release a transcribed speech dataset containing 125 hours of recordings, about agriculture, in each of the above-mentioned languages. These resources are specifically designed for Automatic Speech Recognition purpose, including traditional approaches. To build such technologies, we provide textual corpora in Wolof and Pulaar, and a pronunciation lexicon containing 49,132 entries from the Wolof dataset.

en cs.CL
DOAJ Open Access 2024
Consideraciones dialectológicas sobre expresiones para dar agradecimiento y su inclusión en la enseñanza-aprendizaje del ELA

Daniela Caro Martínez, Juan Felipe Zuluaga Molina

En el Valle de Aburrá no sólo se dice “gracias”, se dice “mi Dios se lo ha de multiplicar”, “sos un sol” o “la buena”. Como se observa, la cultura y la realidad social de los hablantes también se reflejan en sus formas de agradecer. Por lo anterior, esta investigación cualitativa buscó identificar las expresiones para dar agradecimiento (EXPADA) del español estándar y de la variante paisa, así como también comprender cuál ha sido el tratamiento que se les ha dado en la enseñanza de español como lengua adicional (ELA). Del trabajo, se concluye que es necesario enseñar e integrar las EXPADA desde las primeras etapas de adquisición del español y a través de los distintos niveles, procurando un balance entre su complejidad, inclusión curricular y registro respecto al nivel.

Romanic languages, Philology. Linguistics

Halaman 8 dari 166802