Hasil untuk "Translating and interpreting"

Menampilkan 20 dari ~137632 hasil · dari CrossRef, arXiv, DOAJ, Semantic Scholar

JSON API
arXiv Open Access 2025
Voice, Bias, and Coreference: An Interpretability Study of Gender in Speech Translation

Lina Conti, Dennis Fucci, Marco Gaido et al.

Unlike text, speech conveys information about the speaker, such as gender, through acoustic cues like pitch. This gives rise to modality-specific bias concerns. For example, in speech translation (ST), when translating from languages with notional gender, such as English, into languages where gender-ambiguous terms referring to the speaker are assigned grammatical gender, the speaker's vocal characteristics may play a role in gender assignment. This risks misgendering speakers, whether through masculine defaults or vocal-based assumptions. Yet, how ST models make these decisions remains poorly understood. We investigate the mechanisms ST models use to assign gender to speaker-referring terms across three language pairs (en-es/fr/it), examining how training data patterns, internal language model (ILM) biases, and acoustic information interact. We find that models do not simply replicate term-specific gender associations from training data, but learn broader patterns of masculine prevalence. While the ILM exhibits strong masculine bias, models can override these preferences based on acoustic input. Using contrastive feature attribution on spectrograms, we reveal that the model with higher gender accuracy relies on a previously unknown mechanism: using first-person pronouns to link gendered terms back to the speaker, accessing gender information distributed across the frequency spectrum rather than concentrated in pitch.

en cs.CL, cs.AI
arXiv Open Access 2025
MockConf: A Student Interpretation Dataset: Analysis, Word- and Span-level Alignment and Baselines

Dávid Javorský, Ondřej Bojar, François Yvon

In simultaneous interpreting, an interpreter renders a source speech into another language with a very short lag, much sooner than sentences are finished. In order to understand and later reproduce this dynamic and complex task automatically, we need dedicated datasets and tools for analysis, monitoring, and evaluation, such as parallel speech corpora, and tools for their automatic annotation. Existing parallel corpora of translated texts and associated alignment algorithms hardly fill this gap, as they fail to model long-range interactions between speech segments or specific types of divergences (e.g., shortening, simplification, functional generalization) between the original and interpreted speeches. In this work, we introduce MockConf, a student interpreting dataset that was collected from Mock Conferences run as part of the students' curriculum. This dataset contains 7 hours of recordings in 5 European languages, transcribed and aligned at the level of spans and words. We further implement and release InterAlign, a modern web-based annotation tool for parallel word and span annotations on long inputs, suitable for aligning simultaneous interpreting. We propose metrics for the evaluation and a baseline for automatic alignment. Dataset and tools are released to the community.

en cs.CL
DOAJ Open Access 2025
Lenguaje claro en salud y traducción intergenérica: una guía para la adaptación de textos médicos

Paula Torres López

El lenguaje claro se constituye como una herramienta para la adaptación y simplificación de conocimiento especializado, que favorece la comprensión de la información. Por otra parte, en el campo de la traducción, se ha abordado el proceso de traducción de un texto que originalmente era consumido por un especialista y, finalmente, lo consultaba un usuario lego. Además, se han elaborado metodologías para este tipo de traducción y se han propuesto diversas técnicas de las que puede hacer uso el traductor. Sin embargo, la aplicación de estos principios de lenguaje claro en salud ha sido poco estudiada en la actualidad para el español. Este artículo realiza un análisis de guías y normas sobre lenguaje claro en el ámbito sanitario, destacando la necesidad de traducir y adaptar textos médicos complejos a un formato accesible para los pacientes, mediante procesos de desterminologización y simplificación lingüística. Con todo ello, el estudio concluye con la creación de una guía de lenguaje claro en salud, que propone directrices específicas para mejorar la comprensión de textos médicos por parte del público lego, contribuyendo a un acceso equitativo a la información sanitaria.

Translating and interpreting
arXiv Open Access 2024
Refining Translations with LLMs: A Constraint-Aware Iterative Prompting Approach

Shangfeng Chen, Xiayang Shi, Pu Li et al.

Large language models (LLMs) have demonstrated remarkable proficiency in machine translation (MT), even without specific training on the languages in question. However, translating rare words in low-resource or domain-specific contexts remains challenging for LLMs. To address this issue, we propose a multi-step prompt chain that enhances translation faithfulness by prioritizing key terms crucial for semantic accuracy. Our method first identifies these keywords and retrieves their translations from a bilingual dictionary, integrating them into the LLM's context using Retrieval-Augmented Generation (RAG). We further mitigate potential output hallucinations caused by long prompts through an iterative self-checking mechanism, where the LLM refines its translations based on lexical and semantic constraints. Experiments using Llama and Qwen as base models on the FLORES-200 and WMT datasets demonstrate significant improvements over baselines, highlighting the effectiveness of our approach in enhancing translation faithfulness and robustness, particularly in low-resource scenarios.

en cs.CL
arXiv Open Access 2024
Improving Language Models Trained on Translated Data with Continual Pre-Training and Dictionary Learning Analysis

Sabri Boughorbel, MD Rizwan Parvez, Majd Hawasly

Training LLMs for low-resource languages usually utilizes data augmentation from English using machine translation (MT). This, however, brings a number of challenges to LLM training: there are large costs attached to translating and curating huge amounts of content with high-end machine translation solutions; the translated content carries over cultural biases; and if the translation is not faithful and accurate, data quality degrades causing issues in the trained model. In this work, we investigate the role of translation and synthetic data in training language models. We translate TinyStories, a dataset of 2.2M short stories for 3-4 year old children, from English to Arabic using the open NLLB-3B MT model. We train a number of story generation models of size 1M-33M parameters using this data. We identify a number of quality and task-specific issues in the resulting models. To rectify these issues, we further pre-train the models with a small dataset of synthesized high-quality Arabic stories generated by a capable LLM, representing 1% of the original training data. We show, using GPT-4 as a judge and Dictionary Learning Analysis from mechanistic interpretability, that the suggested approach is a practical means to resolve some of the machine translation pitfalls. We illustrate the improvements through case studies of linguistic and cultural bias issues.

en cs.CL
arXiv Open Access 2024
Emojinize: Enriching Any Text with Emoji Translations

Lars Henning Klein, Roland Aydin, Robert West

Emoji have become ubiquitous in written communication, on the Web and beyond. They can emphasize or clarify emotions, add details to conversations, or simply serve decorative purposes. This casual use, however, barely scratches the surface of the expressive power of emoji. To further unleash this power, we present Emojinize, a method for translating arbitrary text phrases into sequences of one or more emoji without requiring human input. By leveraging the power of large language models, Emojinize can choose appropriate emoji by disambiguating based on context (eg, cricket-bat vs bat) and can express complex concepts compositionally by combining multiple emoji (eq, "Emojinize" is translated to input-latin-letters right-arrow grinning-face). In a cloze test--based user study, we show that Emojinize's emoji translations increase the human guessability of masked words by 55%, whereas human-picked emoji translations do so by only 29%. These results suggest that emoji provide a sufficiently rich vocabulary to accurately translate a wide variety of words. Moreover, annotating words and phrases with Emojinize's emoji translations opens the door to numerous downstream applications, including children learning how to read, adults learning foreign languages, and text understanding for people with learning disabilities.

en cs.CL, cs.HC
DOAJ Open Access 2024
Sulla soglia dei mondi possibili. Lavagetto e la teoria della letteratura

Simone Carati

L'obiettivo del saggio è evidenziare le connessioni tra la teoria dei mondi possibili e il lavoro critico di Mario Lavagetto. Ripercorrendo l'idea di teoria letteraria di Lavagetto attraverso alcuni dei suoi saggi, l'analisi si concentra su una serie di aspetti chiave, come le soglie, le bugie, i confini, gli spazi bianchi, che permettono diversi collegamenti con la teoria dei mondi di invenzione. Un'attenzione specifica è rivolta agli itinerari che Lavagetto, in quanto critico letterario, esplora nei suoi saggi con la cooperazione del lettore, interpretati come aperture verso altri mondi possibili.

Geography. Anthropology. Recreation, Language. Linguistic theory. Comparative grammar
arXiv Open Access 2023
N-Gram Nearest Neighbor Machine Translation

Rui Lv, Junliang Guo, Rui Wang et al.

Nearest neighbor machine translation augments the Autoregressive Translation~(AT) with $k$-nearest-neighbor retrieval, by comparing the similarity between the token-level context representations of the target tokens in the query and the datastore. However, the token-level representation may introduce noise when translating ambiguous words, or fail to provide accurate retrieval results when the representation generated by the model contains indistinguishable context information, e.g., Non-Autoregressive Translation~(NAT) models. In this paper, we propose a novel $n$-gram nearest neighbor retrieval method that is model agnostic and applicable to both AT and NAT models. Specifically, we concatenate the adjacent $n$-gram hidden representations as the key, while the tuple of corresponding target tokens is the value. In inference, we propose tailored decoding algorithms for AT and NAT models respectively. We demonstrate that the proposed method consistently outperforms the token-level method on both AT and NAT models as well on general as on domain adaptation translation tasks. On domain adaptation, the proposed method brings $1.03$ and $2.76$ improvements regarding the average BLEU score on AT and NAT models respectively.

en cs.CL, cs.LG
arXiv Open Access 2023
Data Augmentation for Code Translation with Comparable Corpora and Multiple References

Yiqing Xie, Atharva Naik, Daniel Fried et al.

One major challenge of translating code between programming languages is that parallel training data is often limited. To overcome this challenge, we present two data augmentation techniques, one that builds comparable corpora (i.e., code pairs with similar functionality), and another that augments existing parallel data with multiple reference translations. Specifically, we build and analyze multiple types of comparable corpora, including programs generated from natural language documentation using a code generation model. Furthermore, to reduce overfitting to a single reference translation, we automatically generate additional translation references for available parallel data and filter the translations by unit tests, which increases variation in target translations. Experiments show that our data augmentation techniques significantly improve CodeT5 for translation between Java, Python, and C++ by an average of 7.5% Computational Accuracy (CA@1), which verifies the correctness of translations by execution. The code is available at https://github.com/Veronicium/CMTrans.

en cs.CL, cs.LG
arXiv Open Access 2023
An evaluation of Google Translate for Sanskrit to English translation via sentiment and semantic analysis

Akshat Shukla, Chaarvi Bansal, Sushrut Badhe et al.

Google Translate has been prominent for language translation; however, limited work has been done in evaluating the quality of translation when compared to human experts. Sanskrit one of the oldest written languages in the world. In 2022, the Sanskrit language was added to the Google Translate engine. Sanskrit is known as the mother of languages such as Hindi and an ancient source of the Indo-European group of languages. Sanskrit is the original language for sacred Hindu texts such as the Bhagavad Gita. In this study, we present a framework that evaluates the Google Translate for Sanskrit using the Bhagavad Gita. We first publish a translation of the Bhagavad Gita in Sanskrit using Google Translate. Our framework then compares Google Translate version of Bhagavad Gita with expert translations using sentiment and semantic analysis via BERT-based language models. Our results indicate that in terms of sentiment and semantic analysis, there is low level of similarity in selected verses of Google Translate when compared to expert translations. In the qualitative evaluation, we find that Google translate is unsuitable for translation of certain Sanskrit words and phrases due to its poetic nature, contextual significance, metaphor and imagery. The mistranslations are not surprising since the Bhagavad Gita is known as a difficult text not only to translate, but also to interpret since it relies on contextual, philosophical and historical information. Our framework lays the foundation for automatic evaluation of other languages by Google Translate

en cs.CL, cs.AI
arXiv Open Access 2023
Adaptive Machine Translation with Large Language Models

Yasmin Moslem, Rejwanul Haque, John D. Kelleher et al.

Consistency is a key requirement of high-quality translation. It is especially important to adhere to pre-approved terminology and adapt to corrected translations in domain-specific projects. Machine translation (MT) has achieved significant progress in the area of domain adaptation. However, real-time adaptation remains challenging. Large-scale language models (LLMs) have recently shown interesting capabilities of in-context learning, where they learn to replicate certain input-output text generation patterns, without further fine-tuning. By feeding an LLM at inference time with a prompt that consists of a list of translation pairs, it can then simulate the domain and style characteristics. This work aims to investigate how we can utilize in-context learning to improve real-time adaptive MT. Our extensive experiments show promising results at translation time. For example, LLMs can adapt to a set of in-domain sentence pairs and/or terminology while translating a new sentence. We observe that the translation quality with few-shot in-context learning can surpass that of strong encoder-decoder MT systems, especially for high-resource languages. Moreover, we investigate whether we can combine MT from strong encoder-decoder models with fuzzy matches, which can further improve translation quality, especially for less supported languages. We conduct our experiments across five diverse language pairs, namely English-to-Arabic (EN-AR), English-to-Chinese (EN-ZH), English-to-French (EN-FR), English-to-Kinyarwanda (EN-RW), and English-to-Spanish (EN-ES).

en cs.CL
DOAJ Open Access 2023
Traducción Automática y Posedición en TAV: estudio piloto de una práctica en alza

Verónica Arnáiz-Uzquiza, Paula Igareda González

La traducción automática y la posedición (TAPE) siguen considerándose, a menudo, enemigas de la libertad creativa tradicionalmente asociada al género audiovisual y a su traducción. El rechazo que su creciente presencia suscita entre gran parte de la comunidad profesional hace que la traducción audiovisual (TAV) no sea una de las disciplinas donde el uso de la TAPE se aborde abiertamente. Sin embargo, es necesario estudiar los beneficios de su implantación a todos los niveles, considerando todas las partes y a todos los implicados en el proceso. Con el fin de estudiar las diferencias entre el proceso tradicional de TAV –sin herramientas– y el proceso implementado con TAPE, se ha llevado a cabo un estudio piloto destinado a comparar los resultados en la traducción inglés-español de una serie de vídeos de noticias y deportes de 3 minutos de duración con diferentes características técnicas. Los resultados permitirán obtener una descripción comparativa objetiva de ambos procesos.

Translating and interpreting
arXiv Open Access 2022
Screening Gender Transfer in Neural Machine Translation

Guillaume Wisniewski, Lichao Zhu, Nicolas Ballier et al.

This paper aims at identifying the information flow in state-of-the-art machine translation systems, taking as example the transfer of gender when translating from French into English. Using a controlled set of examples, we experiment several ways to investigate how gender information circulates in a encoder-decoder architecture considering both probing techniques as well as interventions on the internal representations used in the MT system. Our results show that gender information can be found in all token representations built by the encoder and the decoder and lead us to conclude that there are multiple pathways for gender transfer.

arXiv Open Access 2022
TURJUMAN: A Public Toolkit for Neural Arabic Machine Translation

El Moatez Billah Nagoudi, AbdelRahim Elmadany, Muhammad Abdul-Mageed

We present TURJUMAN, a neural toolkit for translating from 20 languages into Modern Standard Arabic (MSA). TURJUMAN exploits the recently-introduced text-to-text Transformer AraT5 model, endowing it with a powerful ability to decode into Arabic. The toolkit offers the possibility of employing a number of diverse decoding methods, making it suited for acquiring paraphrases for the MSA translations as an added value. To train TURJUMAN, we sample from publicly available parallel data employing a simple semantic similarity method to ensure data quality. This allows us to prepare and release AraOPUS-20, a new machine translation benchmark. We publicly release our translation toolkit (TURJUMAN) as well as our benchmark dataset (AraOPUS-20).

en cs.CL, cs.AI
DOAJ Open Access 2022
Per una definizione di “inconscio tecno-scopico”: straniamenti nella letteratura sperimentale di Giulia Niccolai e Patrizia Vicinelli

Beatrice Seligardi

By looking at the representation strategies in some works by Giulia Niccolai and Patrizia Vicinelli, two of the most female representatives of Italian experimental literature in the XX century, this article aims to provide a specific interpretation of the estrangement effect produced by the encounter between word, image and the scopic function of both the human eye and optical machines. This mimetic device will be discussed along Walter Benjamin’s and Rosalind Krauss’ definition of “optical unconscious”, as well as Franco Vaccari’s notion of “technological unconscious”, thus leading to a possible new heuristic concept: the “techno-scopic unconscious”.

Geography. Anthropology. Recreation, Language. Linguistic theory. Comparative grammar
S2 Open Access 2021
Indirect translation in translator training: taking stock and looking ahead

E. Torres-Simón, Hanna Pięta, Rita Bueno Maia et al.

ABSTRACT In order to translate and be translated, low-diffusion languages often use strategies that differ from those used by widely spoken languages and therefore create particular challenges for translators. One such strategy is indirect translation (including also relay interpreting). Since there are conflicting opinions about this practice within the translation community, it is unclear to what extent indirect translation is present in translator training. In order to shed some light on this issue, this article reports on an exploratory study that looked at mentions of indirect translation in the European Masters in Translation (EMT) competences, at references to indirect translation in the syllabi of EMT programmes, at tasks to develop specific skills of indirect translation in mainstream training textbooks and at the responses to a survey addressed to translator trainers. Results suggest that indirect translation is overlooked at the institutional level (in the list of EMT competences, in the official EMT syllabi and in published textbooks) but still reaches future translators working with low-diffusion languages via in-class tasks developed by a significant part of surveyed trainers.

11 sitasi en Computer Science
arXiv Open Access 2021
Phrase-level Active Learning for Neural Machine Translation

Junjie Hu, Graham Neubig

Neural machine translation (NMT) is sensitive to domain shift. In this paper, we address this problem in an active learning setting where we can spend a given budget on translating in-domain data, and gradually fine-tune a pre-trained out-of-domain NMT model on the newly translated data. Existing active learning methods for NMT usually select sentences based on uncertainty scores, but these methods require costly translation of full sentences even when only one or two key phrases within the sentence are informative. To address this limitation, we re-examine previous work from the phrase-based machine translation (PBMT) era that selected not full sentences, but rather individual phrases. However, while incorporating these phrases into PBMT systems was relatively simple, it is less trivial for NMT systems, which need to be trained on full sequences to capture larger structural properties of sentences unique to the new domain. To overcome these hurdles, we propose to select both full sentences and individual phrases from unlabelled data in the new domain for routing to human translators. In a German-English translation task, our active learning approach achieves consistent improvements over uncertainty-based sentence selection methods, improving up to 1.2 BLEU score over strong active learning baselines.

en cs.CL, cs.AI

Halaman 24 dari 6882