Building a Strong Instruction Language Model for a Less-Resourced Language
Domen Vreš, Tjaša Arčon, Timotej Petrič
et al.
Large language models (LLMs) have become an essential tool for natural language processing and artificial intelligence in general. Current open-source models are primarily trained on English texts, resulting in poorer performance on less-resourced languages and cultures. We present a set of methodological approaches necessary for the successful adaptation of an LLM to a less-resourced language, and demonstrate them using the Slovene language. We present GaMS3-12B, a generative model for Slovene with 12 billion parameters, and demonstrate that it is the best-performing open-source model for Slovene within its parameter range. We adapted the model to the Slovene language using three-stage continual pre-training of the Gemma 3 model, followed by two-stage supervised fine-tuning (SFT). We trained the model on a combination of 140B Slovene, English, Bosnian, Serbian, and Croatian pretraining tokens, and over 200 thousand English and Slovene SFT examples. We evaluate GaMS3-12B on the Slovenian-LLM-Eval datasets, English-to-Slovene translation, and the Slovene LLM arena. We show that the described model outperforms 12B Gemma 3 across all three scenarios and performs comparably to much larger commercial GPT-4o in the Slovene LLM arena, achieving a win rate of over 60 %.
Terminology Rarity Predicts Catastrophic Failure in LLM Translation of Low-Resource Ancient Languages: Evidence from Ancient Greek
James L. Zainaldin, Cameron Pattison, Manuela Marai
et al.
This study presents the first systematic, reference-free human evaluation of large language model (LLM) machine translation (MT) for Ancient Greek (AG) technical prose. We evaluate translations by three commercial LLMs (Claude, Gemini, ChatGPT) of twenty paragraph-length passages from two works by the Greek physician Galen of Pergamum (ca. 129-216 CE): On Mixtures, which has two published English translations, and On the Composition of Drugs according to Kinds, which has never been fully translated into English. We assess translation quality using both standard automated evaluation metrics (BLEU, chrF++, METEOR, ROUGE-L, BERTScore, COMET, BLEURT) and expert human evaluation via a modified Multidimensional Quality Metrics (MQM) framework applied to all 60 translations by a team of domain specialists. On the previously translated expository text, LLMs achieved high translation quality (mean MQM score 95.2/100), with performance approaching expert level. On the untranslated pharmacological text, aggregate quality was lower (79.9/100) but with high variance driven by two passages presenting extreme terminological density; excluding these, scores converged to within 4 points of the translated text. Terminology rarity, operationalized via corpus frequency in the literary Diorisis Ancient Greek Corpus, emerged as a strong predictor of translation failure (r = -.97 for passage-level quality on the untranslated text). Automated metrics showed moderate correlation with human judgment overall on the text with a wide quality spread (Composition), but no metric discriminated among high-quality translations. We discuss implications for the use of LLMs in Classical scholarship and for the design of automated evaluation pipelines for low-resource ancient languages.
The Patrologia Graeca Corpus: OCR, Annotation, and Open Release of Noisy Nineteenth-Century Polytonic Greek Editions
Chahan Vidal-Gorène, Bastien Kindt
We present the Patrologia Graeca Corpus, the first large-scale open OCR and linguistic resource for nineteenthcentury editions of Ancient Greek. The collection covers the remaining undigitized volumes of the Patrologia Graeca (PG), printed in complex bilingual (Greek-Latin) layouts and characterized by highly degraded polytonic Greek typography. Through a dedicated pipeline combining YOLO-based layout detection and CRNN-based text recognition, we achieve a character error rate (CER) of 1.05% and a word error rate (WER) of 4.69%, largely outperforming existing OCR systems for polytonic Greek. The resulting corpus contains around six million lemmatized and part-of-speech tagged tokens, aligned with full OCR and layout annotations. Beyond its philological value, this corpus establishes a new benchmark for OCR on noisy polytonic Greek and provides training material for future models, including LLMs.
Redefining technology for indigenous languages
Silvia Fernandez-Sabido, Laura Peniche-Sabido
In this paper, we offer an overview of indigenous languages, identifying the causes of their devaluation and the need for legislation on language rights. We review the technologies used to revitalize these languages, finding that when they come from outside, they often have the opposite effect to what they seek; however, when developed from within communities, they become powerful instruments of expression. We propose that the inclusion of Indigenous knowledge in large language models (LLMs) will enrich the technological landscape, but must be done in a participatory environment that encourages the exchange of knowledge.
Automatic Speech Recognition for Greek Medical Dictation
Vardis Georgilas, Themos Stafylakis
Medical dictation systems are essential tools in modern healthcare, enabling accurate and efficient conversion of speech into written medical documentation. The main objective of this paper is to create a domain-specific system for Greek medical speech transcriptions. The ultimate goal is to assist healthcare professionals by reducing the overload of manual documentation and improving workflow efficiency. Towards this goal, we develop a system that combines automatic speech recognition techniques with text correction model, allowing better handling of domain-specific terminology and linguistic variations in Greek. Our approach leverages both acoustic and textual modeling to create more realistic and reliable transcriptions. We focused on adapting existing language and speech technologies to the Greek medical context, addressing challenges such as complex medical terminology and linguistic inconsistencies. Through domain-specific fine-tuning, our system achieves more accurate and coherent transcriptions, contributing to the development of practical language technologies for the Greek healthcare sector.
From Textbook to Talkbot: A Case Study of a Greek-Language RAG-Based Chatbot in Higher Education
Maria Eleni Koutsiaki, Marina Delianidi, Chaido Mizeli
et al.
The integration of AI chatbots into educational settings has opened new pathways for transforming teaching and learning, offering enhanced support to both educators and learners. This study investigates the design and application of an AI chatbot as an educational tool in higher education. Designed to operate in the Greek language, the chatbot addresses linguistic challenges unique to Greek while delivering accurate, context grounded support aligned with the curriculum. The AI chatbot is built on the Retrieval Augmented Generation (RAG) framework by grounding its responses in specific course content. RAG architecture significantly enhances the chatbots reliability by providing accurate, context-aware responses while mitigating common challenges associated with large language models (LLMs), such as hallucinations and misinformation. The AI chatbot serves a dual purpose: it enables students to access accurate, ondemand academic support and assists educators in the rapid creation of relevant educational materials. This dual functionality promotes learner autonomy and streamlines the instructional design process. The study aims to evaluate the effectiveness, reliability, and perceived usability of RAG based chatbots in higher education, exploring their potential to enhance educational practices and outcomes as well as supporting the broader adoption of AI technologies in language specific educational contexts. Findings from this research are expected to contribute to the emerging field of AI driven education by demonstrating how intelligent systems can be effectively aligned with pedagogical goals.
A Transformer-Based Framework for Greek Sign Language Production using Extended Skeletal Motion Representations
Chrysa Pratikaki, Panagiotis Filntisis, Athanasios Katsamanis
et al.
Sign Languages are the primary form of communication for Deaf communities across the world. To break the communication barriers between the Deaf and Hard-of-Hearing and the hearing communities, it is imperative to build systems capable of translating the spoken language into sign language and vice versa. Building on insights from previous research, we propose a deep learning model for Sign Language Production (SLP), which to our knowledge is the first attempt on Greek SLP. We tackle this task by utilizing a transformer-based architecture that enables the translation from text input to human pose keypoints, and the opposite. We evaluate the effectiveness of the proposed pipeline on the Greek SL dataset Elementary23, through a series of comparative analyses and ablation studies. Our pipeline's components, which include data-driven gloss generation, training through video to text translation and a scheduling algorithm for teacher forcing - auto-regressive decoding seem to actively enhance the quality of produced SL videos.
FarFetched: Entity-centric Reasoning and Claim Validation for the Greek Language based on Textually Represented Environments
Dimitris Papadopoulos, Katerina Metropoulou, Nikolaos Matsatsinis
et al.
Our collective attention span is shortened by the flood of online information. With \textit{FarFetched}, we address the need for automated claim validation based on the aggregated evidence derived from multiple online news sources. We introduce an entity-centric reasoning framework in which latent connections between events, actions, or statements are revealed via entity mentions and represented in a graph database. Using entity linking and semantic similarity, we offer a way for collecting and combining information from diverse sources in order to generate evidence relevant to the user's claim. Then, we leverage textual entailment recognition to quantitatively determine whether this assertion is credible, based on the created evidence. Our approach tries to fill the gap in automated claim validation for less-resourced languages and is showcased on the Greek language, complemented by the training of relevant semantic textual similarity (STS) and natural language inference (NLI) models that are evaluated on translated versions of common benchmarks.
Translation of Ancient Works During the First Two Years of the Full-scale Russian Invasion of Ukraine
Svyatoslav Zubchenko
The article attempts to systematize all Ukrainian-language translations from Latin and Ancient Greek that were published in Ukrainian publishing houses during the first two years of the full-scale war of the Russian Federation against Ukraine (from February 24, 2022, to February 24, 2024). The proposed research allows for an understanding of the current state of the book market, the place of translations of ancient classics within it, and, in the long term, to clarify the impact of the existential war on publishing and cultural processes in Ukraine, to forecast potential problems and challenges that this sector will face, and to outline possible ways to overcome them. The aim of the article is to systematize and record trends in the field of translation from classical languages during the first two years of the full-scale aggression of the Russian Federation against Ukraine. The goal of the work determined the solution of a number of specific tasks, the essence and sequence of which are determined by the logic of the research: 1) to find out the state of the publishing market in the first and second year of the full-scale invasion; 2) to analyze the reading tendencies of Ukrainian citizens (number of books read per year, place of purchase of new books, language of reading, most popular genres; 3) to form an idea about the Ukrainian translator of this period (problems of relocation, income, the most popular languages for translation); 4) to systematize the translations from Latin and ancient Greek published in 2022-2024, the publishers who took care of them and the cohort of translators who worked with classical material. The article identifies the most productive translators, collects information about publishing houses that dealt with ancient material during this time, and outlines the problems and trends in the literary and translation field in Ukraine in accordance with the most relevant studies and surveys. In particular, it was found that between 2022 and 2024, publishing houses reissued more old translations (sometimes with changes and additions) than published translations of classical works for the first time. A trend towards a decrease in the number of active publishers of ancient classics (predominantly concentrated in western Ukraine) was also highlighted, along with the maintenance of the same list of active translators from Ancient Greek and Latin (among whom A.O. Sodomora holds a prominent place). The article harmonizes with previous research efforts in the fields of bibliography, translation history, and classical philology, continuing, supplementing, and to some extent clarifying them.
Sprachdatenbasierte Modellierung von Wissensnetzen in der mittelalterlichen Romania (ALMA): Projektskizze
Elton Prifti, W. Schweickard, M. Selig
et al.
Abstract We present ALMA, a new research project aimed at investigating the interaction between language and knowledge practices from AD 1100 to 1500. Our primary question is how Medieval Italian, French, Occitan, Catalan, and Spanish developed into languages of knowledge and scholarship (German Wissen(schafts)sprachen) in permanent opposition to and exchange with the predominant Latin (but also with Arab, Greek, and Hebrew). Focusing on two domains, medicine and law, the project combines linguistics, text philology, and the history of science with the Digital Humanities and ontology engineering. ALMA will create two multi-lingual, domain-specific text corpora by integrating text editions of hitherto unedited manuscripts and incunabula, and digitized printed editions. Our corpus-linguistic exploration of the ALMA corpora will provide the basis for lexical-semantic studies that analyze emerging knowledge networks and the depth of their linguistic representations. We hypothesize that language evolution and the development of more complex linguistic structures will allow for measuring the impact of knowledge practices on medieval vernacular languages. We will trace the dissemination of lexical material across languages, language varieties, cultural spaces, and periods. This will enable us to follow specific vernacular communication channels. We will use cutting-edge technologies to compile, publish, and share our findings, and to model them in the form of historicized ontologies and Linked Data. Our onomasiological, ontology-driven approach will result in the creation of domain models that can be re-used within the Semantic Web. This has great potential to be relevant for researchers from different disciplines.
Erste Griechischakademie NRW für Schülerinnen und Schüler in Bochum
Christiane Schulz
Greek language and literature. Latin language and literature, Philology. Linguistics
Language Cognition and Language Computation -- Human and Machine Language Understanding
Shaonan Wang, Nai Ding, Nan Lin
et al.
Language understanding is a key scientific issue in the fields of cognitive and computer science. However, the two disciplines differ substantially in the specific research questions. Cognitive science focuses on analyzing the specific mechanism of the brain and investigating the brain's response to language; few studies have examined the brain's language system as a whole. By contrast, computer scientists focus on the efficiency of practical applications when choosing research questions but may ignore the most essential laws of language. Given these differences, can a combination of the disciplines offer new insights for building intelligent language models and studying language cognitive mechanisms? In the following text, we first review the research questions, history, and methods of language understanding in cognitive and computer science, focusing on the current progress and challenges. We then compare and contrast the research of language understanding in cognitive and computer sciences. Finally, we review existing work that combines insights from language cognition and language computation and offer prospects for future development trends.
GreekBART: The First Pretrained Greek Sequence-to-Sequence Model
Iakovos Evdaimon, Hadi Abdine, Christos Xypolopoulos
et al.
The era of transfer learning has revolutionized the fields of Computer Vision and Natural Language Processing, bringing powerful pretrained models with exceptional performance across a variety of tasks. Specifically, Natural Language Processing tasks have been dominated by transformer-based language models. In Natural Language Inference and Natural Language Generation tasks, the BERT model and its variants, as well as the GPT model and its successors, demonstrated exemplary performance. However, the majority of these models are pretrained and assessed primarily for the English language or on a multilingual corpus. In this paper, we introduce GreekBART, the first Seq2Seq model based on BART-base architecture and pretrained on a large-scale Greek corpus. We evaluate and compare GreekBART against BART-random, Greek-BERT, and XLM-R on a variety of discriminative tasks. In addition, we examine its performance on two NLG tasks from GreekSUM, a newly introduced summarization dataset for the Greek language. The model, the code, and the new summarization dataset will be publicly available.
When a Language Question Is at Stake. A Revisited Approach to Label Sensitive Content
Stetsenko Daria
Many under-resourced languages require high-quality datasets for specific tasks such as offensive language detection, disinformation, or misinformation identification. However, the intricacies of the content may have a detrimental effect on the annotators. The article aims to revisit an approach of pseudo-labeling sensitive data on the example of Ukrainian tweets covering the Russian-Ukrainian war. Nowadays, this acute topic is in the spotlight of various language manipulations that cause numerous disinformation and profanity on social media platforms. The conducted experiment highlights three main stages of data annotation and underlines the main obstacles during machine annotation. Ultimately, we provide a fundamental statistical analysis of the obtained data, evaluation of models used for pseudo-labelling, and set further guidelines on how the scientists can leverage the corpus to execute more advanced research and extend the existing data samples without annotators' engagement.
Interpretation of thesaurus in the sciences of culture and society
I. Grigorieva
The article is devoted to the methodology of cultural research. It is revealed that the concept of thesaurus in cultural studies, philology, and anthropology does not coincide with those accepted in linguistics and computer science. The author shows that the concept of thesaurus in cultural studies corresponds to the idea of a certain reserve, wealth, which can be called a treasure (from the Greek “thesaurus” — treasure, treasury). In other words, we can talk about the richness of language, knowledge, life experience, and so on. The author relies on the concept by Vl. A. Lukov and V.A. Lukov, who argue that the purpose of the concept of "thesaurus" appears when it is necessary to reflect the completeness of some knowledge that is essential for a given subject for some reason. The article highlights two main characteristics of the thesaurus: the first one is that we are speaking about the completeness/sufficiency of information, without taking into account its volume and measure; the second one is that the materiality of this knowledge is determined by the subject himself — his values, goals, attitudes, etc.
Community Translation and Modern Philosophy
J. Naudé, Cynthia L. Miller-Naudé
The Rosetta Stele, an inscribed stone slab, was discovered in July 1799 near the town of Rashid, ancient Rosetta, which is situated in the western part of the Nile delta of Egypt, by soldiers of Napoleon Bonaparte’s invading army. After the French surrender of Egypt in 1801, the stele passed into British hands and is now in the British Museum in London. The commemorative stele contains three versions of the same text (in Egyptian hieroglyphic, Egyptian Demotic and ancient Greek script, representing two varieties of the ancient Egyptian language and the ancient Greek language). It recounts a decree issued on 27 March 196 BCE by Egyptian priests during the Ptolemaic dynasty on behalf of Ptolemy V Epiphanes to commemorate his crowning. It took more than 20 years and various attempts by scholars to decipher the Demotic and hieroglyphic Egyptian texts. This was done by utilising the mechanisms of modern philology, which had been established as a field early in the 1800s. Standing on the shoulders of his predecessors, Jean-François Champollion was the first Egyptologist to crack the code of hieroglyphic writing by realising that some of the signs were alphabetic, some syllabic, and some determinative. The discovery and decipherment of the Rosetta Stele put multilingualism and the practice of translation and interpreting during the Ptolemaic reign over Egypt into focus. In this essay we describe the rediscovery, as well as the emergence and growth of new knowledge, that was unlocked by the decipherment of the Rosetta Stele, including its implications for African orthographies.
VOĐENA PISANA PRODUKCIJA NA MODERNOM GRČKOM JEZIKU – STUDIJA SLUČAJA
A. Kostić
Gap-fill exercises, which are a type of open-ended exercises, have an important role in language assessment (Durbaba 2011: 153). These exercises give students the opportunity to apply the acquired knowledge and adopted vocabulary, shifting the focus away from the mere translation of terms into the native language (Smith et al. 2010: 1). The goal of our research is the analysis of errors in the tests of a group of first- and second-year students at the Department of Modern Greek Studies at the Faculty of Philology, Univer- sity of Belgrade. The students were at the A1, A2 or B1 level (CEFR) of communicative competence in Modern Greek and they were asked to fill the gaps with adequate words or phrases. The detected errors are divided by type. It is examined whether the students solved the task by paying attention to the concord and the meaning of the given text, while in the tasks at the B1 level the knowledge of the required lexical and grammatical collocations was also examined. In addition, the results of a questionnaire are presented, in which the students were asked to highlight which were the most demanding segments of solving these tasks, as well as which strategies they use in order to prepare for this part of the exam. The results of the research confirm that this task is very demanding for the students of Modern Greek as a foreign language, as well as that they usually prepare for this part of the exam by vocabulary recapitulation. The most common mistakes were identified in grammatical gender and number, as well as in the false use of lexical collocations.
Lexical innovations of Moses Gumilyevsky in his Areopagitica translation of the 18th century
A. Volskaya, N. Nikolaeva
The article analyzes some lexical units (nouns) from the first Russian translation of Areopagitica, made in the 18th century by hieromonk Moses Gumilyovsky, which is of great interest from the point of view of the Russian literary language history of that period. These units are not recorded in lexicographic sources, but most of them are formed according to regular word-formation models of the Russian language. Some of them are structural calques or semi-calques of Greek equivalents. It is noteworthy that even when making a calque, Gumilyevsky tended to create a semantic translation. The purpose of the work is the lexical and word-formation analysis of such word usage. To achieve it, we classified the studied units in accordance with their representations in historical lexicographic sources. As a result of the fulfilled research, we have found that Gumilyevsky's new words are primarily nomina abstracta with the meaning of quality, less often of action (or its result) in accordance with the textual and genre affiliation of the treatise translated by him and its content features. Even within the framework of a small volume of the analyzed vocabulary, we have traced the main trends of Gumilyevsk’s translation as a whole: the translation of substantiates by suffix formations, the preference for names with the meaning of abstract quality and the semantically oriented transmission of the original text.
Rez. U. Jesper, St. Kipf, Th. Riecke-Baulecke (Hrsg.), Basiswissen Lehrerbildung: Latein unterrichten, Hannover 2021
Thomas Kurth
Greek language and literature. Latin language and literature, Philology. Linguistics
Analysing the Greek Parliament Records with Emotion Classification
John Pavlopoulos, Vanessa Lislevand
In this project, we tackle emotion classification for the Greek language, presenting and releasing a new dataset in Greek. We fine-tune and assess Transformer-based masked language models that were pre-trained on monolingual and multilingual resources, and we present the results per emotion and by aggregating at the sentiment and subjectivity level. The potential of the presented resources is investigated by detecting and studying the emotion of `disgust' in the Greek Parliament records. We: (a) locate the months with the highest values from 1989 to present, (b) rank the Greek political parties based on the presence of this emotion in their speeches, and (c) study the emotional context shift of words used to stigmatise people.