Evaluating Monolingual and Multilingual Large Language Models for Greek Question Answering: The DemosQA Benchmark
Charalampos Mastrokostas, Nikolaos Giarelis, Nikos Karacapilidis
Recent advancements in Natural Language Processing and Deep Learning have enabled the development of Large Language Models (LLMs), which have significantly advanced the state-of-the-art across a wide range of tasks, including Question Answering (QA). Despite these advancements, research on LLMs has primarily targeted high-resourced languages (e.g., English), and only recently has attention shifted toward multilingual models. However, these models demonstrate a training data bias towards a small number of popular languages or rely on transfer learning from high- to under-resourced languages; this may lead to a misrepresentation of social, cultural, and historical aspects. To address this challenge, monolingual LLMs have been developed for under-resourced languages; however, their effectiveness remains less studied when compared to multilingual counterparts on language-specific tasks. In this study, we address this research gap in Greek QA by contributing: (i) DemosQA, a novel dataset, which is constructed using social media user questions and community-reviewed answers to better capture the Greek social and cultural zeitgeist; (ii) a memory-efficient LLM evaluation framework adaptable to diverse QA datasets and languages; and (iii) an extensive evaluation of 11 monolingual and multilingual LLMs on 6 human-curated Greek QA datasets using 3 different prompting strategies. We release our code and data to facilitate reproducibility.
Bericht zur DAV-Tagung im September 2025: Gemeinsam neue Wege erschließen
Michele Paschen
Greek language and literature. Latin language and literature, Philology. Linguistics
GLaM-Sign: Greek Language Multimodal Lip Reading with Integrated Sign Language Accessibility
Dimitris Kouremenos, Klimis Ntalianis
The Greek Language Multimodal Lip Reading with Integrated Sign Language Accessibility (GLaM-Sign) [1] is a groundbreaking resource in accessibility and multimodal AI, designed to support Deaf and Hard-of-Hearing (DHH) individuals. Developed from the FEELIT project [2], it integrates high-resolution audio, video, textual transcriptions, and Greek Sign Language translations for applications like real-time sign language translation and enhanced subtitle synchronization. While its primary focus is on promoting inclusivity in the Greek tourism sector, its adaptability extends to education, healthcare, and public services. Future advancements will enhance word-level precision and scalability to additional languages, supported by advanced AI methodologies and collaborations with diverse stakeholders. This dataset underscores the transformative potential of multimodal resources in bridging communication gaps, fostering innovation, and setting a benchmark for ethical AI and inclusive technologies.
Soft Inductive Bias Approach via Explicit Reasoning Perspectives in Inappropriate Utterance Detection Using Large Language Models
Ju-Young Kim, Ji-Hong Park, Se-Yeon Lee
et al.
Recent incidents in certain online games and communities, where anonymity is guaranteed, show that unchecked inappropriate remarks frequently escalate into verbal abuse and even criminal behavior, raising significant social concerns. Consequently, there is a growing need for research on techniques that can detect inappropriate utterances within conversational texts to help build a safer communication environment. Although large-scale language models trained on Korean corpora and chain-of-thought reasoning have recently gained attention, research applying these approaches to inappropriate utterance detection remains limited. In this study, we propose a soft inductive bias approach that explicitly defines reasoning perspectives to guide the inference process, thereby promoting rational decision-making and preventing errors that may arise during reasoning. We fine-tune a Korean large language model using the proposed method and conduct both quantitative performance comparisons and qualitative evaluations across different training strategies. Experimental results show that the Kanana-1.5 model achieves an average accuracy of 87.0046, improving by approximately 3.89 percent over standard supervised learning. These findings indicate that the proposed method goes beyond simple knowledge imitation by large language models and enables more precise and consistent judgments through constrained reasoning perspectives, demonstrating its effectiveness for inappropriate utterance detection.
Officina di IG XIV2 – I graffiti su pilastro dall’acropoli di Monte Sannace
Fanizzi, Federica
This study aims to provide an epigraphic and contextual analysis of the graffiti on the pier of the acropolis of Monte Sannace. The data, of exceptional significance, are valuable especially considering the extensive research conducted on abecedaria and ancient languages of the Mediterranean. The document includes four abecedaria and other six graffiti which are analyzed in relation to their regional epigraphic landscape and through a comparison coherent with their micro-epigraphic context.
Ancient history, Greek philology and language
Are Compressed Language Models Less Subgroup Robust?
Leonidas Gee, Andrea Zugarini, Novi Quadrianto
To reduce the inference cost of large language models, model compression is increasingly used to create smaller scalable models. However, little is known about their robustness to minority subgroups defined by the labels and attributes of a dataset. In this paper, we investigate the effects of 18 different compression methods and settings on the subgroup robustness of BERT language models. We show that worst-group performance does not depend on model size alone, but also on the compression method used. Additionally, we find that model compression does not always worsen the performance on minority subgroups. Altogether, our analysis serves to further research into the subgroup robustness of model compression.
How Important Is Tokenization in French Medical Masked Language Models?
Yanis Labrak, Adrien Bazoge, Beatrice Daille
et al.
Subword tokenization has become the prevailing standard in the field of natural language processing (NLP) over recent years, primarily due to the widespread utilization of pre-trained language models. This shift began with Byte-Pair Encoding (BPE) and was later followed by the adoption of SentencePiece and WordPiece. While subword tokenization consistently outperforms character and word-level tokenization, the precise factors contributing to its success remain unclear. Key aspects such as the optimal segmentation granularity for diverse tasks and languages, the influence of data sources on tokenizers, and the role of morphological information in Indo-European languages remain insufficiently explored. This is particularly pertinent for biomedical terminology, characterized by specific rules governing morpheme combinations. Despite the agglutinative nature of biomedical terminology, existing language models do not explicitly incorporate this knowledge, leading to inconsistent tokenization strategies for common terms. In this paper, we seek to delve into the complexities of subword tokenization in French biomedical domain across a variety of NLP tasks and pinpoint areas where further enhancements can be made. We analyze classical tokenization algorithms, including BPE and SentencePiece, and introduce an original tokenization strategy that integrates morpheme-enriched word segmentation into existing tokenization methods.
Clinical information extraction for Low-resource languages with Few-shot learning using Pre-trained language models and Prompting
Phillip Richter-Pechanski, Philipp Wiesenbach, Dominic M. Schwab
et al.
Automatic extraction of medical information from clinical documents poses several challenges: high costs of required clinical expertise, limited interpretability of model predictions, restricted computational resources and privacy regulations. Recent advances in domain-adaptation and prompting methods showed promising results with minimal training data using lightweight masked language models, which are suited for well-established interpretability methods. We are first to present a systematic evaluation of these methods in a low-resource setting, by performing multi-class section classification on German doctor's letters. We conduct extensive class-wise evaluations supported by Shapley values, to validate the quality of our small training data set and to ensure the interpretability of model predictions. We demonstrate that a lightweight, domain-adapted pretrained model, prompted with just 20 shots, outperforms a traditional classification model by 30.5% accuracy. Our results serve as a process-oriented guideline for clinical information extraction projects working with low-resource.
SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks
Kai-Wei Chang, Haibin Wu, Yu-Kai Wang
et al.
Prompting has become a practical method for utilizing pre-trained language models (LMs). This approach offers several advantages. It allows an LM to adapt to new tasks with minimal training and parameter updates, thus achieving efficiency in both storage and computation. Additionally, prompting modifies only the LM's inputs and harnesses the generative capabilities of language models to address various downstream tasks in a unified manner. This significantly reduces the need for human labor in designing task-specific models. These advantages become even more evident as the number of tasks served by the LM scales up. Motivated by the strengths of prompting, we are the first to explore the potential of prompting speech LMs in the domain of speech processing. Recently, there has been a growing interest in converting speech into discrete units for language modeling. Our pioneer research demonstrates that these quantized speech units are highly versatile within our unified prompting framework. Not only can they serve as class labels, but they also contain rich phonetic information that can be re-synthesized back into speech signals for speech generation tasks. Specifically, we reformulate speech processing tasks into speech-to-unit generation tasks. As a result, we can seamlessly integrate tasks such as speech classification, sequence generation, and speech generation within a single, unified prompting framework. The experiment results show that the prompting method can achieve competitive performance compared to the strong fine-tuning method based on self-supervised learning models with a similar number of trainable parameters. The prompting method also shows promising results in the few-shot setting. Moreover, with the advanced speech LMs coming into the stage, the proposed prompting framework attains great potential.
Fast Vocabulary Transfer for Language Model Compression
Leonidas Gee, Andrea Zugarini, Leonardo Rigutini
et al.
Real-world business applications require a trade-off between language model performance and size. We propose a new method for model compression that relies on vocabulary transfer. We evaluate the method on various vertical domains and downstream tasks. Our results indicate that vocabulary transfer can be effectively used in combination with other compression techniques, yielding a significant reduction in model size and inference time while marginally compromising on performance.
THaMES: An End-to-End Tool for Hallucination Mitigation and Evaluation in Large Language Models
Mengfei Liang, Archish Arun, Zekun Wu
et al.
Hallucination, the generation of factually incorrect content, is a growing challenge in Large Language Models (LLMs). Existing detection and mitigation methods are often isolated and insufficient for domain-specific needs, lacking a standardized pipeline. This paper introduces THaMES (Tool for Hallucination Mitigations and EvaluationS), an integrated framework and library addressing this gap. THaMES offers an end-to-end solution for evaluating and mitigating hallucinations in LLMs, featuring automated test set generation, multifaceted benchmarking, and adaptable mitigation strategies. It automates test set creation from any corpus, ensuring high data quality, diversity, and cost-efficiency through techniques like batch processing, weighted sampling, and counterfactual validation. THaMES assesses a model's ability to detect and reduce hallucinations across various tasks, including text generation and binary classification, applying optimal mitigation strategies like In-Context Learning (ICL), Retrieval Augmented Generation (RAG), and Parameter-Efficient Fine-tuning (PEFT). Evaluations of state-of-the-art LLMs using a knowledge base of academic papers, political news, and Wikipedia reveal that commercial models like GPT-4o benefit more from RAG than ICL, while open-weight models like Llama-3.1-8B-Instruct and Mistral-Nemo gain more from ICL. Additionally, PEFT significantly enhances the performance of Llama-3.1-8B-Instruct in both evaluation tasks.
Compounding as one of the productive ways of word formation of veterinary terms (based on the modern German language)
G. Khakimova
The research aims to determine the formal-structural and semantic-etymological features of compound terms in the field of veterinary epizootiology and veterinary parasitology in the German language. The scientific novelty of the research lies in identifying the most productive models of veterinary compounds in the German language. The structural analysis of complex terms revealed that they can be represented by compounds with a complex modifier, a complex head, or both a complex modifier and a complex head. The most productive word-formation models of veterinary compounds have been identified. An attempt has been made to classify compounds into groups with common specific features and to identify the most frequent components of compounds denoting generic features. From the perspective of etymology, it has been found that veterinary complex terms are predominantly hybrid formations composed of Greek-Latin elements and endogenous language material. As a result, it has been found that determinative compounds dominate, with the majority being two-word proper complex terminological units, in the veterinary terminology system in the German language.
Formal variation of the terms describing human infectious diseases (based on the material of the English language)
Kseniia Andreevna Khanina
The aim of the research is to construct a typology of formal variants of terminological units describing human infectious diseases. The author analyses the similarities and differences between the phenomena of synonymy and variation, presenting two opposite views on the attitude of linguists toward this problem. Then the phenomenon of formal variation is considered separately from synonymy. The scientific novelty of the research lies in a comprehensive approach to studying formal variant terms and constructing their typology. As a result of the study, the main reason for the variation of professional units of infectiology has been determined, i.e., the active process of borrowing of special vocabulary in English under the influence of Greek-Latin bases. The typology of variation of terms is represented by graphic, word-formation, syntactic, morphological-syntactic and phonetic variants. A separate group of formal variants is identified, including lexically related terms, where the replacement of word combination components is the result of semantic clarification, conceptual expansion or replacement of borrowed words. The preservation of functionality of the identified variants of terms in modern scientific literature has been established.
Inhaltsverzeichnis
Susanne Aretz
Greek language and literature. Latin language and literature, Philology. Linguistics
[Μαραζόπουλος, Πέτρος. Τα "Βαλκάνια" στη Νεοελληνική Κουλτούρα: όψεις της διαχείρισης ενός όρου]
Konstantinos Tsivos
History of Greece, Translating and interpreting
Kommentieren statt bloß übersetzen: Von der Rechenschaftsablage zur Literaturkritik
Thomas Kurth
Greek language and literature. Latin language and literature, Philology. Linguistics
Turkish Native Language Identification V2
Ahmet Yavuz Uluslu, Gerold Schneider
This paper presents the first application of Native Language Identification (NLI) for the Turkish language. NLI is the task of automatically identifying an individual's native language (L1) based on their writing or speech in a non-native language (L2). While most NLI research has focused on L2 English, our study extends this scope to L2 Turkish by analyzing a corpus of texts written by native speakers of Albanian, Arabic and Persian. We leverage a cleaned version of the Turkish Learner Corpus and demonstrate the effectiveness of syntactic features, comparing a structural Part-of-Speech n-gram model to a hybrid model that retains function words. Our models achieve promising results, and we analyze the most predictive features to reveal L1-specific transfer effects. We make our data and code publicly available for further study.
Sentence Embedding Models for Ancient Greek Using Multilingual Knowledge Distillation
Kevin Krahn, Derrick Tate, Andrew C. Lamicela
Contextual language models have been trained on Classical languages, including Ancient Greek and Latin, for tasks such as lemmatization, morphological tagging, part of speech tagging, authorship attribution, and detection of scribal errors. However, high-quality sentence embedding models for these historical languages are significantly more difficult to achieve due to the lack of training data. In this work, we use a multilingual knowledge distillation approach to train BERT models to produce sentence embeddings for Ancient Greek text. The state-of-the-art sentence embedding approaches for high-resource languages use massive datasets, but our distillation approach allows our Ancient Greek models to inherit the properties of these models while using a relatively small amount of translated sentence data. We build a parallel sentence dataset using a sentence-embedding alignment method to align Ancient Greek documents with English translations, and use this dataset to train our models. We evaluate our models on translation search, semantic similarity, and semantic retrieval tasks and investigate translation bias. We make our training and evaluation datasets freely available at https://github.com/kevinkrahn/ancient-greek-datasets .
A Comprehensive Review of State-of-The-Art Methods for Java Code Generation from Natural Language Text
Jessica López Espejel, Mahaman Sanoussi Yahaya Alassan, El Mehdi Chouham
et al.
Java Code Generation consists in generating automatically Java code from a Natural Language Text. This NLP task helps in increasing programmers' productivity by providing them with immediate solutions to the simplest and most repetitive tasks. Code generation is a challenging task because of the hard syntactic rules and the necessity of a deep understanding of the semantic aspect of the programming language. Many works tried to tackle this task using either RNN-based, or Transformer-based models. The latter achieved remarkable advancement in the domain and they can be divided into three groups: (1) encoder-only models, (2) decoder-only models, and (3) encoder-decoder models. In this paper, we provide a comprehensive review of the evolution and progress of deep learning models in Java code generation task. We focus on the most important methods and present their merits and limitations, as well as the objective functions used by the community. In addition, we provide a detailed description of datasets and evaluation metrics used in the literature. Finally, we discuss results of different models on CONCODE dataset, then propose some future directions.
Large Language Models and Multimodal Retrieval for Visual Word Sense Disambiguation
Anastasia Kritharoula, Maria Lymperaiou, Giorgos Stamou
Visual Word Sense Disambiguation (VWSD) is a novel challenging task with the goal of retrieving an image among a set of candidates, which better represents the meaning of an ambiguous word within a given context. In this paper, we make a substantial step towards unveiling this interesting task by applying a varying set of approaches. Since VWSD is primarily a text-image retrieval task, we explore the latest transformer-based methods for multimodal retrieval. Additionally, we utilize Large Language Models (LLMs) as knowledge bases to enhance the given phrases and resolve ambiguity related to the target word. We also study VWSD as a unimodal problem by converting to text-to-text and image-to-image retrieval, as well as question-answering (QA), to fully explore the capabilities of relevant models. To tap into the implicit knowledge of LLMs, we experiment with Chain-of-Thought (CoT) prompting to guide explainable answer generation. On top of all, we train a learn to rank (LTR) model in order to combine our different modules, achieving competitive ranking results. Extensive experiments on VWSD demonstrate valuable insights to effectively drive future directions.