Hasil "Greek philology and language"

arXiv Open Access 2026

Chitrakshara: A Large Multilingual Multimodal Dataset for Indian languages

Shaharukh Khan, Ali Faraz, Abhinav Ravi et al.

Multimodal research has predominantly focused on single-image reasoning, with limited exploration of multi-image scenarios. Recent models have sought to enhance multi-image understanding through large-scale pretraining on interleaved image-text datasets. However, most Vision-Language Models (VLMs) are trained primarily on English datasets, leading to inadequate representation of Indian languages. To address this gap, we introduce the Chitrakshara dataset series, covering 11 Indian languages sourced from Common Crawl. It comprises (1) Chitrakshara-IL, a large-scale interleaved pretraining dataset with 193M images, 30B text tokens, and 50M multilingual documents, and (2) Chitrakshara-Cap, which includes 44M image-text pairs with 733M tokens. This paper details the data collection pipeline, including curation, filtering, and processing methodologies. Additionally, we present a comprehensive quality and diversity analysis to assess the dataset's representativeness across Indic languages and its potential for developing more culturally inclusive VLMs.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2026

Merge and Conquer: Instructing Multilingual Models by Adding Target Language Weights

Eneko Valero, Maria Ribalta i Albado, Oscar Sainz et al.

Large Language Models (LLMs) remain heavily centered on English, with limited performance in low-resource languages. Existing adaptation approaches, such as continual pre-training, demand significant computational resources. In the case of instructed models, high-quality instruction data is also required, both of which are often inaccessible for low-resource language communities. Under these constraints, model merging offers a lightweight alternative, but its potential in low-resource contexts has not been systematically explored. In this work, we explore whether it is possible to transfer language knowledge to an instruction-tuned LLM by merging it with a language-specific base model, thereby eliminating the need of language-specific instructions and repeated fine-tuning processes whenever stronger instructed variants become available. Through experiments covering four Iberian languages (Basque, Catalan, Galician, and Spanish) and two model families, we show that merging enables effective instruction following behavior in new languages and even supports multilingual capability through the combination of multiple language-specific models. Our results indicate that model merging is a viable and efficient alternative to traditional adaptation methods for low-resource languages, achieving competitive performance while greatly reducing computational cost.

en cs.CL, cs.AI

Detail Sumber

DOAJ Open Access 2025

Officina di IG XIV2 – Nuove iscrizioni greche di età imperiale da Puteoli

Gelone, Marcello

The paper reports on four new Greek inscriptions from Puteoli, found between 2000 and 2007/08, a bilingual text and three Greek ones. The first three are epitaphs and, although short, are interesting for their onomastics and use of funerary formulas. Only a small portion of the fourth inscription is preserved but its provenance from the stadium of Puteoli and the text reconstruction hypothesised here make it highly probable that it belongs to the typology of agonistic inscriptions: it is not to be excluded that it is the fragment of an agonistic catalogue of the Eusebeia, a typology that it is not yet attested for the Greek style agones founded by Antoninus Pius in memory of his adoptive father Hadrian.

Ancient history, Greek philology and language

Detail DOI Sumber

DOAJ Open Access 2025

Vorwort

Susanne Aretz, Johannes Maximilian Nießen

Greek language and literature. Latin language and literature, Philology. Linguistics

Detail DOI Sumber

arXiv Open Access 2025

Language Lives in Sparse Dimensions: Toward Interpretable and Efficient Multilingual Control for Large Language Models

Chengzhi Zhong, Fei Cheng, Qianying Liu et al.

Large language models exhibit strong multilingual capabilities despite limited exposure to non-English data. Prior studies show that English-centric large language models map multilingual content into English-aligned representations at intermediate layers and then project them back into target-language token spaces in the final layer. From this observation, we hypothesize that this cross-lingual transition is governed by a small and sparse set of dimensions, which occur at consistent indices across the intermediate to final layers. Building on this insight, we introduce a simple, training-free method to identify and manipulate these dimensions, requiring only as few as 50 sentences of either parallel or monolingual data. Experiments on a multilingual generation control task reveal the interpretability of these dimensions, demonstrating that the interventions in these dimensions can switch the output language while preserving semantic content, and that it surpasses the performance of prior neuron-based approaches at a substantially lower cost.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2025

Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages

Xabier de Zuazo, Eva Navas, Ibon Saratxaga et al.

Automatic speech recognition systems have undoubtedly advanced with the integration of multilingual and multitask models such as Whisper, which have shown a promising ability to understand and process speech across a wide range of languages. Despite their robustness, these models often fall short in handling the linguistic distinctions of minority languages. This study addresses this gap by integrating traditional and novel language models with fine-tuned Whisper models to raise their performance in less commonly studied languages. Through rigorous fine-tuning and evaluation across multiple datasets, we demonstrate substantial improvements in word error rate, particularly in low-resource scenarios. Our approach not only does take advantage of the extensive data Whisper was pre-trained on, but also complements its linguistic adaptability by incorporating language models. We obtained improvements up to 51% for in-distribution datasets and up to 34% for out-of-distribution sentences using statistical language models, while large language models provided moderate but consistently robust improvement across diverse linguistic contexts. The findings reveal that, while the integration reliably benefits all model sizes, the extent of improvement varies, highlighting the importance of optimized language model parameters. Finally, we emphasize the importance of selecting appropriate evaluation parameters when reporting the results using transformer-based ASR models. In summary, this research clears the way for more inclusive ASR technologies that perform better across languages by enriching their linguistic knowledge. For further implementation details of this study, the technical documentation and source code are available at http://www.github.com/hitz-zentroa/whisper-lm.

en cs.CL

Detail Sumber

arXiv Open Access 2025

Natural Language-based Assessment of L2 Oral Proficiency using LLMs

Stefano Bannò, Rao Ma, Mengjie Qian et al.

Natural language-based assessment (NLA) is an approach to second language assessment that uses instructions - expressed in the form of can-do descriptors - originally intended for human examiners, aiming to determine whether large language models (LLMs) can interpret and apply them in ways comparable to human assessment. In this work, we explore the use of such descriptors with an open-source LLM, Qwen 2.5 72B, to assess responses from the publicly available S&I Corpus in a zero-shot setting. Our results show that this approach - relying solely on textual information - achieves competitive performance: while it does not outperform state-of-the-art speech LLMs fine-tuned for the task, it surpasses a BERT-based model trained specifically for this purpose. NLA proves particularly effective in mismatched task settings, is generalisable to other data types and languages, and offers greater interpretability, as it is grounded in clearly explainable, widely applicable language descriptors.

en eess.AS, cs.AI

Detail Sumber

arXiv Open Access 2025

Language Games as the Pathway to Artificial Superhuman Intelligence

Ying Wen, Ziyu Wan, Shao Zhang

The evolution of large language models (LLMs) toward artificial superhuman intelligence (ASI) hinges on data reproduction, a cyclical process in which models generate, curate and retrain on novel data to refine capabilities. Current methods, however, risk getting stuck in a data reproduction trap: optimizing outputs within fixed human-generated distributions in a closed loop leads to stagnation, as models merely recombine existing knowledge rather than explore new frontiers. In this paper, we propose language games as a pathway to expanded data reproduction, breaking this cycle through three mechanisms: (1) \textit{role fluidity}, which enhances data diversity and coverage by enabling multi-agent systems to dynamically shift roles across tasks; (2) \textit{reward variety}, embedding multiple feedback criteria that can drive complex intelligent behaviors; and (3) \textit{rule plasticity}, iteratively evolving interaction constraints to foster learnability, thereby injecting continual novelty. By scaling language games into global sociotechnical ecosystems, human-AI co-evolution generates unbounded data streams that drive open-ended exploration. This framework redefines data reproduction not as a closed loop but as an engine for superhuman intelligence.

en cs.AI, cs.CL

Detail Sumber

arXiv Open Access 2025

Self-Organizing Language

P. Myles Eugenio, Anthony Beavers

We introduce a novel paradigm of emergent local memory. It is a continuous-learning completely-parallel content-addressable memory encoding global order. It demonstrates how local constraints on uncoordinated learning can produce topologically protected memories realizing emergent symbolic order. It is therefore a neuro-symbolic bridge. It further has the ability to produce human language without data, by exploiting its own self-organizing dynamics. It teaches us that words arise as a side-effect of emergent symbolic order, and that human language patterns at all structural levels reflect a universal mechanism of word formation (which is subregular). This work answers essential questions about the existence \& origin of all the human language data.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2025

Towards Typologically Aware Rescoring to Mitigate Unfaithfulness in Lower-Resource Languages

Tsan Tsai Chan, Xin Tong, Thi Thu Uyen Hoang et al.

Multilingual large language models (LLMs) are known to more frequently generate non-faithful output in resource-constrained languages (Guerreiro et al., 2023 - arXiv:2303.16104), potentially because these typologically diverse languages are underrepresented in their training data. To mitigate unfaithfulness in such settings, we propose using computationally light auxiliary models to rescore the outputs of larger architectures. As proof of the feasibility of such an approach, we show that monolingual 4-layer BERT models pretrained from scratch on less than 700 MB of data without fine-tuning are able to identify faithful summaries with a mean accuracy of 88.33% in three genetically unrelated languages that differ in their morphological complexity - Vietnamese, Polish and Georgian. The same hyperparameter combination moreover generalises well to three other tasks, suggesting applications for rescoring beyond improving faithfulness. In order to inform typologically aware model selection, we also investigate how morphological complexity interacts with regularisation, model depth and training objectives, ultimately demonstrating that morphologically complex languages are more likely to benefit from dropout, while across languages downstream performance is enhanced most by shallow architectures as well as training using the standard BERT objectives.

en cs.CL

Detail Sumber

S2 Open Access 2025

Moral-ascetic prose of Paisius Velichkovsky: ideological foundations, specificity of argumentation, language and style

A. N. Smolina

The moral-ascetic work of the church writer Paisius Velichkovsky titled "On Intelligent or Internal Prayer" is being explored in the context of religious-Christian culture. The subject of the research is the ideological content of the moral-ascetic texts of Paisius Velichkovsky, the peculiarities of their argumentative components, and their linguistic-stylistic specifics. The aim of this work is to demonstrate the ideological, artistic, and argumentative features of the author’s spiritual prose—texts that focus on the study of the process of spiritual improvement, achieving the Christian ideal, and the struggle of a person against their vices. The relevance of the work is determined by its foundation in the field of theolinguistics and stylistics—directions of modern philology that have been actively developing for several decades—as well as the attention of contemporary linguists to the axiological component of literary works. The main methods of research include the theolinguistic approach, which studies the relationship between language and religion, and the linguistic-stylistic approach, aimed at establishing patterns of language unit functioning influenced by extralinguistic factors. It has been established that the author addresses issues related to the spiritual and moral improvement of a person, achieving internal harmony, and a conscious desire to control one’s inner life at the level of thoughts in the pursuit of salvation, which correlates with the ideological premises of the religious-philosophical teachings of hesychasm. It is emphasized that the ideological foundations of the moral-ascetic prose of Paisius Velichkovsky derive from the hesychastic teachings that were affirmed in the works of the Church Fathers, including those that the author translated from Greek to Russian. It is noted that the ideological foundations linked to hesychasm set the direction for the content of the analyzed texts, determining the development of ideas of humility and self-abasement, represented through kenotic units; ideas of the intelligent practice, embodied through the use of language units that serve the subject area of hesychasm; ideas of salvation, asserted through soteriological vocabulary; ideas of sobriety, realized through theological-ascetic terms, mainly from the thematic group "nepsis"; ideas of silence, whose linguistic representatives become the terms of hesychasm and analogous tropes. It is shown that the author employs arguments in the structure of “model – anti-model,” testimony, and norm; the linguistic characteristics of the texts include the use of theological-ascetic vocabulary, allegories, and metaphors typical for church-religious texts; a feature of the author’s style is adherence to the traditions of monastic literature.

en

Detail DOI Sumber

S2 Open Access 2024

Theology and Philology in Biblical Translation according to Franciscus Titelmans (1502–1537) on the Example of the Book of Job

Tomasz Karol Mantyk

Abstract Franciscus Titelmans (1502–1537) is known to scholars of biblical humanism as a critic of Erasmus’ Novum Instrumentum. His own biblical works have been, however, little studied. This article discusses his approach to humanist methodology on the example of his commentary to the Book of Job. It demonstrates that he used original languages of the Bible to correct copyists’ mistakes and elucidate the meaning of ambiguous passages. Nevertheless, the application of philological tools had strict limits for him: the Church’s dogma. The Vulgate could be corrected where no doctrinal issue was at stake, but whenever an important proof text of the Vulgate differed from the Hebrew and Greek versions, the ancient Latin translation was to be preferred. Titelmans went as far as to claim that it was the Vulgate, which expressed fully the true meaning of the Word of God, the meaning that the Hebrew and Greek texts contained in a hidden from.

2 sitasi en

Detail DOI Sumber

DOAJ Open Access 2024

Translation of Ancient Works During the First Two Years of the Full-scale Russian Invasion of Ukraine

Svyatoslav Zubchenko

The article attempts to systematize all Ukrainian-language translations from Latin and Ancient Greek that were published in Ukrainian publishing houses during the first two years of the full-scale war of the Russian Federation against Ukraine (from February 24, 2022, to February 24, 2024). The proposed research allows for an understanding of the current state of the book market, the place of translations of ancient classics within it, and, in the long term, to clarify the impact of the existential war on publishing and cultural processes in Ukraine, to forecast potential problems and challenges that this sector will face, and to outline possible ways to overcome them. The aim of the article is to systematize and record trends in the field of translation from classical languages during the first two years of the full-scale aggression of the Russian Federation against Ukraine. The goal of the work determined the solution of a number of specific tasks, the essence and sequence of which are determined by the logic of the research: 1) to find out the state of the publishing market in the first and second year of the full-scale invasion; 2) to analyze the reading tendencies of Ukrainian citizens (number of books read per year, place of purchase of new books, language of reading, most popular genres; 3) to form an idea about the Ukrainian translator of this period (problems of relocation, income, the most popular languages for translation); 4) to systematize the translations from Latin and ancient Greek published in 2022-2024, the publishers who took care of them and the cohort of translators who worked with classical material. The article identifies the most productive translators, collects information about publishing houses that dealt with ancient material during this time, and outlines the problems and trends in the literary and translation field in Ukraine in accordance with the most relevant studies and surveys. In particular, it was found that between 2022 and 2024, publishing houses reissued more old translations (sometimes with changes and additions) than published translations of classical works for the first time. A trend towards a decrease in the number of active publishers of ancient classics (predominantly concentrated in western Ukraine) was also highlighted, along with the maintenance of the same list of active translators from Ancient Greek and Latin (among whom A.O. Sodomora holds a prominent place). The article harmonizes with previous research efforts in the fields of bibliography, translation history, and classical philology, continuing, supplementing, and to some extent clarifying them.

Philology. Linguistics

Detail DOI Sumber

arXiv Open Access 2024

Towards Quantifying and Reducing Language Mismatch Effects in Cross-Lingual Speech Anti-Spoofing

Tianchi Liu, Ivan Kukanov, Zihan Pan et al.

The effects of language mismatch impact speech anti-spoofing systems, while investigations and quantification of these effects remain limited. Existing anti-spoofing datasets are mainly in English, and the high cost of acquiring multilingual datasets hinders training language-independent models. We initiate this work by evaluating top-performing speech anti-spoofing systems that are trained on English data but tested on other languages, observing notable performance declines. We propose an innovative approach - Accent-based data expansion via TTS (ACCENT), which introduces diverse linguistic knowledge to monolingual-trained models, improving their cross-lingual capabilities. We conduct experiments on a large-scale dataset consisting of over 3 million samples, including 1.8 million training samples and nearly 1.2 million testing samples across 12 languages. The language mismatch effects are preliminarily quantified and remarkably reduced over 15% by applying the proposed ACCENT. This easily implementable method shows promise for multilingual and low-resource language scenarios.

en eess.AS, cs.AI

Detail Sumber

S2 Open Access 2024

The Sixth Generation of the Perseus Digital Library and a Workflow for Open Philology - DRAFT

Gregory R. Crane, James Tauber, Alison Babeu et al.

We report here on the workflow that we needed to develop in order to integrate the growing range of openly licensed, born-digital and, increasingly, machine actionable publications. Our developmental work focused upon textual data for Ancient Greek, Latin, Old English, Classical Arabic and Classical Persian but the challenges that we have had to address are relevant to sources in a wide range of languages, ancient and modern. Perseus 6 was designed to be a publishing workflow that organizes complementary data into an integrated reading environment. This document focuses on the ways in which we have organized the data and describes the current state of ATLAS (Aligned Text and Linguistic Annotation Server) architecture. While this is the sixth version of the Perseus Digital Library, Perseus 6 represents a major step beyond its predecessors. Where Perseus 5 (described below) can represent and integrate digital versions of print editions (e.g., critical editions with interactive textual notes, links to lexicon and commentary entries), Perseus 6 was designed to bring together an expandable range of born-digital classes of annotation. An online ATLAS server with some initial functionality is now online and public services will expand during the rest of 2024. Most of the ATLAS data is, however, now available on Github and that data will be the focus on this paper in its current version.

en Computer Science

Detail DOI Sumber

DOAJ Open Access 2023

Altsprachlicher Unterricht in NRW und das Netzwerk Historische Schulbibliotheken

Benjamin Magofsky, Johannes Maximilian Nießen, Felicitas Noeske

Greek language and literature. Latin language and literature, Philology. Linguistics

Detail DOI Sumber

DOAJ Open Access 2023

[Τζίρας, Κώστας. Της ερημιάς και του σκότους : ποίηση]

Iakovos Menelaou

History of Greece, Translating and interpreting

Detail DOI Sumber

DOAJ Open Access 2023

Crónicas

Guillermina Bogdan, Julia Bisignano, Chiara Grimozzi et al.

Crónicas

Philology. Linguistics, Greek language and literature. Latin language and literature

Detail DOI Sumber

arXiv Open Access 2023

Learning to Plan with Natural Language

Yiduo Guo, Yaobo Liang, Chenfei Wu et al.

Large Language Models (LLMs) have shown remarkable performance in various basic natural language tasks. For completing the complex task, we still need a plan for the task to guide LLMs to generate the specific solutions step by step. LLMs can directly generate task plans, but these plans may still contain factual errors or are incomplete. A high-quality task plan contains correct step-by-step solutions for solving all situations and behavioral instructions for avoiding mistakes. To obtain it, we propose the Learning to Plan method, which involves two phases: (1) In the first learning task plan phase, it iteratively updates the task plan with new step-by-step solutions and behavioral instructions, which are obtained by prompting LLMs to derive from training error feedback. (2) In the subsequent test phase, the LLM uses the learned task plan to guide the inference of LLM on the test set. We demonstrate the effectiveness of our method on the five different reasoning type tasks (8 datasets). Further, our analysis experiment shows that the task plan learned by one LLM can directly guide another LLM to improve its performance, which reveals a new transfer learning paradigm. We release the code at \url{https://github.com/Eureka6174/LearnNLPlan}

en cs.CL

Detail Sumber

arXiv Open Access 2023

PyThaiNLP: Thai Natural Language Processing in Python

Wannaphong Phatthiyaphaibun, Korakot Chaovavanich, Charin Polpanumas et al.

We present PyThaiNLP, a free and open-source natural language processing (NLP) library for Thai language implemented in Python. It provides a wide range of software, models, and datasets for Thai language. We first provide a brief historical context of tools for Thai language prior to the development of PyThaiNLP. We then outline the functionalities it provided as well as datasets and pre-trained language models. We later summarize its development milestones and discuss our experience during its development. We conclude by demonstrating how industrial and research communities utilize PyThaiNLP in their work. The library is freely available at https://github.com/pythainlp/pythainlp.

en cs.CL

Detail DOI Sumber

Hasil untuk "Greek philology and language"