arXiv Open Access 2025

BERnaT: Basque Encoders for Representing Natural Textual Diversity

Ekhi Azurmendi Joseba Fernandez de Landa Jaione Bengoetxea Maite Heredia Julen Etxaniz +3 lainnya

Lihat Sumber

Abstrak

Language models depend on massive text corpora that are often filtered for quality, a process that can unintentionally exclude non-standard linguistic varieties, reduce model robustness and reinforce representational biases. In this paper, we argue that language models should aim to capture the full spectrum of language variation (dialectal, historical, informal, etc.) rather than relying solely on standardized text. Focusing on the Basque language, we construct new corpora combining standard, social media, and historical sources, and pre-train the BERnaT family of encoder-only models in three configurations: standard, diverse, and combined. We further propose an evaluation framework that separates Natural Language Understanding (NLU) tasks into standard and diverse subsets to assess linguistic generalization. Results show that models trained on both standard and diverse data consistently outperform those trained on standard corpora, improving performance across all task types without compromising standard benchmark accuracy. These findings highlight the importance of linguistic diversity in building inclusive, generalizable language models.

Topik & Kata Kunci

cs.CL cs.AI

Penulis (8)

Ekhi Azurmendi

Joseba Fernandez de Landa

Jaione Bengoetxea

Maite Heredia

Julen Etxaniz

Mikel Zubillaga

Ander Soraluze

Aitor Soroa

Format Sitasi

APA MLA BibTeX

Azurmendi, E., Landa, J.F.d., Bengoetxea, J., Heredia, M., Etxaniz, J., Zubillaga, M. et al. (2025). BERnaT: Basque Encoders for Representing Natural Textual Diversity. https://arxiv.org/abs/2512.03903

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓