arXiv Open Access 2025

Improving Estonian Text Simplification through Pretrained Language Models and Custom Datasets

Eduard Barbu Meeri-Ly Muru Sten Marcus Malva

Lihat Sumber

Abstrak

This paper presents a method for text simplification based on two neural architectures: a neural machine translation (NMT) model and a fine-tuned large language model (LLaMA). Given the scarcity of existing resources for Estonian, a new dataset was created by combining manually translated corpora with GPT-4.0-generated simplifications. OpenNMT was selected as a representative NMT-based system, while LLaMA was fine-tuned on the constructed dataset. Evaluation shows LLaMA outperforms OpenNMT in grammaticality, readability, and meaning preservation. These results underscore the effectiveness of large language models for text simplification in low-resource language settings. The complete dataset, fine-tuning scripts, and evaluation pipeline are provided in a publicly accessible supplementary package to support reproducibility and adaptation to other languages.

Topik & Kata Kunci

cs.CL

Penulis (3)

Eduard Barbu

Meeri-Ly Muru

Sten Marcus Malva

Format Sitasi

APA MLA BibTeX

Barbu, E., Muru, M., Malva, S.M. (2025). Improving Estonian Text Simplification through Pretrained Language Models and Custom Datasets. https://arxiv.org/abs/2501.15624

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓