DOAJ Open Access 2025

An Analysis of the Training Data Impact for Domain-Adapted Tokenizer Performances—The Case of Serbian Legal Domain Adaptation

Miloš Bogdanović Milena Frtunić Gligorijević Jelena Kocić Leonid Stoimenov

Abstrak

Various areas of natural language processing (NLP) have greatly benefited from the development of large language models in recent years. This research addresses the challenge of developing efficient tokenizers for transformer-based domain-specific language models. Tokenization efficiency within transformer-based models is directly related to model efficiency, which motivated the research we present in this paper. Our goal in this research was to demonstrate that the appropriate selection of data used for tokenizer training has a significant impact on tokenizer performance. Subsequently, we will demonstrate that efficient tokenizers and models can be developed even if language resources are limited. To do so, we will present a domain-adapted large language model tokenizer developed for masked language modeling of the Serbian legal domain. In this paper, we will present a comparison of the tokenization performance for a domain-adapted tokenizer in version 2 of the SrBERTa language model we developed, against the performances of five other tokenizers belonging to state-of-the-art multilingual, Slavic or Serbian-specific models—XLM-RoBERTa (base-sized), BERTić, Jerteh-81, SrBERTa v1, NER4Legal_SRB. The comparison is performed using a test dataset consisting of 275,660 samples of legal texts written in the Cyrillic alphabet gathered from the Official Gazette of the Republic of Serbia. This dataset contains 197,134 distinct words, while the overall word count is 5,265,352. We will show that our tokenizer, trained upon a domain-adapted dataset, outperforms presented tokenizers by at least 4.5% ranging to 54.62%, regarding the number of tokens generated for the whole test dataset. In terms of tokenizer fertility, we will show that our tokenizer outperforms compared tokenizers by at least 6.39% ranging to 56.8%.

Penulis (4)

M

Miloš Bogdanović

M

Milena Frtunić Gligorijević

J

Jelena Kocić

L

Leonid Stoimenov

Format Sitasi

Bogdanović, M., Gligorijević, M.F., Kocić, J., Stoimenov, L. (2025). An Analysis of the Training Data Impact for Domain-Adapted Tokenizer Performances—The Case of Serbian Legal Domain Adaptation. https://doi.org/10.3390/app15137491

Akses Cepat

PDF tidak tersedia langsung

Cek di sumber asli →
Lihat di Sumber doi.org/10.3390/app15137491
Informasi Jurnal
Tahun Terbit
2025
Sumber Database
DOAJ
DOI
10.3390/app15137491
Akses
Open Access ✓