arXiv Open Access 2022

Dealing with Abbreviations in the Slovenian Biographical Lexicon

Angel Daza Antske Fokkens Tomaž Erjavec

Lihat Sumber

Abstrak

Abbreviations present a significant challenge for NLP systems because they cause tokenization and out-of-vocabulary errors. They can also make the text less readable, especially in reference printed books, where they are extensively used. Abbreviations are especially problematic in low-resource settings, where systems are less robust to begin with. In this paper, we propose a new method for addressing the problems caused by a high density of domain-specific abbreviations in a text. We apply this method to the case of a Slovenian biographical lexicon and evaluate it on a newly developed gold-standard dataset of 51 Slovenian biographies. Our abbreviation identification method performs significantly better than commonly used ad-hoc solutions, especially at identifying unseen abbreviations. We also propose and present the results of a method for expanding the identified abbreviations in context.

Topik & Kata Kunci

cs.CL

Penulis (3)

Angel Daza

Antske Fokkens

Tomaž Erjavec

Format Sitasi

APA MLA BibTeX

Daza, A., Fokkens, A., Erjavec, T. (2022). Dealing with Abbreviations in the Slovenian Biographical Lexicon. https://arxiv.org/abs/2211.02429

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2022
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓