arXiv Open Access 2022

Dealing with Abbreviations in the Slovenian Biographical Lexicon

Angel Daza Antske Fokkens Tomaž Erjavec
Lihat Sumber

Abstrak

Abbreviations present a significant challenge for NLP systems because they cause tokenization and out-of-vocabulary errors. They can also make the text less readable, especially in reference printed books, where they are extensively used. Abbreviations are especially problematic in low-resource settings, where systems are less robust to begin with. In this paper, we propose a new method for addressing the problems caused by a high density of domain-specific abbreviations in a text. We apply this method to the case of a Slovenian biographical lexicon and evaluate it on a newly developed gold-standard dataset of 51 Slovenian biographies. Our abbreviation identification method performs significantly better than commonly used ad-hoc solutions, especially at identifying unseen abbreviations. We also propose and present the results of a method for expanding the identified abbreviations in context.

Topik & Kata Kunci

Penulis (3)

A

Angel Daza

A

Antske Fokkens

T

Tomaž Erjavec

Format Sitasi

Daza, A., Fokkens, A., Erjavec, T. (2022). Dealing with Abbreviations in the Slovenian Biographical Lexicon. https://arxiv.org/abs/2211.02429

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2022
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓