Semantic Scholar Open Access 2024 9 sitasi

Development and Evaluation of Pre-trained Language Models for Historical Danish and Norwegian Literary Texts

Ali Al-Laith Alexander Conroy Jens Bjerring-Hansen Daniel Hershcovich

Abstrak

We develop and evaluate the first pre-trained language models specifically tailored for historical Danish and Norwegian texts. Three models are trained on a corpus of 19th-century Danish and Norwegian literature: two directly on the corpus with no prior pre-training, and one with continued pre-training. To evaluate the models, we utilize an existing sentiment classification dataset, and additionally introduce a new annotated word sense disambiguation dataset focusing on the concept of fate. Our assessment reveals that the model employing continued pre-training outperforms the others in two downstream NLP tasks on historical texts. Specifically, we observe substantial improvement in sentiment classification and word sense disambiguation compared to models trained on contemporary texts. These results highlight the effectiveness of continued pre-training for enhancing performance across various NLP tasks in historical text analysis.

Topik & Kata Kunci

Penulis (4)

A

Ali Al-Laith

A

Alexander Conroy

J

Jens Bjerring-Hansen

D

Daniel Hershcovich

Format Sitasi

Al-Laith, A., Conroy, A., Bjerring-Hansen, J., Hershcovich, D. (2024). Development and Evaluation of Pre-trained Language Models for Historical Danish and Norwegian Literary Texts. https://doi.org/10.63317/2ixrvppuoo96

Akses Cepat

PDF tidak tersedia langsung

Cek di sumber asli →
Lihat di Sumber doi.org/10.63317/2ixrvppuoo96
Informasi Jurnal
Tahun Terbit
2024
Bahasa
en
Total Sitasi
Sumber Database
Semantic Scholar
DOI
10.63317/2ixrvppuoo96
Akses
Open Access ✓