arXiv Open Access 2025

Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models

Şaziye Betül Özateş Tarık Emre Tıraş Ece Elif Adak Berat Doğan Fatih Burak Karagöz +2 lainnya
Lihat Sumber

Abstrak

This paper introduces foundational resources and models for natural language processing (NLP) of historical Turkish, a domain that has remained underexplored in computational linguistics. We present the first named entity recognition (NER) dataset, HisTR, and the first Universal Dependencies treebank, OTA-BOUN, for a historical form of the Turkish language along with transformer-based models trained using these datasets for named entity recognition, dependency parsing, and part-of-speech tagging tasks. Furthermore, we introduce the Ottoman Text Corpus (OTC), a clean corpus of transliterated historical Turkish texts that spans a wide range of historical periods. Our experimental results demonstrate prominent improvements in the computational analysis of historical Turkish, achieving strong performance on tasks that require understanding of historical linguistic structures -- specifically, 90.29% F1 in named entity recognition, 73.79% LAS for dependency parsing, and 94.98% F1 for part-of-speech tagging. They also highlight existing challenges, such as domain adaptation and language variations between time periods. All the resources and models presented are available at https://hf.co/bucolin to serve as a benchmark for future progress in historical Turkish NLP.

Topik & Kata Kunci

Penulis (7)

Ş

Şaziye Betül Özateş

T

Tarık Emre Tıraş

E

Ece Elif Adak

B

Berat Doğan

F

Fatih Burak Karagöz

E

Efe Eren Genç

E

Esma F. Bilgin Taşdemir

Format Sitasi

Özateş, Ş.B., Tıraş, T.E., Adak, E.E., Doğan, B., Karagöz, F.B., Genç, E.E. et al. (2025). Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models. https://arxiv.org/abs/2501.04828

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2025
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓