Semantic Scholar Open Access 2022 8 sitasi

Fine-Tuning NER with spaCy for Transliterated Entities Found in Digital Collections From the Multilingual Persian Gulf

Almazhan Kapan Suphan Kirmizialtin Rhythm Kukreja D. Wrisley

Abstrak

Text recognition technologies increase access to global archives and make possible their computational study using techniques such as Named Entity Recognition (NER). In this paper, we present an approach to extracting a variety of named entities (NE) in unstructured historical datasets from open digital collections dealing with a space of informal British empire: the Persian Gulf region. The sources are largely concerned with people, places and tribes as well as economic and diplomatic transactions in the region. Since models in state-of-the-art NER systems function with limited tag sets and are generally trained on English-language media, they struggle to capture entities of interest to the historian and do not perform well with entities transliterated from other languages. We build custom spaCy-based NER models trained on domain-specific annotated datasets. We also extend the set of named entity labels provided by spaCy and focus on detecting entities of non-Western origin, particularly from Arabic and Farsi. We test and compare performance of the blank, pre-trained and merged spaCy-based models, suggesting further improvements. Our study makes an intervention into thinking beyond Western notions of the entity in digital historical research by creating more inclusive models using non-metropolitan corpora in English.

Topik & Kata Kunci

Computer Science

Penulis (4)

Almazhan Kapan

Suphan Kirmizialtin

Rhythm Kukreja

D. Wrisley

Format Sitasi

APA MLA BibTeX

Kapan, A., Kirmizialtin, S., Kukreja, R., Wrisley, D. (2022). Fine-Tuning NER with spaCy for Transliterated Entities Found in Digital Collections From the Multilingual Persian Gulf. https://doi.org/10.5617/dhnbpub.11300

Akses Cepat

PDF tidak tersedia langsung

Cek di sumber asli →

Lihat di Sumber doi.org/10.5617/dhnbpub.11300

Informasi Jurnal

Tahun Terbit: 2022
Bahasa: en
Total Sitasi: 8×
Sumber Database: Semantic Scholar
DOI: 10.5617/dhnbpub.11300
Akses: Open Access ✓