DOAJ Open Access 2022

Mending Fractured Texts

Jens Bjerring-Hansen Ross Deans Kristensen-McLachlan Philip Diderichsen Dorte Haltrup Hansen

Abstrak

In this paper we present an OCR correction pipeline for 19th century printed Danish fraktur (gothic/blackletter). The work has been carried out at the University of Copenhagen in relation to a research project involving digital explorations of a corpus of some 900 Danish and Norwegian novels from 1870 to 1899, totalling app. 65 million words. Roughly 25% of these novels are printed in the traditional fraktur font, which was almost totally dominating in the beginning of the 19th century. These texts are important culturally, since they represent mostly forgotten, popular novels, however they pose technical and methodological challenges in terms of processing the text from printed page to digital corpus. In order to provide the best possible material for digital literary analysis as well as more linguistic studies, we designed an OCR correction pipeline for the fraktur part of the corpus consisting of several different heuristic correction steps, with reference to a gold standard. The first step is a preprocessing step which takes care of obvious and unambiguous OCR errors. In the second step we align our primary OCR output candidate (the output from Tesseract using the Fraktur.traineddata pretrained OCR model) with several other OCR output candidates and perform selective correction with reference to these. Especially the Danish “æ” and “ø” characters can be successfully recovered with reference to the Danish, non-fraktur dan.traineddata Tesseract model. Finally, in the third step, we employ the SymSpell spell checker to perform spelling correction backed by a word form dictionary hand-crafted from various relevant sources. The pipeline reduces the word error rate by 7.6 percentage points from 10.5% (89.5% correctly recognized word forms) to 2.8% (97.2% correctly recognized word forms) - an improvement of almost 73%. The character error rate (CER) similarly decreased from 1.94% to 0.54%.

Penulis (4)

J

Jens Bjerring-Hansen

R

Ross Deans Kristensen-McLachlan

P

Philip Diderichsen

D

Dorte Haltrup Hansen

Format Sitasi

Bjerring-Hansen, J., Kristensen-McLachlan, R.D., Diderichsen, P., Hansen, D.H. (2022). Mending Fractured Texts. https://doi.org/10.5617/dhnbpub.11285

Akses Cepat

PDF tidak tersedia langsung

Cek di sumber asli →
Lihat di Sumber doi.org/10.5617/dhnbpub.11285
Informasi Jurnal
Tahun Terbit
2022
Sumber Database
DOAJ
DOI
10.5617/dhnbpub.11285
Akses
Open Access ✓