Mending Fractured Texts
Abstrak
In this paper we present an OCR correction pipeline for 19th century printed Danish fraktur (gothic/blackletter). The work has been carried out at the University of Copenhagen in relation to a research project involving digital explorations of a corpus of some 900 Danish and Norwegian novels from 1870 to 1899, totalling app. 65 million words. Roughly 25% of these novels are printed in the traditional fraktur font, which was almost totally dominating in the beginning of the 19th century. These texts are important culturally, since they represent mostly forgotten, popular novels, however they pose technical and methodological challenges in terms of processing the text from printed page to digital corpus. In order to provide the best possible material for digital literary analysis as well as more linguistic studies, we designed an OCR correction pipeline for the fraktur part of the corpus consisting of several different heuristic correction steps, with reference to a gold standard. The first step is a preprocessing step which takes care of obvious and unambiguous OCR errors. In the second step we align our primary OCR output candidate (the output from Tesseract using the Fraktur.traineddata pretrained OCR model) with several other OCR output candidates and perform selective correction with reference to these. Especially the Danish “æ” and “ø” characters can be successfully recovered with reference to the Danish, non-fraktur dan.traineddata Tesseract model. Finally, in the third step, we employ the SymSpell spell checker to perform spelling correction backed by a word form dictionary hand-crafted from various relevant sources. The pipeline reduces the word error rate by 7.6 percentage points from 10.5% (89.5% correctly recognized word forms) to 2.8% (97.2% correctly recognized word forms) - an improvement of almost 73%. The character error rate (CER) similarly decreased from 1.94% to 0.54%.
Topik & Kata Kunci
Penulis (4)
Jens Bjerring-Hansen
Ross Deans Kristensen-McLachlan
Philip Diderichsen
Dorte Haltrup Hansen
Akses Cepat
- Tahun Terbit
- 2022
- Sumber Database
- DOAJ
- DOI
- 10.5617/dhnbpub.11285
- Akses
- Open Access ✓