arXiv
Open Access
2024
Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction
Laura Manrique-Gómez
Tony Montes
Arturo Rodríguez-Herrera
Rubén Manrique
Abstrak
This paper presents two significant contributions: First, it introduces a novel dataset of 19th-century Latin American newspaper texts, addressing a critical gap in specialized corpora for historical and linguistic analysis in this region. Second, it develops a flexible framework that utilizes a Large Language Model for OCR error correction and linguistic surface form detection in digitized corpora. This semi-automated framework is adaptable to various contexts and datasets and is applied to the newly created dataset.
Penulis (4)
L
Laura Manrique-Gómez
T
Tony Montes
A
Arturo Rodríguez-Herrera
R
Rubén Manrique
Akses Cepat
Informasi Jurnal
- Tahun Terbit
- 2024
- Bahasa
- en
- Sumber Database
- arXiv
- Akses
- Open Access ✓