arXiv Open Access 2024

Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction

Laura Manrique-Gómez Tony Montes Arturo Rodríguez-Herrera Rubén Manrique
Lihat Sumber

Abstrak

This paper presents two significant contributions: First, it introduces a novel dataset of 19th-century Latin American newspaper texts, addressing a critical gap in specialized corpora for historical and linguistic analysis in this region. Second, it develops a flexible framework that utilizes a Large Language Model for OCR error correction and linguistic surface form detection in digitized corpora. This semi-automated framework is adaptable to various contexts and datasets and is applied to the newly created dataset.

Topik & Kata Kunci

Penulis (4)

L

Laura Manrique-Gómez

T

Tony Montes

A

Arturo Rodríguez-Herrera

R

Rubén Manrique

Format Sitasi

Manrique-Gómez, L., Montes, T., Rodríguez-Herrera, A., Manrique, R. (2024). Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction. https://arxiv.org/abs/2407.12838

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2024
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓