arXiv Open Access 2024

Seventeenth-Century Spanish American Notary Records for Fine-Tuning Spanish Large Language Models

Shraboni Sarker Ahmad Tamim Hamad Hulayyil Alshammari Viviana Grieco Praveen Rao
Lihat Sumber

Abstrak

Large language models have gained tremendous popularity in domains such as e-commerce, finance, healthcare, and education. Fine-tuning is a common approach to customize an LLM on a domain-specific dataset for a desired downstream task. In this paper, we present a valuable resource for fine-tuning LLMs developed for the Spanish language to perform a variety of tasks such as classification, masked language modeling, clustering, and others. Our resource is a collection of handwritten notary records from the seventeenth century obtained from the National Archives of Argentina. This collection contains a combination of original images and transcribed text (and metadata) of 160+ pages that were handwritten by two notaries, namely, Estenban Agreda de Vergara and Nicolas de Valdivia y Brisuela nearly 400 years ago. Through empirical evaluation, we demonstrate that our collection can be used to fine-tune Spanish LLMs for tasks such as classification and masked language modeling, and can outperform pre-trained Spanish models and ChatGPT-3.5/ChatGPT-4o. Our resource will be an invaluable resource for historical text analysis and is publicly available on GitHub.

Topik & Kata Kunci

Penulis (5)

S

Shraboni Sarker

A

Ahmad Tamim Hamad

H

Hulayyil Alshammari

V

Viviana Grieco

P

Praveen Rao

Format Sitasi

Sarker, S., Hamad, A.T., Alshammari, H., Grieco, V., Rao, P. (2024). Seventeenth-Century Spanish American Notary Records for Fine-Tuning Spanish Large Language Models. https://arxiv.org/abs/2406.05812

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2024
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓