arXiv Open Access 2026

L-ReLF: A Framework for Lexical Dataset Creation

Anass Sedrati Mounir Afifi Reda Benkhadra

Lihat Sumber

Abstrak

This paper introduces the L-ReLF (Low-Resource Lexical Framework), a novel, reproducible methodology for creating high-quality, structured lexical datasets for underserved languages. The lack of standardized terminology, exemplified by Moroccan Darija, poses a critical barrier to knowledge equity in platforms like Wikipedia, often forcing editors to rely on inconsistent, ad-hoc methods to create new words in their language. Our research details the technical pipeline developed to overcome these challenges. We systematically address the difficulties of working with low-resource data, including source identification, utilizing Optical Character Recognition (OCR) despite its bias towards Modern Standard Arabic, and rigorous post-processing to correct errors and standardize the data model. The resulting structured dataset is fully compatible with Wikidata Lexemes, serving as a vital technical resource. The L-ReLF methodology is designed for generalizability, offering other language communities a clear path to build foundational lexical data for downstream NLP applications, such as Machine Translation and morphological analysis.

Topik & Kata Kunci

cs.CL

Penulis (3)

Anass Sedrati

Mounir Afifi

Reda Benkhadra

Format Sitasi

APA MLA BibTeX

Sedrati, A., Afifi, M., Benkhadra, R. (2026). L-ReLF: A Framework for Lexical Dataset Creation. https://arxiv.org/abs/2603.29346

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2026
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓