Semantic Scholar Open Access 2024 4 sitasi

Data-Augmentation-Based Dialectal Adaptation for LLMs

FAHIM FAISAL Antonios Anastasopoulos

Abstrak

This report presents gmnlp’s participation to the Dialect-Copa shared task at VarDial 2024 (Chifu et al., 2024), which focuses on evaluating the commonsense reasoning capabilities of large language models (LLMs) on South Slavic micro-dialects. The task aims to assess how well LLMs can handle non-standard dialectal varieties, as their performance on standard languages is already well-established. We propose an approach that combines the strengths of different types of language models and leverages data augmentation techniques to improve task performance on three South Slavic dialects: Chakavian, Cherkano, and Torlak. We conduct experiments using a language-family-focused encoder-based model (BERTić) and a domain-agnostic multilingual model (AYA-101). Our results demonstrate that the proposed data augmentation techniques lead to substantial performance gains across all three test datasets in the open-source model category. This work highlights the practical utility of data augmentation and the potential of LLMs in handling non-standard dialectal varieties, contributing to the broader goal of advancing natural language understanding in low-resource and dialectal settings.

Topik & Kata Kunci

Computer Science

Penulis (2)

FAHIM FAISAL

Antonios Anastasopoulos

Format Sitasi

APA MLA BibTeX

FAISAL, F., Anastasopoulos, A. (2024). Data-Augmentation-Based Dialectal Adaptation for LLMs. https://doi.org/10.48550/arXiv.2404.08092

Akses Cepat

PDF tidak tersedia langsung

Cek di sumber asli →

Lihat di Sumber doi.org/10.48550/arXiv.2404.08092

Informasi Jurnal

Tahun Terbit: 2024
Bahasa: en
Total Sitasi: 4×
Sumber Database: Semantic Scholar
DOI: 10.48550/arXiv.2404.08092
Akses: Open Access ✓