Semantic Scholar Open Access 2024 11 sitasi

DIALECT-COPA: Extending the Standard Translations of the COPA Causal Commonsense Reasoning Dataset to South Slavic Dialects

Nikola Ljubešić Nada Galant Sonja Benčina Jaka Čibej Stefan Milosavljević +2 lainnya

Abstrak

The paper presents new causal commonsense reasoning datasets for South Slavic dialects, based on the Choice of Plausible Alternatives (COPA) dataset. The dialectal datasets are built by translating by native dialect speakers from the English original and the corresponding standard translation. Three dialects are covered – the Cerkno dialect of Slovenian, the Chakavian dialect of Croatian and the Torlak dialect of Serbian. The datasets are the first resource for evaluation of large language models on South Slavic dialects, as well as among the first commonsense reasoning datasets on dialects overall. The paper describes specific challenges met during the translation process. A comparison of the dialectal datasets with their standard language counterparts shows a varying level of character-level, word-level and lexicon-level deviation of dialectal text from the standard datasets. The observed differences are well reproduced in initial zero-shot and 10-shot experiments, where the Slovenian Cerkno dialect and the Croatian Chakavian dialect show significantly lower results than the Torlak dialect. These results show also for the dialectal datasets to be significantly more challenging than the standard datasets. Finally, in-context learning on just 10 examples shows to improve the results dramatically, especially for the dialects with the lowest results.

Penulis (7)

N

Nikola Ljubešić

N

Nada Galant

S

Sonja Benčina

J

Jaka Čibej

S

Stefan Milosavljević

P

Peter Rupnik

T

Taja Kuzman

Format Sitasi

Ljubešić, N., Galant, N., Benčina, S., Čibej, J., Milosavljević, S., Rupnik, P. et al. (2024). DIALECT-COPA: Extending the Standard Translations of the COPA Causal Commonsense Reasoning Dataset to South Slavic Dialects. https://doi.org/10.18653/v1/2024.vardial-1.7

Akses Cepat

PDF tidak tersedia langsung

Cek di sumber asli →
Lihat di Sumber doi.org/10.18653/v1/2024.vardial-1.7
Informasi Jurnal
Tahun Terbit
2024
Bahasa
en
Total Sitasi
11×
Sumber Database
Semantic Scholar
DOI
10.18653/v1/2024.vardial-1.7
Akses
Open Access ✓