Semantic Scholar Open Access 2023 10 sitasi

BENCHić-lang: A Benchmark for Discriminating between Bosnian, Croatian, Montenegrin and Serbian

Peter Rupnik Taja Kuzman Nikola Ljubesic

Lihat Sumber DOI

Abstrak

Automatic discrimination between Bosnian, Croatian, Montenegrin and Serbian is a hard task due to the mutual intelligibility of these South-Slavic languages. In this paper, we introduce the BENCHić-lang benchmark for discriminating between these four languages. The benchmark consists of two datasets from different domains - a Twitter and a news dataset - selected with the aim of fostering cross-dataset evaluation of different modelling approaches. We experiment with the baseline SVM models, based on character n-grams, which perform nicely in-dataset, but do not generalize well in cross-dataset experiments. Thus, we introduce another approach, exploiting only web-crawled data and the weak supervision signal coming from the respective country/language top-level domains. The resulting simple Naive Bayes model, based on less than a thousand word features extracted from web data, outperforms the baseline models in the cross-dataset scenario and achieves good levels of generalization across datasets.

Topik & Kata Kunci

Computer Science

Penulis (3)

Peter Rupnik

Taja Kuzman

Nikola Ljubesic

Format Sitasi

APA MLA BibTeX

Rupnik, P., Kuzman, T., Ljubesic, N. (2023). BENCHić-lang: A Benchmark for Discriminating between Bosnian, Croatian, Montenegrin and Serbian. https://doi.org/10.18653/v1/2023.vardial-1.11

Akses Cepat

Lihat di Sumber doi.org/10.18653/v1/2023.vardial-1.11

Informasi Jurnal

Tahun Terbit: 2023
Bahasa: en
Total Sitasi: 10×
Sumber Database: Semantic Scholar
DOI: 10.18653/v1/2023.vardial-1.11
Akses: Open Access ✓