Semantic Scholar Open Access 2023 10 sitasi

BENCHić-lang: A Benchmark for Discriminating between Bosnian, Croatian, Montenegrin and Serbian

Peter Rupnik Taja Kuzman Nikola Ljubesic

Abstrak

Automatic discrimination between Bosnian, Croatian, Montenegrin and Serbian is a hard task due to the mutual intelligibility of these South-Slavic languages. In this paper, we introduce the BENCHić-lang benchmark for discriminating between these four languages. The benchmark consists of two datasets from different domains - a Twitter and a news dataset - selected with the aim of fostering cross-dataset evaluation of different modelling approaches. We experiment with the baseline SVM models, based on character n-grams, which perform nicely in-dataset, but do not generalize well in cross-dataset experiments. Thus, we introduce another approach, exploiting only web-crawled data and the weak supervision signal coming from the respective country/language top-level domains. The resulting simple Naive Bayes model, based on less than a thousand word features extracted from web data, outperforms the baseline models in the cross-dataset scenario and achieves good levels of generalization across datasets.

Topik & Kata Kunci

Penulis (3)

P

Peter Rupnik

T

Taja Kuzman

N

Nikola Ljubesic

Format Sitasi

Rupnik, P., Kuzman, T., Ljubesic, N. (2023). BENCHić-lang: A Benchmark for Discriminating between Bosnian, Croatian, Montenegrin and Serbian. https://doi.org/10.18653/v1/2023.vardial-1.11

Akses Cepat

Informasi Jurnal
Tahun Terbit
2023
Bahasa
en
Total Sitasi
10×
Sumber Database
Semantic Scholar
DOI
10.18653/v1/2023.vardial-1.11
Akses
Open Access ✓