BENCHić-lang: A Benchmark for Discriminating between Bosnian, Croatian, Montenegrin and Serbian
Abstrak
Automatic discrimination between Bosnian, Croatian, Montenegrin and Serbian is a hard task due to the mutual intelligibility of these South-Slavic languages. In this paper, we introduce the BENCHić-lang benchmark for discriminating between these four languages. The benchmark consists of two datasets from different domains - a Twitter and a news dataset - selected with the aim of fostering cross-dataset evaluation of different modelling approaches. We experiment with the baseline SVM models, based on character n-grams, which perform nicely in-dataset, but do not generalize well in cross-dataset experiments. Thus, we introduce another approach, exploiting only web-crawled data and the weak supervision signal coming from the respective country/language top-level domains. The resulting simple Naive Bayes model, based on less than a thousand word features extracted from web data, outperforms the baseline models in the cross-dataset scenario and achieves good levels of generalization across datasets.
Topik & Kata Kunci
Penulis (3)
Peter Rupnik
Taja Kuzman
Nikola Ljubesic
Akses Cepat
- Tahun Terbit
- 2023
- Bahasa
- en
- Total Sitasi
- 10×
- Sumber Database
- Semantic Scholar
- DOI
- 10.18653/v1/2023.vardial-1.11
- Akses
- Open Access ✓