Semantic Scholar Open Access 2021 263 sitasi

BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla

Abhik Bhattacharjee Tahmid Hasan Kazi Samin Mubasshir Md. Saiful Islam Wasi Uddin Ahmad +3 lainnya

Lihat Sumber DOI

Abstrak

In this work, we introduce BanglaBERT, a BERT-based Natural Language Understanding (NLU) model pretrained in Bangla, a widely spoken yet low-resource language in the NLP literature. To pretrain BanglaBERT, we collect 27.5 GB of Bangla pretraining data (dubbed `Bangla2B+') by crawling 110 popular Bangla sites. We introduce two downstream task datasets on natural language inference and question answering and benchmark on four diverse NLU tasks covering text classification, sequence labeling, and span prediction. In the process, we bring them under the first-ever Bangla Language Understanding Benchmark (BLUB). BanglaBERT achieves state-of-the-art results outperforming multilingual and monolingual models. We are making the models, datasets, and a leaderboard publicly available at https://github.com/csebuetnlp/banglabert to advance Bangla NLP.

Topik & Kata Kunci

Computer Science

Penulis (8)

Abhik Bhattacharjee

Tahmid Hasan

Kazi Samin Mubasshir

Md. Saiful Islam

Wasi Uddin Ahmad

Anindya Iqbal

M. Rahman

Rifat Shahriyar

Format Sitasi

APA MLA BibTeX

Bhattacharjee, A., Hasan, T., Mubasshir, K.S., Islam, M.S., Ahmad, W.U., Iqbal, A. et al. (2021). BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla. https://doi.org/10.18653/v1/2022.findings-naacl.98

Akses Cepat

Lihat di Sumber doi.org/10.18653/v1/2022.findings-naacl.98

Informasi Jurnal

Tahun Terbit: 2021
Bahasa: en
Total Sitasi: 263×
Sumber Database: Semantic Scholar
DOI: 10.18653/v1/2022.findings-naacl.98
Akses: Open Access ✓