Semantic Scholar Open Access 2023 12 sitasi

Bangla Document Classification Based on Machine Learning and Explainable NLP

Md. Habibullah Md. Shymon Islam Fatima Tuz Jahura Jyoti Biswas

Abstrak

Massive digital texts are now accessible, thanks to technological advancement. Any amount of disorganized writing is useless. A high-quality representative corpus of any particular language is essential for research in computational linguistics and natural language processing (NLP). Bangla NLP research is still in its infancy because of the dearth of high-quality public corpus. This paper proposed a newly produced corpus consists of 1,30,307 documents covering 10 categories collected from 11 websites, having 2,94,80,828 tokens and 17,59,085 unique tokens. Seven supervised machine learning methods are explored in this work. Furthermore, Local Interpretable Model-agnostic Explanations (LIME) and SHapley Additive explanations (SHAP) are also examined to explain about different model performance. The obtained results show that the Random Forest (RF), Decision Tree (DT) and Support Vector Machine (SVM) outperform other models. RF classifier achieves the highest accuracy 99.91% which is better than the existing state-of-the-art methods.

Penulis (4)

Md. Habibullah

Md. Shymon Islam

Fatima Tuz Jahura

Jyoti Biswas

Format Sitasi

APA MLA BibTeX

Habibullah, M., Islam, M.S., Jahura, F.T., Biswas, J. (2023). Bangla Document Classification Based on Machine Learning and Explainable NLP. https://doi.org/10.1109/EICT61409.2023.10427766

Akses Cepat

PDF tidak tersedia langsung

Cek di sumber asli →

Lihat di Sumber doi.org/10.1109/EICT61409.2023.10427766

Informasi Jurnal

Tahun Terbit: 2023
Bahasa: en
Total Sitasi: 12×
Sumber Database: Semantic Scholar
DOI: 10.1109/EICT61409.2023.10427766
Akses: Open Access ✓