Bangla Document Classification Based on Machine Learning and Explainable NLP
Abstrak
Massive digital texts are now accessible, thanks to technological advancement. Any amount of disorganized writing is useless. A high-quality representative corpus of any particular language is essential for research in computational linguistics and natural language processing (NLP). Bangla NLP research is still in its infancy because of the dearth of high-quality public corpus. This paper proposed a newly produced corpus consists of 1,30,307 documents covering 10 categories collected from 11 websites, having 2,94,80,828 tokens and 17,59,085 unique tokens. Seven supervised machine learning methods are explored in this work. Furthermore, Local Interpretable Model-agnostic Explanations (LIME) and SHapley Additive explanations (SHAP) are also examined to explain about different model performance. The obtained results show that the Random Forest (RF), Decision Tree (DT) and Support Vector Machine (SVM) outperform other models. RF classifier achieves the highest accuracy 99.91% which is better than the existing state-of-the-art methods.
Penulis (4)
Md. Habibullah
Md. Shymon Islam
Fatima Tuz Jahura
Jyoti Biswas
Akses Cepat
PDF tidak tersedia langsung
Cek di sumber asli →- Tahun Terbit
- 2023
- Bahasa
- en
- Total Sitasi
- 12×
- Sumber Database
- Semantic Scholar
- DOI
- 10.1109/EICT61409.2023.10427766
- Akses
- Open Access ✓