Semantic Scholar Open Access 2023 12 sitasi

Bangla Document Classification Based on Machine Learning and Explainable NLP

Md. Habibullah Md. Shymon Islam Fatima Tuz Jahura Jyoti Biswas

Abstrak

Massive digital texts are now accessible, thanks to technological advancement. Any amount of disorganized writing is useless. A high-quality representative corpus of any particular language is essential for research in computational linguistics and natural language processing (NLP). Bangla NLP research is still in its infancy because of the dearth of high-quality public corpus. This paper proposed a newly produced corpus consists of 1,30,307 documents covering 10 categories collected from 11 websites, having 2,94,80,828 tokens and 17,59,085 unique tokens. Seven supervised machine learning methods are explored in this work. Furthermore, Local Interpretable Model-agnostic Explanations (LIME) and SHapley Additive explanations (SHAP) are also examined to explain about different model performance. The obtained results show that the Random Forest (RF), Decision Tree (DT) and Support Vector Machine (SVM) outperform other models. RF classifier achieves the highest accuracy 99.91% which is better than the existing state-of-the-art methods.

Penulis (4)

M

Md. Habibullah

M

Md. Shymon Islam

F

Fatima Tuz Jahura

J

Jyoti Biswas

Format Sitasi

Habibullah, M., Islam, M.S., Jahura, F.T., Biswas, J. (2023). Bangla Document Classification Based on Machine Learning and Explainable NLP. https://doi.org/10.1109/EICT61409.2023.10427766

Akses Cepat

Informasi Jurnal
Tahun Terbit
2023
Bahasa
en
Total Sitasi
12×
Sumber Database
Semantic Scholar
DOI
10.1109/EICT61409.2023.10427766
Akses
Open Access ✓