arXiv Open Access 2025

Keyword Extraction, and Aspect Classification in Sinhala, English, and Code-Mixed Content

F. A. Rizvi T. Navojith A. M. N. H. Adhikari W. P. U. Senevirathna Dharshana Kasthurirathna +1 lainnya
Lihat Sumber

Abstrak

Brand reputation in the banking sector is maintained through insightful analysis of customer opinion on code-mixed and multilingual content. Conventional NLP models misclassify or ignore code-mixed text, when mix with low resource languages such as Sinhala-English and fail to capture domain-specific knowledge. This study introduces a hybrid NLP method to improve keyword extraction, content filtering, and aspect-based classification of banking content. Keyword extraction in English is performed with a hybrid approach comprising a fine-tuned SpaCy NER model, FinBERT-based KeyBERT embeddings, YAKE, and EmbedRank, which results in a combined accuracy of 91.2%. Code-mixed and Sinhala keywords are extracted using a fine-tuned XLM-RoBERTa model integrated with a domain-specific Sinhala financial vocabulary, and it results in an accuracy of 87.4%. To ensure data quality, irrelevant comment filtering was performed using several models, with the BERT-base-uncased model achieving 85.2% for English and XLM-RoBERTa 88.1% for Sinhala, which was better than GPT-4o, SVM, and keyword-based filtering. Aspect classification followed the same pattern, with the BERT-base-uncased model achieving 87.4% for English and XLM-RoBERTa 85.9% for Sinhala, both exceeding GPT-4 and keyword-based approaches. These findings confirm that fine-tuned transformer models outperform traditional methods in multilingual financial text analysis. The present framework offers an accurate and scalable solution for brand reputation monitoring in code-mixed and low-resource banking environments.

Topik & Kata Kunci

Penulis (6)

F

F. A. Rizvi

T

T. Navojith

A

A. M. N. H. Adhikari

W

W. P. U. Senevirathna

D

Dharshana Kasthurirathna

L

Lakmini Abeywardhana

Format Sitasi

Rizvi, F.A., Navojith, T., Adhikari, A.M.N.H., Senevirathna, W.P.U., Kasthurirathna, D., Abeywardhana, L. (2025). Keyword Extraction, and Aspect Classification in Sinhala, English, and Code-Mixed Content. https://arxiv.org/abs/2504.10679

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2025
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓