arXiv Open Access 2025

Keyword Extraction, and Aspect Classification in Sinhala, English, and Code-Mixed Content

F. A. Rizvi T. Navojith A. M. N. H. Adhikari W. P. U. Senevirathna Dharshana Kasthurirathna +1 lainnya

Lihat Sumber

Abstrak

Brand reputation in the banking sector is maintained through insightful analysis of customer opinion on code-mixed and multilingual content. Conventional NLP models misclassify or ignore code-mixed text, when mix with low resource languages such as Sinhala-English and fail to capture domain-specific knowledge. This study introduces a hybrid NLP method to improve keyword extraction, content filtering, and aspect-based classification of banking content. Keyword extraction in English is performed with a hybrid approach comprising a fine-tuned SpaCy NER model, FinBERT-based KeyBERT embeddings, YAKE, and EmbedRank, which results in a combined accuracy of 91.2%. Code-mixed and Sinhala keywords are extracted using a fine-tuned XLM-RoBERTa model integrated with a domain-specific Sinhala financial vocabulary, and it results in an accuracy of 87.4%. To ensure data quality, irrelevant comment filtering was performed using several models, with the BERT-base-uncased model achieving 85.2% for English and XLM-RoBERTa 88.1% for Sinhala, which was better than GPT-4o, SVM, and keyword-based filtering. Aspect classification followed the same pattern, with the BERT-base-uncased model achieving 87.4% for English and XLM-RoBERTa 85.9% for Sinhala, both exceeding GPT-4 and keyword-based approaches. These findings confirm that fine-tuned transformer models outperform traditional methods in multilingual financial text analysis. The present framework offers an accurate and scalable solution for brand reputation monitoring in code-mixed and low-resource banking environments.

Topik & Kata Kunci

cs.CL cs.AI cs.LG

Penulis (6)

F. A. Rizvi

T. Navojith

A. M. N. H. Adhikari

W. P. U. Senevirathna

Dharshana Kasthurirathna

Lakmini Abeywardhana

Format Sitasi

APA MLA BibTeX

Rizvi, F.A., Navojith, T., Adhikari, A.M.N.H., Senevirathna, W.P.U., Kasthurirathna, D., Abeywardhana, L. (2025). Keyword Extraction, and Aspect Classification in Sinhala, English, and Code-Mixed Content. https://arxiv.org/abs/2504.10679

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓