Semantic Scholar Open Access 2025 14 sitasi

Hybrid Machine Learning and Deep Learning Approaches for Insult Detection in Roman Urdu Text

Nisar Hussain Amna Qasim Gull Mehak O. Kolesnikova A. Gelbukh +1 lainnya

Abstrak

Thisstudy introduces a new model for detecting insults in Roman Urdu, filling an important gap in natural language processing (NLP) for low-resource languages. The transliterated nature of Roman Urdu also poses specific challenges from a computational linguistics perspective, including non-standardized grammar, variation in spellings for the same word, and high levels of code-mixing with English, which together make automated insult detection for Roman Urdu a highly complex problem. To address these problems, we created a large-scale dataset with 46,045 labeled comments from social media websites such as Twitter, Facebook, and YouTube. This is the first dataset for insult detection for Roman Urdu that was created and annotated with insulting and non-insulting content. Advanced preprocessing methods such as text cleaning, text normalization, and tokenization are used in the study, as well as feature extraction using TF–IDF through unigram (Uni), bigram (Bi), trigram (Tri), and their unions: Uni+Bi+Trigram. We compared ten machine learning algorithms (logistic regression, support vector machines, random forest, gradient boosting, AdaBoost, and XGBoost) and three deep learning topologies (CNN, LSTM, and Bi-LSTM). Different models were compared, and ensemble ones were proven to give the highest F1-scores, reaching 97.79%, 97.78%, and 95.25%, respectively, for AdaBoost, decision tree, TF–IDF, and Uni+Bi+Trigram configurations. Deeper learning models also performed on par, with CNN achieving an F1-score of 97.01%. Overall, the results highlight the utility of n-gram features and the combination of robust classifiers in detecting insults. This study makes strides in improving NLP for Roman Urdu, yet further research has established the foundation of pre-trained transformers and hybrid approaches; this could overcome existing systems and platform limitations. This study has conscious implications, mainly on the construction of automated moderation tools to achieve safer online spaces, especially for South Asian social media websites.

Penulis (6)

N

Nisar Hussain

A

Amna Qasim

G

Gull Mehak

O

O. Kolesnikova

A

A. Gelbukh

G

Grigori Sidorov

Format Sitasi

Hussain, N., Qasim, A., Mehak, G., Kolesnikova, O., Gelbukh, A., Sidorov, G. (2025). Hybrid Machine Learning and Deep Learning Approaches for Insult Detection in Roman Urdu Text. https://doi.org/10.3390/ai6020033

Akses Cepat

PDF tidak tersedia langsung

Cek di sumber asli →
Lihat di Sumber doi.org/10.3390/ai6020033
Informasi Jurnal
Tahun Terbit
2025
Bahasa
en
Total Sitasi
14×
Sumber Database
Semantic Scholar
DOI
10.3390/ai6020033
Akses
Open Access ✓