RACHNA: Racial hoax code mixed Hindi–English with novel language augmentation
Abstrak
Warning: This paper contains derogatory language that may be offensive to some readers. As a type of misinformation, hoaxes seek to propagate incorrect information in order to gain popularity on social media. Racial hoaxes are a particular kind of hoax that is particularly harmful since they falsely link individuals or groups to crimes or incidents. This involves nuanced challenges of identifying false accusations, fabrications, and stereotypes that falsely impact other social, ethnic or out groups in negative actions. On the other hand, social media comments frequently incorporate many languages and are written in scripts that are not native to the user. They also rarely adhere to inflexible grammar norms. Lack of code-mixed racial hoax annotated data for a Low-resource languages like Code-Mixed Hindi and English make this issue more challenging. In order to address this, we collected 210,768 sentences and generated a racial hoax-annotated, code-mixed corpus of 5,105 YouTube comment postings in Hindi–English as HoaxMixPlus corpus. We outline the method of building the corpus and assigning the binary values indicating the presence of racial hoax which fills a critical gap in understanding and combating racialized misinformation along with inter-annotator agreement. We display the results of analysis, training using this corpus as a benchmark, new methodologies which includes dictionary based approach by correctly identifying code-mixed words as well as novel language augmentation strategies like transliteration and language tags. We evaluate several models on this dataset and demonstrate that our augmentation strategies lead to consistent performance gains.
Topik & Kata Kunci
Penulis (6)
Shanu SidharthKumar Dhawale
Rahul Ponnusamy
Prasanna Kumar Kumaresan
Sajeetha Thavareesan
Saranya Rajiakodi
Bharathi Raja Chakravarthi
Akses Cepat
- Tahun Terbit
- 2025
- Sumber Database
- DOAJ
- DOI
- 10.1016/j.nlp.2025.100183
- Akses
- Open Access ✓