arXiv Open Access 2025

Can Small-Scale Data Poisoning Exacerbate Dialect-Linked Biases in Large Language Models?

Chaymaa Abbas Mariette Awad Razane Tajeddine
Lihat Sumber

Abstrak

Style-conditioned data poisoning is identified as a covert vector for amplifying sociolinguistic bias in large language models. Using small poisoned budgets that pair dialectal prompts -- principally African American Vernacular English (AAVE) and a Southern dialect -- with toxic or stereotyped completions during instruction tuning, this work probes whether linguistic style can act as a latent trigger for harmful behavior. Across multiple model families and scales, poisoned exposure elevates toxicity and stereotype expression for dialectal inputs -- most consistently for AAVE -- while Standard American English remains comparatively lower yet not immune. A multi-metric audit combining classifier-based toxicity with an LLM-as-a-judge reveals stereotype-laden content even when lexical toxicity appears muted, indicating that conventional detectors under-estimate sociolinguistic harms. Additionally, poisoned models exhibit emergent jailbreaking despite the absence of explicit slurs in the poison, suggesting weakened alignment rather than memorization. These findings underscore the need for dialect-aware evaluation, content-level stereotype auditing, and training protocols that explicitly decouple style from toxicity to prevent bias amplification through seemingly minor, style-based contamination.

Topik & Kata Kunci

Penulis (3)

C

Chaymaa Abbas

M

Mariette Awad

R

Razane Tajeddine

Format Sitasi

Abbas, C., Awad, M., Tajeddine, R. (2025). Can Small-Scale Data Poisoning Exacerbate Dialect-Linked Biases in Large Language Models?. https://arxiv.org/abs/2507.19195

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2025
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓