arXiv Open Access 2024

Towards Resource Efficient and Interpretable Bias Mitigation in Large Language Models

Schrasing Tong Eliott Zemour Jessica Lu Rawisara Lohanimit Lalana Kagal

Lihat Sumber

Abstrak

Although large language models (LLMs) have demonstrated their effectiveness in a wide range of applications, they have also been observed to perpetuate unwanted biases present in the training data, potentially leading to harm for marginalized communities. In this paper, we mitigate bias by leveraging small biased and anti-biased expert models to obtain a debiasing signal that is added to the LLM output at decoding-time. This approach combines computational efficiency - fine-tuning a small model versus re-training a large model and interpretability - one can examine the probability shift from debiasing. The framework can also be tailored to specific contexts by switching the choice of the fine-tuning dataset. Experiments on mitigating gender, race, and religion biases on different architectures show a reduction in bias on several local and global bias metrics while preserving language model performance.

Topik & Kata Kunci

cs.CL

Penulis (5)

Schrasing Tong

Eliott Zemour

Jessica Lu

Rawisara Lohanimit

Lalana Kagal

Format Sitasi

APA MLA BibTeX

Tong, S., Zemour, E., Lu, J., Lohanimit, R., Kagal, L. (2024). Towards Resource Efficient and Interpretable Bias Mitigation in Large Language Models. https://arxiv.org/abs/2412.01711

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2024
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓