arXiv Open Access 2025

Analysing Moral Bias in Finetuned LLMs through Mechanistic Interpretability

Bianca Raimondi Daniela Dalbagno Maurizio Gabbrielli

Lihat Sumber

Abstrak

Large language models (LLMs) have been shown to internalize human-like biases during finetuning, yet the mechanisms by which these biases manifest remain unclear. In this work, we investigated whether the well-known Knobe effect, a moral bias in intentionality judgements, emerges in finetuned LLMs and whether it can be traced back to specific components of the model. We conducted a Layer-Patching analysis across 3 open-weights LLMs and demonstrated that the bias is not only learned during finetuning but also localized in a specific set of layers. Surprisingly, we found that patching activations from the corresponding pretrained model into just a few critical layers is sufficient to eliminate the effect. Our findings offer new evidence that social biases in LLMs can be interpreted, localized, and mitigated through targeted interventions, without the need for model retraining.

Topik & Kata Kunci

cs.CL cs.AI

Penulis (3)

Bianca Raimondi

Daniela Dalbagno

Maurizio Gabbrielli

Format Sitasi

APA MLA BibTeX

Raimondi, B., Dalbagno, D., Gabbrielli, M. (2025). Analysing Moral Bias in Finetuned LLMs through Mechanistic Interpretability. https://arxiv.org/abs/2510.12229

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓