arXiv Open Access 2025

Analysing Moral Bias in Finetuned LLMs through Mechanistic Interpretability

Bianca Raimondi Daniela Dalbagno Maurizio Gabbrielli
Lihat Sumber

Abstrak

Large language models (LLMs) have been shown to internalize human-like biases during finetuning, yet the mechanisms by which these biases manifest remain unclear. In this work, we investigated whether the well-known Knobe effect, a moral bias in intentionality judgements, emerges in finetuned LLMs and whether it can be traced back to specific components of the model. We conducted a Layer-Patching analysis across 3 open-weights LLMs and demonstrated that the bias is not only learned during finetuning but also localized in a specific set of layers. Surprisingly, we found that patching activations from the corresponding pretrained model into just a few critical layers is sufficient to eliminate the effect. Our findings offer new evidence that social biases in LLMs can be interpreted, localized, and mitigated through targeted interventions, without the need for model retraining.

Topik & Kata Kunci

Penulis (3)

B

Bianca Raimondi

D

Daniela Dalbagno

M

Maurizio Gabbrielli

Format Sitasi

Raimondi, B., Dalbagno, D., Gabbrielli, M. (2025). Analysing Moral Bias in Finetuned LLMs through Mechanistic Interpretability. https://arxiv.org/abs/2510.12229

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2025
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓