arXiv Open Access 2025

Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning

Mahavir Dabas Si Chen Charles Fleming Ming Jin Ruoxi Jia

Lihat Sumber

Abstrak

Safety alignment is crucial for large language models (LLMs) to resist malicious instructions but often results in over-refusals, where benign prompts are unnecessarily rejected, impairing user experience and model utility. We introduce ACTOR (Activation-Based Training for Over-Refusal Reduction), a robust and compute- and data-efficient training framework that minimizes over-refusals by leveraging internal activation patterns from diverse queries. ACTOR precisely identifies and adjusts the activation components that trigger refusals, providing stronger control over the refusal mechanism. By fine-tuning only a single model layer, ACTOR effectively reduces over-refusals across multiple benchmarks while maintaining the model's ability to handle harmful queries and preserve overall utility.

Topik & Kata Kunci

cs.LG cs.AI

Penulis (5)

Mahavir Dabas

Si Chen

Charles Fleming

Ming Jin

Ruoxi Jia

Format Sitasi

APA MLA BibTeX

Dabas, M., Chen, S., Fleming, C., Jin, M., Jia, R. (2025). Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning. https://arxiv.org/abs/2507.04250

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓