arXiv Open Access 2025

Disabling Self-Correction in Retrieval-Augmented Generation via Stealthy Retriever Poisoning

Yanbo Dai Zhenlan Ji Zongjie Li Kuan Li Shuai Wang

Lihat Sumber

Abstrak

Retrieval-Augmented Generation (RAG) has become a standard approach for improving the reliability of large language models (LLMs). Prior work demonstrates the vulnerability of RAG systems by misleading them into generating attacker-chosen outputs through poisoning the knowledge base. However, this paper uncovers that such attacks could be mitigated by the strong \textit{self-correction ability (SCA)} of modern LLMs, which can reject false context once properly configured. This SCA poses a significant challenge for attackers aiming to manipulate RAG systems. In contrast to previous poisoning methods, which primarily target the knowledge base, we introduce \textsc{DisarmRAG}, a new poisoning paradigm that compromises the retriever itself to suppress the SCA and enforce attacker-chosen outputs. This compromisation enables the attacker to straightforwardly embed anti-SCA instructions into the context provided to the generator, thereby bypassing the SCA. To this end, we present a contrastive-learning-based model editing technique that performs localized and stealthy edits, ensuring the retriever returns a malicious instruction only for specific victim queries while preserving benign retrieval behavior. To further strengthen the attack, we design an iterative co-optimization framework that automatically discovers robust instructions capable of bypassing prompt-based defenses. We extensively evaluate DisarmRAG across six LLMs and three QA benchmarks. Our results show near-perfect retrieval of malicious instructions, which successfully suppress SCA and achieve attack success rates exceeding 90\% under diverse defensive prompts. Also, the edited retriever remains stealthy under several detection methods, highlighting the urgent need for retriever-centric defenses.

Topik & Kata Kunci

cs.CR cs.CL

Penulis (5)

Yanbo Dai

Zhenlan Ji

Zongjie Li

Kuan Li

Shuai Wang

Format Sitasi

APA MLA BibTeX

Dai, Y., Ji, Z., Li, Z., Li, K., Wang, S. (2025). Disabling Self-Correction in Retrieval-Augmented Generation via Stealthy Retriever Poisoning. https://arxiv.org/abs/2508.20083

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓