arXiv Open Access 2024

Turning Generative Models Degenerate: The Power of Data Poisoning Attacks

Shuli Jiang Swanand Ravindra Kadhe Yi Zhou Farhan Ahmed Ling Cai +1 lainnya

Lihat Sumber

Abstrak

The increasing use of large language models (LLMs) trained by third parties raises significant security concerns. In particular, malicious actors can introduce backdoors through poisoning attacks to generate undesirable outputs. While such attacks have been extensively studied in image domains and classification tasks, they remain underexplored for natural language generation (NLG) tasks. To address this gap, we conduct an investigation of various poisoning techniques targeting the LLM's fine-tuning phase via prefix-tuning, a Parameter Efficient Fine-Tuning (PEFT) method. We assess their effectiveness across two generative tasks: text summarization and text completion; and we also introduce new metrics to quantify the success and stealthiness of such NLG poisoning attacks. Through our experiments, we find that the prefix-tuning hyperparameters and trigger designs are the most crucial factors to influence attack success and stealthiness. Moreover, we demonstrate that existing popular defenses are ineffective against our poisoning attacks. Our study presents the first systematic approach to understanding poisoning attacks targeting NLG tasks during fine-tuning via PEFT across a wide range of triggers and attack settings. We hope our findings will aid the AI security community in developing effective defenses against such threats.

Topik & Kata Kunci

cs.CR cs.AI

Penulis (6)

Shuli Jiang

Swanand Ravindra Kadhe

Yi Zhou

Farhan Ahmed

Ling Cai

Nathalie Baracaldo

Format Sitasi

APA MLA BibTeX

Jiang, S., Kadhe, S.R., Zhou, Y., Ahmed, F., Cai, L., Baracaldo, N. (2024). Turning Generative Models Degenerate: The Power of Data Poisoning Attacks. https://arxiv.org/abs/2407.12281

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2024
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓