Semantic Scholar Open Access 2021 75 sitasi

How Should Pre-Trained Language Models Be Fine-Tuned Towards Adversarial Robustness?

Xinhsuai Dong Anh Tuan Luu Min Lin Shuicheng Yan Hanwang Zhang

Abstrak

The fine-tuning of pre-trained language models has a great success in many NLP fields. Yet, it is strikingly vulnerable to adversarial examples, e.g., word substitution attacks using only synonyms can easily fool a BERT-based sentiment analysis model. In this paper, we demonstrate that adversarial training, the prevalent defense technique, does not directly fit a conventional fine-tuning scenario, because it suffers severely from catastrophic forgetting: failing to retain the generic and robust linguistic features that have already been captured by the pre-trained model. In this light, we propose Robust Informative Fine-Tuning (RIFT), a novel adversarial fine-tuning method from an information-theoretical perspective. In particular, RIFT encourages an objective model to retain the features learned from the pre-trained model throughout the entire fine-tuning process, whereas a conventional one only uses the pre-trained weights for initialization. Experimental results show that RIFT consistently outperforms the state-of-the-arts on two popular NLP tasks: sentiment analysis and natural language inference, under different attacks across various pre-trained language models.

Topik & Kata Kunci

Penulis (5)

X

Xinhsuai Dong

A

Anh Tuan Luu

M

Min Lin

S

Shuicheng Yan

H

Hanwang Zhang

Format Sitasi

Dong, X., Luu, A.T., Lin, M., Yan, S., Zhang, H. (2021). How Should Pre-Trained Language Models Be Fine-Tuned Towards Adversarial Robustness?. https://www.semanticscholar.org/paper/a4f533f2b7d77b667e1f05b210924ec7c90cc5d1

Akses Cepat

PDF tidak tersedia langsung

Cek di sumber asli →
Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2021
Bahasa
en
Total Sitasi
75×
Sumber Database
Semantic Scholar
Akses
Open Access ✓