arXiv Open Access 2025

OVERT: A Benchmark for Over-Refusal Evaluation on Text-to-Image Models

Ziheng Cheng Yixiao Huang Hui Xu Somayeh Sojoudi Xuandong Zhao +2 lainnya
Lihat Sumber

Abstrak

Text-to-Image (T2I) models have achieved remarkable success in generating visual content from text inputs. Although multiple safety alignment strategies have been proposed to prevent harmful outputs, they often lead to overly cautious behavior -- rejecting even benign prompts -- a phenomenon known as $\textit{over-refusal}$ that reduces the practical utility of T2I models. Despite over-refusal having been observed in practice, there is no large-scale benchmark that systematically evaluates this phenomenon for T2I models. In this paper, we present an automatic workflow to construct synthetic evaluation data, resulting in OVERT ($\textbf{OVE}$r-$\textbf{R}$efusal evaluation on $\textbf{T}$ext-to-image models), the first large-scale benchmark for assessing over-refusal behaviors in T2I models. OVERT includes 4,600 seemingly harmful but benign prompts across nine safety-related categories, along with 1,785 genuinely harmful prompts (OVERT-unsafe) to evaluate the safety-utility trade-off. Using OVERT, we evaluate several leading T2I models and find that over-refusal is a widespread issue across various categories (Figure 1), underscoring the need for further research to enhance the safety alignment of T2I models without compromising their functionality. As a preliminary attempt to reduce over-refusal, we explore prompt rewriting; however, we find it often compromises faithfulness to the meaning of the original prompts. Finally, we demonstrate the flexibility of our generation framework in accommodating diverse safety requirements by generating customized evaluation data adapting to user-defined policies.

Topik & Kata Kunci

Penulis (7)

Z

Ziheng Cheng

Y

Yixiao Huang

H

Hui Xu

S

Somayeh Sojoudi

X

Xuandong Zhao

D

Dawn Song

S

Song Mei

Format Sitasi

Cheng, Z., Huang, Y., Xu, H., Sojoudi, S., Zhao, X., Song, D. et al. (2025). OVERT: A Benchmark for Over-Refusal Evaluation on Text-to-Image Models. https://arxiv.org/abs/2505.21347

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2025
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓