arXiv Open Access 2025

From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training

Yuan Yuan Tina Sriskandarajah Anna-Luisa Brakman Alec Helyar Alex Beutel +2 lainnya

Lihat Sumber

Abstrak

Large Language Models used in ChatGPT have traditionally been trained to learn a refusal boundary: depending on the user's intent, the model is taught to either fully comply or outright refuse. While this is a strong mitigation for explicitly malicious prompts, focusing safety training on refusals can lead to brittleness for prompts with obscured user intent. Binary refusal boundaries are especially ill-suited for dual-use cases (such as biology or cybersecurity), where a user request can be answered safely at a high level, but in some cases can lead to malicious uplift if sufficiently detailed or actionable. As an alternative, we propose safe-completions: a safety-training approach that centers on the safety of the assistant's output, rather than a binary classification of the user's intent. Safe-completions seek to maximize helpfulness within the safety policy's constraints. We incorporated this approach into GPT-5 and find that across both production comparisons and internally controlled experiments, safe-completion training improves safety (especially on dual-use prompts), reduces the severity of residual safety failures, and substantially increases model helpfulness.

Topik & Kata Kunci

cs.CY cs.AI cs.CL

Penulis (7)

Yuan Yuan

Tina Sriskandarajah

Anna-Luisa Brakman

Alec Helyar

Alex Beutel

Andrea Vallone

Saachi Jain

Format Sitasi

APA MLA BibTeX

Yuan, Y., Sriskandarajah, T., Brakman, A., Helyar, A., Beutel, A., Vallone, A. et al. (2025). From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training. https://arxiv.org/abs/2508.09224

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓