arXiv Open Access 2025

Don't Waste Mistakes: Leveraging Negative RL-Groups via Confidence Reweighting

Yunzhen Feng Parag Jain Anthony Hartshorn Yaqi Duan Julia Kempe

Lihat Sumber

Abstrak

Reinforcement learning with verifiable rewards (RLVR) has become a standard recipe for improving large language models (LLMs) on reasoning tasks, with Group Relative Policy Optimization (GRPO) widely used in practice. Yet GRPO wastes substantial compute on negative groups: groups in which no sampled response is correct yield zero advantage and thus no gradient. We ask whether negative groups can be leveraged without extra supervision. Starting from a maximum-likelihood (MLE) objective in reward modeling, we show that the MLE gradient is equivalent to a policy gradient for a modified value function. This value function adds a confidence-weighted penalty on incorrect responses, imposing larger penalties on more confident mistakes. We refer to this as \textbf{L}ikelihood \textbf{E}stimation with \textbf{N}egative \textbf{S}amples (\textbf{LENS}). LENS modifies GRPO to assign non-zero, confidence-dependent rewards to incorrect generations, making negative groups informative and converting previously wasted samples into useful gradient updates. On the MATH benchmark with Llama-3.1-8B and Qwen-2.5-3B, the proposed variant consistently outperforms GRPO baseline, with significant gains on harder items. These results demonstrate a principled and practical way to "rescue" negative groups, improving efficiency and performance in RLVR.

Topik & Kata Kunci

cs.LG

Penulis (5)

Yunzhen Feng

Parag Jain

Anthony Hartshorn

Yaqi Duan

Julia Kempe

Format Sitasi

APA MLA BibTeX

Feng, Y., Jain, P., Hartshorn, A., Duan, Y., Kempe, J. (2025). Don't Waste Mistakes: Leveraging Negative RL-Groups via Confidence Reweighting. https://arxiv.org/abs/2510.08696

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓