arXiv Open Access 2025

Revisiting Group Relative Policy Optimization: Insights into On-Policy and Off-Policy Training

Youssef Mroueh Nicolas Dupuis Brian Belgodere Apoorva Nitsure Mattia Rigotti +4 lainnya
Lihat Sumber

Abstrak

We revisit Group Relative Policy Optimization (GRPO) in both on-policy and off-policy optimization regimes. Our motivation comes from recent work on off-policy Proximal Policy Optimization (PPO), which improves training stability, sampling efficiency, and memory usage. In addition, a recent analysis of GRPO suggests that estimating the advantage function with off-policy samples could be beneficial. Building on these observations, we adapt GRPO to the off-policy setting. We show that both on-policy and off-policy GRPO objectives yield an improvement in the reward. This result motivates the use of clipped surrogate objectives in the off-policy version of GRPO. We then compare the empirical performance of reinforcement learning with verifiable rewards in post-training using both GRPO variants. Our results show that off-policy GRPO either significantly outperforms or performs on par with its on-policy counterpart.

Topik & Kata Kunci

Penulis (9)

Y

Youssef Mroueh

N

Nicolas Dupuis

B

Brian Belgodere

A

Apoorva Nitsure

M

Mattia Rigotti

K

Kristjan Greenewald

J

Jiri Navratil

J

Jerret Ross

J

Jesus Rios

Format Sitasi

Mroueh, Y., Dupuis, N., Belgodere, B., Nitsure, A., Rigotti, M., Greenewald, K. et al. (2025). Revisiting Group Relative Policy Optimization: Insights into On-Policy and Off-Policy Training. https://arxiv.org/abs/2505.22257

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2025
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓