arXiv Open Access 2025

Learning to Refine: Self-Refinement of Parallel Reasoning in LLMs

Qibin Wang Pu Zhao Shaohan Huang Fangkai Yang Lu Wang +4 lainnya

Lihat Sumber

Abstrak

To further enhance the ability of Large Language Models (LLMs) to solve complex, multi-step reasoning problems, test-time scaling (TTS) methods have gained widespread attention. Existing approaches such as Best-of-N and majority voting are limited as their performance depends on the quality of candidate responses, making them unable to produce a correct solution when all candidates are incorrect. Introducing an additional model to select the best response also incurs significant deployment costs. To this end, we introduce Generative Self-Refinement (GSR), a novel parallel test-time scaling framework where a unified model first generates a set of candidate responses in parallel and then performs self-refinement to synthesize a new superior solution based on a prompt consisting of the problem and these candidates. However, LLMs struggle to perform refinement effectively when prompted directly. Therefore, we design a hybrid training pipeline by jointly optimizing for two complementary objectives, solving problems directly and refining candidate responses. Experimental results demonstrate that our method achieves state-of-the-art performance across five mathematical benchmarks. We further show that this learned self-refinement skill is a model-agnostic enhancement, robust across different model scales and generalizing to out-of-distribution reasoning tasks.

Topik & Kata Kunci

cs.LG cs.AI cs.CL

Penulis (9)

Qibin Wang

Pu Zhao

Shaohan Huang

Fangkai Yang

Lu Wang

Furu Wei

Qingwei Lin

Saravan Rajmohan

Dongmei Zhang

Format Sitasi

APA MLA BibTeX

Wang, Q., Zhao, P., Huang, S., Yang, F., Wang, L., Wei, F. et al. (2025). Learning to Refine: Self-Refinement of Parallel Reasoning in LLMs. https://arxiv.org/abs/2509.00084

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓