arXiv Open Access 2026

CARE-RFT: Confidence-Anchored Reinforcement Finetuning for Reliable Reasoning in Large Language Models

Shuozhe Li Jincheng Cao Bodun Hu Aryan Mokhtari Leqi Liu +1 lainnya

Lihat Sumber

Abstrak

Reinforcement finetuning (RFT) has emerged as a powerful paradigm for unlocking reasoning capabilities in large language models. However, we identify a critical trade-off: while unconstrained RFT achieves strong reasoning performance, it severely compromises model trustworthiness by amplifying hallucination and worsening calibration; conversely, RKL-constrained RFT preserves trustworthiness but limits reasoning gains due to its unbounded penalty on exploratory deviations. To resolve this tension, we introduce CARE-RFT (Confidence-Anchored Regularized Reinforcement Finetuning), a novel method that replaces standard reverse KL regularization with a skew reverse KL divergence. CARE-RFT provides a confidence-sensitive penalty: it is bounded for confident, consistently rewarded explorations to enable reasoning, while unbounded elsewhere to preserve calibration. Extensive experiments across multiple model scales and RFT algorithms show that CARE-RFT achieves a superior balance, matching the reasoning performance of unconstrained RFT while recovering the trustworthiness and calibration of the base model. Our work establishes that careful, confidence-aware regularization is key to building both capable and trustworthy reasoning models.

Topik & Kata Kunci

cs.LG cs.AI cs.CL

Penulis (6)

Shuozhe Li

Jincheng Cao

Bodun Hu

Aryan Mokhtari

Leqi Liu

Amy Zhang

Format Sitasi

APA MLA BibTeX

Li, S., Cao, J., Hu, B., Mokhtari, A., Liu, L., Zhang, A. (2026). CARE-RFT: Confidence-Anchored Reinforcement Finetuning for Reliable Reasoning in Large Language Models. https://arxiv.org/abs/2602.00085

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2026
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓