arXiv Open Access 2026

CARE-RFT: Confidence-Anchored Reinforcement Finetuning for Reliable Reasoning in Large Language Models

Shuozhe Li Jincheng Cao Bodun Hu Aryan Mokhtari Leqi Liu +1 lainnya
Lihat Sumber

Abstrak

Reinforcement finetuning (RFT) has emerged as a powerful paradigm for unlocking reasoning capabilities in large language models. However, we identify a critical trade-off: while unconstrained RFT achieves strong reasoning performance, it severely compromises model trustworthiness by amplifying hallucination and worsening calibration; conversely, RKL-constrained RFT preserves trustworthiness but limits reasoning gains due to its unbounded penalty on exploratory deviations. To resolve this tension, we introduce CARE-RFT (Confidence-Anchored Regularized Reinforcement Finetuning), a novel method that replaces standard reverse KL regularization with a skew reverse KL divergence. CARE-RFT provides a confidence-sensitive penalty: it is bounded for confident, consistently rewarded explorations to enable reasoning, while unbounded elsewhere to preserve calibration. Extensive experiments across multiple model scales and RFT algorithms show that CARE-RFT achieves a superior balance, matching the reasoning performance of unconstrained RFT while recovering the trustworthiness and calibration of the base model. Our work establishes that careful, confidence-aware regularization is key to building both capable and trustworthy reasoning models.

Topik & Kata Kunci

Penulis (6)

S

Shuozhe Li

J

Jincheng Cao

B

Bodun Hu

A

Aryan Mokhtari

L

Leqi Liu

A

Amy Zhang

Format Sitasi

Li, S., Cao, J., Hu, B., Mokhtari, A., Liu, L., Zhang, A. (2026). CARE-RFT: Confidence-Anchored Reinforcement Finetuning for Reliable Reasoning in Large Language Models. https://arxiv.org/abs/2602.00085

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2026
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓