arXiv Open Access 2026

Co-Evolution of Policy and Internal Reward for Language Agents

Xinyu Wang Hanwei Wu Jingwei Song Shuyuan Zhang Jiayi Zhang +6 lainnya
Lihat Sumber

Abstrak

Large language model (LLM) agents learn by interacting with environments, but long-horizon training remains fundamentally bottlenecked by sparse and delayed rewards. Existing methods typically address this challenge through post-hoc credit assignment or external reward models, which provide limited guidance at inference time and often separate reward improvement from policy improvement. We propose Self-Guide, a self-generated internal reward for language agents that supports both inference-time guidance and training-time supervision. Specifically, the agent uses Self-Guide as a short self-guidance signal to steer the next action during inference, and converts the same signal into step-level internal reward for denser policy optimization during training. This creates a co-evolving loop: better policy produces better guidance, and better guidance further improves policy as internal reward. Across three agent benchmarks, inference-time self-guidance already yields clear gains, while jointly evolving policy and internal reward with GRPO brings further improvements (8\%) over baselines trained solely with environment reward. Overall, our results suggest that language agents can improve not only by collecting more experience, but also by learning to generate and refine their own internal reward during acting and learning.

Topik & Kata Kunci

Penulis (11)

X

Xinyu Wang

H

Hanwei Wu

J

Jingwei Song

S

Shuyuan Zhang

J

Jiayi Zhang

F

Fanqi Kong

T

Tung Sum Thomas Kwok

X

Xiao-Wen Chang

Y

Yuyu Luo

C

Chenglin Wu

B

Bang Liu

Format Sitasi

Wang, X., Wu, H., Song, J., Zhang, S., Zhang, J., Kong, F. et al. (2026). Co-Evolution of Policy and Internal Reward for Language Agents. https://arxiv.org/abs/2604.03098

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2026
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓