arXiv Open Access 2026

Co-Evolution of Policy and Internal Reward for Language Agents

Xinyu Wang Hanwei Wu Jingwei Song Shuyuan Zhang Jiayi Zhang +6 lainnya

Lihat Sumber

Abstrak

Large language model (LLM) agents learn by interacting with environments, but long-horizon training remains fundamentally bottlenecked by sparse and delayed rewards. Existing methods typically address this challenge through post-hoc credit assignment or external reward models, which provide limited guidance at inference time and often separate reward improvement from policy improvement. We propose Self-Guide, a self-generated internal reward for language agents that supports both inference-time guidance and training-time supervision. Specifically, the agent uses Self-Guide as a short self-guidance signal to steer the next action during inference, and converts the same signal into step-level internal reward for denser policy optimization during training. This creates a co-evolving loop: better policy produces better guidance, and better guidance further improves policy as internal reward. Across three agent benchmarks, inference-time self-guidance already yields clear gains, while jointly evolving policy and internal reward with GRPO brings further improvements (8\%) over baselines trained solely with environment reward. Overall, our results suggest that language agents can improve not only by collecting more experience, but also by learning to generate and refine their own internal reward during acting and learning.

Topik & Kata Kunci

cs.LG cs.AI cs.CL

Penulis (11)

Xinyu Wang

Hanwei Wu

Jingwei Song

Shuyuan Zhang

Jiayi Zhang

Fanqi Kong

Tung Sum Thomas Kwok

Xiao-Wen Chang

Yuyu Luo

Chenglin Wu

Bang Liu

Format Sitasi

APA MLA BibTeX

Wang, X., Wu, H., Song, J., Zhang, S., Zhang, J., Kong, F. et al. (2026). Co-Evolution of Policy and Internal Reward for Language Agents. https://arxiv.org/abs/2604.03098

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2026
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓