arXiv Open Access 2025

Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation

Yaqi Li Peng Chen Mingyang Han Pi Bu Haoxiang Shi +5 lainnya

Lihat Sumber

Abstrak

Despite the promising progress of recent autoregressive models in text-to-image (T2I) generation, their ability to handle multi-attribute and ambiguous prompts remains limited. To address these limitations, existing works have applied chain-of-thought (CoT) to enable stage-aware visual synthesis and employed reinforcement learning (RL) to improve reasoning capabilities. However, most models provide reward signals only at the end of the generation stage. This monolithic final-only guidance makes it difficult to identify which stages contribute positively to the final outcome and may lead to suboptimal policies. To tackle this issue, we propose a Visual-Chain of Guidance (Visual-CoG) paradigm consisting of three stages: semantic reasoning, process refining, and outcome evaluation, with stage-aware rewards providing immediate guidance throughout the image generation pipeline. We further construct a visual cognition benchmark, VisCog-Bench, which comprises four subtasks to evaluate the effectiveness of semantic reasoning. Comprehensive evaluations on GenEval, T2I-CompBench, and the proposed VisCog-Bench show improvements of 15%, 5%, and 19%, respectively, demonstrating the superior performance of the proposed Visual-CoG. We will release all the resources soon.

Topik & Kata Kunci

cs.CV

Penulis (10)

Yaqi Li

Peng Chen

Mingyang Han

Pi Bu

Haoxiang Shi

Runzhou Zhao

Yang Yao

Xuan Zhang

Jun Song

Bo Zheng

Format Sitasi

APA MLA BibTeX

Li, Y., Chen, P., Han, M., Bu, P., Shi, H., Zhao, R. et al. (2025). Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation. https://arxiv.org/abs/2508.18032

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓