arXiv Open Access 2026

Prompt Reinjection: Alleviating Prompt Forgetting in Multimodal Diffusion Transformers

Yuxuan Yao Yuxuan Chen Hui Li Kaihui Cheng Qipeng Guo +4 lainnya

Lihat Sumber

Abstrak

Multimodal Diffusion Transformers (MMDiTs) for text-to-image generation maintain separate text and image branches, with bidirectional information flow between text tokens and visual latents throughout denoising. In this setting, we observe a prompt forgetting phenomenon: the semantics of the prompt representation in the text branch is progressively forgotten as depth increases. We further verify this effect on three representative MMDiTs--SD3, SD3.5, and FLUX.1 by probing linguistic attributes of the representations over the layers in the text branch. Motivated by these findings, we introduce a training-free approach, prompt reinjection, which reinjects prompt representations from early layers into later layers to alleviate this forgetting. Experiments on GenEval, DPG, and T2I-CompBench++ show consistent gains in instruction-following capability, along with improvements on metrics capturing preference, aesthetics, and overall text--image generation quality.

Topik & Kata Kunci

cs.CV

Penulis (9)

Yuxuan Yao

Yuxuan Chen

Hui Li

Kaihui Cheng

Qipeng Guo

Yuwei Sun

Zilong Dong

Jingdong Wang

Siyu Zhu

Format Sitasi

APA MLA BibTeX

Yao, Y., Chen, Y., Li, H., Cheng, K., Guo, Q., Sun, Y. et al. (2026). Prompt Reinjection: Alleviating Prompt Forgetting in Multimodal Diffusion Transformers. https://arxiv.org/abs/2602.06886

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2026
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓