arXiv Open Access 2026

Prompt Reinjection: Alleviating Prompt Forgetting in Multimodal Diffusion Transformers

Yuxuan Yao Yuxuan Chen Hui Li Kaihui Cheng Qipeng Guo +4 lainnya
Lihat Sumber

Abstrak

Multimodal Diffusion Transformers (MMDiTs) for text-to-image generation maintain separate text and image branches, with bidirectional information flow between text tokens and visual latents throughout denoising. In this setting, we observe a prompt forgetting phenomenon: the semantics of the prompt representation in the text branch is progressively forgotten as depth increases. We further verify this effect on three representative MMDiTs--SD3, SD3.5, and FLUX.1 by probing linguistic attributes of the representations over the layers in the text branch. Motivated by these findings, we introduce a training-free approach, prompt reinjection, which reinjects prompt representations from early layers into later layers to alleviate this forgetting. Experiments on GenEval, DPG, and T2I-CompBench++ show consistent gains in instruction-following capability, along with improvements on metrics capturing preference, aesthetics, and overall text--image generation quality.

Topik & Kata Kunci

Penulis (9)

Y

Yuxuan Yao

Y

Yuxuan Chen

H

Hui Li

K

Kaihui Cheng

Q

Qipeng Guo

Y

Yuwei Sun

Z

Zilong Dong

J

Jingdong Wang

S

Siyu Zhu

Format Sitasi

Yao, Y., Chen, Y., Li, H., Cheng, K., Guo, Q., Sun, Y. et al. (2026). Prompt Reinjection: Alleviating Prompt Forgetting in Multimodal Diffusion Transformers. https://arxiv.org/abs/2602.06886

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2026
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓