arXiv Open Access 2026

Diffusion-State Policy Optimization for Masked Diffusion Language Models

Daisuke Oba Hiroki Furuta Naoaki Okazaki

Lihat Sumber

Abstrak

Masked diffusion language models generate by iteratively filling masked tokens over multiple denoising steps, so learning only from a terminal reward on the final completion yields coarse credit assignment over intermediate decisions. We propose DiSPO (Diffusion-State Policy Optimization), a plug-in credit-assignment layer that directly optimizes intermediate filling decisions. At selected intermediate masked states, DiSPO branches by resampling fillings for the currently masked positions from rollout-cached logits, scores the resulting completions, and updates only the newly filled tokens -- without additional multi-step diffusion rollouts. We formalize a fixed-state objective for branched completions and derive a policy-gradient estimator that can be combined with terminal-feedback policy optimization using the same rollouts. On LLaDA-8B-Instruct, DiSPO consistently improves over the terminal-feedback diffu-GRPO baseline on math and planning benchmarks under matched rollout compute and optimizer steps. Our code will be available at https://daioba.github.io/dispo .

Topik & Kata Kunci

cs.CL cs.LG

Penulis (3)

Daisuke Oba

Hiroki Furuta

Naoaki Okazaki

Format Sitasi

APA MLA BibTeX

Oba, D., Furuta, H., Okazaki, N. (2026). Diffusion-State Policy Optimization for Masked Diffusion Language Models. https://arxiv.org/abs/2602.06462

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2026
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓