arXiv Open Access 2026

Diffusion-State Policy Optimization for Masked Diffusion Language Models

Daisuke Oba Hiroki Furuta Naoaki Okazaki
Lihat Sumber

Abstrak

Masked diffusion language models generate by iteratively filling masked tokens over multiple denoising steps, so learning only from a terminal reward on the final completion yields coarse credit assignment over intermediate decisions. We propose DiSPO (Diffusion-State Policy Optimization), a plug-in credit-assignment layer that directly optimizes intermediate filling decisions. At selected intermediate masked states, DiSPO branches by resampling fillings for the currently masked positions from rollout-cached logits, scores the resulting completions, and updates only the newly filled tokens -- without additional multi-step diffusion rollouts. We formalize a fixed-state objective for branched completions and derive a policy-gradient estimator that can be combined with terminal-feedback policy optimization using the same rollouts. On LLaDA-8B-Instruct, DiSPO consistently improves over the terminal-feedback diffu-GRPO baseline on math and planning benchmarks under matched rollout compute and optimizer steps. Our code will be available at https://daioba.github.io/dispo .

Topik & Kata Kunci

Penulis (3)

D

Daisuke Oba

H

Hiroki Furuta

N

Naoaki Okazaki

Format Sitasi

Oba, D., Furuta, H., Okazaki, N. (2026). Diffusion-State Policy Optimization for Masked Diffusion Language Models. https://arxiv.org/abs/2602.06462

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2026
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓