arXiv Open Access 2026

Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning

Shaofeng Yin Jiaxin Ge Zora Zhiruo Wang Chenyang Wang Xiuyu Li +4 lainnya
Lihat Sumber

Abstrak

Vision-as-inverse-graphics, the concept of reconstructing images into editable programs, remains challenging for Vision-Language Models (VLMs), which inherently lack fine-grained spatial grounding in one-shot settings. To address this, we introduce VIGA (Vision-as-Inverse-Graphics Agent), an interleaved multimodal reasoning framework where symbolic logic and visual perception actively cross-verify each other. VIGA operates through a tightly coupled code-render-inspect loop: synthesizing symbolic programs, projecting them into visual states, and inspecting discrepancies to guide iterative edits. Equipped with high-level semantic skills and an evolving multimodal memory, VIGA sustains evidence-based modifications over long horizons. This training-free, task-agnostic framework seamlessly supports 2D document generation, 3D reconstruction, multi-step 3D editing, and 4D physical interaction. Finally, we introduce BlenderBench, a challenging visual-to-code benchmark. Empirically, VIGA substantially improves accuracy compared with one-shot baselines in BlenderGym (35.32%), SlideBench (117.17%) and our proposed BlenderBench (124.70%).

Topik & Kata Kunci

Penulis (9)

S

Shaofeng Yin

J

Jiaxin Ge

Z

Zora Zhiruo Wang

C

Chenyang Wang

X

Xiuyu Li

M

Michael J. Black

T

Trevor Darrell

A

Angjoo Kanazawa

H

Haiwen Feng

Format Sitasi

Yin, S., Ge, J., Wang, Z.Z., Wang, C., Li, X., Black, M.J. et al. (2026). Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning. https://arxiv.org/abs/2601.11109

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2026
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓