arXiv Open Access 2026

Visual Set Program Synthesizer

Zehua Cheng Wei Dai Wenhu Zhang Thomas Lukasiewicz Jiahao Sun

Lihat Sumber

Abstrak

A user pointing their phone at a supermarket shelf and asking "Which soda has the least sugar?" poses a difficult challenge for current visual Al assistants. Such queries require not only object recognition, but explicit set-based reasoning such as filtering, comparison, and aggregation. Standard endto-end MLLMs often fail at these tasks because they lack an explicit mechanism for compositional logic. We propose treating visual reasoning as Visual Program Synthesis, where the model first generates a symbolic program that is executed by a separate engine grounded in visual scenes. We also introduce Set-VQA, a new benchmark designed specifically for evaluating set-based visual reasoning. Experiments show that our approach significantly outperforms state-of-the-art baselines on complex reasoning tasks, producing more systematic and transparent behavior while substantially improving answer accuracy. These results demonstrate that program-driven reasoning provides a principled alternative to black-box visual-language inference.

Topik & Kata Kunci

cs.MM cs.CL cs.SC

Penulis (5)

Zehua Cheng

Wei Dai

Wenhu Zhang

Thomas Lukasiewicz

Jiahao Sun

Format Sitasi

APA MLA BibTeX

Cheng, Z., Dai, W., Zhang, W., Lukasiewicz, T., Sun, J. (2026). Visual Set Program Synthesizer. https://arxiv.org/abs/2603.15997

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2026
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓