arXiv Open Access 2026

Visual Set Program Synthesizer

Zehua Cheng Wei Dai Wenhu Zhang Thomas Lukasiewicz Jiahao Sun
Lihat Sumber

Abstrak

A user pointing their phone at a supermarket shelf and asking "Which soda has the least sugar?" poses a difficult challenge for current visual Al assistants. Such queries require not only object recognition, but explicit set-based reasoning such as filtering, comparison, and aggregation. Standard endto-end MLLMs often fail at these tasks because they lack an explicit mechanism for compositional logic. We propose treating visual reasoning as Visual Program Synthesis, where the model first generates a symbolic program that is executed by a separate engine grounded in visual scenes. We also introduce Set-VQA, a new benchmark designed specifically for evaluating set-based visual reasoning. Experiments show that our approach significantly outperforms state-of-the-art baselines on complex reasoning tasks, producing more systematic and transparent behavior while substantially improving answer accuracy. These results demonstrate that program-driven reasoning provides a principled alternative to black-box visual-language inference.

Topik & Kata Kunci

Penulis (5)

Z

Zehua Cheng

W

Wei Dai

W

Wenhu Zhang

T

Thomas Lukasiewicz

J

Jiahao Sun

Format Sitasi

Cheng, Z., Dai, W., Zhang, W., Lukasiewicz, T., Sun, J. (2026). Visual Set Program Synthesizer. https://arxiv.org/abs/2603.15997

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2026
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓