arXiv Open Access 2025

Visual Graph Arena: Evaluating Visual Conceptualization of Vision and Multimodal Large Language Models

Zahra Babaiee Peyman M. Kiasari Daniela Rus Radu Grosu
Lihat Sumber

Abstrak

Recent advancements in multimodal large language models have driven breakthroughs in visual question answering. Yet, a critical gap persists, `conceptualization'-the ability to recognize and reason about the same concept despite variations in visual form, a basic ability of human reasoning. To address this challenge, we introduce the Visual Graph Arena (VGA), a dataset featuring six graph-based tasks designed to evaluate and improve AI systems' capacity for visual abstraction. VGA uses diverse graph layouts (e.g., Kamada-Kawai vs. planar) to test reasoning independent of visual form. Experiments with state-of-the-art vision models and multimodal LLMs reveal a striking divide: humans achieved near-perfect accuracy across tasks, while models totally failed on isomorphism detection and showed limited success in path/cycle tasks. We further identify behavioral anomalies suggesting pseudo-intelligent pattern matching rather than genuine understanding. These findings underscore fundamental limitations in current AI models for visual understanding. By isolating the challenge of representation-invariant reasoning, the VGA provides a framework to drive progress toward human-like conceptualization in AI visual models. The Visual Graph Arena is available at: \href{https://vga.csail.mit.edu/}{vga.csail.mit.edu}

Topik & Kata Kunci

Penulis (4)

Z

Zahra Babaiee

P

Peyman M. Kiasari

D

Daniela Rus

R

Radu Grosu

Format Sitasi

Babaiee, Z., Kiasari, P.M., Rus, D., Grosu, R. (2025). Visual Graph Arena: Evaluating Visual Conceptualization of Vision and Multimodal Large Language Models. https://arxiv.org/abs/2506.06242

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2025
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓