arXiv Open Access 2025

Spot The Ball: A Benchmark for Visual Social Inference

Neha Balamurugan Sarah Wu Adam Chun Gabe Gaw Cristobal Eyzaguirre +1 lainnya

Lihat Sumber

Abstrak

Humans excel at visual social inference, the ability to infer hidden elements of a scene from subtle behavioral cues such as other people's gaze, pose, and orientation. This ability drives everyday social reasoning in humans and is critical for developing more human-like AI agents. We introduce Spot The Ball, a challenging benchmark for evaluating visual social inference in vision-language models (VLMs) using sports as a test domain. The task is to localize a removed sports ball from soccer, basketball, and volleyball images. We present a curated evaluation set with human baselines and a scalable pipeline for generating additional test items. We evaluate four state-of-the-art VLMs (Gemini, GPT, LLaMA, Qwen) using three prompting strategies, finding that humans are consistently two to three times more accurate (20-34%) than models ($\leq$ 17%) across all sports. Our analyses show that models rely on superficial spatial heuristics--such as guessing near the image center or nearby players--while humans leverage social cues like gaze direction and body pose. These findings reveal a persistent human-model gap in visual social reasoning and underscore the need for architectures that explicitly encode structured behavioral cues to achieve robust, human-like inference.

Topik & Kata Kunci

cs.CV cs.HC

Penulis (6)

Neha Balamurugan

Sarah Wu

Adam Chun

Gabe Gaw

Cristobal Eyzaguirre

Tobias Gerstenberg

Format Sitasi

APA MLA BibTeX

Balamurugan, N., Wu, S., Chun, A., Gaw, G., Eyzaguirre, C., Gerstenberg, T. (2025). Spot The Ball: A Benchmark for Visual Social Inference. https://arxiv.org/abs/2511.00261

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓