arXiv Open Access 2026

A Computational Approach to Visual Metonymy

Saptarshi Ghosh Linfeng Liu Tianyu Jiang

Lihat Sumber

Abstrak

Images often communicate more than they literally depict: a set of tools can suggest an occupation and a cultural artifact can suggest a tradition. This kind of indirect visual reference, known as visual metonymy, invites viewers to recover a target concept via associated cues rather than explicit depiction. In this work, we present the first computational investigation of visual metonymy. We introduce a novel pipeline grounded in semiotic theory that leverages large language models and text-to-image models to generate metonymic visual representations. Using this framework, we construct ViMET, the first visual metonymy dataset comprising 2,000 multiple-choice questions to evaluate the cognitive reasoning abilities in multimodal language models. Experimental results on our dataset reveal a significant gap between human performance (86.9%) and state-of-the-art vision-language models (65.9%), highlighting limitations in machines' ability to interpret indirect visual references. Our dataset is publicly available at: https://github.com/cincynlp/ViMET.

Topik & Kata Kunci

cs.CL cs.CV

Penulis (3)

Saptarshi Ghosh

Linfeng Liu

Tianyu Jiang

Format Sitasi

APA MLA BibTeX

Ghosh, S., Liu, L., Jiang, T. (2026). A Computational Approach to Visual Metonymy. https://arxiv.org/abs/2601.17706

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2026
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓