arXiv Open Access 2025

Taxonomy-Aware Evaluation of Vision-Language Models

Vésteinn Snæbjarnarson Kevin Du Niklas Stoehr Serge Belongie Ryan Cotterell +2 lainnya

Lihat Sumber

Abstrak

When a vision-language model (VLM) is prompted to identify an entity depicted in an image, it may answer 'I see a conifer,' rather than the specific label 'norway spruce'. This raises two issues for evaluation: First, the unconstrained generated text needs to be mapped to the evaluation label space (i.e., 'conifer'). Second, a useful classification measure should give partial credit to less-specific, but not incorrect, answers ('norway spruce' being a type of 'conifer'). To meet these requirements, we propose a framework for evaluating unconstrained text predictions, such as those generated from a vision-language model, against a taxonomy. Specifically, we propose the use of hierarchical precision and recall measures to assess the level of correctness and specificity of predictions with regard to a taxonomy. Experimentally, we first show that existing text similarity measures do not capture taxonomic similarity well. We then develop and compare different methods to map textual VLM predictions onto a taxonomy. This allows us to compute hierarchical similarity measures between the generated text and the ground truth labels. Finally, we analyze modern VLMs on fine-grained visual classification tasks based on our proposed taxonomic evaluation scheme.

Topik & Kata Kunci

cs.CV

Penulis (7)

Vésteinn Snæbjarnarson

Kevin Du

Niklas Stoehr

Serge Belongie

Ryan Cotterell

Nico Lang

Stella Frank

Format Sitasi

APA MLA BibTeX

Snæbjarnarson, V., Du, K., Stoehr, N., Belongie, S., Cotterell, R., Lang, N. et al. (2025). Taxonomy-Aware Evaluation of Vision-Language Models. https://arxiv.org/abs/2504.05457

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓