arXiv Open Access 2025

Taxonomy-Aware Evaluation of Vision-Language Models

Vésteinn Snæbjarnarson Kevin Du Niklas Stoehr Serge Belongie Ryan Cotterell +2 lainnya
Lihat Sumber

Abstrak

When a vision-language model (VLM) is prompted to identify an entity depicted in an image, it may answer 'I see a conifer,' rather than the specific label 'norway spruce'. This raises two issues for evaluation: First, the unconstrained generated text needs to be mapped to the evaluation label space (i.e., 'conifer'). Second, a useful classification measure should give partial credit to less-specific, but not incorrect, answers ('norway spruce' being a type of 'conifer'). To meet these requirements, we propose a framework for evaluating unconstrained text predictions, such as those generated from a vision-language model, against a taxonomy. Specifically, we propose the use of hierarchical precision and recall measures to assess the level of correctness and specificity of predictions with regard to a taxonomy. Experimentally, we first show that existing text similarity measures do not capture taxonomic similarity well. We then develop and compare different methods to map textual VLM predictions onto a taxonomy. This allows us to compute hierarchical similarity measures between the generated text and the ground truth labels. Finally, we analyze modern VLMs on fine-grained visual classification tasks based on our proposed taxonomic evaluation scheme.

Topik & Kata Kunci

Penulis (7)

V

Vésteinn Snæbjarnarson

K

Kevin Du

N

Niklas Stoehr

S

Serge Belongie

R

Ryan Cotterell

N

Nico Lang

S

Stella Frank

Format Sitasi

Snæbjarnarson, V., Du, K., Stoehr, N., Belongie, S., Cotterell, R., Lang, N. et al. (2025). Taxonomy-Aware Evaluation of Vision-Language Models. https://arxiv.org/abs/2504.05457

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2025
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓