arXiv Open Access 2025

Testing the Limits of Fine-Tuning for Improving Visual Cognition in Vision Language Models

Luca M. Schulze Buschoff Konstantinos Voudouris Elif Akata Matthias Bethge Joshua B. Tenenbaum +1 lainnya

Lihat Sumber

Abstrak

Pre-trained vision language models still fall short of human visual cognition. In an effort to improve visual cognition and align models with human behavior, we introduce visual stimuli and human judgments on visual cognition tasks, allowing us to systematically evaluate performance across cognitive domains under a consistent environment. We fine-tune models on ground truth data for intuitive physics and causal reasoning and find that this improves model performance in the respective fine-tuning domain. Furthermore, it can improve model alignment with human behavior. However, we find that task-specific fine-tuning does not contribute to robust human-like generalization to data with other visual characteristics or to tasks in other cognitive domains.

Topik & Kata Kunci

cs.LG

Penulis (6)

Luca M. Schulze Buschoff

Konstantinos Voudouris

Elif Akata

Matthias Bethge

Joshua B. Tenenbaum

Eric Schulz

Format Sitasi

APA MLA BibTeX

Buschoff, L.M.S., Voudouris, K., Akata, E., Bethge, M., Tenenbaum, J.B., Schulz, E. (2025). Testing the Limits of Fine-Tuning for Improving Visual Cognition in Vision Language Models. https://arxiv.org/abs/2502.15678

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓