arXiv Open Access 2024

Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark with Human-VLM Collaboration

ChaeHun Park Yujin Baek Jaeseok Kim Yu-Jung Heo Du-Seong Chang +1 lainnya

Lihat Sumber

Abstrak

To create culturally inclusive vision-language models (VLMs), developing a benchmark that tests their ability to address culturally relevant questions is essential. Existing approaches typically rely on human annotators, making the process labor-intensive and creating a cognitive burden in generating diverse questions. To address this, we propose a semi-automated framework for constructing cultural VLM benchmarks, specifically targeting multiple-choice QA. This framework combines human-VLM collaboration, where VLMs generate questions based on guidelines, a small set of annotated examples, and relevant knowledge, followed by a verification process by native speakers. We demonstrate the effectiveness of this framework through the creation of \texttt{K-Viscuit}, a dataset focused on Korean culture. Our experiments on this dataset reveal that open-source models lag behind proprietary ones in understanding Korean culture, highlighting key areas for improvement. We also present a series of further analyses, including human evaluation, augmenting VLMs with external knowledge, and the evaluation beyond multiple-choice QA. Our dataset is available at https://huggingface.co/datasets/ddehun/k-viscuit.

Topik & Kata Kunci

cs.CL cs.CV

Penulis (6)

ChaeHun Park

Yujin Baek

Jaeseok Kim

Yu-Jung Heo

Du-Seong Chang

Jaegul Choo

Format Sitasi

APA MLA BibTeX

Park, C., Baek, Y., Kim, J., Heo, Y., Chang, D., Choo, J. (2024). Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark with Human-VLM Collaboration. https://arxiv.org/abs/2406.16469

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2024
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓