arXiv Open Access 2025

World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models

Eunsu Kim Junyeong Park Na Min An Junseong Kim Hitesh Laxmichand Patel +7 lainnya
Lihat Sumber

Abstrak

In a globalized world, cultural elements from diverse origins frequently appear together within a single visual scene. We refer to these as culture mixing scenarios, yet how Large Vision-Language Models (LVLMs) perceive them remains underexplored. We investigate culture mixing as a critical challenge for LVLMs and examine how current models behave when cultural items from multiple regions appear together. To systematically analyze these behaviors, we construct CultureMix, a food Visual Question Answering (VQA) benchmark with 23k diffusion-generated, human-verified culture mixing images across four subtasks: (1) food-only, (2) food+food, (3) food+background, and (4) food+food+background. Evaluating 10 LVLMs, we find consistent failures to preserve individual cultural identities in mixed settings. Models show strong background reliance, with accuracy dropping 14% when cultural backgrounds are added to food-only baselines, and they produce inconsistent predictions for identical foods across different contexts. To address these limitations, we explore three robustness strategies. We find supervised fine-tuning using a diverse culture mixing dataset substantially improve model consistency and reduce background sensitivity. We call for increased attention to culture mixing scenarios as a critical step toward developing LVLMs capable of operating reliably in culturally diverse real-world environments.

Topik & Kata Kunci

Penulis (12)

E

Eunsu Kim

J

Junyeong Park

N

Na Min An

J

Junseong Kim

H

Hitesh Laxmichand Patel

J

Jiho Jin

J

Julia Kruk

A

Amit Agarwal

S

Srikant Panda

F

Fenal Ashokbhai Ilasariya

H

Hyunjung Shim

A

Alice Oh

Format Sitasi

Kim, E., Park, J., An, N.M., Kim, J., Patel, H.L., Jin, J. et al. (2025). World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models. https://arxiv.org/abs/2511.22787

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2025
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓