arXiv Open Access 2025

JEEM: Vision-Language Understanding in Four Arabic Dialects

Karima Kadaoui Hanin Atwany Hamdan Al-Ali Abdelrahman Mohamed Ali Mekky +5 lainnya

Lihat Sumber

Abstrak

We introduce JEEM, a benchmark designed to evaluate Vision-Language Models (VLMs) on visual understanding across four Arabic-speaking countries: Jordan, The Emirates, Egypt, and Morocco. JEEM includes the tasks of image captioning and visual question answering, and features culturally rich and regionally diverse content. This dataset aims to assess the ability of VLMs to generalize across dialects and accurately interpret cultural elements in visual contexts. In an evaluation of five prominent open-source Arabic VLMs and GPT-4V, we find that the Arabic VLMs consistently underperform, struggling with both visual understanding and dialect-specific generation. While GPT-4V ranks best in this comparison, the model's linguistic competence varies across dialects, and its visual understanding capabilities lag behind. This underscores the need for more inclusive models and the value of culturally-diverse evaluation paradigms.

Topik & Kata Kunci

cs.CL cs.AI

Penulis (10)

Karima Kadaoui

Hanin Atwany

Hamdan Al-Ali

Abdelrahman Mohamed

Ali Mekky

Sergei Tilga

Natalia Fedorova

Ekaterina Artemova

Hanan Aldarmaki

Yova Kementchedjhieva

Format Sitasi

APA MLA BibTeX

Kadaoui, K., Atwany, H., Al-Ali, H., Mohamed, A., Mekky, A., Tilga, S. et al. (2025). JEEM: Vision-Language Understanding in Four Arabic Dialects. https://arxiv.org/abs/2503.21910

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓