arXiv Open Access 2025

AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models

Mohammad Zbeeb Hasan Abed Al Kader Hammoud Sina Mukalled Nadine Rizk Fatima Karnib +3 lainnya

Lihat Sumber

Abstrak

We present AraLingBench: a fully human annotated benchmark for evaluating the Arabic linguistic competence of large language models (LLMs). The benchmark spans five core categories: grammar, morphology, spelling, reading comprehension, and syntax, through 150 expert-designed multiple choice questions that directly assess structural language understanding. Evaluating 35 Arabic and bilingual LLMs reveals that current models demonstrate strong surface level proficiency but struggle with deeper grammatical and syntactic reasoning. AraLingBench highlights a persistent gap between high scores on knowledge-based benchmarks and true linguistic mastery, showing that many models succeed through memorization or pattern recognition rather than authentic comprehension. By isolating and measuring fundamental linguistic skills, AraLingBench provides a diagnostic framework for developing Arabic LLMs. The full evaluation code is publicly available on GitHub.

Topik & Kata Kunci

cs.CL cs.AI cs.LG

Penulis (8)

Mohammad Zbeeb

Hasan Abed Al Kader Hammoud

Sina Mukalled

Nadine Rizk

Fatima Karnib

Issam Lakkis

Ammar Mohanna

Bernard Ghanem

Format Sitasi

APA MLA BibTeX

Zbeeb, M., Hammoud, H.A.A.K., Mukalled, S., Rizk, N., Karnib, F., Lakkis, I. et al. (2025). AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models. https://arxiv.org/abs/2511.14295

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓