Evaluation of large Language model performance on Persian rheumatology board exams: accuracy and clinical reasoning of GPT-4o vs. GPT-5.1
Abstrak
Abstract Large language models are increasingly integrated into medical education, yet their performance in non-English clinical examinations, particularly Persian, remains limited. This study evaluated how GPT-4o and GPT-5.1 perform on Iranian Rheumatology Board examination questions. A total of 204 multiple-choice items were administered in Persian using a similar prompt. Accuracy was determined using the official answer key, and six board-certified rheumatologists independently scored each model’s clinical reasoning on a 1–5 scale. GPT-5.1 demonstrated markedly superior performance, achieving 76% accuracy compared with 64.5% for GPT-4o, alongside significantly higher reasoning scores. Unlike GPT-4o, which showed considerable variability across question types, GPT-5.1 performed consistently across basic science, clinical scenarios, diagnosis, and treatment domains. Although inter-rater agreement among rheumatologists was modest, it remained statistically significant. These findings suggest that newer-generation LLMs provide more reliable reasoning and accuracy in Persian medical assessments. Nevertheless, despite their promising role as educational aids, current models are not yet suitable for high-stakes clinical decision-making and require continued evaluation across diverse languages and specialties.
Penulis (9)
Farzad Rafiei
Saina Sadeghipour
Shahab Sheikhalishahi
Ehsan Rahmanian
Maryam Loghman
Bita Shadbakht
Leyla Bagheri
Hamidreza Bashiri
Hamidreza Soltani
Format Sitasi
Akses Cepat
- Tahun Terbit
- 2026
- Sumber Database
- DOAJ
- DOI
- 10.1038/s41598-026-38716-z
- Akses
- Open Access ✓