DOAJ Open Access 2026

Evaluation of large Language model performance on Persian rheumatology board exams: accuracy and clinical reasoning of GPT-4o vs. GPT-5.1

Farzad Rafiei Saina Sadeghipour Shahab Sheikhalishahi Ehsan Rahmanian Maryam Loghman +4 lainnya

Abstrak

Abstract Large language models are increasingly integrated into medical education, yet their performance in non-English clinical examinations, particularly Persian, remains limited. This study evaluated how GPT-4o and GPT-5.1 perform on Iranian Rheumatology Board examination questions. A total of 204 multiple-choice items were administered in Persian using a similar prompt. Accuracy was determined using the official answer key, and six board-certified rheumatologists independently scored each model’s clinical reasoning on a 1–5 scale. GPT-5.1 demonstrated markedly superior performance, achieving 76% accuracy compared with 64.5% for GPT-4o, alongside significantly higher reasoning scores. Unlike GPT-4o, which showed considerable variability across question types, GPT-5.1 performed consistently across basic science, clinical scenarios, diagnosis, and treatment domains. Although inter-rater agreement among rheumatologists was modest, it remained statistically significant. These findings suggest that newer-generation LLMs provide more reliable reasoning and accuracy in Persian medical assessments. Nevertheless, despite their promising role as educational aids, current models are not yet suitable for high-stakes clinical decision-making and require continued evaluation across diverse languages and specialties.

Topik & Kata Kunci

Medicine Science

Penulis (9)

Farzad Rafiei

Saina Sadeghipour

Shahab Sheikhalishahi

Ehsan Rahmanian

Maryam Loghman

Bita Shadbakht

Leyla Bagheri

Hamidreza Bashiri

Hamidreza Soltani

Format Sitasi

APA MLA BibTeX

Rafiei, F., Sadeghipour, S., Sheikhalishahi, S., Rahmanian, E., Loghman, M., Shadbakht, B. et al. (2026). Evaluation of large Language model performance on Persian rheumatology board exams: accuracy and clinical reasoning of GPT-4o vs. GPT-5.1. https://doi.org/10.1038/s41598-026-38716-z

Akses Cepat

PDF tidak tersedia langsung

Cek di sumber asli →

Lihat di Sumber doi.org/10.1038/s41598-026-38716-z

Informasi Jurnal

Tahun Terbit: 2026
Sumber Database: DOAJ
DOI: 10.1038/s41598-026-38716-z
Akses: Open Access ✓