DOAJ Open Access 2026

Evaluation of large Language model performance on Persian rheumatology board exams: accuracy and clinical reasoning of GPT-4o vs. GPT-5.1

Farzad Rafiei Saina Sadeghipour Shahab Sheikhalishahi Ehsan Rahmanian Maryam Loghman +4 lainnya

Abstrak

Abstract Large language models are increasingly integrated into medical education, yet their performance in non-English clinical examinations, particularly Persian, remains limited. This study evaluated how GPT-4o and GPT-5.1 perform on Iranian Rheumatology Board examination questions. A total of 204 multiple-choice items were administered in Persian using a similar prompt. Accuracy was determined using the official answer key, and six board-certified rheumatologists independently scored each model’s clinical reasoning on a 1–5 scale. GPT-5.1 demonstrated markedly superior performance, achieving 76% accuracy compared with 64.5% for GPT-4o, alongside significantly higher reasoning scores. Unlike GPT-4o, which showed considerable variability across question types, GPT-5.1 performed consistently across basic science, clinical scenarios, diagnosis, and treatment domains. Although inter-rater agreement among rheumatologists was modest, it remained statistically significant. These findings suggest that newer-generation LLMs provide more reliable reasoning and accuracy in Persian medical assessments. Nevertheless, despite their promising role as educational aids, current models are not yet suitable for high-stakes clinical decision-making and require continued evaluation across diverse languages and specialties.

Topik & Kata Kunci

Penulis (9)

F

Farzad Rafiei

S

Saina Sadeghipour

S

Shahab Sheikhalishahi

E

Ehsan Rahmanian

M

Maryam Loghman

B

Bita Shadbakht

L

Leyla Bagheri

H

Hamidreza Bashiri

H

Hamidreza Soltani

Format Sitasi

Rafiei, F., Sadeghipour, S., Sheikhalishahi, S., Rahmanian, E., Loghman, M., Shadbakht, B. et al. (2026). Evaluation of large Language model performance on Persian rheumatology board exams: accuracy and clinical reasoning of GPT-4o vs. GPT-5.1. https://doi.org/10.1038/s41598-026-38716-z

Akses Cepat

PDF tidak tersedia langsung

Cek di sumber asli →
Lihat di Sumber doi.org/10.1038/s41598-026-38716-z
Informasi Jurnal
Tahun Terbit
2026
Sumber Database
DOAJ
DOI
10.1038/s41598-026-38716-z
Akses
Open Access ✓