DOAJ Open Access 2026

The role of large language models in emergency care: a comprehensive benchmarking study

Borna Naderi Longsha Liu Anita Ghandehari Darius Khoshons R. Andrew Taylor +8 lainnya

Abstrak

Abstract With EDs increasingly overburdened, Large Language Models (LLMs) may help streamline workflow and decision-making. We evaluated their emergency medicine knowledge and performance in simulated ED tasks. This two-part study first tested factual knowledge of 18 LLMs using a curated MedMCQA subset covering 12 ED chief complaints, assessing accuracy, precision, and recall. Five models (GPT-5, GPT-4, Claude 3.5, Claude 4, and LLaMA 3.1) were then evaluated on patient summaries, Emergency Severity Index scoring, investigative questioning, management planning, and differential diagnosis across 12 simulated ED cases presented through four sequential information levels. Physicians rated outputs for accuracy, safety, and clinical relevance, with performance differences analyzed statistically. LLaMA-4 Maverick achieved the highest factual accuracy(90.7%), followed by LLaMA-3.1-70B(90.1%). In clinical tasks, GPT-5 outperformed all models, (Level 2 onwards, p < 0.05), with performance stable or improving as complexity increased. Claude 3.5 ranked next, while Claude 4 performed slightly lower but stable with complexity. LLaMA-3.1 and GPT-4 ranked lowest and showed the greatest degradation. All models undertriaged except Claude 3.5, which initially overtriaged. GPT-5 demonstrated the strongest clinical reasoning and scalability with complexity, while LLaMA models excelled in factual recall. Findings suggest a generational leap in reasoning performance and support GPT-5 as a potential ED decision-support tool.

Topik & Kata Kunci

Penulis (13)

B

Borna Naderi

L

Longsha Liu

A

Anita Ghandehari

D

Darius Khoshons

R

R. Andrew Taylor

N

Neil Bhavsar

S

Shriman Balasubramanian

R

Robert Tanouye

N

Nancy Creech

C

Christian Davidson

J

Justin Norden

R

Rahul Sharma

A

Alexander Fortenko

Format Sitasi

Naderi, B., Liu, L., Ghandehari, A., Khoshons, D., Taylor, R.A., Bhavsar, N. et al. (2026). The role of large language models in emergency care: a comprehensive benchmarking study. https://doi.org/10.1038/s44387-026-00078-2

Akses Cepat

PDF tidak tersedia langsung

Cek di sumber asli →
Lihat di Sumber doi.org/10.1038/s44387-026-00078-2
Informasi Jurnal
Tahun Terbit
2026
Sumber Database
DOAJ
DOI
10.1038/s44387-026-00078-2
Akses
Open Access ✓