The role of large language models in emergency care: a comprehensive benchmarking study
Abstrak
Abstract With EDs increasingly overburdened, Large Language Models (LLMs) may help streamline workflow and decision-making. We evaluated their emergency medicine knowledge and performance in simulated ED tasks. This two-part study first tested factual knowledge of 18 LLMs using a curated MedMCQA subset covering 12 ED chief complaints, assessing accuracy, precision, and recall. Five models (GPT-5, GPT-4, Claude 3.5, Claude 4, and LLaMA 3.1) were then evaluated on patient summaries, Emergency Severity Index scoring, investigative questioning, management planning, and differential diagnosis across 12 simulated ED cases presented through four sequential information levels. Physicians rated outputs for accuracy, safety, and clinical relevance, with performance differences analyzed statistically. LLaMA-4 Maverick achieved the highest factual accuracy(90.7%), followed by LLaMA-3.1-70B(90.1%). In clinical tasks, GPT-5 outperformed all models, (Level 2 onwards, p < 0.05), with performance stable or improving as complexity increased. Claude 3.5 ranked next, while Claude 4 performed slightly lower but stable with complexity. LLaMA-3.1 and GPT-4 ranked lowest and showed the greatest degradation. All models undertriaged except Claude 3.5, which initially overtriaged. GPT-5 demonstrated the strongest clinical reasoning and scalability with complexity, while LLaMA models excelled in factual recall. Findings suggest a generational leap in reasoning performance and support GPT-5 as a potential ED decision-support tool.
Topik & Kata Kunci
Penulis (13)
Borna Naderi
Longsha Liu
Anita Ghandehari
Darius Khoshons
R. Andrew Taylor
Neil Bhavsar
Shriman Balasubramanian
Robert Tanouye
Nancy Creech
Christian Davidson
Justin Norden
Rahul Sharma
Alexander Fortenko
Akses Cepat
- Tahun Terbit
- 2026
- Sumber Database
- DOAJ
- DOI
- 10.1038/s44387-026-00078-2
- Akses
- Open Access ✓