DOAJ Open Access 2026

The role of large language models in emergency care: a comprehensive benchmarking study

Borna Naderi Longsha Liu Anita Ghandehari Darius Khoshons R. Andrew Taylor +8 lainnya

Abstrak

Abstract With EDs increasingly overburdened, Large Language Models (LLMs) may help streamline workflow and decision-making. We evaluated their emergency medicine knowledge and performance in simulated ED tasks. This two-part study first tested factual knowledge of 18 LLMs using a curated MedMCQA subset covering 12 ED chief complaints, assessing accuracy, precision, and recall. Five models (GPT-5, GPT-4, Claude 3.5, Claude 4, and LLaMA 3.1) were then evaluated on patient summaries, Emergency Severity Index scoring, investigative questioning, management planning, and differential diagnosis across 12 simulated ED cases presented through four sequential information levels. Physicians rated outputs for accuracy, safety, and clinical relevance, with performance differences analyzed statistically. LLaMA-4 Maverick achieved the highest factual accuracy(90.7%), followed by LLaMA-3.1-70B(90.1%). In clinical tasks, GPT-5 outperformed all models, (Level 2 onwards, p < 0.05), with performance stable or improving as complexity increased. Claude 3.5 ranked next, while Claude 4 performed slightly lower but stable with complexity. LLaMA-3.1 and GPT-4 ranked lowest and showed the greatest degradation. All models undertriaged except Claude 3.5, which initially overtriaged. GPT-5 demonstrated the strongest clinical reasoning and scalability with complexity, while LLaMA models excelled in factual recall. Findings suggest a generational leap in reasoning performance and support GPT-5 as a potential ED decision-support tool.

Topik & Kata Kunci

Information technology

Penulis (13)

Borna Naderi

Longsha Liu

Anita Ghandehari

Darius Khoshons

R. Andrew Taylor

Neil Bhavsar

Shriman Balasubramanian

Robert Tanouye

Nancy Creech

Christian Davidson

Justin Norden

Rahul Sharma

Alexander Fortenko

Format Sitasi

APA MLA BibTeX

Naderi, B., Liu, L., Ghandehari, A., Khoshons, D., Taylor, R.A., Bhavsar, N. et al. (2026). The role of large language models in emergency care: a comprehensive benchmarking study. https://doi.org/10.1038/s44387-026-00078-2

Akses Cepat

PDF tidak tersedia langsung

Cek di sumber asli →

Lihat di Sumber doi.org/10.1038/s44387-026-00078-2

Informasi Jurnal

Tahun Terbit: 2026
Sumber Database: DOAJ
DOI: 10.1038/s44387-026-00078-2
Akses: Open Access ✓