Real-world performance of open-source large language models in diabetes diagnosis
Abstrak
BackgroundThis study aimed to evaluate the performance of diverse open-source large language models (LLMs) in diagnosing diabetes subtypes and comorbidities from unstructured clinical text, assessing the impact of model characteristics, prompting, and language.MethodsWe conducted a retrospective analysis of 11,329 adult diabetes patients from a large Chinese tertiary center (2010–2020). Various open-source LLMs were tested using four prompting strategies in English and Chinese. Primary outcomes were F1-scores for multi-class diabetes subtyping and binary classification of diabetic kidney disease (DKD) and metabolic syndrome (MetS).ResultsLLMs demonstrated high performance in complex subtyping (peak F1 0.951) but showed limitations in rule-based DKD (F1 0.570) and MetS (F1 0.650) diagnosis. Chain-of-Thought prompting improved MetS classification but degraded DKD performance. Optimal model size was approximately 32B parameters. Notably, English prompts outperformed Chinese prompts on native Chinese text.ConclusionOpen-source LLMs exhibit strong holistic pattern recognition for complex classification but struggle with rule-based procedural reasoning. These models are promising as clinical co-pilots to augment expert decision-making rather than serving as autonomous diagnostic tools.
Topik & Kata Kunci
Penulis (17)
Shuting Yang
Shuting Yang
Shuting Yang
Sujie Liu
Sujie Liu
Yuxi Ma
Yuxi Ma
Baowen Gai
Baowen Gai
Baowen Gai
Junwei Liu
Liansheng Wang
Liansheng Wang
Feng Gao
Feng Gao
Feng Gao
Zhiguang Zhou
Akses Cepat
- Tahun Terbit
- 2026
- Sumber Database
- DOAJ
- DOI
- 10.3389/fendo.2026.1747468
- Akses
- Open Access ✓