DOAJ Open Access 2025

Evaluating Medical Text Summaries Using Automatic Evaluation Metrics and LLM-as-a-Judge Approach: A Pilot Study

Yuriy Vasilev Irina Raznitsyna Anastasia Pamova Tikhon Burtsev Tatiana Bobrovskaya +4 lainnya

Abstrak

<b>Background:</b> Electronic health records (EHRs) remain a vital source of clinical information, yet processing these heterogeneous data is extremely labor-intensive. Summarization of these data using Large Language Models (LLMs) is considered a promising tool to support practicing physicians. Unbiased, automated quality control is crucial for integrating the tools into routine practice, saving time and labor. This pilot study aimed to assess the potential and constraints of self-contained evaluation of summarization quality (without expert involvement) based on automatic evaluation metrics and LLM-as-a-judge. <b>Methods:</b> The summaries of text data from 30 EHRs were generated by six open-source low-parameter LLMs. The medical summaries were evaluated using standard automatic metrics (BLEU, ROUGE, METEOR, BERTScore) as well as the LLM-as-a-judge approach using the following criteria: relevance, completeness, redundancy, coherence and structure, grammar and terminology, and hallucinations. Expert evaluation was conducted using the same criteria. <b>Results:</b> The results showed that LLMs hold great promise for summarizing medical data. Nevertheless, neither the evaluation metrics nor LLM judges are reliable in detecting factual errors and semantic distortions (hallucinations). In terms of relevance, the Pearson correlation between the summary quality score and the expert opinions was 0.688. <b>Conclusions:</b> Completely automating the evaluation of medical summaries remains challenging. Further research should focus on dedicated methods for detecting hallucinations, along with investigating larger or specialized models trained on medical texts. Additionally, the potential integration of retrieval-augmented generation (RAG) within the LLM-as-a-judge architecture deserves attention. Nevertheless, even now, the combination of LLMs and the automatic evaluation metrics can underpin medical decision support systems by performing initial evaluations and highlighting potential shortcomings for expert review.

Topik & Kata Kunci

Penulis (9)

Y

Yuriy Vasilev

I

Irina Raznitsyna

A

Anastasia Pamova

T

Tikhon Burtsev

T

Tatiana Bobrovskaya

P

Pavel Kosov

A

Anton Vladzymyrskyy

O

Olga Omelyanskaya

K

Kirill Arzamasov

Format Sitasi

Vasilev, Y., Raznitsyna, I., Pamova, A., Burtsev, T., Bobrovskaya, T., Kosov, P. et al. (2025). Evaluating Medical Text Summaries Using Automatic Evaluation Metrics and LLM-as-a-Judge Approach: A Pilot Study. https://doi.org/10.3390/diagnostics16010003

Akses Cepat

PDF tidak tersedia langsung

Cek di sumber asli →
Lihat di Sumber doi.org/10.3390/diagnostics16010003
Informasi Jurnal
Tahun Terbit
2025
Sumber Database
DOAJ
DOI
10.3390/diagnostics16010003
Akses
Open Access ✓