Medical QA dialogue datasets in RAG systems performance evaluation and ChatGPT optimization
Abstrak
Abstract This study evaluates the effectiveness of Chinese doctor–patient dialogues as retrieval sources for Retrieval-Augmented Generation (RAG) in clinical question answering. Using ChatGPT-3.5 as a baseline and extending to GPT-4o and GPT-5, we compare multiple retrieval pipelines, including dense retrieval, Cross-Encoder reranking, Reciprocal Rank Fusion (RRF), and Cascade RRF→Rerank. Experimental results show that dialogue-based retrieval significantly improves generation quality relative to direct prompting (e.g., ROUGE-1-f: +12.6%, BERTScore_F1: +1.5%, p < 0.05). Among retrieval strategies, Rerank-only provides the best accuracy–latency balance, while the cascade pipeline introduces noise and yields no additional benefit. Under identical retrieval settings, GPT-4o achieves stronger automatic metrics and 4–5× lower latency, whereas GPT-5 receives slightly higher human preference scores (+ 0.08, p < 0.001), indicating a trade-off between efficiency and perceived coherence. Expert evaluation further confirms improvements in readability, accuracy, and authenticity (all p < 0.001). These findings highlight that data representation and metadata structure have a greater impact on RAG performance than retrieval algorithm complexity, offering practical guidance for reliable medical QA deployment.
Penulis (5)
Muretijiang Muhetaer
Ailimulati Yusupu
Wang Yifan
Munire Mutalipu
Fan Hao
Akses Cepat
- Tahun Terbit
- 2025
- Sumber Database
- DOAJ
- DOI
- 10.1038/s41598-025-28015-4
- Akses
- Open Access ✓