arXiv Open Access 2025

EVLF-FM: Explainable Vision Language Foundation Model for Medicine

Yang Bai Haoran Cheng Yang Zhou Jun Zhou Arun Thirunavukarasu +38 lainnya

Lihat Sumber

Abstrak

Despite the promise of foundation models in medical AI, current systems remain limited - they are modality-specific and lack transparent reasoning processes, hindering clinical adoption. To address this gap, we present EVLF-FM, a multimodal vision-language foundation model (VLM) designed to unify broad diagnostic capability with fine-grain explainability. The development and testing of EVLF-FM encompassed over 1.3 million total samples from 23 global datasets across eleven imaging modalities related to six clinical specialties: dermatology, hepatology, ophthalmology, pathology, pulmonology, and radiology. External validation employed 8,884 independent test samples from 10 additional datasets across five imaging modalities. Technically, EVLF-FM is developed to assist with multiple disease diagnosis and visual question answering with pixel-level visual grounding and reasoning capabilities. In internal validation for disease diagnostics, EVLF-FM achieved the highest average accuracy (0.858) and F1-score (0.797), outperforming leading generalist and specialist models. In medical visual grounding, EVLF-FM also achieved stellar performance across nine modalities with average mIOU of 0.743 and Acc@0.5 of 0.837. External validations further confirmed strong zero-shot and few-shot performance, with competitive F1-scores despite a smaller model size. Through a hybrid training strategy combining supervised and visual reinforcement fine-tuning, EVLF-FM not only achieves state-of-the-art accuracy but also exhibits step-by-step reasoning, aligning outputs with visual evidence. EVLF-FM is an early multi-disease VLM model with explainability and reasoning capabilities that could advance adoption of and trust in foundation models for real-world clinical deployment.

Topik & Kata Kunci

cs.CV

Penulis (43)

Yang Bai

Haoran Cheng

Yang Zhou

Jun Zhou

Arun Thirunavukarasu

Yuhe Ke

Jie Yao

Kanae Fukutsu

Chrystie Wan Ning Quek

Ashley Hong

Laura Gutierrez

Zhen Ling Teo

Darren Shu Jeng Ting

Brian T. Soetikno

Christopher S. Nielsen

Tobias Elze

Zengxiang Li

Linh Le Dinh

Hiok Hong Chan

Victor Koh

Marcus Tan

Kelvin Z. Li

Leonard Yip

Ching Yu Cheng

Yih Chung Tham

Gavin Siew Wei Tan

Leopold Schmetterer

Marcus Ang

Rahat Hussain

Jod Mehta

Tin Aung

Lionel Tim-Ee Cheng

Tran Nguyen Tuan Anh

Chee Leong Cheng

Tien Yin Wong

Nan Liu

Iain Beehuat Tan

Soon Thye Lim

Eyal Klang

Tony Kiat Hon Lim

Rick Siow Mong Goh

Yong Liu

Daniel Shu Wei Ting

Format Sitasi

APA MLA BibTeX

Bai, Y., Cheng, H., Zhou, Y., Zhou, J., Thirunavukarasu, A., Ke, Y. et al. (2025). EVLF-FM: Explainable Vision Language Foundation Model for Medicine. https://arxiv.org/abs/2509.24231

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓