arXiv Open Access 2025

EVLF-FM: Explainable Vision Language Foundation Model for Medicine

Yang Bai Haoran Cheng Yang Zhou Jun Zhou Arun Thirunavukarasu +38 lainnya
Lihat Sumber

Abstrak

Despite the promise of foundation models in medical AI, current systems remain limited - they are modality-specific and lack transparent reasoning processes, hindering clinical adoption. To address this gap, we present EVLF-FM, a multimodal vision-language foundation model (VLM) designed to unify broad diagnostic capability with fine-grain explainability. The development and testing of EVLF-FM encompassed over 1.3 million total samples from 23 global datasets across eleven imaging modalities related to six clinical specialties: dermatology, hepatology, ophthalmology, pathology, pulmonology, and radiology. External validation employed 8,884 independent test samples from 10 additional datasets across five imaging modalities. Technically, EVLF-FM is developed to assist with multiple disease diagnosis and visual question answering with pixel-level visual grounding and reasoning capabilities. In internal validation for disease diagnostics, EVLF-FM achieved the highest average accuracy (0.858) and F1-score (0.797), outperforming leading generalist and specialist models. In medical visual grounding, EVLF-FM also achieved stellar performance across nine modalities with average mIOU of 0.743 and Acc@0.5 of 0.837. External validations further confirmed strong zero-shot and few-shot performance, with competitive F1-scores despite a smaller model size. Through a hybrid training strategy combining supervised and visual reinforcement fine-tuning, EVLF-FM not only achieves state-of-the-art accuracy but also exhibits step-by-step reasoning, aligning outputs with visual evidence. EVLF-FM is an early multi-disease VLM model with explainability and reasoning capabilities that could advance adoption of and trust in foundation models for real-world clinical deployment.

Topik & Kata Kunci

Penulis (43)

Y

Yang Bai

H

Haoran Cheng

Y

Yang Zhou

J

Jun Zhou

A

Arun Thirunavukarasu

Y

Yuhe Ke

J

Jie Yao

K

Kanae Fukutsu

C

Chrystie Wan Ning Quek

A

Ashley Hong

L

Laura Gutierrez

Z

Zhen Ling Teo

D

Darren Shu Jeng Ting

B

Brian T. Soetikno

C

Christopher S. Nielsen

T

Tobias Elze

Z

Zengxiang Li

L

Linh Le Dinh

H

Hiok Hong Chan

V

Victor Koh

M

Marcus Tan

K

Kelvin Z. Li

L

Leonard Yip

C

Ching Yu Cheng

Y

Yih Chung Tham

G

Gavin Siew Wei Tan

L

Leopold Schmetterer

M

Marcus Ang

R

Rahat Hussain

J

Jod Mehta

T

Tin Aung

L

Lionel Tim-Ee Cheng

T

Tran Nguyen Tuan Anh

C

Chee Leong Cheng

T

Tien Yin Wong

N

Nan Liu

I

Iain Beehuat Tan

S

Soon Thye Lim

E

Eyal Klang

T

Tony Kiat Hon Lim

R

Rick Siow Mong Goh

Y

Yong Liu

D

Daniel Shu Wei Ting

Format Sitasi

Bai, Y., Cheng, H., Zhou, Y., Zhou, J., Thirunavukarasu, A., Ke, Y. et al. (2025). EVLF-FM: Explainable Vision Language Foundation Model for Medicine. https://arxiv.org/abs/2509.24231

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2025
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓