arXiv Open Access 2024

Read Like a Radiologist: Efficient Vision-Language Model for 3D Medical Imaging Interpretation

Changsun Lee Sangjoon Park Cheong-Il Shin Woo Hee Choi Hyun Jeong Park +2 lainnya
Lihat Sumber

Abstrak

Recent medical vision-language models (VLMs) have shown promise in 2D medical image interpretation. However extending them to 3D medical imaging has been challenging due to computational complexities and data scarcity. Although a few recent VLMs specified for 3D medical imaging have emerged, all are limited to learning volumetric representation of a 3D medical image as a set of sub-volumetric features. Such process introduces overly correlated representations along the z-axis that neglect slice-specific clinical details, particularly for 3D medical images where adjacent slices have low redundancy. To address this limitation, we introduce MS-VLM that mimic radiologists' workflow in 3D medical image interpretation. Specifically, radiologists analyze 3D medical images by examining individual slices sequentially and synthesizing information across slices and views. Likewise, MS-VLM leverages self-supervised 2D transformer encoders to learn a volumetric representation that capture inter-slice dependencies from a sequence of slice-specific features. Unbound by sub-volumetric patchification, MS-VLM is capable of obtaining useful volumetric representations from 3D medical images with any slice length and from multiple images acquired from different planes and phases. We evaluate MS-VLM on publicly available chest CT dataset CT-RATE and in-house rectal MRI dataset. In both scenarios, MS-VLM surpasses existing methods in radiology report generation, producing more coherent and clinically relevant reports. These findings highlight the potential of MS-VLM to advance 3D medical image interpretation and improve the robustness of medical VLMs.

Penulis (7)

C

Changsun Lee

S

Sangjoon Park

C

Cheong-Il Shin

W

Woo Hee Choi

H

Hyun Jeong Park

J

Jeong Eun Lee

J

Jong Chul Ye

Format Sitasi

Lee, C., Park, S., Shin, C., Choi, W.H., Park, H.J., Lee, J.E. et al. (2024). Read Like a Radiologist: Efficient Vision-Language Model for 3D Medical Imaging Interpretation. https://arxiv.org/abs/2412.13558

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2024
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓