arXiv Open Access 2025

MedicalNarratives: Connecting Medical Vision and Language with Localized Narratives

Wisdom O. Ikezogwo Kevin Zhang Mehmet Saygin Seyfioglu Fatemeh Ghezloo Linda Shapiro +1 lainnya

Lihat Sumber

Abstrak

Multi-modal models are data hungry. While datasets with natural images are abundant, medical image datasets can not afford the same luxury. To enable representation learning for medical images at scale, we turn to YouTube, a platform with a large reservoir of open-source medical pedagogical videos. We curate MedicalNarratives, a dataset 4.7M medical image-text pairs, with 1M samples containing dense annotations in the form of spatial traces (and bounding boxes), and 118K videos centered on the trace event (with aligned text), enabling spatiotemporal grounding beyond single frames. Similar to $\textit{think-aloud}$ studies where instructors speak while hovering their mouse cursor movements over relevant image regions, 1M images in MedicalNarratives contains localized mouse traces in image pixels, creating a spatial and temporal association between the text and pixels. To evaluate the utility of MedicalNarratives, we train GenMedClip with a CLIP-like objective using our dataset spanning 12 medical domains. GenMedClip outperforms previous state-of-the-art models on all 12 domains on a newly constructed medical imaging benchmark. $\href{https://huggingface.co/datasets/wisdomik/MedicalNarratives}{[Data]}$

Topik & Kata Kunci

cs.CV

Penulis (6)

Wisdom O. Ikezogwo

Kevin Zhang

Mehmet Saygin Seyfioglu

Fatemeh Ghezloo

Linda Shapiro

Ranjay Krishna

Format Sitasi

APA MLA BibTeX

Ikezogwo, W.O., Zhang, K., Seyfioglu, M.S., Ghezloo, F., Shapiro, L., Krishna, R. (2025). MedicalNarratives: Connecting Medical Vision and Language with Localized Narratives. https://arxiv.org/abs/2501.04184

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓