arXiv Open Access 2025

ViDRiP-LLaVA: A Dataset and Benchmark for Diagnostic Reasoning from Pathology Videos

Trinh T. L. Vuong Jin Tae Kwak
Lihat Sumber

Abstrak

We present ViDRiP-LLaVA, the first large multimodal model (LMM) in computational pathology that integrates three distinct image scenarios, including single patch images, automatically segmented pathology video clips, and manually segmented pathology videos. This integration closely mirrors the natural diagnostic process of pathologists. By generating detailed histological descriptions and culminating in a definitive sign-out diagnosis, ViDRiP-LLaVA bridges visual narratives with diagnostic reasoning. Central to our approach is the ViDRiP-Instruct dataset, comprising 4278 video and diagnosis-specific chain-of-thought instructional pairs sourced from educational histopathology videos on YouTube. Although high-quality data is critical for enhancing diagnostic reasoning, its creation is time-intensive and limited in volume. To overcome this challenge, we transfer knowledge from existing single-image instruction datasets to train on weakly annotated, keyframe-extracted clips, followed by fine-tuning on manually segmented videos. ViDRiP-LLaVA establishes a new benchmark in pathology video analysis and offers a promising foundation for future AI systems that support clinical decision-making through integrated visual and diagnostic reasoning. Our code, data, and model are publicly available at: https://github.com/QuIIL/ViDRiP-LLaVA.

Topik & Kata Kunci

Penulis (2)

T

Trinh T. L. Vuong

J

Jin Tae Kwak

Format Sitasi

Vuong, T.T.L., Kwak, J.T. (2025). ViDRiP-LLaVA: A Dataset and Benchmark for Diagnostic Reasoning from Pathology Videos. https://arxiv.org/abs/2505.04192

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2025
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓