arXiv Open Access 2025

EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery

Guankun Wang Long Bai Junyi Wang Kun Yuan Zhen Li +11 lainnya

Lihat Sumber

Abstrak

Recently, Multimodal Large Language Models (MLLMs) have demonstrated their immense potential in computer-aided diagnosis and decision-making. In the context of robotic-assisted surgery, MLLMs can serve as effective tools for surgical training and guidance. However, there is still a lack of MLLMs specialized for surgical scene understanding in clinical applications. In this work, we introduce EndoChat to address various dialogue paradigms and subtasks in surgical scene understanding that surgeons encounter. To train our EndoChat, we construct the Surg-396K dataset through a novel pipeline that systematically extracts surgical information and generates structured annotations based on collected large-scale endoscopic surgery datasets. Furthermore, we introduce a multi-scale visual token interaction mechanism and a visual contrast-based reasoning mechanism to enhance the model's representation learning and reasoning capabilities. Our model achieves state-of-the-art performance across five dialogue paradigms and eight surgical scene understanding tasks. Additionally, we conduct evaluations with professional surgeons, most of whom provide positive feedback on collaborating with EndoChat. Overall, these results demonstrate that our EndoChat has great potential to significantly advance training and automation in robotic-assisted surgery.

Topik & Kata Kunci

cs.CV

Penulis (16)

Guankun Wang

Long Bai

Junyi Wang

Kun Yuan

Zhen Li

Tianxu Jiang

Xiting He

Jinlin Wu

Zhen Chen

Zhen Lei

Hongbin Liu

Jiazheng Wang

Fan Zhang

Nicolas Padoy

Nassir Navab

Hongliang Ren

Format Sitasi

APA MLA BibTeX

Wang, G., Bai, L., Wang, J., Yuan, K., Li, Z., Jiang, T. et al. (2025). EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery. https://arxiv.org/abs/2501.11347

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓