Semantic Scholar Open Access 2023 661 sitasi

Otter: A Multi-Modal Model With In-Context Instruction Tuning

Bo Li Yuanhan Zhang Liangyu Chen Jinghao Wang Jingkang Yang +1 lainnya

Abstrak

Recent advances in Large Multimodal Models (LMMs) have unveiled great potential as visual assistants. However, most existing works focus on responding to individual instructions or using previous dialogues for contextual understanding. There is little discussion on employing both images and text as in-context examples to enhance the instruction following capability. To bridge this gap, we introduce the Otter model to leverage both textual and visual in-context examples for instruction tuning. Specifically, Otter builds upon Flamingo with Perceiver architecture, and has been instruction tuned for general purpose multi-modal assistant. Otter seamlessly processes multi-modal inputs, supporting modalities including text, multiple images, and dynamic video content. To support the training of Otter, we present the MIMIC-IT (MultI-Modal In-Context Instruction Tuning) dataset, which encompasses over 3 million multi-modal instruction-response pairs, including approximately 2.2 million unique instructions across a broad spectrum of images and videos. MIMIC-IT has been carefully curated to feature a diverse array of in-context examples for each entry. Comprehensive evaluations suggest that instruction tuning with these in-context examples substantially enhances model convergence and generalization capabilities. Notably, the extensive scenario coverage provided by the MIMIC-IT dataset empowers the Otter model to excel in tasks involving complex video and multi-image understanding.

Penulis (6)

B

Bo Li

Y

Yuanhan Zhang

L

Liangyu Chen

J

Jinghao Wang

J

Jingkang Yang

Z

Ziwei Liu

Format Sitasi

Li, B., Zhang, Y., Chen, L., Wang, J., Yang, J., Liu, Z. (2023). Otter: A Multi-Modal Model With In-Context Instruction Tuning. https://doi.org/10.1109/TPAMI.2025.3571946

Akses Cepat

PDF tidak tersedia langsung

Cek di sumber asli →
Lihat di Sumber doi.org/10.1109/TPAMI.2025.3571946
Informasi Jurnal
Tahun Terbit
2023
Bahasa
en
Total Sitasi
661×
Sumber Database
Semantic Scholar
DOI
10.1109/TPAMI.2025.3571946
Akses
Open Access ✓