Semantic Scholar Open Access 2023 8745 sitasi

Visual Instruction Tuning

Haotian Liu Chunyuan Li Qingyang Wu Yong Jae Lee

Lihat Sumber DOI

Abstrak

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model and code base publicly available.

Topik & Kata Kunci

Computer Science

Penulis (4)

Haotian Liu

Chunyuan Li

Qingyang Wu

Yong Jae Lee

Format Sitasi

APA MLA BibTeX

Liu, H., Li, C., Wu, Q., Lee, Y.J. (2023). Visual Instruction Tuning. https://doi.org/10.48550/arXiv.2304.08485

Akses Cepat

Lihat di Sumber doi.org/10.48550/arXiv.2304.08485

Informasi Jurnal

Tahun Terbit: 2023
Bahasa: en
Total Sitasi: 8745×
Sumber Database: Semantic Scholar
DOI: 10.48550/arXiv.2304.08485
Akses: Open Access ✓