arXiv Open Access 2023

FoodLMM: A Versatile Food Assistant using Large Multi-modal Model

Yuehao Yin Huiyan Qi Bin Zhu Jingjing Chen Yu-Gang Jiang +1 lainnya
Lihat Sumber

Abstrak

Large Multi-modal Models (LMMs) have made impressive progress in many vision-language tasks. Nevertheless, the performance of general LMMs in specific domains is still far from satisfactory. This paper proposes FoodLMM, a versatile food assistant based on LMMs with various capabilities, including food recognition, ingredient recognition, recipe generation, nutrition estimation, food segmentation and multi-round conversation. To facilitate FoodLMM to deal with tasks beyond pure text output, we introduce a series of novel task-specific tokens and heads, enabling the model to predict food nutritional values and multiple segmentation masks. We adopt a two-stage training strategy. In the first stage, we utilize multiple public food benchmarks for multi-task learning by leveraging the instruct-following paradigm. In the second stage, we construct a multi-round conversation dataset and a reasoning segmentation dataset to fine-tune the model, enabling it to conduct professional dialogues and generate segmentation masks based on complex reasoning in the food domain. Our fine-tuned FoodLMM achieves state-of-the-art results across several food benchmarks. We will make our code, models and datasets publicly available.

Topik & Kata Kunci

Penulis (6)

Y

Yuehao Yin

H

Huiyan Qi

B

Bin Zhu

J

Jingjing Chen

Y

Yu-Gang Jiang

C

Chong-Wah Ngo

Format Sitasi

Yin, Y., Qi, H., Zhu, B., Chen, J., Jiang, Y., Ngo, C. (2023). FoodLMM: A Versatile Food Assistant using Large Multi-modal Model. https://arxiv.org/abs/2312.14991

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2023
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓