arXiv Open Access 2023

ConaCLIP: Exploring Distillation of Fully-Connected Knowledge Interaction Graph for Lightweight Text-Image Retrieval

Jiapeng Wang Chengyu Wang Xiaodan Wang Jun Huang Lianwen Jin
Lihat Sumber

Abstrak

Large-scale pre-trained text-image models with dual-encoder architectures (such as CLIP) are typically adopted for various vision-language applications, including text-image retrieval. However,these models are still less practical on edge devices or for real-time situations, due to the substantial indexing and inference time and the large consumption of computational resources. Although knowledge distillation techniques have been widely utilized for uni-modal model compression, how to expand them to the situation when the numbers of modalities and teachers/students are doubled has been rarely studied. In this paper, we conduct comprehensive experiments on this topic and propose the fully-Connected knowledge interaction graph (Cona) technique for cross-modal pre-training distillation. Based on our findings, the resulting ConaCLIP achieves SOTA performances on the widely-used Flickr30K and MSCOCO benchmarks under the lightweight setting. An industry application of our method on an e-commercial platform further demonstrates the significant effectiveness of ConaCLIP.

Topik & Kata Kunci

Penulis (5)

J

Jiapeng Wang

C

Chengyu Wang

X

Xiaodan Wang

J

Jun Huang

L

Lianwen Jin

Format Sitasi

Wang, J., Wang, C., Wang, X., Huang, J., Jin, L. (2023). ConaCLIP: Exploring Distillation of Fully-Connected Knowledge Interaction Graph for Lightweight Text-Image Retrieval. https://arxiv.org/abs/2305.17652

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2023
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓