Semantic Scholar Open Access 2024 3 sitasi

Clustering swap prediction for image-text pre-training

Sun Fayou Hea Choon Ngo Y. W. Sek Zuqiang Meng

Lihat Sumber DOI

Abstrak

It is essential to delve into the strategy of multimodal model pre-training, which is an obvious impact on downstream tasks. Currently, clustering learning has achieved noteworthy benefits in multiple methods. However, due to the availability of open image-text pairs, it is challenging for multimodal with clustering learning. In this paper, we propose an approach that utilizes clustering swap prediction strategy to learn image-text clustering embedding space by interaction prediction between image and text features. Unlike existing models with clustering learning, our method (Clus) allows for an open number of clusters for web-scale alt-text data. Furthermore, in order to train the image and text encoders efficiently, we introduce distillation learning approach and evaluate the performance of the image-encoder in downstream visual tasks. In addition, Clus is pre-trained end-to-end by using large-scale image-text pairs. Specifically, both text and image serve as ground truth for swap prediction, enabling effective representation learning. Concurrently, extensive experiments demonstrate that Clus achieves state-of-the-art performance on multiple downstream fine-tuning and zero-shot tasks (i.e., Image-Text Retrieval, VQA, NLVR2, Image Captioning, Object Detection, and Semantic Segmentation).

Topik & Kata Kunci

Medicine

Penulis (4)

Sun Fayou

Hea Choon Ngo

Y. W. Sek

Zuqiang Meng

Format Sitasi

APA MLA BibTeX

Fayou, S., Ngo, H.C., Sek, Y.W., Meng, Z. (2024). Clustering swap prediction for image-text pre-training. https://doi.org/10.1038/s41598-024-60832-x

Akses Cepat

Lihat di Sumber doi.org/10.1038/s41598-024-60832-x

Informasi Jurnal

Tahun Terbit: 2024
Bahasa: en
Total Sitasi: 3×
Sumber Database: Semantic Scholar
DOI: 10.1038/s41598-024-60832-x
Akses: Open Access ✓