Image Description Generation Method by Panoptic Segmentation and Multi-Visual-Feature Fusion
Abstrak
Due to their powerful sequence modeling capabilities, Transformer-based image captioning models have demonstrated remarkable performance. However, most of these models typically utilize region visual features to perform encoding and decoding, which cannot fully use the fine-grained information of the whole image, and this leads to visual feature confusion. Accordingly, we introduce panoptic segmentation into the Transformer-based image captioning model by replacing the region visual feature with mask visual features and propose a novel image captioning model based on multi-visual-feature fusion. Our model not only disentangles the region visual features effectively but also makes use of both mask and grid visual features to improve image captioning performance. We perform quantitative and qualitative experiments on the MSCOCO dataset, which demonstrate that our method significantly outperforms existing Transformer-based image captioning models. In addition, our model enhances the interpretability of the caption generation process, and more specifically, achieves CIDEr and BLEU-4 scores of 138.5 and 41, respectively.
Topik & Kata Kunci
Penulis (1)
LIU Mingming, LU Jinfu, LIU Hao, ZHANG Haiyan
Akses Cepat
- Tahun Terbit
- 2024
- Sumber Database
- DOAJ
- DOI
- 10.19678/j.issn.1000-3428.0069303
- Akses
- Open Access ✓