DOAJ Open Access 2024

Image Description Generation Method by Panoptic Segmentation and Multi-Visual-Feature Fusion

LIU Mingming, LU Jinfu, LIU Hao, ZHANG Haiyan

Abstrak

Due to their powerful sequence modeling capabilities, Transformer-based image captioning models have demonstrated remarkable performance. However, most of these models typically utilize region visual features to perform encoding and decoding, which cannot fully use the fine-grained information of the whole image, and this leads to visual feature confusion. Accordingly, we introduce panoptic segmentation into the Transformer-based image captioning model by replacing the region visual feature with mask visual features and propose a novel image captioning model based on multi-visual-feature fusion. Our model not only disentangles the region visual features effectively but also makes use of both mask and grid visual features to improve image captioning performance. We perform quantitative and qualitative experiments on the MSCOCO dataset, which demonstrate that our method significantly outperforms existing Transformer-based image captioning models. In addition, our model enhances the interpretability of the caption generation process, and more specifically, achieves CIDEr and BLEU-4 scores of 138.5 and 41, respectively.

Penulis (1)

L

LIU Mingming, LU Jinfu, LIU Hao, ZHANG Haiyan

Format Sitasi

Haiyan, L.M.L.J.L.H.Z. (2024). Image Description Generation Method by Panoptic Segmentation and Multi-Visual-Feature Fusion. https://doi.org/10.19678/j.issn.1000-3428.0069303

Akses Cepat

Informasi Jurnal
Tahun Terbit
2024
Sumber Database
DOAJ
DOI
10.19678/j.issn.1000-3428.0069303
Akses
Open Access ✓