DOAJ Open Access 2024

Image Description Generation Method by Panoptic Segmentation and Multi-Visual-Feature Fusion

LIU Mingming, LU Jinfu, LIU Hao, ZHANG Haiyan

Lihat Sumber DOI

Abstrak

Due to their powerful sequence modeling capabilities, Transformer-based image captioning models have demonstrated remarkable performance. However, most of these models typically utilize region visual features to perform encoding and decoding, which cannot fully use the fine-grained information of the whole image, and this leads to visual feature confusion. Accordingly, we introduce panoptic segmentation into the Transformer-based image captioning model by replacing the region visual feature with mask visual features and propose a novel image captioning model based on multi-visual-feature fusion. Our model not only disentangles the region visual features effectively but also makes use of both mask and grid visual features to improve image captioning performance. We perform quantitative and qualitative experiments on the MSCOCO dataset, which demonstrate that our method significantly outperforms existing Transformer-based image captioning models. In addition, our model enhances the interpretability of the caption generation process, and more specifically, achieves CIDEr and BLEU-4 scores of 138.5 and 41, respectively.

Topik & Kata Kunci

Computer engineering. Computer hardware Computer software

Penulis (1)

LIU Mingming, LU Jinfu, LIU Hao, ZHANG Haiyan

Format Sitasi

APA MLA BibTeX

Haiyan, L.M.L.J.L.H.Z. (2024). Image Description Generation Method by Panoptic Segmentation and Multi-Visual-Feature Fusion. https://doi.org/10.19678/j.issn.1000-3428.0069303

Akses Cepat

Lihat di Sumber doi.org/10.19678/j.issn.1000-3428.0069303

Informasi Jurnal

Tahun Terbit: 2024
Sumber Database: DOAJ
DOI: 10.19678/j.issn.1000-3428.0069303
Akses: Open Access ✓