DOAJ Open Access 2022

Research on Text Representation of Video Content Based on Multi-Modal Fusion and Multi-Layer Attention

ZHAO Hong, GUO Lan, CHEN Zhiwen, ZHENG Houze

Abstrak

Aiming at the challenges of single-text representation and low accuracy of existing video content text-representation models, a video content text-reprsentation model that integrates frame-level image and audio information is proposed.The network structure of the model includes a single-mode embedding layer based on a self attention mechanism, and learns single-mode feature parameters.Two schemes, joint-representation and cooperative-representation, are adopted to fuse high-dimensional feature vectors output from the single-mode embedding layer, so that the model can focus on different objects in the video and their interaction, thereby generating richer and more accurate video text representation.The model is pretrained through large-scale datasets, and representation information, such as video frames and audio carried by the video, are extracted and sent to the coder to realize the text representation of the video content.The experimental results on MSR-VTT and LSMDC datasets show that the BLEU4, METEOR, ROUGEL, and CIDEr scores of the proposed model are 0.386, 0.250, 0.609 and 0.463 respectively.Compared with the model released by the IIT DeIhi in the MSR-VTT challenge, the proposed model improves the indexes above by 0.082, 0.037, 0.115 and 0.257 respectively.The model in this study can effectively improve the accuracy of the video content text-representation model.

Penulis (1)

Z

ZHAO Hong, GUO Lan, CHEN Zhiwen, ZHENG Houze

Format Sitasi

Houze, Z.H.G.L.C.Z.Z. (2022). Research on Text Representation of Video Content Based on Multi-Modal Fusion and Multi-Layer Attention. https://doi.org/10.19678/j.issn.1000-3428.0063294

Akses Cepat

Informasi Jurnal
Tahun Terbit
2022
Sumber Database
DOAJ
DOI
10.19678/j.issn.1000-3428.0063294
Akses
Open Access ✓