arXiv Open Access 2022

Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention

Xubo Liu Qiushi Huang Xinhao Mei Haohe Liu Qiuqiang Kong +8 lainnya

Lihat Sumber

Abstrak

Audio captioning aims to generate text descriptions of audio clips. In the real world, many objects produce similar sounds. How to accurately recognize ambiguous sounds is a major challenge for audio captioning. In this work, inspired by inherent human multimodal perception, we propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects. Specifically, we introduce an off-the-shelf visual encoder to extract video features and incorporate the visual features into an audio captioning system. Furthermore, to better exploit complementary audio-visual contexts, we propose an audio-visual attention mechanism that adaptively integrates audio and visual context and removes the redundant information in the latent space. Experimental results on AudioCaps, the largest audio captioning dataset, show that our proposed method achieves state-of-the-art results on machine translation metrics.

Topik & Kata Kunci

eess.AS cs.AI cs.MM cs.SD

Penulis (13)

Xubo Liu

Qiushi Huang

Xinhao Mei

Haohe Liu

Qiuqiang Kong

Jianyuan Sun

Shengchen Li

Tom Ko

Yu Zhang

Lilian H. Tang

Mark D. Plumbley

Volkan Kılıç

Wenwu Wang

Format Sitasi

APA MLA BibTeX

Liu, X., Huang, Q., Mei, X., Liu, H., Kong, Q., Sun, J. et al. (2022). Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention. https://arxiv.org/abs/2210.16428

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2022
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓