arXiv Open Access 2021

Learning Audio-Visual Dereverberation

Changan Chen Wei Sun David Harwath Kristen Grauman

Lihat Sumber

Abstrak

Reverberation not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition. Prior work attempts to remove reverberation based on the audio modality only. Our idea is to learn to dereverberate speech from audio-visual observations. The visual environment surrounding a human speaker reveals important cues about the room geometry, materials, and speaker location, all of which influence the precise reverberation effects. We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed monaural sound and visual scene. In support of this new task, we develop a large-scale dataset SoundSpaces-Speech that uses realistic acoustic renderings of speech in real-world 3D scans of homes offering a variety of room acoustics. Demonstrating our approach on both simulated and real imagery for speech enhancement, speech recognition, and speaker identification, we show it achieves state-of-the-art performance and substantially improves over audio-only methods.

Topik & Kata Kunci

cs.SD cs.CV cs.LG eess.AS

Penulis (4)

Changan Chen

Wei Sun

David Harwath

Kristen Grauman

Format Sitasi

APA MLA BibTeX

Chen, C., Sun, W., Harwath, D., Grauman, K. (2021). Learning Audio-Visual Dereverberation. https://arxiv.org/abs/2106.07732

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2021
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓