arXiv Open Access 2024

Video-Guided Foley Sound Generation with Multimodal Controls

Ziyang Chen Prem Seetharaman Bryan Russell Oriol Nieto David Bourgin +2 lainnya

Lihat Sumber

Abstrak

Generating sound effects for videos often requires creating artistic sound effects that diverge significantly from real-life sources and flexible control in the sound design. To address this problem, we introduce MultiFoley, a model designed for video-guided sound generation that supports multimodal conditioning through text, audio, and video. Given a silent video and a text prompt, MultiFoley allows users to create clean sounds (e.g., skateboard wheels spinning without wind noise) or more whimsical sounds (e.g., making a lion's roar sound like a cat's meow). MultiFoley also allows users to choose reference audio from sound effects (SFX) libraries or partial videos for conditioning. A key novelty of our model lies in its joint training on both internet video datasets with low-quality audio and professional SFX recordings, enabling high-quality, full-bandwidth (48kHz) audio generation. Through automated evaluations and human studies, we demonstrate that MultiFoley successfully generates synchronized high-quality sounds across varied conditional inputs and outperforms existing methods. Please see our project page for video results: https://ificl.github.io/MultiFoley/

Topik & Kata Kunci

cs.CV cs.MM cs.SD eess.AS

Penulis (7)

Ziyang Chen

Prem Seetharaman

Bryan Russell

Oriol Nieto

David Bourgin

Andrew Owens

Justin Salamon

Format Sitasi

APA MLA BibTeX

Chen, Z., Seetharaman, P., Russell, B., Nieto, O., Bourgin, D., Owens, A. et al. (2024). Video-Guided Foley Sound Generation with Multimodal Controls. https://arxiv.org/abs/2411.17698

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2024
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓