arXiv Open Access 2024

Video-Guided Foley Sound Generation with Multimodal Controls

Ziyang Chen Prem Seetharaman Bryan Russell Oriol Nieto David Bourgin +2 lainnya
Lihat Sumber

Abstrak

Generating sound effects for videos often requires creating artistic sound effects that diverge significantly from real-life sources and flexible control in the sound design. To address this problem, we introduce MultiFoley, a model designed for video-guided sound generation that supports multimodal conditioning through text, audio, and video. Given a silent video and a text prompt, MultiFoley allows users to create clean sounds (e.g., skateboard wheels spinning without wind noise) or more whimsical sounds (e.g., making a lion's roar sound like a cat's meow). MultiFoley also allows users to choose reference audio from sound effects (SFX) libraries or partial videos for conditioning. A key novelty of our model lies in its joint training on both internet video datasets with low-quality audio and professional SFX recordings, enabling high-quality, full-bandwidth (48kHz) audio generation. Through automated evaluations and human studies, we demonstrate that MultiFoley successfully generates synchronized high-quality sounds across varied conditional inputs and outperforms existing methods. Please see our project page for video results: https://ificl.github.io/MultiFoley/

Penulis (7)

Z

Ziyang Chen

P

Prem Seetharaman

B

Bryan Russell

O

Oriol Nieto

D

David Bourgin

A

Andrew Owens

J

Justin Salamon

Format Sitasi

Chen, Z., Seetharaman, P., Russell, B., Nieto, O., Bourgin, D., Owens, A. et al. (2024). Video-Guided Foley Sound Generation with Multimodal Controls. https://arxiv.org/abs/2411.17698

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2024
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓