arXiv Open Access 2025

MusicInfuser: Making Video Diffusion Listen and Dance

Susung Hong Ira Kemelmacher-Shlizerman Brian Curless Steven M. Seitz
Lihat Sumber

Abstrak

We introduce MusicInfuser, an approach that aligns pre-trained text-to-video diffusion models to generate high-quality dance videos synchronized with specified music tracks. Rather than training a multimodal audio-video or audio-motion model from scratch, our method demonstrates how existing video diffusion models can be efficiently adapted to align with musical inputs. We propose a novel layer-wise adaptability criterion based on a guidance-inspired constructive influence function to select adaptable layers, significantly reducing training costs while preserving rich prior knowledge, even with limited, specialized datasets. Experiments show that MusicInfuser effectively bridges the gap between music and video, generating novel and diverse dance movements that respond dynamically to music. Furthermore, our framework generalizes well to unseen music tracks, longer video sequences, and unconventional subjects, outperforming baseline models in consistency and synchronization. All of this is achieved without requiring motion data, with training completed on a single GPU within a day.

Topik & Kata Kunci

Penulis (4)

S

Susung Hong

I

Ira Kemelmacher-Shlizerman

B

Brian Curless

S

Steven M. Seitz

Format Sitasi

Hong, S., Kemelmacher-Shlizerman, I., Curless, B., Seitz, S.M. (2025). MusicInfuser: Making Video Diffusion Listen and Dance. https://arxiv.org/abs/2503.14505

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2025
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓