arXiv Open Access 2025

MusicInfuser: Making Video Diffusion Listen and Dance

Susung Hong Ira Kemelmacher-Shlizerman Brian Curless Steven M. Seitz

Lihat Sumber

Abstrak

We introduce MusicInfuser, an approach that aligns pre-trained text-to-video diffusion models to generate high-quality dance videos synchronized with specified music tracks. Rather than training a multimodal audio-video or audio-motion model from scratch, our method demonstrates how existing video diffusion models can be efficiently adapted to align with musical inputs. We propose a novel layer-wise adaptability criterion based on a guidance-inspired constructive influence function to select adaptable layers, significantly reducing training costs while preserving rich prior knowledge, even with limited, specialized datasets. Experiments show that MusicInfuser effectively bridges the gap between music and video, generating novel and diverse dance movements that respond dynamically to music. Furthermore, our framework generalizes well to unseen music tracks, longer video sequences, and unconventional subjects, outperforming baseline models in consistency and synchronization. All of this is achieved without requiring motion data, with training completed on a single GPU within a day.

Topik & Kata Kunci

cs.CV cs.AI cs.LG

Penulis (4)

Susung Hong

Ira Kemelmacher-Shlizerman

Brian Curless

Steven M. Seitz

Format Sitasi

APA MLA BibTeX

Hong, S., Kemelmacher-Shlizerman, I., Curless, B., Seitz, S.M. (2025). MusicInfuser: Making Video Diffusion Listen and Dance. https://arxiv.org/abs/2503.14505

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓