arXiv Open Access 2025

Reframing Music-Driven 2D Dance Pose Generation as Multi-Channel Image Generation

Yan Zhang Han Zou Lincong Feng Cong Xie Ruiqi Yu +1 lainnya

Lihat Sumber

Abstrak

Recent pose-to-video models can translate 2D pose sequences into photorealistic, identity-preserving dance videos, so the key challenge is to generate temporally coherent, rhythm-aligned 2D poses from music, especially under complex, high-variance in-the-wild distributions. We address this by reframing music-to-dance generation as a music-token-conditioned multi-channel image synthesis problem: 2D pose sequences are encoded as one-hot images, compressed by a pretrained image VAE, and modeled with a DiT-style backbone, allowing us to inherit architectural and training advances from modern text-to-image models and better capture high-variance 2D pose distributions. On top of this formulation, we introduce (i) a time-shared temporal indexing scheme that explicitly synchronizes music tokens and pose latents over time and (ii) a reference-pose conditioning strategy that preserves subject-specific body proportions and on-screen scale while enabling long-horizon segment-and-stitch generation. Experiments on a large in-the-wild 2D dance corpus and the calibrated AIST++2D benchmark show consistent improvements over representative music-to-dance methods in pose- and video-space metrics and human preference, and ablations validate the contributions of the representation, temporal indexing, and reference conditioning. See supplementary videos at https://hot-dance.github.io

Topik & Kata Kunci

cs.CV

Penulis (6)

Yan Zhang

Han Zou

Lincong Feng

Cong Xie

Ruiqi Yu

Zhenpeng Zhan

Format Sitasi

APA MLA BibTeX

Zhang, Y., Zou, H., Feng, L., Xie, C., Yu, R., Zhan, Z. (2025). Reframing Music-Driven 2D Dance Pose Generation as Multi-Channel Image Generation. https://arxiv.org/abs/2512.11720

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓