arXiv Open Access 2025

UniForm: A Unified Multi-Task Diffusion Transformer for Audio-Video Generation

Lei Zhao Linfeng Feng Dongxu Ge Rujin Chen Fangqiu Yi +3 lainnya

Lihat Sumber

Abstrak

With the rise of diffusion models, audio-video generation has been revolutionized. However, most existing methods rely on separate modules for each modality, with limited exploration of unified generative architectures. In addition, many are confined to a single task and small-scale datasets. To overcome these limitations, we introduce UniForm, a unified multi-task diffusion transformer that generates both audio and visual modalities in a shared latent space. By using a unified denoising network, UniForm captures the inherent correlations between sound and vision. Additionally, we propose task-specific noise schemes and task tokens, enabling the model to support multiple tasks with a single set of parameters, including video-to-audio, audio-to-video and text-to-audio-video generation. Furthermore, by leveraging large language models and a large-scale text-audio-video combined dataset, UniForm achieves greater generative diversity than prior approaches. Experiments show that UniForm achieves performance close to the state-of-the-art single-task models across three generation tasks, with generated content that is not only highly aligned with real-world data distributions but also enables more diverse and fine-grained generation.

Topik & Kata Kunci

cs.MM cs.AI cs.CV cs.SD eess.AS

Penulis (8)

Lei Zhao

Linfeng Feng

Dongxu Ge

Rujin Chen

Fangqiu Yi

Chi Zhang

Xiao-Lei Zhang

Xuelong Li

Format Sitasi

APA MLA BibTeX

Zhao, L., Feng, L., Ge, D., Chen, R., Yi, F., Zhang, C. et al. (2025). UniForm: A Unified Multi-Task Diffusion Transformer for Audio-Video Generation. https://arxiv.org/abs/2502.03897

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓