arXiv Open Access 2025

VimoRAG: Video-based Retrieval-augmented 3D Motion Generation for Motion Language Models

Haidong Xu Guangwei Xu Zhedong Zheng Xiatian Zhu Wei Ji +5 lainnya

Lihat Sumber

Abstrak

This paper introduces VimoRAG, a novel video-based retrieval-augmented motion generation framework for motion large language models (LLMs). As motion LLMs face severe out-of-domain/out-of-vocabulary issues due to limited annotated data, VimoRAG leverages large-scale in-the-wild video databases to enhance 3D motion generation by retrieving relevant 2D human motion signals. While video-based motion RAG is nontrivial, we address two key bottlenecks: (1) developing an effective motion-centered video retrieval model that distinguishes human poses and actions, and (2) mitigating the issue of error propagation caused by suboptimal retrieval results. We design the Gemini Motion Video Retriever mechanism and the Motion-centric Dual-alignment DPO Trainer, enabling effective retrieval and generation processes. Experimental results show that VimoRAG significantly boosts the performance of motion LLMs constrained to text-only input. All the resources are available at https://walkermitty.github.io/VimoRAG/

Topik & Kata Kunci

cs.CV cs.AI cs.CL

Penulis (10)

Haidong Xu

Guangwei Xu

Zhedong Zheng

Xiatian Zhu

Wei Ji

Xiangtai Li

Ruijie Guo

Meishan Zhang

Min zhang

Hao Fei

Format Sitasi

APA MLA BibTeX

Xu, H., Xu, G., Zheng, Z., Zhu, X., Ji, W., Li, X. et al. (2025). VimoRAG: Video-based Retrieval-augmented 3D Motion Generation for Motion Language Models. https://arxiv.org/abs/2508.12081

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓