Semantic Scholar Open Access 2024 202 sitasi

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

Chi-Lam Cheang Guangzeng Chen Ya Jing Tao Kong Hang Li +7 lainnya

Abstrak

We present GR-2, a state-of-the-art generalist robot agent for versatile and generalizable robot manipulation. GR-2 is first pre-trained on a vast number of Internet videos to capture the dynamics of the world. This large-scale pre-training, involving 38 million video clips and over 50 billion tokens, equips GR-2 with the ability to generalize across a wide range of robotic tasks and environments during subsequent policy learning. Following this, GR-2 is fine-tuned for both video generation and action prediction using robot trajectories. It exhibits impressive multi-task learning capabilities, achieving an average success rate of 97.7% across more than 100 tasks. Moreover, GR-2 demonstrates exceptional generalization to new, previously unseen scenarios, including novel backgrounds, environments, objects, and tasks. Notably, GR-2 scales effectively with model size, underscoring its potential for continued growth and application. Project page: \url{https://gr2-manipulation.github.io}.

Topik & Kata Kunci

Computer Science

Penulis (12)

Chi-Lam Cheang

Guangzeng Chen

Ya Jing

Tao Kong

Hang Li

Yifeng Li

Yuxiao Liu

Hongtao Wu

Jiafeng Xu

Yichu Yang

Hanbo Zhang

Minzhao Zhu

Format Sitasi

APA MLA BibTeX

Cheang, C., Chen, G., Jing, Y., Kong, T., Li, H., Li, Y. et al. (2024). GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation. https://doi.org/10.48550/arXiv.2410.06158

Akses Cepat

PDF tidak tersedia langsung

Cek di sumber asli →

Lihat di Sumber doi.org/10.48550/arXiv.2410.06158

Informasi Jurnal

Tahun Terbit: 2024
Bahasa: en
Total Sitasi: 202×
Sumber Database: Semantic Scholar
DOI: 10.48550/arXiv.2410.06158
Akses: Open Access ✓