Semantic Scholar Open Access 2024 202 sitasi

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

Chi-Lam Cheang Guangzeng Chen Ya Jing Tao Kong Hang Li +7 lainnya

Abstrak

We present GR-2, a state-of-the-art generalist robot agent for versatile and generalizable robot manipulation. GR-2 is first pre-trained on a vast number of Internet videos to capture the dynamics of the world. This large-scale pre-training, involving 38 million video clips and over 50 billion tokens, equips GR-2 with the ability to generalize across a wide range of robotic tasks and environments during subsequent policy learning. Following this, GR-2 is fine-tuned for both video generation and action prediction using robot trajectories. It exhibits impressive multi-task learning capabilities, achieving an average success rate of 97.7% across more than 100 tasks. Moreover, GR-2 demonstrates exceptional generalization to new, previously unseen scenarios, including novel backgrounds, environments, objects, and tasks. Notably, GR-2 scales effectively with model size, underscoring its potential for continued growth and application. Project page: \url{https://gr2-manipulation.github.io}.

Topik & Kata Kunci

Penulis (12)

C

Chi-Lam Cheang

G

Guangzeng Chen

Y

Ya Jing

T

Tao Kong

H

Hang Li

Y

Yifeng Li

Y

Yuxiao Liu

H

Hongtao Wu

J

Jiafeng Xu

Y

Yichu Yang

H

Hanbo Zhang

M

Minzhao Zhu

Format Sitasi

Cheang, C., Chen, G., Jing, Y., Kong, T., Li, H., Li, Y. et al. (2024). GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation. https://doi.org/10.48550/arXiv.2410.06158

Akses Cepat

PDF tidak tersedia langsung

Cek di sumber asli →
Lihat di Sumber doi.org/10.48550/arXiv.2410.06158
Informasi Jurnal
Tahun Terbit
2024
Bahasa
en
Total Sitasi
202×
Sumber Database
Semantic Scholar
DOI
10.48550/arXiv.2410.06158
Akses
Open Access ✓