arXiv Open Access 2025

Toward Aligning Human and Robot Actions via Multi-Modal Demonstration Learning

Azizul Zahid Jie Fan Farong Wang Ashton Dy Sai Swaminathan +1 lainnya
Lihat Sumber

Abstrak

Understanding action correspondence between humans and robots is essential for evaluating alignment in decision-making, particularly in human-robot collaboration and imitation learning within unstructured environments. We propose a multimodal demonstration learning framework that explicitly models human demonstrations from RGB video with robot demonstrations in voxelized RGB-D space. Focusing on the "pick and place" task from the RH20T dataset, we utilize data from 5 users across 10 diverse scenes. Our approach combines ResNet-based visual encoding for human intention modeling and a Perceiver Transformer for voxel-based robot action prediction. After 2000 training epochs, the human model reaches 71.67% accuracy, and the robot model achieves 71.8% accuracy, demonstrating the framework's potential for aligning complex, multimodal human and robot behaviors in manipulation tasks.

Topik & Kata Kunci

Penulis (6)

A

Azizul Zahid

J

Jie Fan

F

Farong Wang

A

Ashton Dy

S

Sai Swaminathan

F

Fei Liu

Format Sitasi

Zahid, A., Fan, J., Wang, F., Dy, A., Swaminathan, S., Liu, F. (2025). Toward Aligning Human and Robot Actions via Multi-Modal Demonstration Learning. https://arxiv.org/abs/2504.11493

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2025
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓