arXiv Open Access 2025

Toward Aligning Human and Robot Actions via Multi-Modal Demonstration Learning

Azizul Zahid Jie Fan Farong Wang Ashton Dy Sai Swaminathan +1 lainnya

Lihat Sumber

Abstrak

Understanding action correspondence between humans and robots is essential for evaluating alignment in decision-making, particularly in human-robot collaboration and imitation learning within unstructured environments. We propose a multimodal demonstration learning framework that explicitly models human demonstrations from RGB video with robot demonstrations in voxelized RGB-D space. Focusing on the "pick and place" task from the RH20T dataset, we utilize data from 5 users across 10 diverse scenes. Our approach combines ResNet-based visual encoding for human intention modeling and a Perceiver Transformer for voxel-based robot action prediction. After 2000 training epochs, the human model reaches 71.67% accuracy, and the robot model achieves 71.8% accuracy, demonstrating the framework's potential for aligning complex, multimodal human and robot behaviors in manipulation tasks.

Topik & Kata Kunci

cs.RO cs.AI cs.CV

Penulis (6)

Azizul Zahid

Jie Fan

Farong Wang

Ashton Dy

Sai Swaminathan

Fei Liu

Format Sitasi

APA MLA BibTeX

Zahid, A., Fan, J., Wang, F., Dy, A., Swaminathan, S., Liu, F. (2025). Toward Aligning Human and Robot Actions via Multi-Modal Demonstration Learning. https://arxiv.org/abs/2504.11493

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓