arXiv Open Access 2025

VPHO: Joint Visual-Physical Cue Learning and Aggregation for Hand-Object Pose Estimation

Jun Zhou Chi Xu Kaifeng Tang Yuting Ge Tingrui Guo +1 lainnya

Lihat Sumber

Abstrak

Estimating the 3D poses of hands and objects from a single RGB image is a fundamental yet challenging problem, with broad applications in augmented reality and human-computer interaction. Existing methods largely rely on visual cues alone, often producing results that violate physical constraints such as interpenetration or non-contact. Recent efforts to incorporate physics reasoning typically depend on post-optimization or non-differentiable physics engines, which compromise visual consistency and end-to-end trainability. To overcome these limitations, we propose a novel framework that jointly integrates visual and physical cues for hand-object pose estimation. This integration is achieved through two key ideas: 1) joint visual-physical cue learning: The model is trained to extract 2D visual cues and 3D physical cues, thereby enabling more comprehensive representation learning for hand-object interactions; 2) candidate pose aggregation: A novel refinement process that aggregates multiple diffusion-generated candidate poses by leveraging both visual and physical predictions, yielding a final estimate that is visually consistent and physically plausible. Extensive experiments demonstrate that our method significantly outperforms existing state-of-the-art approaches in both pose accuracy and physical plausibility.

Topik & Kata Kunci

cs.CV

Penulis (6)

Jun Zhou

Chi Xu

Kaifeng Tang

Yuting Ge

Tingrui Guo

Li Cheng

Format Sitasi

APA MLA BibTeX

Zhou, J., Xu, C., Tang, K., Ge, Y., Guo, T., Cheng, L. (2025). VPHO: Joint Visual-Physical Cue Learning and Aggregation for Hand-Object Pose Estimation. https://arxiv.org/abs/2511.12030

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓