Multi-modal 6-DoF object pose tracking: integrating spatial cues with monocular RGB imagery

Yunpeng Mei, Shuze Wang, Zhuo Li, Jian Sun, Gang Wang*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

1 Citation (Scopus)

Abstract

Accurate six degrees of freedom (6-DoF) pose estimation is crucial for robust visual perception in fields such as smart manufacturing. Traditional RGB-based methods, though widely used, often face difficulties in adapting to dynamic scenes, understanding contextual information, and capturing temporal variations effectively. To address these challenges, we introduce a novel multi-modal 6-DoF pose estimation framework. This framework uses RGB images as the primary input and integrates spatial cues, including keypoint heatmaps and affinity fields, through a spatially aligned approach inspired by the Trans-UNet architecture. Our multi-modal method enhances both contextual understanding and temporal consistency. Experimental results on the Objectron dataset demonstrate that our approach surpasses existing algorithms across most categories. Furthermore, real-world tests confirm the accuracy and practical applicability of our method for robotic tasks, such as precision grasping, highlighting its effectiveness for real-world applications.

Original languageEnglish
Pages (from-to)1327-1340
Number of pages14
JournalInternational Journal of Machine Learning and Cybernetics
Volume16
Issue number2
DOIs
Publication statusPublished - Feb 2025

Keywords

  • 6-DoF pose estimation
  • Multi-modal learning
  • Spatial-temporal modeling

Fingerprint

Dive into the research topics of 'Multi-modal 6-DoF object pose tracking: integrating spatial cues with monocular RGB imagery'. Together they form a unique fingerprint.

Cite this