TY - JOUR
T1 - Multi-modal 6-DoF object pose tracking
T2 - integrating spatial cues with monocular RGB imagery
AU - Mei, Yunpeng
AU - Wang, Shuze
AU - Li, Zhuo
AU - Sun, Jian
AU - Wang, Gang
N1 - Publisher Copyright:
© The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2024.
PY - 2025/2
Y1 - 2025/2
N2 - Accurate six degrees of freedom (6-DoF) pose estimation is crucial for robust visual perception in fields such as smart manufacturing. Traditional RGB-based methods, though widely used, often face difficulties in adapting to dynamic scenes, understanding contextual information, and capturing temporal variations effectively. To address these challenges, we introduce a novel multi-modal 6-DoF pose estimation framework. This framework uses RGB images as the primary input and integrates spatial cues, including keypoint heatmaps and affinity fields, through a spatially aligned approach inspired by the Trans-UNet architecture. Our multi-modal method enhances both contextual understanding and temporal consistency. Experimental results on the Objectron dataset demonstrate that our approach surpasses existing algorithms across most categories. Furthermore, real-world tests confirm the accuracy and practical applicability of our method for robotic tasks, such as precision grasping, highlighting its effectiveness for real-world applications.
AB - Accurate six degrees of freedom (6-DoF) pose estimation is crucial for robust visual perception in fields such as smart manufacturing. Traditional RGB-based methods, though widely used, often face difficulties in adapting to dynamic scenes, understanding contextual information, and capturing temporal variations effectively. To address these challenges, we introduce a novel multi-modal 6-DoF pose estimation framework. This framework uses RGB images as the primary input and integrates spatial cues, including keypoint heatmaps and affinity fields, through a spatially aligned approach inspired by the Trans-UNet architecture. Our multi-modal method enhances both contextual understanding and temporal consistency. Experimental results on the Objectron dataset demonstrate that our approach surpasses existing algorithms across most categories. Furthermore, real-world tests confirm the accuracy and practical applicability of our method for robotic tasks, such as precision grasping, highlighting its effectiveness for real-world applications.
KW - 6-DoF pose estimation
KW - Multi-modal learning
KW - Spatial-temporal modeling
UR - http://www.scopus.com/inward/record.url?scp=85202714519&partnerID=8YFLogxK
U2 - 10.1007/s13042-024-02336-8
DO - 10.1007/s13042-024-02336-8
M3 - Article
AN - SCOPUS:85202714519
SN - 1868-8071
VL - 16
SP - 1327
EP - 1340
JO - International Journal of Machine Learning and Cybernetics
JF - International Journal of Machine Learning and Cybernetics
IS - 2
ER -