TY - JOUR
T1 - FreePose
T2 - Zero-Shot 6D Object Pose Estimation Using Pretrained Foundation Models
AU - Alsumeri, Abdulrahman
AU - Zhai, Di Hua
AU - Xia, Yuanqing
N1 - Publisher Copyright:
© 1991-2012 IEEE.
PY - 2025
Y1 - 2025
N2 - An accurate 6D object pose estimation is essential for robotic manipulation and augmented reality applications. Existing methods typically require extensive training for new objects, limiting their effectiveness in dynamic environments where new objects are frequently introduced. In this paper, we propose FreePose, an efficient free-trained zero-shot 6D pose estimation method leveraging pre-trained visual and geometric foundation models. Our approach includes an offline onboarding stage, in which multiple viewpoint templates of a reference object are rendered, then visual and geometric features are extracted using visual and geometric pretrained models, respectively. These visual features are then back-projected onto corresponding 3D points, enabling a precise alignment between appearance and geometry, and subsequently fused with geometric features to form a robust unified representation. During inference stage, target object instances are segmented from RGB-D image using SAM2 coupled with an object-matching algorithm. Visual features of each target instance is similarly extracted, back-projected, and fused with geometric features. Robust 3D-3D correspondences are then established using nearest-neighbor search. Finally, pose estimation is obtained using the TEASER registration algorithm. Extensive evaluations conducted on the BOP5 core datasets show that our approach achieves results comparable to state-of-the-art methods. To highlight the effectiveness and potential of FreePose in real-world scenarios, FreePose is deployed on a real UR3 robot to perform grasping experiments reaching a success grasp rate of 65.0%.
AB - An accurate 6D object pose estimation is essential for robotic manipulation and augmented reality applications. Existing methods typically require extensive training for new objects, limiting their effectiveness in dynamic environments where new objects are frequently introduced. In this paper, we propose FreePose, an efficient free-trained zero-shot 6D pose estimation method leveraging pre-trained visual and geometric foundation models. Our approach includes an offline onboarding stage, in which multiple viewpoint templates of a reference object are rendered, then visual and geometric features are extracted using visual and geometric pretrained models, respectively. These visual features are then back-projected onto corresponding 3D points, enabling a precise alignment between appearance and geometry, and subsequently fused with geometric features to form a robust unified representation. During inference stage, target object instances are segmented from RGB-D image using SAM2 coupled with an object-matching algorithm. Visual features of each target instance is similarly extracted, back-projected, and fused with geometric features. Robust 3D-3D correspondences are then established using nearest-neighbor search. Finally, pose estimation is obtained using the TEASER registration algorithm. Extensive evaluations conducted on the BOP5 core datasets show that our approach achieves results comparable to state-of-the-art methods. To highlight the effectiveness and potential of FreePose in real-world scenarios, FreePose is deployed on a real UR3 robot to perform grasping experiments reaching a success grasp rate of 65.0%.
KW - 6D object pose estimation
KW - Unseen Objects
KW - features fusion
KW - foundation models
KW - free-trained
UR - https://www.scopus.com/pages/publications/105020736588
U2 - 10.1109/TCSVT.2025.3627799
DO - 10.1109/TCSVT.2025.3627799
M3 - Article
AN - SCOPUS:105020736588
SN - 1051-8215
JO - IEEE Transactions on Circuits and Systems for Video Technology
JF - IEEE Transactions on Circuits and Systems for Video Technology
ER -