Category-Level 6-D Object Pose Estimation With Shape Deformation for Robotic Grasp Detection

Sheng Yu; Di Hua Zhai; Yuyin Guan; Yuanqing Xia

doi:10.1109/TNNLS.2023.3330011

Category-Level 6-D Object Pose Estimation With Shape Deformation for Robotic Grasp Detection

Sheng Yu, Di Hua Zhai, Yuyin Guan, Yuanqing Xia

自动化学院

科研成果: 期刊稿件 › 文章 › 同行评审

1 引用（Scopus）

摘要

Category-level 6-D object pose estimation plays a crucial role in achieving reliable robotic grasp detection. However, the disparity between synthetic and real datasets hinders the direct transfer of models trained on synthetic data to real-world scenarios, leading to ineffective results. Additionally, creating large-scale real datasets is a time-consuming and labor-intensive task. To overcome these challenges, we propose CatDeform, a novel category-level object pose estimation network trained on synthetic data but capable of delivering good performance on real datasets. In our approach, we introduce a transformer-based fusion module that enables the network to leverage multiple sources of information and enhance prediction accuracy through feature fusion. To ensure proper deformation of the prior point cloud to align with scene objects, we propose a transformer-based attention module that deforms the prior point cloud from both geometric and feature perspectives. Building upon CatDeform, we design a two-branch network for supervised learning, bridging the gap between synthetic and real datasets and achieving high-precision pose estimation in real-world scenes using predominantly synthetic data supplemented with a small amount of real data. To minimize reliance on large-scale real datasets, we train the network in a self-supervised manner by estimating object poses in real scenes based on the synthetic dataset without manual annotation. We conduct training and testing on CAMERA25 and REAL275 datasets, and our experimental results demonstrate that the proposed method outperforms state-of-the-art (SOTA) techniques in both self-supervised and supervised training paradigms. Finally, we apply CatDeform to object pose estimation and robotic grasp experiments in real-world scenarios, showcasing a higher grasp success rate.

源语言	英语
页（从-至）	1-15
页数	15
期刊	IEEE Transactions on Neural Networks and Learning Systems
DOI	https://doi.org/10.1109/TNNLS.2023.3330011
出版状态	已接受/待刊 - 2023

访问文件

10.1109/TNNLS.2023.3330011

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{22c1d994b2b7475facacdd08e6058b1b,

title = "Category-Level 6-D Object Pose Estimation With Shape Deformation for Robotic Grasp Detection",

abstract = "Category-level 6-D object pose estimation plays a crucial role in achieving reliable robotic grasp detection. However, the disparity between synthetic and real datasets hinders the direct transfer of models trained on synthetic data to real-world scenarios, leading to ineffective results. Additionally, creating large-scale real datasets is a time-consuming and labor-intensive task. To overcome these challenges, we propose CatDeform, a novel category-level object pose estimation network trained on synthetic data but capable of delivering good performance on real datasets. In our approach, we introduce a transformer-based fusion module that enables the network to leverage multiple sources of information and enhance prediction accuracy through feature fusion. To ensure proper deformation of the prior point cloud to align with scene objects, we propose a transformer-based attention module that deforms the prior point cloud from both geometric and feature perspectives. Building upon CatDeform, we design a two-branch network for supervised learning, bridging the gap between synthetic and real datasets and achieving high-precision pose estimation in real-world scenes using predominantly synthetic data supplemented with a small amount of real data. To minimize reliance on large-scale real datasets, we train the network in a self-supervised manner by estimating object poses in real scenes based on the synthetic dataset without manual annotation. We conduct training and testing on CAMERA25 and REAL275 datasets, and our experimental results demonstrate that the proposed method outperforms state-of-the-art (SOTA) techniques in both self-supervised and supervised training paradigms. Finally, we apply CatDeform to object pose estimation and robotic grasp experiments in real-world scenarios, showcasing a higher grasp success rate.",

keywords = "Category-level object pose estimation, Deformation, Feature extraction, Point cloud compression, Pose estimation, Shape, Solid modeling, Transformers, robotic grasp, shape deformation, transformer",

author = "Sheng Yu and Zhai, {Di Hua} and Yuyin Guan and Yuanqing Xia",

note = "Publisher Copyright: IEEE",

year = "2023",

doi = "10.1109/TNNLS.2023.3330011",

language = "English",

pages = "1--15",

journal = "IEEE Transactions on Neural Networks and Learning Systems",

issn = "2162-237X",

publisher = "IEEE Computational Intelligence Society",

}

TY - JOUR

T1 - Category-Level 6-D Object Pose Estimation With Shape Deformation for Robotic Grasp Detection

AU - Yu, Sheng

AU - Zhai, Di Hua

AU - Guan, Yuyin

AU - Xia, Yuanqing

N1 - Publisher Copyright: IEEE

PY - 2023

Y1 - 2023

N2 - Category-level 6-D object pose estimation plays a crucial role in achieving reliable robotic grasp detection. However, the disparity between synthetic and real datasets hinders the direct transfer of models trained on synthetic data to real-world scenarios, leading to ineffective results. Additionally, creating large-scale real datasets is a time-consuming and labor-intensive task. To overcome these challenges, we propose CatDeform, a novel category-level object pose estimation network trained on synthetic data but capable of delivering good performance on real datasets. In our approach, we introduce a transformer-based fusion module that enables the network to leverage multiple sources of information and enhance prediction accuracy through feature fusion. To ensure proper deformation of the prior point cloud to align with scene objects, we propose a transformer-based attention module that deforms the prior point cloud from both geometric and feature perspectives. Building upon CatDeform, we design a two-branch network for supervised learning, bridging the gap between synthetic and real datasets and achieving high-precision pose estimation in real-world scenes using predominantly synthetic data supplemented with a small amount of real data. To minimize reliance on large-scale real datasets, we train the network in a self-supervised manner by estimating object poses in real scenes based on the synthetic dataset without manual annotation. We conduct training and testing on CAMERA25 and REAL275 datasets, and our experimental results demonstrate that the proposed method outperforms state-of-the-art (SOTA) techniques in both self-supervised and supervised training paradigms. Finally, we apply CatDeform to object pose estimation and robotic grasp experiments in real-world scenarios, showcasing a higher grasp success rate.

AB - Category-level 6-D object pose estimation plays a crucial role in achieving reliable robotic grasp detection. However, the disparity between synthetic and real datasets hinders the direct transfer of models trained on synthetic data to real-world scenarios, leading to ineffective results. Additionally, creating large-scale real datasets is a time-consuming and labor-intensive task. To overcome these challenges, we propose CatDeform, a novel category-level object pose estimation network trained on synthetic data but capable of delivering good performance on real datasets. In our approach, we introduce a transformer-based fusion module that enables the network to leverage multiple sources of information and enhance prediction accuracy through feature fusion. To ensure proper deformation of the prior point cloud to align with scene objects, we propose a transformer-based attention module that deforms the prior point cloud from both geometric and feature perspectives. Building upon CatDeform, we design a two-branch network for supervised learning, bridging the gap between synthetic and real datasets and achieving high-precision pose estimation in real-world scenes using predominantly synthetic data supplemented with a small amount of real data. To minimize reliance on large-scale real datasets, we train the network in a self-supervised manner by estimating object poses in real scenes based on the synthetic dataset without manual annotation. We conduct training and testing on CAMERA25 and REAL275 datasets, and our experimental results demonstrate that the proposed method outperforms state-of-the-art (SOTA) techniques in both self-supervised and supervised training paradigms. Finally, we apply CatDeform to object pose estimation and robotic grasp experiments in real-world scenarios, showcasing a higher grasp success rate.

KW - Category-level object pose estimation

KW - Deformation

KW - Feature extraction

KW - Point cloud compression

KW - Pose estimation

KW - Shape

KW - Solid modeling

KW - Transformers

KW - robotic grasp

KW - shape deformation

KW - transformer

UR - http://www.scopus.com/inward/record.url?scp=85177080921&partnerID=8YFLogxK

U2 - 10.1109/TNNLS.2023.3330011

DO - 10.1109/TNNLS.2023.3330011

M3 - Article

AN - SCOPUS:85177080921

SN - 2162-237X

SP - 1

EP - 15

JO - IEEE Transactions on Neural Networks and Learning Systems

JF - IEEE Transactions on Neural Networks and Learning Systems

ER -

Category-Level 6-D Object Pose Estimation With Shape Deformation for Robotic Grasp Detection

摘要

访问文件

其它文件与链接

指纹

引用此