TY - JOUR
T1 - Structured Local Feature-Conditioned 6-DOF Variational Grasp Detection Network in Cluttered Scenes
AU - Liu, Hongyang
AU - Li, Hui
AU - Jiang, Changhua
AU - Xue, Shuqi
AU - Zhao, Yan
AU - Huang, Xiao
AU - Jiang, Zhihong
N1 - Publisher Copyright:
© 1996-2012 IEEE.
PY - 2024
Y1 - 2024
N2 - One of the most crucial abilities for robots is to grasp objects accurately in cluttered scenes. This article proposes a structured local feature-conditioned 6-DOF variational grasp detection network (LF-GraspNet) that can generate accurate grasp configurations in cluttered scenes end to end. First, we propose a network using a 3-D convolutional neural network with a conditional variational autoencoder (CVAE) as a backbone. The explorability of the VAE enhances the network's generalizability in grasp detection. Second, we jointly encode the truncated signed distance function (TSDF) of the scene and successful grasp configurations into the global feature as the prior of the latent space of the CVAE. The structured local feature of the TSDF volume is used as the condition of the CVAE, which can then skillfully fuse different modalities and scales of features. Simulation and real-world grasp experiments demonstrate that LF-GraspNet, trained on a grasp dataset with a limited number of primitive objects, achieves better success rates and declutter rates for unseen objects in cluttered scenes than baseline methods. Specifically, in real-world grasp experiments, LF-GraspNet achieves stable grasping of objects in cluttered scenes with single-view and multiview depth images as input, demonstrating its excellent grasp performance and generalization ability from simple primitive objects to complex and unseen objects.
AB - One of the most crucial abilities for robots is to grasp objects accurately in cluttered scenes. This article proposes a structured local feature-conditioned 6-DOF variational grasp detection network (LF-GraspNet) that can generate accurate grasp configurations in cluttered scenes end to end. First, we propose a network using a 3-D convolutional neural network with a conditional variational autoencoder (CVAE) as a backbone. The explorability of the VAE enhances the network's generalizability in grasp detection. Second, we jointly encode the truncated signed distance function (TSDF) of the scene and successful grasp configurations into the global feature as the prior of the latent space of the CVAE. The structured local feature of the TSDF volume is used as the condition of the CVAE, which can then skillfully fuse different modalities and scales of features. Simulation and real-world grasp experiments demonstrate that LF-GraspNet, trained on a grasp dataset with a limited number of primitive objects, achieves better success rates and declutter rates for unseen objects in cluttered scenes than baseline methods. Specifically, in real-world grasp experiments, LF-GraspNet achieves stable grasping of objects in cluttered scenes with single-view and multiview depth images as input, demonstrating its excellent grasp performance and generalization ability from simple primitive objects to complex and unseen objects.
KW - Conditional variational autoencoder (CVAE)
KW - convolutional neural network (CNN)
KW - grasp detection
KW - robot grasping
UR - http://www.scopus.com/inward/record.url?scp=85211984447&partnerID=8YFLogxK
U2 - 10.1109/TMECH.2024.3500577
DO - 10.1109/TMECH.2024.3500577
M3 - Article
AN - SCOPUS:85211984447
SN - 1083-4435
JO - IEEE/ASME Transactions on Mechatronics
JF - IEEE/ASME Transactions on Mechatronics
ER -