Structured Local Feature-Conditioned 6-DOF Variational Grasp Detection Network in Cluttered Scenes

Hongyang Liu; Hui Li; Changhua Jiang; Shuqi Xue; Yan Zhao; Xiao Huang; Zhihong Jiang

doi:10.1109/TMECH.2024.3500577

Structured Local Feature-Conditioned 6-DOF Variational Grasp Detection Network in Cluttered Scenes

Hongyang Liu, Hui Li, Changhua Jiang, Shuqi Xue, Yan Zhao, Xiao Huang^*, Zhihong Jiang^*

^*Corresponding author for this work

School of Mechatronical Engineering

Research output: Contribution to journal › Article › peer-review

Abstract

One of the most crucial abilities for robots is to grasp objects accurately in cluttered scenes. This article proposes a structured local feature-conditioned 6-DOF variational grasp detection network (LF-GraspNet) that can generate accurate grasp configurations in cluttered scenes end to end. First, we propose a network using a 3-D convolutional neural network with a conditional variational autoencoder (CVAE) as a backbone. The explorability of the VAE enhances the network's generalizability in grasp detection. Second, we jointly encode the truncated signed distance function (TSDF) of the scene and successful grasp configurations into the global feature as the prior of the latent space of the CVAE. The structured local feature of the TSDF volume is used as the condition of the CVAE, which can then skillfully fuse different modalities and scales of features. Simulation and real-world grasp experiments demonstrate that LF-GraspNet, trained on a grasp dataset with a limited number of primitive objects, achieves better success rates and declutter rates for unseen objects in cluttered scenes than baseline methods. Specifically, in real-world grasp experiments, LF-GraspNet achieves stable grasping of objects in cluttered scenes with single-view and multiview depth images as input, demonstrating its excellent grasp performance and generalization ability from simple primitive objects to complex and unseen objects.

Original language	English
Journal	IEEE/ASME Transactions on Mechatronics
DOIs	https://doi.org/10.1109/TMECH.2024.3500577
Publication status	Accepted/In press - 2024

Keywords

Conditional variational autoencoder (CVAE)
convolutional neural network (CNN)
grasp detection
robot grasping

Access to Document

10.1109/TMECH.2024.3500577

Cite this

Liu, H., Li, H., Jiang, C., Xue, S., Zhao, Y., Huang, X., & Jiang, Z. (Accepted/In press). Structured Local Feature-Conditioned 6-DOF Variational Grasp Detection Network in Cluttered Scenes. IEEE/ASME Transactions on Mechatronics. https://doi.org/10.1109/TMECH.2024.3500577

@article{c9711b05648848b9a69cea1b11393b40,

title = "Structured Local Feature-Conditioned 6-DOF Variational Grasp Detection Network in Cluttered Scenes",

abstract = "One of the most crucial abilities for robots is to grasp objects accurately in cluttered scenes. This article proposes a structured local feature-conditioned 6-DOF variational grasp detection network (LF-GraspNet) that can generate accurate grasp configurations in cluttered scenes end to end. First, we propose a network using a 3-D convolutional neural network with a conditional variational autoencoder (CVAE) as a backbone. The explorability of the VAE enhances the network's generalizability in grasp detection. Second, we jointly encode the truncated signed distance function (TSDF) of the scene and successful grasp configurations into the global feature as the prior of the latent space of the CVAE. The structured local feature of the TSDF volume is used as the condition of the CVAE, which can then skillfully fuse different modalities and scales of features. Simulation and real-world grasp experiments demonstrate that LF-GraspNet, trained on a grasp dataset with a limited number of primitive objects, achieves better success rates and declutter rates for unseen objects in cluttered scenes than baseline methods. Specifically, in real-world grasp experiments, LF-GraspNet achieves stable grasping of objects in cluttered scenes with single-view and multiview depth images as input, demonstrating its excellent grasp performance and generalization ability from simple primitive objects to complex and unseen objects.",

keywords = "Conditional variational autoencoder (CVAE), convolutional neural network (CNN), grasp detection, robot grasping",

author = "Hongyang Liu and Hui Li and Changhua Jiang and Shuqi Xue and Yan Zhao and Xiao Huang and Zhihong Jiang",

note = "Publisher Copyright: {\textcopyright} 1996-2012 IEEE.",

year = "2024",

doi = "10.1109/TMECH.2024.3500577",

language = "English",

journal = "IEEE/ASME Transactions on Mechatronics",

issn = "1083-4435",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Structured Local Feature-Conditioned 6-DOF Variational Grasp Detection Network in Cluttered Scenes

AU - Liu, Hongyang

AU - Li, Hui

AU - Jiang, Changhua

AU - Xue, Shuqi

AU - Zhao, Yan

AU - Huang, Xiao

AU - Jiang, Zhihong

PY - 2024

Y1 - 2024

N2 - One of the most crucial abilities for robots is to grasp objects accurately in cluttered scenes. This article proposes a structured local feature-conditioned 6-DOF variational grasp detection network (LF-GraspNet) that can generate accurate grasp configurations in cluttered scenes end to end. First, we propose a network using a 3-D convolutional neural network with a conditional variational autoencoder (CVAE) as a backbone. The explorability of the VAE enhances the network's generalizability in grasp detection. Second, we jointly encode the truncated signed distance function (TSDF) of the scene and successful grasp configurations into the global feature as the prior of the latent space of the CVAE. The structured local feature of the TSDF volume is used as the condition of the CVAE, which can then skillfully fuse different modalities and scales of features. Simulation and real-world grasp experiments demonstrate that LF-GraspNet, trained on a grasp dataset with a limited number of primitive objects, achieves better success rates and declutter rates for unseen objects in cluttered scenes than baseline methods. Specifically, in real-world grasp experiments, LF-GraspNet achieves stable grasping of objects in cluttered scenes with single-view and multiview depth images as input, demonstrating its excellent grasp performance and generalization ability from simple primitive objects to complex and unseen objects.

AB - One of the most crucial abilities for robots is to grasp objects accurately in cluttered scenes. This article proposes a structured local feature-conditioned 6-DOF variational grasp detection network (LF-GraspNet) that can generate accurate grasp configurations in cluttered scenes end to end. First, we propose a network using a 3-D convolutional neural network with a conditional variational autoencoder (CVAE) as a backbone. The explorability of the VAE enhances the network's generalizability in grasp detection. Second, we jointly encode the truncated signed distance function (TSDF) of the scene and successful grasp configurations into the global feature as the prior of the latent space of the CVAE. The structured local feature of the TSDF volume is used as the condition of the CVAE, which can then skillfully fuse different modalities and scales of features. Simulation and real-world grasp experiments demonstrate that LF-GraspNet, trained on a grasp dataset with a limited number of primitive objects, achieves better success rates and declutter rates for unseen objects in cluttered scenes than baseline methods. Specifically, in real-world grasp experiments, LF-GraspNet achieves stable grasping of objects in cluttered scenes with single-view and multiview depth images as input, demonstrating its excellent grasp performance and generalization ability from simple primitive objects to complex and unseen objects.

KW - Conditional variational autoencoder (CVAE)

KW - convolutional neural network (CNN)

KW - grasp detection

KW - robot grasping

UR - http://www.scopus.com/inward/record.url?scp=85211984447&partnerID=8YFLogxK

U2 - 10.1109/TMECH.2024.3500577

DO - 10.1109/TMECH.2024.3500577

M3 - Article

AN - SCOPUS:85211984447

SN - 1083-4435

JO - IEEE/ASME Transactions on Mechatronics

JF - IEEE/ASME Transactions on Mechatronics

ER -

Structured Local Feature-Conditioned 6-DOF Variational Grasp Detection Network in Cluttered Scenes

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this