Skip to main navigation Skip to search Skip to main content

Fast and Efficient 6-DoF Grasp Estimation With Segment Anything Model in Cluttered Scenes

  • Beijing Institute of Technology
  • Zhongyuan University of Technology

Research output: Contribution to journalArticlepeer-review

Abstract

The task of executing object grasping in unstructured and cluttered environments is a significant challenge. Despite the development of various 6-DoF grasping methods to tackle this issue, rapid grasping objects from arbitrary viewpoints remains difficult. In this article, we introduce a zero-shot 6-DoF grasp pose estimation method for unstructured cluttered scenes, named FS-Grasp. Initially, we leverage the zero-shot capabilities of the segment anything model to perform object segmentation in cluttered scenes, thereby obtaining point clouds of unknown objects. Next, we design a zero-shot 6-DoF grasp pose prediction algorithm based on these object point clouds, enabling the detection of grasp poses for unknown objects in cluttered environments. In FS-Grasp, we introduce a multiscale, multiangle graspable region search algorithm that integrates transformers to conduct a comprehensive search for graspable poses. We conduct grasping tests across various datasets, and our experimental results demonstrate that the proposed FS-Grasp can be effectively applied to most zero-shot grasping tasks. Furthermore, we apply FS-Grasp in diverse human–robot interaction scenarios, establishing an autonomous robot grasping framework based on visual language large models, which successfully performs the grasping and placement of multiple unknown objects, showcasing considerable practical application value.

Original languageEnglish
JournalIEEE/ASME Transactions on Mechatronics
DOIs
Publication statusAccepted/In press - 2026
Externally publishedYes

Keywords

  • 6-DoF grasp
  • human–robot interaction
  • segment anything model (SAM)
  • visual language model (VLM)
  • zero-shot

Fingerprint

Dive into the research topics of 'Fast and Efficient 6-DoF Grasp Estimation With Segment Anything Model in Cluttered Scenes'. Together they form a unique fingerprint.

Cite this