TY - JOUR
T1 - Fast and Efficient 6-DoF Grasp Estimation With Segment Anything Model in Cluttered Scenes
AU - Zhai, Di Hua
AU - Yu, Sheng
AU - Xia, Yuanqing
N1 - Publisher Copyright:
© 1996-2012 IEEE.
PY - 2026
Y1 - 2026
N2 - The task of executing object grasping in unstructured and cluttered environments is a significant challenge. Despite the development of various 6-DoF grasping methods to tackle this issue, rapid grasping objects from arbitrary viewpoints remains difficult. In this article, we introduce a zero-shot 6-DoF grasp pose estimation method for unstructured cluttered scenes, named FS-Grasp. Initially, we leverage the zero-shot capabilities of the segment anything model to perform object segmentation in cluttered scenes, thereby obtaining point clouds of unknown objects. Next, we design a zero-shot 6-DoF grasp pose prediction algorithm based on these object point clouds, enabling the detection of grasp poses for unknown objects in cluttered environments. In FS-Grasp, we introduce a multiscale, multiangle graspable region search algorithm that integrates transformers to conduct a comprehensive search for graspable poses. We conduct grasping tests across various datasets, and our experimental results demonstrate that the proposed FS-Grasp can be effectively applied to most zero-shot grasping tasks. Furthermore, we apply FS-Grasp in diverse human–robot interaction scenarios, establishing an autonomous robot grasping framework based on visual language large models, which successfully performs the grasping and placement of multiple unknown objects, showcasing considerable practical application value.
AB - The task of executing object grasping in unstructured and cluttered environments is a significant challenge. Despite the development of various 6-DoF grasping methods to tackle this issue, rapid grasping objects from arbitrary viewpoints remains difficult. In this article, we introduce a zero-shot 6-DoF grasp pose estimation method for unstructured cluttered scenes, named FS-Grasp. Initially, we leverage the zero-shot capabilities of the segment anything model to perform object segmentation in cluttered scenes, thereby obtaining point clouds of unknown objects. Next, we design a zero-shot 6-DoF grasp pose prediction algorithm based on these object point clouds, enabling the detection of grasp poses for unknown objects in cluttered environments. In FS-Grasp, we introduce a multiscale, multiangle graspable region search algorithm that integrates transformers to conduct a comprehensive search for graspable poses. We conduct grasping tests across various datasets, and our experimental results demonstrate that the proposed FS-Grasp can be effectively applied to most zero-shot grasping tasks. Furthermore, we apply FS-Grasp in diverse human–robot interaction scenarios, establishing an autonomous robot grasping framework based on visual language large models, which successfully performs the grasping and placement of multiple unknown objects, showcasing considerable practical application value.
KW - 6-DoF grasp
KW - human–robot interaction
KW - segment anything model (SAM)
KW - visual language model (VLM)
KW - zero-shot
UR - https://www.scopus.com/pages/publications/105039209791
U2 - 10.1109/TMECH.2026.3687717
DO - 10.1109/TMECH.2026.3687717
M3 - Article
AN - SCOPUS:105039209791
SN - 1083-4435
JO - IEEE/ASME Transactions on Mechatronics
JF - IEEE/ASME Transactions on Mechatronics
ER -