RoG-SAM: A Language-Driven Framework for Instance-Level Robotic Grasping Detection

Yunpeng Mei, Jian Sun, Zhihong Peng, Fang Deng, Gang Wang*, Jie Chen

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Robotic grasping is a crucial topic in robotics and computer vision, with broad applications in industrial production and intelligent manufacturing. Although some methods have begun addressing instance-level grasping, most remain limited to predefined instances and categories, lacking flexibility for open-vocabulary grasp prediction based on user-specified instructions. To address this, we propose RoG-SAM, a language-driven, instance-level grasp detection framework built on Segment Anything Model (SAM). RoG-SAM utilizes open-vocabulary prompts for object localization and grasp pose prediction, adapting SAM through transfer learning with encoder adapters and multi-head decoders to extend its segmentation capabilities to grasp pose estimation. Experimental results show that RoG-SAM achieves competitive performance on single-object datasets (Cornell and Jacquard) and cluttered datasets (GraspNet-1Billion and OCID), with instance-level accuracies of 91.2% and 90.1%, respectively, while using only 28.3% of SAM's trainable parameters. The effectiveness of RoG-SAM was also validated in real-world environments. A demonstration video is available at https://www.youtube.com/playlist?list=PL7et4nGJAImLGytsJbglGbXl1hacA2dy.

Original languageEnglish
JournalIEEE Transactions on Multimedia
DOIs
Publication statusAccepted/In press - 2025

Keywords

  • fine-tuning
  • grasp detection
  • language-guided detection
  • Robotic vision
  • segment anything model

Fingerprint

Dive into the research topics of 'RoG-SAM: A Language-Driven Framework for Instance-Level Robotic Grasping Detection'. Together they form a unique fingerprint.

Cite this