TY - JOUR
T1 - RoG-SAM
T2 - A Language-Driven Framework for Instance-Level Robotic Grasping Detection
AU - Mei, Yunpeng
AU - Sun, Jian
AU - Peng, Zhihong
AU - Deng, Fang
AU - Wang, Gang
AU - Chen, Jie
N1 - Publisher Copyright:
© 1999-2012 IEEE.
PY - 2025
Y1 - 2025
N2 - Robotic grasping is a crucial topic in robotics and computer vision, with broad applications in industrial production and intelligent manufacturing. Although some methods have begun addressing instance-level grasping, most remain limited to predefined instances and categories, lacking flexibility for open-vocabulary grasp prediction based on user-specified instructions. To address this, we propose RoG-SAM, a language-driven, instance-level grasp detection framework built on Segment Anything Model (SAM). RoG-SAM utilizes open-vocabulary prompts for object localization and grasp pose prediction, adapting SAM through transfer learning with encoder adapters and multi-head decoders to extend its segmentation capabilities to grasp pose estimation. Experimental results show that RoG-SAM achieves competitive performance on single-object datasets (Cornell and Jacquard) and cluttered datasets (GraspNet-1Billion and OCID), with instance-level accuracies of 91.2% and 90.1%, respectively, while using only 28.3% of SAM's trainable parameters. The effectiveness of RoG-SAM was also validated in real-world environments. A demonstration video is available at https://www.youtube.com/playlist?list=PL7et4nGJAImLGytsJbglGbXl1hacA2dy.
AB - Robotic grasping is a crucial topic in robotics and computer vision, with broad applications in industrial production and intelligent manufacturing. Although some methods have begun addressing instance-level grasping, most remain limited to predefined instances and categories, lacking flexibility for open-vocabulary grasp prediction based on user-specified instructions. To address this, we propose RoG-SAM, a language-driven, instance-level grasp detection framework built on Segment Anything Model (SAM). RoG-SAM utilizes open-vocabulary prompts for object localization and grasp pose prediction, adapting SAM through transfer learning with encoder adapters and multi-head decoders to extend its segmentation capabilities to grasp pose estimation. Experimental results show that RoG-SAM achieves competitive performance on single-object datasets (Cornell and Jacquard) and cluttered datasets (GraspNet-1Billion and OCID), with instance-level accuracies of 91.2% and 90.1%, respectively, while using only 28.3% of SAM's trainable parameters. The effectiveness of RoG-SAM was also validated in real-world environments. A demonstration video is available at https://www.youtube.com/playlist?list=PL7et4nGJAImLGytsJbglGbXl1hacA2dy.
KW - fine-tuning
KW - grasp detection
KW - language-guided detection
KW - Robotic vision
KW - segment anything model
UR - http://www.scopus.com/inward/record.url?scp=105002133756&partnerID=8YFLogxK
U2 - 10.1109/TMM.2025.3557685
DO - 10.1109/TMM.2025.3557685
M3 - Article
AN - SCOPUS:105002133756
SN - 1520-9210
JO - IEEE Transactions on Multimedia
JF - IEEE Transactions on Multimedia
ER -