MSTCNet: Multiscale Transformer-CNN Network for Robotic Grasp Detection

Research output: Contribution to journalArticlepeer-review

Abstract

Existing robotic grasp detection methods often struggle with inaccurate predictions in complex scenarios involving multiple objects and textured backgrounds. Most existing methods attempt to enhance CNN-based architectures to improve grasp accuracy. However, their limited receptive fields and ineffective multiscale feature integration hinder handling objects of varying sizes, and performance degrades when local salient features are disturbed. To address these limitations, we propose a Swin Transformer-based grasp detection framework. It features a hierarchical Swin Transformer encoder that models both global contextual dependencies and local features through shifted-window and window attention mechanisms. In addition, we introduce a multiscale transformer pyramid pooling module that dynamically processes and fuses multiscale features, enabling the network to adjust grasp predictions based on object scales. Finally, a lightweight CNN decoder is designed to optimize multiscale feature fusion while maintaining spatial precision. Our method achieves accuracies of 98.9% and 96.7% on the Cornell and Jacquard datasets, respectively. Moreover, the network demonstrates real-time performance with a prediction rate of 27 fps. We conduct comparative experiments, ablation experiments, and real-world grasping experiments on the UR3 robot, achieving a 96.0% success rate.

Original languageEnglish
JournalIEEE/ASME Transactions on Mechatronics
DOIs
Publication statusAccepted/In press - 2025
Externally publishedYes

Keywords

  • Attention mechanism
  • grasp detection
  • multiscale feature
  • robotic grasping
  • transformer

Fingerprint

Dive into the research topics of 'MSTCNet: Multiscale Transformer-CNN Network for Robotic Grasp Detection'. Together they form a unique fingerprint.

Cite this