Abstract
Existing robotic grasp detection methods often struggle with inaccurate predictions in complex scenarios involving multiple objects and textured backgrounds. Most existing methods attempt to enhance CNN-based architectures to improve grasp accuracy. However, their limited receptive fields and ineffective multiscale feature integration hinder handling objects of varying sizes, and performance degrades when local salient features are disturbed. To address these limitations, we propose a Swin Transformer-based grasp detection framework. It features a hierarchical Swin Transformer encoder that models both global contextual dependencies and local features through shifted-window and window attention mechanisms. In addition, we introduce a multiscale transformer pyramid pooling module that dynamically processes and fuses multiscale features, enabling the network to adjust grasp predictions based on object scales. Finally, a lightweight CNN decoder is designed to optimize multiscale feature fusion while maintaining spatial precision. Our method achieves accuracies of 98.9% and 96.7% on the Cornell and Jacquard datasets, respectively. Moreover, the network demonstrates real-time performance with a prediction rate of 27 fps. We conduct comparative experiments, ablation experiments, and real-world grasping experiments on the UR3 robot, achieving a 96.0% success rate.
| Original language | English |
|---|---|
| Journal | IEEE/ASME Transactions on Mechatronics |
| DOIs | |
| Publication status | Accepted/In press - 2025 |
| Externally published | Yes |
Keywords
- Attention mechanism
- grasp detection
- multiscale feature
- robotic grasping
- transformer