Abstract
The dominant multi-camera 3D detection paradigm is based on explicit 3D feature construction, which requires complicated indexing of local image-view features via 3D-to-2D projection. Other methods implicitly introduce geometric positional encoding and perform global attention (e.g., PETR) to build the relationship between image tokens and 3D objects. The 3D-to-2D perspective inconsistency and global attention lead to a weak correlation between foreground tokens and queries, resulting in slow convergence. We propose Focal-PETR with instance-guided supervision and spatial alignment module to adaptively focus object queries on discriminative foreground regions. Focal-PETR additionally introduces a down-sampling strategy to reduce the consumption of global attention. Our model achieves leading performance on the large-scale nuScenes benchmark and a superior speed of 30 FPS on a single RTX3090 GPU. Extensive experiments show that our method outperforms PETR while consuming 3x fewer training hours. The code is made publicly available.
| Original language | English |
|---|---|
| Pages (from-to) | 1481-1489 |
| Number of pages | 9 |
| Journal | IEEE Transactions on Intelligent Vehicles |
| Volume | 9 |
| Issue number | 1 |
| DOIs | |
| Publication status | Published - 1 Jan 2024 |
Keywords
- 3D Object Detection
- Autonomous Driving
- Detection Transformer
Fingerprint
Dive into the research topics of 'Focal-PETR: Embracing Foreground for Efficient Multi-Camera 3D Object Detection'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver