Abstract
The dominant multi-camera 3D detection paradigm is based on explicit 3D feature construction, which requires complicated indexing of local image-view features via 3D-to-2D projection. Other methods implicitly introduce geometric positional encoding and perform global attention (e.g., PETR) to build the relationship between image tokens and 3D objects. The 3D-to-2D perspective inconsistency and global attention lead to a weak correlation between foreground tokens and queries, resulting in slow convergence. We propose Focal-PETR with instance-guided supervision and spatial alignment module to adaptively focus object queries on discriminative foreground regions. Focal-PETR additionally introduces a down-sampling strategy to reduce the consumption of global attention. Our model achieves leading performance on the large-scale nuScenes benchmark and a superior speed of 30 FPS on a single RTX3090 GPU. Extensive experiments show that our method outperforms PETR while consuming 3x fewer training hours. The code is made publicly available.
Original language | English |
---|---|
Pages (from-to) | 1481-1489 |
Number of pages | 9 |
Journal | IEEE Transactions on Intelligent Vehicles |
Volume | 9 |
Issue number | 1 |
DOIs | |
Publication status | Published - 1 Jan 2024 |
Keywords
- 3D Object Detection
- Autonomous Driving
- Detection Transformer