Focal-PETR: Embracing Foreground for Efficient Multi-Camera 3D Object Detection

Shihao Wang; Xiaohui Jiang; Ying Li

doi:10.1109/TIV.2023.3332608

Focal-PETR: Embracing Foreground for Efficient Multi-Camera 3D Object Detection

Shihao Wang, Xiaohui Jiang, Ying Li^*

^*Corresponding author for this work

School of Mechanical Engineering

Beijing Institute of Technology

Research output: Contribution to journal › Article › peer-review

1 Citation (Scopus)

Abstract

The dominant multi-camera 3D detection paradigm is based on explicit 3D feature construction, which requires complicated indexing of local image-view features via 3D-to-2D projection. Other methods implicitly introduce geometric positional encoding and perform global attention (e.g., PETR) to build the relationship between image tokens and 3D objects. The 3D-to-2D perspective inconsistency and global attention lead to a weak correlation between foreground tokens and queries, resulting in slow convergence. We propose Focal-PETR with instance-guided supervision and spatial alignment module to adaptively focus object queries on discriminative foreground regions. Focal-PETR additionally introduces a down-sampling strategy to reduce the consumption of global attention. Our model achieves leading performance on the large-scale nuScenes benchmark and a superior speed of 30 FPS on a single RTX3090 GPU. Extensive experiments show that our method outperforms PETR while consuming 3x fewer training hours. The code is made publicly available.

Original language	English
Pages (from-to)	1481-1489
Number of pages	9
Journal	IEEE Transactions on Intelligent Vehicles
Volume	9
Issue number	1
DOIs	https://doi.org/10.1109/TIV.2023.3332608
Publication status	Published - 1 Jan 2024

Keywords

3D Object Detection
Autonomous Driving
Detection Transformer

Access to Document

10.1109/TIV.2023.3332608

Cite this

@article{d701e5e71dfa4f929e74114f6b2b8d53,

title = "Focal-PETR: Embracing Foreground for Efficient Multi-Camera 3D Object Detection",

abstract = "The dominant multi-camera 3D detection paradigm is based on explicit 3D feature construction, which requires complicated indexing of local image-view features via 3D-to-2D projection. Other methods implicitly introduce geometric positional encoding and perform global attention (e.g., PETR) to build the relationship between image tokens and 3D objects. The 3D-to-2D perspective inconsistency and global attention lead to a weak correlation between foreground tokens and queries, resulting in slow convergence. We propose Focal-PETR with instance-guided supervision and spatial alignment module to adaptively focus object queries on discriminative foreground regions. Focal-PETR additionally introduces a down-sampling strategy to reduce the consumption of global attention. Our model achieves leading performance on the large-scale nuScenes benchmark and a superior speed of 30 FPS on a single RTX3090 GPU. Extensive experiments show that our method outperforms PETR while consuming 3x fewer training hours. The code is made publicly available.",

keywords = "3D Object Detection, Autonomous Driving, Detection Transformer",

author = "Shihao Wang and Xiaohui Jiang and Ying Li",

note = "Publisher Copyright: {\textcopyright} 2016 IEEE.",

year = "2024",

month = jan,

day = "1",

doi = "10.1109/TIV.2023.3332608",

language = "English",

volume = "9",

pages = "1481--1489",

journal = "IEEE Transactions on Intelligent Vehicles",

issn = "2379-8858",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "1",

}

TY - JOUR

T1 - Focal-PETR

T2 - Embracing Foreground for Efficient Multi-Camera 3D Object Detection

AU - Wang, Shihao

AU - Jiang, Xiaohui

AU - Li, Ying

PY - 2024/1/1

Y1 - 2024/1/1

N2 - The dominant multi-camera 3D detection paradigm is based on explicit 3D feature construction, which requires complicated indexing of local image-view features via 3D-to-2D projection. Other methods implicitly introduce geometric positional encoding and perform global attention (e.g., PETR) to build the relationship between image tokens and 3D objects. The 3D-to-2D perspective inconsistency and global attention lead to a weak correlation between foreground tokens and queries, resulting in slow convergence. We propose Focal-PETR with instance-guided supervision and spatial alignment module to adaptively focus object queries on discriminative foreground regions. Focal-PETR additionally introduces a down-sampling strategy to reduce the consumption of global attention. Our model achieves leading performance on the large-scale nuScenes benchmark and a superior speed of 30 FPS on a single RTX3090 GPU. Extensive experiments show that our method outperforms PETR while consuming 3x fewer training hours. The code is made publicly available.

AB - The dominant multi-camera 3D detection paradigm is based on explicit 3D feature construction, which requires complicated indexing of local image-view features via 3D-to-2D projection. Other methods implicitly introduce geometric positional encoding and perform global attention (e.g., PETR) to build the relationship between image tokens and 3D objects. The 3D-to-2D perspective inconsistency and global attention lead to a weak correlation between foreground tokens and queries, resulting in slow convergence. We propose Focal-PETR with instance-guided supervision and spatial alignment module to adaptively focus object queries on discriminative foreground regions. Focal-PETR additionally introduces a down-sampling strategy to reduce the consumption of global attention. Our model achieves leading performance on the large-scale nuScenes benchmark and a superior speed of 30 FPS on a single RTX3090 GPU. Extensive experiments show that our method outperforms PETR while consuming 3x fewer training hours. The code is made publicly available.

KW - 3D Object Detection

KW - Autonomous Driving

KW - Detection Transformer

UR - http://www.scopus.com/inward/record.url?scp=85177072670&partnerID=8YFLogxK

U2 - 10.1109/TIV.2023.3332608

DO - 10.1109/TIV.2023.3332608

M3 - Article

AN - SCOPUS:85177072670

SN - 2379-8858

VL - 9

SP - 1481

EP - 1489

JO - IEEE Transactions on Intelligent Vehicles

JF - IEEE Transactions on Intelligent Vehicles

IS - 1

ER -

Focal-PETR: Embracing Foreground for Efficient Multi-Camera 3D Object Detection

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this