Abstract
Multimodal fusion technology significantly enhances the safety and perception capabilities of intelligent vehicles. Recently, replacing Cartesian coordinate system voxels with polar voxels in 3D perception tasks has significantly improved spatial occupancy rates and adaptability. However, the uneven distribution of voxels introduces new challenges: feature information distortion and reduced real-time performance. This paper proposes a multimodal fusion network based on polar graphs to address these issues. Raw data from LiDAR, cameras, and millimeter-wave (MMW) radar are initially preprocessed, and point-graph and voxel-graph structures in polar coordinates are constructed. Subsequently, using Graph Attention Networks (GAT), features are extracted and aggregated at multiple levels, forming a polar-based Bird's Eye View (BEV) feature map. At the BEV level, multimodal features are fused, and multi-scale features are aggregated using multi-scale GAT, culminating in the design of a polar-based CenterHead to complete the 3D perception task. Extensive experiments conducted on the nuScenes dataset and real vehicle test data have demonstrated that the detection precision (70.5% mAP) and inference speed (12.6 Hz) of the model's surpass those of comparative models, establishing a new state-of-the-art (SOTA). Additionally, the model exhibits high levels of perception accuracy, robustness, and generalizability across various real vehicle scenarios.
Original language | English |
---|---|
Pages (from-to) | 1-12 |
Number of pages | 12 |
Journal | IEEE Transactions on Intelligent Vehicles |
DOIs | |
Publication status | Accepted/In press - 2024 |
Externally published | Yes |
Keywords
- 3D Perception
- Cameras
- Feature extraction
- Graph Attention Networks
- Intelligent Vehicles
- Laser radar
- Multimodal Fusion
- Point cloud compression
- Radar
- Real-time systems
- Three-dimensional displays