PLPFusion: Plane-Line-Pixel Fully Sparse Fusion for Robust Multi-Modal 3D Object Detection

Research output: Contribution to journalArticlepeer-review

Abstract

Fully sparse fusion makes an excellent balance between efficiency and accuracy in multi-modal 3D object detection. However, most existing methods focus on foreground objects while overlooking background context. This oversight compromises detection robustness, especially for occluded or small-sized objects, leading to suboptimal detection performance. To address this limitation, we propose a novel fully sparse fusion framework (PLPFusion), which introduces a hierarchical Plane-Line-Pixel representation to progressively model the object-context relationships. PLPFusion comprises three key modules: the Plane Enhancement Module (PEM), the Line Alignment Module (LAM) and the Pixel-Level Aggregation Module (PLAM). Firstly, PEM utilizes geometric cues from LiDAR feature planes to generate spatially-aware object queries. Secondly, LAM further refines these queries with geometric priors for semantic awareness. Lastly, PLAM aggregates pixel-level context to enhance discriminative completeness by leveraging the semantically-aware object queries. On the nuScenes benchmark, PLPFusion achieves 71.9% mAP and 74.0% NDS, outperforming the baseline method FUTR3D by +2.5% mAP and +1.9% NDS, respectively. On the KITTI benchmark, it achieves 72.68% BEV mAP and 67.39% 3D mAP. These results confirm its robustness and effectiveness in diverse multi-modal 3D scenarios.

Original languageEnglish
JournalIEEE Transactions on Circuits and Systems for Video Technology
DOIs
Publication statusAccepted/In press - 2025

Keywords

  • Fully sparse fusion
  • Hierarchical representation modeling
  • LiDAR-Camera fusion
  • Multi-modal 3D object detection

Fingerprint

Dive into the research topics of 'PLPFusion: Plane-Line-Pixel Fully Sparse Fusion for Robust Multi-Modal 3D Object Detection'. Together they form a unique fingerprint.

Cite this