TY - JOUR
T1 - PLPFusion
T2 - Plane-Line-Pixel Fully Sparse Fusion for Robust Multi-Modal 3D Object Detection
AU - Hou, Jingfu
AU - Song, Hong
AU - Li, Jinfu
AU - Lin, Yucong
AU - Huang, Tianyu
AU - He, Jugang
AU - He, Xiuwei
AU - Yang, Jian
N1 - Publisher Copyright:
© 1991-2012 IEEE.
PY - 2025
Y1 - 2025
N2 - Fully sparse fusion makes an excellent balance between efficiency and accuracy in multi-modal 3D object detection. However, most existing methods focus on foreground objects while overlooking background context. This oversight compromises detection robustness, especially for occluded or small-sized objects, leading to suboptimal detection performance. To address this limitation, we propose a novel fully sparse fusion framework (PLPFusion), which introduces a hierarchical Plane-Line-Pixel representation to progressively model the object-context relationships. PLPFusion comprises three key modules: the Plane Enhancement Module (PEM), the Line Alignment Module (LAM) and the Pixel-Level Aggregation Module (PLAM). Firstly, PEM utilizes geometric cues from LiDAR feature planes to generate spatially-aware object queries. Secondly, LAM further refines these queries with geometric priors for semantic awareness. Lastly, PLAM aggregates pixel-level context to enhance discriminative completeness by leveraging the semantically-aware object queries. On the nuScenes benchmark, PLPFusion achieves 71.9% mAP and 74.0% NDS, outperforming the baseline method FUTR3D by +2.5% mAP and +1.9% NDS, respectively. On the KITTI benchmark, it achieves 72.68% BEV mAP and 67.39% 3D mAP. These results confirm its robustness and effectiveness in diverse multi-modal 3D scenarios.
AB - Fully sparse fusion makes an excellent balance between efficiency and accuracy in multi-modal 3D object detection. However, most existing methods focus on foreground objects while overlooking background context. This oversight compromises detection robustness, especially for occluded or small-sized objects, leading to suboptimal detection performance. To address this limitation, we propose a novel fully sparse fusion framework (PLPFusion), which introduces a hierarchical Plane-Line-Pixel representation to progressively model the object-context relationships. PLPFusion comprises three key modules: the Plane Enhancement Module (PEM), the Line Alignment Module (LAM) and the Pixel-Level Aggregation Module (PLAM). Firstly, PEM utilizes geometric cues from LiDAR feature planes to generate spatially-aware object queries. Secondly, LAM further refines these queries with geometric priors for semantic awareness. Lastly, PLAM aggregates pixel-level context to enhance discriminative completeness by leveraging the semantically-aware object queries. On the nuScenes benchmark, PLPFusion achieves 71.9% mAP and 74.0% NDS, outperforming the baseline method FUTR3D by +2.5% mAP and +1.9% NDS, respectively. On the KITTI benchmark, it achieves 72.68% BEV mAP and 67.39% 3D mAP. These results confirm its robustness and effectiveness in diverse multi-modal 3D scenarios.
KW - Fully sparse fusion
KW - Hierarchical representation modeling
KW - LiDAR-Camera fusion
KW - Multi-modal 3D object detection
UR - https://www.scopus.com/pages/publications/105025417854
U2 - 10.1109/TCSVT.2025.3644414
DO - 10.1109/TCSVT.2025.3644414
M3 - Article
AN - SCOPUS:105025417854
SN - 1051-8215
JO - IEEE Transactions on Circuits and Systems for Video Technology
JF - IEEE Transactions on Circuits and Systems for Video Technology
ER -