Three-Dimensional Object Detection Network Based on Multi-Layer and Multi-Modal Fusion

Wenming Zhu; Jia Zhou; Zizhe Wang; Xuehua Zhou; Feng Zhou; Jingwen Sun; Mingrui Song; Zhiguo Zhou

doi:10.3390/electronics13173512

Three-Dimensional Object Detection Network Based on Multi-Layer and Multi-Modal Fusion

Wenming Zhu, Jia Zhou, Zizhe Wang, Xuehua Zhou^*, Feng Zhou, Jingwen Sun, Mingrui Song, Zhiguo Zhou

^*此作品的通讯作者

集成电路与电子学院

科研成果: 期刊稿件 › 文章 › 同行评审

摘要

Cameras and LiDAR are important sensors in autonomous driving systems that can provide complementary information to each other. However, most LiDAR-only methods outperform the fusion method on the main benchmark datasets. Current studies attribute the reasons for this to misalignment of views and difficulty in matching heterogeneous features. Specially, using the single-stage fusion method, it is difficult to fully fuse the features of the image and point cloud. In this work, we propose a 3D object detection network based on the multi-layer and multi-modal fusion (3DMMF) method. 3DMMF works by painting and encoding the point cloud in the frustum proposed by the 2D object detection network. Then, the painted point cloud is fed to the LiDAR-only object detection network, which has expanded channels and a self-attention mechanism module. Finally, the camera-LiDAR object candidates fusion for 3D object detection(CLOCs) method is used to match the geometric direction features and category semantic features of the 2D and 3D detection results. Experiments on the KITTI dataset (a public dataset) show that this fusion method has a significant improvement over the baseline of the LiDAR-only method, with an average mAP improvement of 6.3%.

源语言	英语
文章编号	3512
期刊	Electronics (Switzerland)
卷	13
期	17
DOI	https://doi.org/10.3390/electronics13173512
出版状态	已出版 - 9月 2024

访问文件

10.3390/electronics13173512

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{df614c652bd8415697b160eb1c75467a,

title = "Three-Dimensional Object Detection Network Based on Multi-Layer and Multi-Modal Fusion",

abstract = "Cameras and LiDAR are important sensors in autonomous driving systems that can provide complementary information to each other. However, most LiDAR-only methods outperform the fusion method on the main benchmark datasets. Current studies attribute the reasons for this to misalignment of views and difficulty in matching heterogeneous features. Specially, using the single-stage fusion method, it is difficult to fully fuse the features of the image and point cloud. In this work, we propose a 3D object detection network based on the multi-layer and multi-modal fusion (3DMMF) method. 3DMMF works by painting and encoding the point cloud in the frustum proposed by the 2D object detection network. Then, the painted point cloud is fed to the LiDAR-only object detection network, which has expanded channels and a self-attention mechanism module. Finally, the camera-LiDAR object candidates fusion for 3D object detection(CLOCs) method is used to match the geometric direction features and category semantic features of the 2D and 3D detection results. Experiments on the KITTI dataset (a public dataset) show that this fusion method has a significant improvement over the baseline of the LiDAR-only method, with an average mAP improvement of 6.3%.",

keywords = "3D object detection, auto-driving, multi-sensor fusion, self-attention mechanism",

author = "Wenming Zhu and Jia Zhou and Zizhe Wang and Xuehua Zhou and Feng Zhou and Jingwen Sun and Mingrui Song and Zhiguo Zhou",

note = "Publisher Copyright: {\textcopyright} 2024 by the authors.",

year = "2024",

month = sep,

doi = "10.3390/electronics13173512",

language = "English",

volume = "13",

journal = "Electronics (Switzerland)",

issn = "2079-9292",

publisher = "Multidisciplinary Digital Publishing Institute (MDPI)",

number = "17",

}

TY - JOUR

T1 - Three-Dimensional Object Detection Network Based on Multi-Layer and Multi-Modal Fusion

AU - Zhu, Wenming

AU - Zhou, Jia

AU - Wang, Zizhe

AU - Zhou, Xuehua

AU - Zhou, Feng

AU - Sun, Jingwen

AU - Song, Mingrui

AU - Zhou, Zhiguo

PY - 2024/9

Y1 - 2024/9

N2 - Cameras and LiDAR are important sensors in autonomous driving systems that can provide complementary information to each other. However, most LiDAR-only methods outperform the fusion method on the main benchmark datasets. Current studies attribute the reasons for this to misalignment of views and difficulty in matching heterogeneous features. Specially, using the single-stage fusion method, it is difficult to fully fuse the features of the image and point cloud. In this work, we propose a 3D object detection network based on the multi-layer and multi-modal fusion (3DMMF) method. 3DMMF works by painting and encoding the point cloud in the frustum proposed by the 2D object detection network. Then, the painted point cloud is fed to the LiDAR-only object detection network, which has expanded channels and a self-attention mechanism module. Finally, the camera-LiDAR object candidates fusion for 3D object detection(CLOCs) method is used to match the geometric direction features and category semantic features of the 2D and 3D detection results. Experiments on the KITTI dataset (a public dataset) show that this fusion method has a significant improvement over the baseline of the LiDAR-only method, with an average mAP improvement of 6.3%.

AB - Cameras and LiDAR are important sensors in autonomous driving systems that can provide complementary information to each other. However, most LiDAR-only methods outperform the fusion method on the main benchmark datasets. Current studies attribute the reasons for this to misalignment of views and difficulty in matching heterogeneous features. Specially, using the single-stage fusion method, it is difficult to fully fuse the features of the image and point cloud. In this work, we propose a 3D object detection network based on the multi-layer and multi-modal fusion (3DMMF) method. 3DMMF works by painting and encoding the point cloud in the frustum proposed by the 2D object detection network. Then, the painted point cloud is fed to the LiDAR-only object detection network, which has expanded channels and a self-attention mechanism module. Finally, the camera-LiDAR object candidates fusion for 3D object detection(CLOCs) method is used to match the geometric direction features and category semantic features of the 2D and 3D detection results. Experiments on the KITTI dataset (a public dataset) show that this fusion method has a significant improvement over the baseline of the LiDAR-only method, with an average mAP improvement of 6.3%.

KW - 3D object detection

KW - auto-driving

KW - multi-sensor fusion

KW - self-attention mechanism

UR - http://www.scopus.com/inward/record.url?scp=85203659210&partnerID=8YFLogxK

U2 - 10.3390/electronics13173512

DO - 10.3390/electronics13173512

M3 - Article

AN - SCOPUS:85203659210

SN - 2079-9292

VL - 13

JO - Electronics (Switzerland)

JF - Electronics (Switzerland)

IS - 17

M1 - 3512

ER -

Three-Dimensional Object Detection Network Based on Multi-Layer and Multi-Modal Fusion

摘要

访问文件

其它文件与链接

指纹

引用此