一种多层多模态融合 3D 目标检测方法

Zhi Guo Zhou; Wen Hao Ma

doi:10.12263/DZXB.20220593

一种多层多模态融合 3D 目标检测方法

Zhi Guo Zhou, Wen Hao Ma

集成电路与电子学院

Beijing Institute of Technology

科研成果: 期刊稿件 › 文章 › 同行评审

1 引用（Scopus）

摘要

Camera and lidar are the key sources of information in autonomous vehicles (AVs). However, in the current 3D object detection tasks, most of the pure point cloud network detection capabilities are better than those of image and laser point cloud fusion networks. Existing studies summarize the reasons for this as the misalignment of view between image and radar information and the difficulty of matching heterogeneous features. Single-stage fusion algorithm is difficult to fully fuse the features of both. For this reason, a nova 3D object detection based on multilayer multimodal fusion (3DMMF) is presented. First, in the early-fusion phase, point clouds are encoded locally by Frustum-RGB-PointPainting (FRP) formed by the 2D detection frame. Then, the encoded point cloud input is combined with the self-attention mechanism context-aware channel to expand the PointPillars detection network. In the later-fusion phase, 2D and 3D candidate boxes are coded as two sets of sparse tensors before they are not greatly suppressed, and the final 3D target detection result is obtained by using the camera lidar object candidates fusion (CLOCs) network. Experiments on KITTI datasets show that this fusion detection method has a significant performance improvement over the baseline of pure point cloud networks, with an average mAP improvement of 6.24%.

投稿的翻译标题	3D Object Detection Based on Multilayer Multimodal Fusion
源语言	繁体中文
页（从-至）	696-708
页数	13
期刊	Tien Tzu Hsueh Pao/Acta Electronica Sinica
卷	52
期	3
DOI	https://doi.org/10.12263/DZXB.20220593
出版状态	已出版 - 3月 2024

关键词

3D target detection
auto-driving
multi-sensor fusion
point cloud coding
self-attention mechanism

访问文件

10.12263/DZXB.20220593

其它文件与链接

链接到 Scopus 的出版物

引用此

Zhou, Z. G., & Ma, W. H. (2024). 一种多层多模态融合 3D 目标检测方法. Tien Tzu Hsueh Pao/Acta Electronica Sinica, 52(3), 696-708. https://doi.org/10.12263/DZXB.20220593

@article{b5768e8d4a13405da98b7c2c25399ff8,

title = "一种多层多模态融合 3D 目标检测方法",

abstract = "Camera and lidar are the key sources of information in autonomous vehicles (AVs). However, in the current 3D object detection tasks, most of the pure point cloud network detection capabilities are better than those of image and laser point cloud fusion networks. Existing studies summarize the reasons for this as the misalignment of view between image and radar information and the difficulty of matching heterogeneous features. Single-stage fusion algorithm is difficult to fully fuse the features of both. For this reason, a nova 3D object detection based on multilayer multimodal fusion (3DMMF) is presented. First, in the early-fusion phase, point clouds are encoded locally by Frustum-RGB-PointPainting (FRP) formed by the 2D detection frame. Then, the encoded point cloud input is combined with the self-attention mechanism context-aware channel to expand the PointPillars detection network. In the later-fusion phase, 2D and 3D candidate boxes are coded as two sets of sparse tensors before they are not greatly suppressed, and the final 3D target detection result is obtained by using the camera lidar object candidates fusion (CLOCs) network. Experiments on KITTI datasets show that this fusion detection method has a significant performance improvement over the baseline of pure point cloud networks, with an average mAP improvement of 6.24%.",

keywords = "3D target detection, auto-driving, multi-sensor fusion, point cloud coding, self-attention mechanism",

author = "Zhou, {Zhi Guo} and Ma, {Wen Hao}",

year = "2024",

month = mar,

doi = "10.12263/DZXB.20220593",

language = "繁体中文",

volume = "52",

pages = "696--708",

journal = "Tien Tzu Hsueh Pao/Acta Electronica Sinica",

issn = "0372-2112",

publisher = "Chinese Institute of Electronics",

number = "3",

}

TY - JOUR

T1 - 一种多层多模态融合 3D 目标检测方法

AU - Zhou, Zhi Guo

AU - Ma, Wen Hao

PY - 2024/3

Y1 - 2024/3

N2 - Camera and lidar are the key sources of information in autonomous vehicles (AVs). However, in the current 3D object detection tasks, most of the pure point cloud network detection capabilities are better than those of image and laser point cloud fusion networks. Existing studies summarize the reasons for this as the misalignment of view between image and radar information and the difficulty of matching heterogeneous features. Single-stage fusion algorithm is difficult to fully fuse the features of both. For this reason, a nova 3D object detection based on multilayer multimodal fusion (3DMMF) is presented. First, in the early-fusion phase, point clouds are encoded locally by Frustum-RGB-PointPainting (FRP) formed by the 2D detection frame. Then, the encoded point cloud input is combined with the self-attention mechanism context-aware channel to expand the PointPillars detection network. In the later-fusion phase, 2D and 3D candidate boxes are coded as two sets of sparse tensors before they are not greatly suppressed, and the final 3D target detection result is obtained by using the camera lidar object candidates fusion (CLOCs) network. Experiments on KITTI datasets show that this fusion detection method has a significant performance improvement over the baseline of pure point cloud networks, with an average mAP improvement of 6.24%.

AB - Camera and lidar are the key sources of information in autonomous vehicles (AVs). However, in the current 3D object detection tasks, most of the pure point cloud network detection capabilities are better than those of image and laser point cloud fusion networks. Existing studies summarize the reasons for this as the misalignment of view between image and radar information and the difficulty of matching heterogeneous features. Single-stage fusion algorithm is difficult to fully fuse the features of both. For this reason, a nova 3D object detection based on multilayer multimodal fusion (3DMMF) is presented. First, in the early-fusion phase, point clouds are encoded locally by Frustum-RGB-PointPainting (FRP) formed by the 2D detection frame. Then, the encoded point cloud input is combined with the self-attention mechanism context-aware channel to expand the PointPillars detection network. In the later-fusion phase, 2D and 3D candidate boxes are coded as two sets of sparse tensors before they are not greatly suppressed, and the final 3D target detection result is obtained by using the camera lidar object candidates fusion (CLOCs) network. Experiments on KITTI datasets show that this fusion detection method has a significant performance improvement over the baseline of pure point cloud networks, with an average mAP improvement of 6.24%.

KW - 3D target detection

KW - auto-driving

KW - multi-sensor fusion

KW - point cloud coding

KW - self-attention mechanism

UR - http://www.scopus.com/inward/record.url?scp=85192981533&partnerID=8YFLogxK

U2 - 10.12263/DZXB.20220593

DO - 10.12263/DZXB.20220593

M3 - 文章

AN - SCOPUS:85192981533

SN - 0372-2112

VL - 52

SP - 696

EP - 708

JO - Tien Tzu Hsueh Pao/Acta Electronica Sinica

JF - Tien Tzu Hsueh Pao/Acta Electronica Sinica

IS - 3

ER -

一种多层多模态融合 3D 目标检测方法

摘要

关键词

访问文件

其它文件与链接

指纹

引用此