TY - JOUR
T1 - Scene captioning with deep fusion of images and point clouds
AU - Yu, Qiang
AU - Zhang, Chunxia
AU - Weng, Lubin
AU - Xiang, Shiming
AU - Pan, Chunhong
N1 - Publisher Copyright:
© 2022 Elsevier B.V.
PY - 2022/6
Y1 - 2022/6
N2 - Recently, the fusion of images and point clouds has received appreciable attentions in various fields, for example, autonomous driving, whose advantage over single-modal vision has been verified. However, it has not been extensively exploited in the scene captioning task. In this paper, a novel scene captioning framework with deep fusion of images and point clouds based on region correlation and attention is proposed to improve performances of captioning models. In our model, a symmetrical processing pipeline is designed for point clouds and images. First, 3D and 2D region features are generated respectively through region proposal generation, proposal fusion, and region pooling modules. Then, a feature fusion module is designed to integrate features according to the region correlation rule and the attention mechanism, which increases the interpretability of the fusion process and results in a sequence of fused visual features. Finally, the fused features are transformed into captions by an attention-based caption generation module. Comprehensive experiments indicate that the performance of our model reaches the state of the art.
AB - Recently, the fusion of images and point clouds has received appreciable attentions in various fields, for example, autonomous driving, whose advantage over single-modal vision has been verified. However, it has not been extensively exploited in the scene captioning task. In this paper, a novel scene captioning framework with deep fusion of images and point clouds based on region correlation and attention is proposed to improve performances of captioning models. In our model, a symmetrical processing pipeline is designed for point clouds and images. First, 3D and 2D region features are generated respectively through region proposal generation, proposal fusion, and region pooling modules. Then, a feature fusion module is designed to integrate features according to the region correlation rule and the attention mechanism, which increases the interpretability of the fusion process and results in a sequence of fused visual features. Finally, the fused features are transformed into captions by an attention-based caption generation module. Comprehensive experiments indicate that the performance of our model reaches the state of the art.
KW - Deep fusion
KW - Point cloud
KW - Scene captioning
UR - http://www.scopus.com/inward/record.url?scp=85128695406&partnerID=8YFLogxK
U2 - 10.1016/j.patrec.2022.04.017
DO - 10.1016/j.patrec.2022.04.017
M3 - Article
AN - SCOPUS:85128695406
SN - 0167-8655
VL - 158
SP - 9
EP - 15
JO - Pattern Recognition Letters
JF - Pattern Recognition Letters
ER -