Scene captioning with deep fusion of images and point clouds

Qiang Yu*, Chunxia Zhang, Lubin Weng, Shiming Xiang, Chunhong Pan

*此作品的通讯作者

科研成果: 期刊稿件文章同行评审

1 引用 (Scopus)

摘要

Recently, the fusion of images and point clouds has received appreciable attentions in various fields, for example, autonomous driving, whose advantage over single-modal vision has been verified. However, it has not been extensively exploited in the scene captioning task. In this paper, a novel scene captioning framework with deep fusion of images and point clouds based on region correlation and attention is proposed to improve performances of captioning models. In our model, a symmetrical processing pipeline is designed for point clouds and images. First, 3D and 2D region features are generated respectively through region proposal generation, proposal fusion, and region pooling modules. Then, a feature fusion module is designed to integrate features according to the region correlation rule and the attention mechanism, which increases the interpretability of the fusion process and results in a sequence of fused visual features. Finally, the fused features are transformed into captions by an attention-based caption generation module. Comprehensive experiments indicate that the performance of our model reaches the state of the art.

源语言英语
页(从-至)9-15
页数7
期刊Pattern Recognition Letters
158
DOI
出版状态已出版 - 6月 2022

指纹

探究 'Scene captioning with deep fusion of images and point clouds' 的科研主题。它们共同构成独一无二的指纹。

引用此