Scene captioning with deep fusion of images and point clouds

Qiang Yu*, Chunxia Zhang, Lubin Weng, Shiming Xiang, Chunhong Pan

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

1 Citation (Scopus)

Abstract

Recently, the fusion of images and point clouds has received appreciable attentions in various fields, for example, autonomous driving, whose advantage over single-modal vision has been verified. However, it has not been extensively exploited in the scene captioning task. In this paper, a novel scene captioning framework with deep fusion of images and point clouds based on region correlation and attention is proposed to improve performances of captioning models. In our model, a symmetrical processing pipeline is designed for point clouds and images. First, 3D and 2D region features are generated respectively through region proposal generation, proposal fusion, and region pooling modules. Then, a feature fusion module is designed to integrate features according to the region correlation rule and the attention mechanism, which increases the interpretability of the fusion process and results in a sequence of fused visual features. Finally, the fused features are transformed into captions by an attention-based caption generation module. Comprehensive experiments indicate that the performance of our model reaches the state of the art.

Original languageEnglish
Pages (from-to)9-15
Number of pages7
JournalPattern Recognition Letters
Volume158
DOIs
Publication statusPublished - Jun 2022

Keywords

  • Deep fusion
  • Point cloud
  • Scene captioning

Fingerprint

Dive into the research topics of 'Scene captioning with deep fusion of images and point clouds'. Together they form a unique fingerprint.

Cite this