Scene captioning with deep fusion of images and point clouds

Qiang Yu; Chunxia Zhang; Lubin Weng; Shiming Xiang; Chunhong Pan

doi:10.1016/j.patrec.2022.04.017

Scene captioning with deep fusion of images and point clouds

Qiang Yu^*, Chunxia Zhang, Lubin Weng, Shiming Xiang, Chunhong Pan

^*Corresponding author for this work

School of Computer Science and Technology

Research output: Contribution to journal › Article › peer-review

1 Citation (Scopus)

Abstract

Recently, the fusion of images and point clouds has received appreciable attentions in various fields, for example, autonomous driving, whose advantage over single-modal vision has been verified. However, it has not been extensively exploited in the scene captioning task. In this paper, a novel scene captioning framework with deep fusion of images and point clouds based on region correlation and attention is proposed to improve performances of captioning models. In our model, a symmetrical processing pipeline is designed for point clouds and images. First, 3D and 2D region features are generated respectively through region proposal generation, proposal fusion, and region pooling modules. Then, a feature fusion module is designed to integrate features according to the region correlation rule and the attention mechanism, which increases the interpretability of the fusion process and results in a sequence of fused visual features. Finally, the fused features are transformed into captions by an attention-based caption generation module. Comprehensive experiments indicate that the performance of our model reaches the state of the art.

Original language	English
Pages (from-to)	9-15
Number of pages	7
Journal	Pattern Recognition Letters
Volume	158
DOIs	https://doi.org/10.1016/j.patrec.2022.04.017
Publication status	Published - Jun 2022

Keywords

Deep fusion
Point cloud
Scene captioning

Access to Document

10.1016/j.patrec.2022.04.017

Cite this

Yu, Q., Zhang, C., Weng, L., Xiang, S., & Pan, C. (2022). Scene captioning with deep fusion of images and point clouds. Pattern Recognition Letters, 158, 9-15. https://doi.org/10.1016/j.patrec.2022.04.017

@article{3f6c8e21c56146318452483939dcf679,

title = "Scene captioning with deep fusion of images and point clouds",

abstract = "Recently, the fusion of images and point clouds has received appreciable attentions in various fields, for example, autonomous driving, whose advantage over single-modal vision has been verified. However, it has not been extensively exploited in the scene captioning task. In this paper, a novel scene captioning framework with deep fusion of images and point clouds based on region correlation and attention is proposed to improve performances of captioning models. In our model, a symmetrical processing pipeline is designed for point clouds and images. First, 3D and 2D region features are generated respectively through region proposal generation, proposal fusion, and region pooling modules. Then, a feature fusion module is designed to integrate features according to the region correlation rule and the attention mechanism, which increases the interpretability of the fusion process and results in a sequence of fused visual features. Finally, the fused features are transformed into captions by an attention-based caption generation module. Comprehensive experiments indicate that the performance of our model reaches the state of the art.",

keywords = "Deep fusion, Point cloud, Scene captioning",

author = "Qiang Yu and Chunxia Zhang and Lubin Weng and Shiming Xiang and Chunhong Pan",

note = "Publisher Copyright: {\textcopyright} 2022 Elsevier B.V.",

year = "2022",

month = jun,

doi = "10.1016/j.patrec.2022.04.017",

language = "English",

volume = "158",

pages = "9--15",

journal = "Pattern Recognition Letters",

issn = "0167-8655",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - Scene captioning with deep fusion of images and point clouds

AU - Yu, Qiang

AU - Zhang, Chunxia

AU - Weng, Lubin

AU - Xiang, Shiming

AU - Pan, Chunhong

PY - 2022/6

Y1 - 2022/6

N2 - Recently, the fusion of images and point clouds has received appreciable attentions in various fields, for example, autonomous driving, whose advantage over single-modal vision has been verified. However, it has not been extensively exploited in the scene captioning task. In this paper, a novel scene captioning framework with deep fusion of images and point clouds based on region correlation and attention is proposed to improve performances of captioning models. In our model, a symmetrical processing pipeline is designed for point clouds and images. First, 3D and 2D region features are generated respectively through region proposal generation, proposal fusion, and region pooling modules. Then, a feature fusion module is designed to integrate features according to the region correlation rule and the attention mechanism, which increases the interpretability of the fusion process and results in a sequence of fused visual features. Finally, the fused features are transformed into captions by an attention-based caption generation module. Comprehensive experiments indicate that the performance of our model reaches the state of the art.

AB - Recently, the fusion of images and point clouds has received appreciable attentions in various fields, for example, autonomous driving, whose advantage over single-modal vision has been verified. However, it has not been extensively exploited in the scene captioning task. In this paper, a novel scene captioning framework with deep fusion of images and point clouds based on region correlation and attention is proposed to improve performances of captioning models. In our model, a symmetrical processing pipeline is designed for point clouds and images. First, 3D and 2D region features are generated respectively through region proposal generation, proposal fusion, and region pooling modules. Then, a feature fusion module is designed to integrate features according to the region correlation rule and the attention mechanism, which increases the interpretability of the fusion process and results in a sequence of fused visual features. Finally, the fused features are transformed into captions by an attention-based caption generation module. Comprehensive experiments indicate that the performance of our model reaches the state of the art.

KW - Deep fusion

KW - Point cloud

KW - Scene captioning

UR - http://www.scopus.com/inward/record.url?scp=85128695406&partnerID=8YFLogxK

U2 - 10.1016/j.patrec.2022.04.017

DO - 10.1016/j.patrec.2022.04.017

M3 - Article

AN - SCOPUS:85128695406

SN - 0167-8655

VL - 158

SP - 9

EP - 15

JO - Pattern Recognition Letters

JF - Pattern Recognition Letters

ER -

Scene captioning with deep fusion of images and point clouds

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this