STAT: Spatial-Temporal Attention Mechanism for Video Captioning

Chenggang Yan, Yunbin Tu, Xingzheng Wang*, Yongbing Zhang, Xinhong Hao, Yongdong Zhang, Qionghai Dai

*此作品的通讯作者

科研成果: 期刊稿件文章同行评审

280 引用 (Scopus)

摘要

Video captioning refers to automatic generate natural language sentences, which summarize the video contents. Inspired by the visual attention mechanism of human beings, temporal attention mechanism has been widely used in video description to selectively focus on important frames. However, most existing methods based on temporal attention mechanism suffer from the problems of recognition error and detail missing, because temporal attention mechanism cannot further catch significant regions in frames. In order to address above problems, we propose the use of a novel spatial-temporal attention mechanism (STAT) within an encoder-decoder neural network for video captioning. The proposed STAT successfully takes into account both the spatial and temporal structures in a video, so it makes the decoder to automatically select the significant regions in the most relevant temporal segments for word prediction. We evaluate our STAT on two well-known benchmarks: MSVD and MSR-VTT-10K. Experimental results show that our proposed STAT achieves the state-of-the-art performance with several popular evaluation metrics: BLEU-4, METEOR, and CIDEr.

源语言英语
文章编号8744407
页(从-至)229-241
页数13
期刊IEEE Transactions on Multimedia
22
1
DOI
出版状态已出版 - 1月 2020

指纹

探究 'STAT: Spatial-Temporal Attention Mechanism for Video Captioning' 的科研主题。它们共同构成独一无二的指纹。

引用此