TY - JOUR
T1 - STAT
T2 - Spatial-Temporal Attention Mechanism for Video Captioning
AU - Yan, Chenggang
AU - Tu, Yunbin
AU - Wang, Xingzheng
AU - Zhang, Yongbing
AU - Hao, Xinhong
AU - Zhang, Yongdong
AU - Dai, Qionghai
N1 - Publisher Copyright:
© 1999-2012 IEEE.
PY - 2020/1
Y1 - 2020/1
N2 - Video captioning refers to automatic generate natural language sentences, which summarize the video contents. Inspired by the visual attention mechanism of human beings, temporal attention mechanism has been widely used in video description to selectively focus on important frames. However, most existing methods based on temporal attention mechanism suffer from the problems of recognition error and detail missing, because temporal attention mechanism cannot further catch significant regions in frames. In order to address above problems, we propose the use of a novel spatial-temporal attention mechanism (STAT) within an encoder-decoder neural network for video captioning. The proposed STAT successfully takes into account both the spatial and temporal structures in a video, so it makes the decoder to automatically select the significant regions in the most relevant temporal segments for word prediction. We evaluate our STAT on two well-known benchmarks: MSVD and MSR-VTT-10K. Experimental results show that our proposed STAT achieves the state-of-the-art performance with several popular evaluation metrics: BLEU-4, METEOR, and CIDEr.
AB - Video captioning refers to automatic generate natural language sentences, which summarize the video contents. Inspired by the visual attention mechanism of human beings, temporal attention mechanism has been widely used in video description to selectively focus on important frames. However, most existing methods based on temporal attention mechanism suffer from the problems of recognition error and detail missing, because temporal attention mechanism cannot further catch significant regions in frames. In order to address above problems, we propose the use of a novel spatial-temporal attention mechanism (STAT) within an encoder-decoder neural network for video captioning. The proposed STAT successfully takes into account both the spatial and temporal structures in a video, so it makes the decoder to automatically select the significant regions in the most relevant temporal segments for word prediction. We evaluate our STAT on two well-known benchmarks: MSVD and MSR-VTT-10K. Experimental results show that our proposed STAT achieves the state-of-the-art performance with several popular evaluation metrics: BLEU-4, METEOR, and CIDEr.
KW - Video captioning
KW - encoder-decoder neural networks
KW - spatial-temporal attention mechanism
UR - http://www.scopus.com/inward/record.url?scp=85077788801&partnerID=8YFLogxK
U2 - 10.1109/TMM.2019.2924576
DO - 10.1109/TMM.2019.2924576
M3 - Article
AN - SCOPUS:85077788801
SN - 1520-9210
VL - 22
SP - 229
EP - 241
JO - IEEE Transactions on Multimedia
JF - IEEE Transactions on Multimedia
IS - 1
M1 - 8744407
ER -