STAT: Spatial-Temporal Attention Mechanism for Video Captioning

Chenggang Yan; Yunbin Tu; Xingzheng Wang; Yongbing Zhang; Xinhong Hao; Yongdong Zhang; Qionghai Dai

doi:10.1109/TMM.2019.2924576

STAT: Spatial-Temporal Attention Mechanism for Video Captioning

Chenggang Yan, Yunbin Tu, Xingzheng Wang^*, Yongbing Zhang, Xinhong Hao, Yongdong Zhang, Qionghai Dai

^*此作品的通讯作者

机电学院

科研成果: 期刊稿件 › 文章 › 同行评审

280 引用（Scopus）

摘要

Video captioning refers to automatic generate natural language sentences, which summarize the video contents. Inspired by the visual attention mechanism of human beings, temporal attention mechanism has been widely used in video description to selectively focus on important frames. However, most existing methods based on temporal attention mechanism suffer from the problems of recognition error and detail missing, because temporal attention mechanism cannot further catch significant regions in frames. In order to address above problems, we propose the use of a novel spatial-temporal attention mechanism (STAT) within an encoder-decoder neural network for video captioning. The proposed STAT successfully takes into account both the spatial and temporal structures in a video, so it makes the decoder to automatically select the significant regions in the most relevant temporal segments for word prediction. We evaluate our STAT on two well-known benchmarks: MSVD and MSR-VTT-10K. Experimental results show that our proposed STAT achieves the state-of-the-art performance with several popular evaluation metrics: BLEU-4, METEOR, and CIDEr.

源语言	英语
文章编号	8744407
页（从-至）	229-241
页数	13
期刊	IEEE Transactions on Multimedia
卷	22
期	1
DOI	https://doi.org/10.1109/TMM.2019.2924576
出版状态	已出版 - 1月 2020

访问文件

10.1109/TMM.2019.2924576

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{c2c79123f5a24de2ae6a8d7034a0c5bf,

title = "STAT: Spatial-Temporal Attention Mechanism for Video Captioning",

abstract = "Video captioning refers to automatic generate natural language sentences, which summarize the video contents. Inspired by the visual attention mechanism of human beings, temporal attention mechanism has been widely used in video description to selectively focus on important frames. However, most existing methods based on temporal attention mechanism suffer from the problems of recognition error and detail missing, because temporal attention mechanism cannot further catch significant regions in frames. In order to address above problems, we propose the use of a novel spatial-temporal attention mechanism (STAT) within an encoder-decoder neural network for video captioning. The proposed STAT successfully takes into account both the spatial and temporal structures in a video, so it makes the decoder to automatically select the significant regions in the most relevant temporal segments for word prediction. We evaluate our STAT on two well-known benchmarks: MSVD and MSR-VTT-10K. Experimental results show that our proposed STAT achieves the state-of-the-art performance with several popular evaluation metrics: BLEU-4, METEOR, and CIDEr.",

keywords = "Video captioning, encoder-decoder neural networks, spatial-temporal attention mechanism",

author = "Chenggang Yan and Yunbin Tu and Xingzheng Wang and Yongbing Zhang and Xinhong Hao and Yongdong Zhang and Qionghai Dai",

note = "Publisher Copyright: {\textcopyright} 1999-2012 IEEE.",

year = "2020",

month = jan,

doi = "10.1109/TMM.2019.2924576",

language = "English",

volume = "22",

pages = "229--241",

journal = "IEEE Transactions on Multimedia",

issn = "1520-9210",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "1",

}

TY - JOUR

T1 - STAT

T2 - Spatial-Temporal Attention Mechanism for Video Captioning

AU - Yan, Chenggang

AU - Tu, Yunbin

AU - Wang, Xingzheng

AU - Zhang, Yongbing

AU - Hao, Xinhong

AU - Zhang, Yongdong

AU - Dai, Qionghai

PY - 2020/1

Y1 - 2020/1

N2 - Video captioning refers to automatic generate natural language sentences, which summarize the video contents. Inspired by the visual attention mechanism of human beings, temporal attention mechanism has been widely used in video description to selectively focus on important frames. However, most existing methods based on temporal attention mechanism suffer from the problems of recognition error and detail missing, because temporal attention mechanism cannot further catch significant regions in frames. In order to address above problems, we propose the use of a novel spatial-temporal attention mechanism (STAT) within an encoder-decoder neural network for video captioning. The proposed STAT successfully takes into account both the spatial and temporal structures in a video, so it makes the decoder to automatically select the significant regions in the most relevant temporal segments for word prediction. We evaluate our STAT on two well-known benchmarks: MSVD and MSR-VTT-10K. Experimental results show that our proposed STAT achieves the state-of-the-art performance with several popular evaluation metrics: BLEU-4, METEOR, and CIDEr.

AB - Video captioning refers to automatic generate natural language sentences, which summarize the video contents. Inspired by the visual attention mechanism of human beings, temporal attention mechanism has been widely used in video description to selectively focus on important frames. However, most existing methods based on temporal attention mechanism suffer from the problems of recognition error and detail missing, because temporal attention mechanism cannot further catch significant regions in frames. In order to address above problems, we propose the use of a novel spatial-temporal attention mechanism (STAT) within an encoder-decoder neural network for video captioning. The proposed STAT successfully takes into account both the spatial and temporal structures in a video, so it makes the decoder to automatically select the significant regions in the most relevant temporal segments for word prediction. We evaluate our STAT on two well-known benchmarks: MSVD and MSR-VTT-10K. Experimental results show that our proposed STAT achieves the state-of-the-art performance with several popular evaluation metrics: BLEU-4, METEOR, and CIDEr.

KW - Video captioning

KW - encoder-decoder neural networks

KW - spatial-temporal attention mechanism

UR - http://www.scopus.com/inward/record.url?scp=85077788801&partnerID=8YFLogxK

U2 - 10.1109/TMM.2019.2924576

DO - 10.1109/TMM.2019.2924576

M3 - Article

AN - SCOPUS:85077788801

SN - 1520-9210

VL - 22

SP - 229

EP - 241

JO - IEEE Transactions on Multimedia

JF - IEEE Transactions on Multimedia

IS - 1

M1 - 8744407

ER -

STAT: Spatial-Temporal Attention Mechanism for Video Captioning

摘要

访问文件

其它文件与链接

指纹

引用此