TY - GEN
T1 - REVnet
T2 - 2019 IEEE International Conference on Multimedia and Expo, ICME 2019
AU - Li, Huidong
AU - Song, Dandan
AU - Liao, Lejian
AU - Peng, Cuimei
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/7
Y1 - 2019/7
N2 - Recently, the task of automatically generating a textual description of a video is attracting increasing interest. The attention-based encoder-decoder framework has been extensively applied in this domain. However, compared with other captioning tasks, such as image captioning, video captioning is more challenging because semantic information among frames is hard to be extracted. In this paper, we propose a reviewing network (REVnet) to reconstruct the previous hidden state, which is combined with the conventional encoder-decoder framework. REVnet brings backward flow into the caption generation process, which encourages the hidden state embedding more information and enables the semantics of the generated sentence more coherent. Furthermore, REVnet can regularize the attention mechanism within the framework, which encourages the model better utilizing the semantic information extracted from multiple different frames. Our experimental results on benchmark datasets demonstrate that our proposed REVnet has a significant improvement over the baseline method. Furthermore, we use a reinforcement learning method to finetune the model, and get better results than the state-of-the-art methods.
AB - Recently, the task of automatically generating a textual description of a video is attracting increasing interest. The attention-based encoder-decoder framework has been extensively applied in this domain. However, compared with other captioning tasks, such as image captioning, video captioning is more challenging because semantic information among frames is hard to be extracted. In this paper, we propose a reviewing network (REVnet) to reconstruct the previous hidden state, which is combined with the conventional encoder-decoder framework. REVnet brings backward flow into the caption generation process, which encourages the hidden state embedding more information and enables the semantics of the generated sentence more coherent. Furthermore, REVnet can regularize the attention mechanism within the framework, which encourages the model better utilizing the semantic information extracted from multiple different frames. Our experimental results on benchmark datasets demonstrate that our proposed REVnet has a significant improvement over the baseline method. Furthermore, we use a reinforcement learning method to finetune the model, and get better results than the state-of-the-art methods.
KW - Attention mechanism
KW - Backward flow
KW - Reinforcement learning
KW - Video caption
UR - http://www.scopus.com/inward/record.url?scp=85071000752&partnerID=8YFLogxK
U2 - 10.1109/ICME.2019.00228
DO - 10.1109/ICME.2019.00228
M3 - Conference contribution
AN - SCOPUS:85071000752
T3 - Proceedings - IEEE International Conference on Multimedia and Expo
SP - 1312
EP - 1317
BT - Proceedings - 2019 IEEE International Conference on Multimedia and Expo, ICME 2019
PB - IEEE Computer Society
Y2 - 8 July 2019 through 12 July 2019
ER -