REVnet: Bring reviewing into video captioning for a better description

Huidong Li, Dandan Song, Lejian Liao, Cuimei Peng

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

7 Citations (Scopus)

Abstract

Recently, the task of automatically generating a textual description of a video is attracting increasing interest. The attention-based encoder-decoder framework has been extensively applied in this domain. However, compared with other captioning tasks, such as image captioning, video captioning is more challenging because semantic information among frames is hard to be extracted. In this paper, we propose a reviewing network (REVnet) to reconstruct the previous hidden state, which is combined with the conventional encoder-decoder framework. REVnet brings backward flow into the caption generation process, which encourages the hidden state embedding more information and enables the semantics of the generated sentence more coherent. Furthermore, REVnet can regularize the attention mechanism within the framework, which encourages the model better utilizing the semantic information extracted from multiple different frames. Our experimental results on benchmark datasets demonstrate that our proposed REVnet has a significant improvement over the baseline method. Furthermore, we use a reinforcement learning method to finetune the model, and get better results than the state-of-the-art methods.

Original languageEnglish
Title of host publicationProceedings - 2019 IEEE International Conference on Multimedia and Expo, ICME 2019
PublisherIEEE Computer Society
Pages1312-1317
Number of pages6
ISBN (Electronic)9781538695524
DOIs
Publication statusPublished - Jul 2019
Event2019 IEEE International Conference on Multimedia and Expo, ICME 2019 - Shanghai, China
Duration: 8 Jul 201912 Jul 2019

Publication series

NameProceedings - IEEE International Conference on Multimedia and Expo
Volume2019-July
ISSN (Print)1945-7871
ISSN (Electronic)1945-788X

Conference

Conference2019 IEEE International Conference on Multimedia and Expo, ICME 2019
Country/TerritoryChina
CityShanghai
Period8/07/1912/07/19

Keywords

  • Attention mechanism
  • Backward flow
  • Reinforcement learning
  • Video caption

Fingerprint

Dive into the research topics of 'REVnet: Bring reviewing into video captioning for a better description'. Together they form a unique fingerprint.

Cite this