跳到主要导航 跳到搜索 跳到主要内容

Multi-modal Dependency Tree for Video Captioning

  • Wentian Zhao
  • , Xinxiao Wu*
  • , Jiebo Luo
  • *此作品的通讯作者

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

Generating fluent and relevant language to describe visual content is critical for the video captioning task. Many existing methods generate captions using sequence models that predict words in a left-to-right order. In this paper, we investigate a graph structured model by explicitly modeling the hierarchical structure in the sentences to further improve the fluency and relevance of the generated captions. To this end, we propose a novel video captioning method that generates a sentence by first constructing a multi-modal dependency tree and then traversing the constructed tree, where the syntactic structure and semantic relationship in the sentence are represented by the tree topology. To take full advantage of the information from both vision and language, both the visual and textual representation features are encoded into each tree node. Different from existing dependency parsing methods that generate uni-modal dependency trees for language understanding, our method constructs multi-modal dependency trees for language generation of videos. We also propose a tree-structured reinforcement learning algorithm to effectively optimize the captioning model, where a novel reward is designed by evaluating the semantic consistency between the generated sub-trees and the ground-truth tree. Extensive experiments on several video captioning datasets demonstrate the effectiveness of the proposed method.

源语言英语
主期刊名Advances in Neural Information Processing Systems 34 - 35th Conference on Neural Information Processing Systems, NeurIPS 2021
编辑Marc'Aurelio Ranzato, Alina Beygelzimer, Yann Dauphin, Percy S. Liang, Jenn Wortman Vaughan
出版商Neural information processing systems foundation
6634-6645
页数12
ISBN(电子版)9781713845393
出版状态已出版 - 2021
活动35th Conference on Neural Information Processing Systems, NeurIPS 2021 - Virtual, Online
期限: 6 12月 202114 12月 2021

出版系列

姓名Advances in Neural Information Processing Systems
8
ISSN(印刷版)1049-5258

会议

会议35th Conference on Neural Information Processing Systems, NeurIPS 2021
Virtual, Online
时期6/12/2114/12/21

指纹

探究 'Multi-modal Dependency Tree for Video Captioning' 的科研主题。它们共同构成独一无二的指纹。

引用此