Multi-modal Dependency Tree for Video Captioning

Wentian Zhao; Xinxiao Wu; Jiebo Luo

Multi-modal Dependency Tree for Video Captioning

Wentian Zhao, Xinxiao Wu^*, Jiebo Luo

^*Corresponding author for this work

School of Computer Science and Technology

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

22 Citations (Scopus)

Abstract

Generating fluent and relevant language to describe visual content is critical for the video captioning task. Many existing methods generate captions using sequence models that predict words in a left-to-right order. In this paper, we investigate a graph structured model by explicitly modeling the hierarchical structure in the sentences to further improve the fluency and relevance of the generated captions. To this end, we propose a novel video captioning method that generates a sentence by first constructing a multi-modal dependency tree and then traversing the constructed tree, where the syntactic structure and semantic relationship in the sentence are represented by the tree topology. To take full advantage of the information from both vision and language, both the visual and textual representation features are encoded into each tree node. Different from existing dependency parsing methods that generate uni-modal dependency trees for language understanding, our method constructs multi-modal dependency trees for language generation of videos. We also propose a tree-structured reinforcement learning algorithm to effectively optimize the captioning model, where a novel reward is designed by evaluating the semantic consistency between the generated sub-trees and the ground-truth tree. Extensive experiments on several video captioning datasets demonstrate the effectiveness of the proposed method.

Original language	English
Title of host publication	Advances in Neural Information Processing Systems 34 - 35th Conference on Neural Information Processing Systems, NeurIPS 2021
Editors	Marc'Aurelio Ranzato, Alina Beygelzimer, Yann Dauphin, Percy S. Liang, Jenn Wortman Vaughan
Publisher	Neural information processing systems foundation
Pages	6634-6645
Number of pages	12
ISBN (Electronic)	9781713845393
Publication status	Published - 2021
Event	35th Conference on Neural Information Processing Systems, NeurIPS 2021 - Virtual, Online Duration: 6 Dec 2021 → 14 Dec 2021

Publication series

Name	Advances in Neural Information Processing Systems
Volume	8
ISSN (Print)	1049-5258

Conference

Conference	35th Conference on Neural Information Processing Systems, NeurIPS 2021
City	Virtual, Online
Period	6/12/21 → 14/12/21

Cite this

Zhao, W., Wu, X., & Luo, J. (2021). Multi-modal Dependency Tree for Video Captioning. In MA. Ranzato, A. Beygelzimer, Y. Dauphin, P. S. Liang, & J. Wortman Vaughan (Eds.), Advances in Neural Information Processing Systems 34 - 35th Conference on Neural Information Processing Systems, NeurIPS 2021 (pp. 6634-6645). (Advances in Neural Information Processing Systems; Vol. 8). Neural information processing systems foundation.

Zhao, Wentian ; Wu, Xinxiao ; Luo, Jiebo. / Multi-modal Dependency Tree for Video Captioning. Advances in Neural Information Processing Systems 34 - 35th Conference on Neural Information Processing Systems, NeurIPS 2021. editor / Marc'Aurelio Ranzato ; Alina Beygelzimer ; Yann Dauphin ; Percy S. Liang ; Jenn Wortman Vaughan. Neural information processing systems foundation, 2021. pp. 6634-6645 (Advances in Neural Information Processing Systems).

@inproceedings{dc797f865a3f4388a16aea8ba703ad27,

title = "Multi-modal Dependency Tree for Video Captioning",

abstract = "Generating fluent and relevant language to describe visual content is critical for the video captioning task. Many existing methods generate captions using sequence models that predict words in a left-to-right order. In this paper, we investigate a graph structured model by explicitly modeling the hierarchical structure in the sentences to further improve the fluency and relevance of the generated captions. To this end, we propose a novel video captioning method that generates a sentence by first constructing a multi-modal dependency tree and then traversing the constructed tree, where the syntactic structure and semantic relationship in the sentence are represented by the tree topology. To take full advantage of the information from both vision and language, both the visual and textual representation features are encoded into each tree node. Different from existing dependency parsing methods that generate uni-modal dependency trees for language understanding, our method constructs multi-modal dependency trees for language generation of videos. We also propose a tree-structured reinforcement learning algorithm to effectively optimize the captioning model, where a novel reward is designed by evaluating the semantic consistency between the generated sub-trees and the ground-truth tree. Extensive experiments on several video captioning datasets demonstrate the effectiveness of the proposed method.",

author = "Wentian Zhao and Xinxiao Wu and Jiebo Luo",

note = "Publisher Copyright: {\textcopyright} 2021 Neural information processing systems foundation. All rights reserved.; 35th Conference on Neural Information Processing Systems, NeurIPS 2021 ; Conference date: 06-12-2021 Through 14-12-2021",

year = "2021",

language = "English",

series = "Advances in Neural Information Processing Systems",

publisher = "Neural information processing systems foundation",

pages = "6634--6645",

editor = "Marc'Aurelio Ranzato and Alina Beygelzimer and Yann Dauphin and Liang, {Percy S.} and {Wortman Vaughan}, Jenn",

booktitle = "Advances in Neural Information Processing Systems 34 - 35th Conference on Neural Information Processing Systems, NeurIPS 2021",

}

Zhao, W, Wu, X & Luo, J 2021, Multi-modal Dependency Tree for Video Captioning. in MA Ranzato, A Beygelzimer, Y Dauphin, PS Liang & J Wortman Vaughan (eds), Advances in Neural Information Processing Systems 34 - 35th Conference on Neural Information Processing Systems, NeurIPS 2021. Advances in Neural Information Processing Systems, vol. 8, Neural information processing systems foundation, pp. 6634-6645, 35th Conference on Neural Information Processing Systems, NeurIPS 2021, Virtual, Online, 6/12/21.

Multi-modal Dependency Tree for Video Captioning. / Zhao, Wentian; Wu, Xinxiao; Luo, Jiebo.
Advances in Neural Information Processing Systems 34 - 35th Conference on Neural Information Processing Systems, NeurIPS 2021. ed. / Marc'Aurelio Ranzato; Alina Beygelzimer; Yann Dauphin; Percy S. Liang; Jenn Wortman Vaughan. Neural information processing systems foundation, 2021. p. 6634-6645 (Advances in Neural Information Processing Systems; Vol. 8).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Multi-modal Dependency Tree for Video Captioning

AU - Zhao, Wentian

AU - Wu, Xinxiao

AU - Luo, Jiebo

PY - 2021

Y1 - 2021

N2 - Generating fluent and relevant language to describe visual content is critical for the video captioning task. Many existing methods generate captions using sequence models that predict words in a left-to-right order. In this paper, we investigate a graph structured model by explicitly modeling the hierarchical structure in the sentences to further improve the fluency and relevance of the generated captions. To this end, we propose a novel video captioning method that generates a sentence by first constructing a multi-modal dependency tree and then traversing the constructed tree, where the syntactic structure and semantic relationship in the sentence are represented by the tree topology. To take full advantage of the information from both vision and language, both the visual and textual representation features are encoded into each tree node. Different from existing dependency parsing methods that generate uni-modal dependency trees for language understanding, our method constructs multi-modal dependency trees for language generation of videos. We also propose a tree-structured reinforcement learning algorithm to effectively optimize the captioning model, where a novel reward is designed by evaluating the semantic consistency between the generated sub-trees and the ground-truth tree. Extensive experiments on several video captioning datasets demonstrate the effectiveness of the proposed method.

AB - Generating fluent and relevant language to describe visual content is critical for the video captioning task. Many existing methods generate captions using sequence models that predict words in a left-to-right order. In this paper, we investigate a graph structured model by explicitly modeling the hierarchical structure in the sentences to further improve the fluency and relevance of the generated captions. To this end, we propose a novel video captioning method that generates a sentence by first constructing a multi-modal dependency tree and then traversing the constructed tree, where the syntactic structure and semantic relationship in the sentence are represented by the tree topology. To take full advantage of the information from both vision and language, both the visual and textual representation features are encoded into each tree node. Different from existing dependency parsing methods that generate uni-modal dependency trees for language understanding, our method constructs multi-modal dependency trees for language generation of videos. We also propose a tree-structured reinforcement learning algorithm to effectively optimize the captioning model, where a novel reward is designed by evaluating the semantic consistency between the generated sub-trees and the ground-truth tree. Extensive experiments on several video captioning datasets demonstrate the effectiveness of the proposed method.

UR - http://www.scopus.com/inward/record.url?scp=85131028684&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85131028684

T3 - Advances in Neural Information Processing Systems

SP - 6634

EP - 6645

BT - Advances in Neural Information Processing Systems 34 - 35th Conference on Neural Information Processing Systems, NeurIPS 2021

A2 - Ranzato, Marc'Aurelio

A2 - Beygelzimer, Alina

A2 - Dauphin, Yann

A2 - Liang, Percy S.

A2 - Wortman Vaughan, Jenn

PB - Neural information processing systems foundation

T2 - 35th Conference on Neural Information Processing Systems, NeurIPS 2021

Y2 - 6 December 2021 through 14 December 2021

ER -

Zhao W, Wu X, Luo J. Multi-modal Dependency Tree for Video Captioning. In Ranzato MA, Beygelzimer A, Dauphin Y, Liang PS, Wortman Vaughan J, editors, Advances in Neural Information Processing Systems 34 - 35th Conference on Neural Information Processing Systems, NeurIPS 2021. Neural information processing systems foundation. 2021. p. 6634-6645. (Advances in Neural Information Processing Systems).

Multi-modal Dependency Tree for Video Captioning

Abstract

Publication series

Conference

Other files and links

Fingerprint

Cite this