Joint syntax representation learning and visual cue translation for video captioning

Jingyi Hou, Xinxiao Wu*, Wentian Zhao, Jiebo Luo, Yunde Jia

*此作品的通讯作者

科研成果: 书/报告/会议事项章节会议稿件同行评审

83 引用 (Scopus)

摘要

Video captioning is a challenging task that involves not only visual perception but also syntax representation learning. Recent progress in video captioning has been achieved through visual perception, but syntax representation learning is still under-explored. We propose a novel video captioning approach that takes into account both visual perception and syntax representation learning to generate accurate descriptions of videos. Specifically, we use sentence templates composed of Part-of-Speech (POS) tags to represent the syntax structure of captions, and accordingly, syntax representation learning is performed by directly inferring POS tags from videos. The visual perception is implemented by a mixture model which translates visual cues into lexical words that are conditional on the learned syntactic structure of sentences. Thus, a video captioning task consists of two sub-tasks: Video POS tagging and visual cue translation, which are jointly modeled and trained in an end-to-end fashion. Evaluations on three public benchmark datasets demonstrate that our proposed method achieves substantially better performance than the state-of-the-art methods, which validates the superiority of joint modeling of syntax representation learning and visual perception for video captioning.

源语言英语
主期刊名Proceedings - 2019 International Conference on Computer Vision, ICCV 2019
出版商Institute of Electrical and Electronics Engineers Inc.
8917-8926
页数10
ISBN(电子版)9781728148038
DOI
出版状态已出版 - 10月 2019
活动17th IEEE/CVF International Conference on Computer Vision, ICCV 2019 - Seoul, 韩国
期限: 27 10月 20192 11月 2019

出版系列

姓名Proceedings of the IEEE International Conference on Computer Vision
2019-October
ISSN(印刷版)1550-5499

会议

会议17th IEEE/CVF International Conference on Computer Vision, ICCV 2019
国家/地区韩国
Seoul
时期27/10/192/11/19

指纹

探究 'Joint syntax representation learning and visual cue translation for video captioning' 的科研主题。它们共同构成独一无二的指纹。

引用此