Joint syntax representation learning and visual cue translation for video captioning

Jingyi Hou; Xinxiao Wu; Wentian Zhao; Jiebo Luo; Yunde Jia

doi:10.1109/ICCV.2019.00901

Joint syntax representation learning and visual cue translation for video captioning

Jingyi Hou, Xinxiao Wu^*, Wentian Zhao, Jiebo Luo, Yunde Jia

^*此作品的通讯作者

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

83 引用（Scopus）

摘要

Video captioning is a challenging task that involves not only visual perception but also syntax representation learning. Recent progress in video captioning has been achieved through visual perception, but syntax representation learning is still under-explored. We propose a novel video captioning approach that takes into account both visual perception and syntax representation learning to generate accurate descriptions of videos. Specifically, we use sentence templates composed of Part-of-Speech (POS) tags to represent the syntax structure of captions, and accordingly, syntax representation learning is performed by directly inferring POS tags from videos. The visual perception is implemented by a mixture model which translates visual cues into lexical words that are conditional on the learned syntactic structure of sentences. Thus, a video captioning task consists of two sub-tasks: Video POS tagging and visual cue translation, which are jointly modeled and trained in an end-to-end fashion. Evaluations on three public benchmark datasets demonstrate that our proposed method achieves substantially better performance than the state-of-the-art methods, which validates the superiority of joint modeling of syntax representation learning and visual perception for video captioning.

源语言	英语
主期刊名	Proceedings - 2019 International Conference on Computer Vision, ICCV 2019
出版商	Institute of Electrical and Electronics Engineers Inc.
页	8917-8926
页数	10
ISBN（电子版）	9781728148038
DOI	https://doi.org/10.1109/ICCV.2019.00901
出版状态	已出版 - 10月 2019
活动	17th IEEE/CVF International Conference on Computer Vision, ICCV 2019 - Seoul, 韩国期限: 27 10月 2019 → 2 11月 2019

出版系列

姓名	Proceedings of the IEEE International Conference on Computer Vision
卷	2019-October
ISSN（印刷版）	1550-5499

会议

会议	17th IEEE/CVF International Conference on Computer Vision, ICCV 2019
国家/地区	韩国
市	Seoul
时期	27/10/19 → 2/11/19

访问文件

10.1109/ICCV.2019.00901

其它文件与链接

链接到 Scopus 的出版物

引用此

Hou, J., Wu, X., Zhao, W., Luo, J., & Jia, Y. (2019). Joint syntax representation learning and visual cue translation for video captioning. 在 Proceedings - 2019 International Conference on Computer Vision, ICCV 2019 (页码 8917-8926). 文章 9010931 (Proceedings of the IEEE International Conference on Computer Vision; 卷 2019-October). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICCV.2019.00901

@inproceedings{989b09f0f6e748b0a40d7007a4aebe7a,

title = "Joint syntax representation learning and visual cue translation for video captioning",

abstract = "Video captioning is a challenging task that involves not only visual perception but also syntax representation learning. Recent progress in video captioning has been achieved through visual perception, but syntax representation learning is still under-explored. We propose a novel video captioning approach that takes into account both visual perception and syntax representation learning to generate accurate descriptions of videos. Specifically, we use sentence templates composed of Part-of-Speech (POS) tags to represent the syntax structure of captions, and accordingly, syntax representation learning is performed by directly inferring POS tags from videos. The visual perception is implemented by a mixture model which translates visual cues into lexical words that are conditional on the learned syntactic structure of sentences. Thus, a video captioning task consists of two sub-tasks: Video POS tagging and visual cue translation, which are jointly modeled and trained in an end-to-end fashion. Evaluations on three public benchmark datasets demonstrate that our proposed method achieves substantially better performance than the state-of-the-art methods, which validates the superiority of joint modeling of syntax representation learning and visual perception for video captioning.",

author = "Jingyi Hou and Xinxiao Wu and Wentian Zhao and Jiebo Luo and Yunde Jia",

note = "Publisher Copyright: {\textcopyright} 2019 IEEE.; 17th IEEE/CVF International Conference on Computer Vision, ICCV 2019 ; Conference date: 27-10-2019 Through 02-11-2019",

year = "2019",

month = oct,

doi = "10.1109/ICCV.2019.00901",

language = "English",

series = "Proceedings of the IEEE International Conference on Computer Vision",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "8917--8926",

booktitle = "Proceedings - 2019 International Conference on Computer Vision, ICCV 2019",

address = "United States",

}

Hou, J, Wu, X, Zhao, W, Luo, J & Jia, Y 2019, Joint syntax representation learning and visual cue translation for video captioning. 在 Proceedings - 2019 International Conference on Computer Vision, ICCV 2019., 9010931, Proceedings of the IEEE International Conference on Computer Vision, 卷 2019-October, Institute of Electrical and Electronics Engineers Inc., 页码 8917-8926, 17th IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, 韩国, 27/10/19. https://doi.org/10.1109/ICCV.2019.00901

Joint syntax representation learning and visual cue translation for video captioning. / Hou, Jingyi; Wu, Xinxiao; Zhao, Wentian 等.
Proceedings - 2019 International Conference on Computer Vision, ICCV 2019. Institute of Electrical and Electronics Engineers Inc., 2019. 页码 8917-8926 9010931 (Proceedings of the IEEE International Conference on Computer Vision; 卷 2019-October).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Joint syntax representation learning and visual cue translation for video captioning

AU - Hou, Jingyi

AU - Wu, Xinxiao

AU - Zhao, Wentian

AU - Luo, Jiebo

AU - Jia, Yunde

PY - 2019/10

Y1 - 2019/10

N2 - Video captioning is a challenging task that involves not only visual perception but also syntax representation learning. Recent progress in video captioning has been achieved through visual perception, but syntax representation learning is still under-explored. We propose a novel video captioning approach that takes into account both visual perception and syntax representation learning to generate accurate descriptions of videos. Specifically, we use sentence templates composed of Part-of-Speech (POS) tags to represent the syntax structure of captions, and accordingly, syntax representation learning is performed by directly inferring POS tags from videos. The visual perception is implemented by a mixture model which translates visual cues into lexical words that are conditional on the learned syntactic structure of sentences. Thus, a video captioning task consists of two sub-tasks: Video POS tagging and visual cue translation, which are jointly modeled and trained in an end-to-end fashion. Evaluations on three public benchmark datasets demonstrate that our proposed method achieves substantially better performance than the state-of-the-art methods, which validates the superiority of joint modeling of syntax representation learning and visual perception for video captioning.

AB - Video captioning is a challenging task that involves not only visual perception but also syntax representation learning. Recent progress in video captioning has been achieved through visual perception, but syntax representation learning is still under-explored. We propose a novel video captioning approach that takes into account both visual perception and syntax representation learning to generate accurate descriptions of videos. Specifically, we use sentence templates composed of Part-of-Speech (POS) tags to represent the syntax structure of captions, and accordingly, syntax representation learning is performed by directly inferring POS tags from videos. The visual perception is implemented by a mixture model which translates visual cues into lexical words that are conditional on the learned syntactic structure of sentences. Thus, a video captioning task consists of two sub-tasks: Video POS tagging and visual cue translation, which are jointly modeled and trained in an end-to-end fashion. Evaluations on three public benchmark datasets demonstrate that our proposed method achieves substantially better performance than the state-of-the-art methods, which validates the superiority of joint modeling of syntax representation learning and visual perception for video captioning.

UR - http://www.scopus.com/inward/record.url?scp=85081904825&partnerID=8YFLogxK

U2 - 10.1109/ICCV.2019.00901

DO - 10.1109/ICCV.2019.00901

M3 - Conference contribution

AN - SCOPUS:85081904825

T3 - Proceedings of the IEEE International Conference on Computer Vision

SP - 8917

EP - 8926

BT - Proceedings - 2019 International Conference on Computer Vision, ICCV 2019

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 17th IEEE/CVF International Conference on Computer Vision, ICCV 2019

Y2 - 27 October 2019 through 2 November 2019

ER -

Hou J, Wu X, Zhao W, Luo J, Jia Y. Joint syntax representation learning and visual cue translation for video captioning. 在 Proceedings - 2019 International Conference on Computer Vision, ICCV 2019. Institute of Electrical and Electronics Engineers Inc. 2019. 页码 8917-8926. 9010931. (Proceedings of the IEEE International Conference on Computer Vision). doi: 10.1109/ICCV.2019.00901

Joint syntax representation learning and visual cue translation for video captioning

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此