Joint syntax representation learning and visual cue translation for video captioning

Jingyi Hou; Xinxiao Wu; Wentian Zhao; Jiebo Luo; Yunde Jia

doi:10.1109/ICCV.2019.00901

Joint syntax representation learning and visual cue translation for video captioning

Jingyi Hou, Xinxiao Wu^*, Wentian Zhao, Jiebo Luo, Yunde Jia

^*Corresponding author for this work

School of Computer Science and Technology

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

83 Citations (Scopus)

Abstract

Video captioning is a challenging task that involves not only visual perception but also syntax representation learning. Recent progress in video captioning has been achieved through visual perception, but syntax representation learning is still under-explored. We propose a novel video captioning approach that takes into account both visual perception and syntax representation learning to generate accurate descriptions of videos. Specifically, we use sentence templates composed of Part-of-Speech (POS) tags to represent the syntax structure of captions, and accordingly, syntax representation learning is performed by directly inferring POS tags from videos. The visual perception is implemented by a mixture model which translates visual cues into lexical words that are conditional on the learned syntactic structure of sentences. Thus, a video captioning task consists of two sub-tasks: Video POS tagging and visual cue translation, which are jointly modeled and trained in an end-to-end fashion. Evaluations on three public benchmark datasets demonstrate that our proposed method achieves substantially better performance than the state-of-the-art methods, which validates the superiority of joint modeling of syntax representation learning and visual perception for video captioning.

Original language	English
Title of host publication	Proceedings - 2019 International Conference on Computer Vision, ICCV 2019
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	8917-8926
Number of pages	10
ISBN (Electronic)	9781728148038
DOIs	https://doi.org/10.1109/ICCV.2019.00901
Publication status	Published - Oct 2019
Event	17th IEEE/CVF International Conference on Computer Vision, ICCV 2019 - Seoul, Korea, Republic of Duration: 27 Oct 2019 → 2 Nov 2019

Publication series

Name	Proceedings of the IEEE International Conference on Computer Vision
Volume	2019-October
ISSN (Print)	1550-5499

Conference

Conference	17th IEEE/CVF International Conference on Computer Vision, ICCV 2019
Country/Territory	Korea, Republic of
City	Seoul
Period	27/10/19 → 2/11/19

Access to Document

10.1109/ICCV.2019.00901

Cite this

Hou, J., Wu, X., Zhao, W., Luo, J., & Jia, Y. (2019). Joint syntax representation learning and visual cue translation for video captioning. In Proceedings - 2019 International Conference on Computer Vision, ICCV 2019 (pp. 8917-8926). Article 9010931 (Proceedings of the IEEE International Conference on Computer Vision; Vol. 2019-October). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICCV.2019.00901

@inproceedings{989b09f0f6e748b0a40d7007a4aebe7a,

title = "Joint syntax representation learning and visual cue translation for video captioning",

abstract = "Video captioning is a challenging task that involves not only visual perception but also syntax representation learning. Recent progress in video captioning has been achieved through visual perception, but syntax representation learning is still under-explored. We propose a novel video captioning approach that takes into account both visual perception and syntax representation learning to generate accurate descriptions of videos. Specifically, we use sentence templates composed of Part-of-Speech (POS) tags to represent the syntax structure of captions, and accordingly, syntax representation learning is performed by directly inferring POS tags from videos. The visual perception is implemented by a mixture model which translates visual cues into lexical words that are conditional on the learned syntactic structure of sentences. Thus, a video captioning task consists of two sub-tasks: Video POS tagging and visual cue translation, which are jointly modeled and trained in an end-to-end fashion. Evaluations on three public benchmark datasets demonstrate that our proposed method achieves substantially better performance than the state-of-the-art methods, which validates the superiority of joint modeling of syntax representation learning and visual perception for video captioning.",

author = "Jingyi Hou and Xinxiao Wu and Wentian Zhao and Jiebo Luo and Yunde Jia",

note = "Publisher Copyright: {\textcopyright} 2019 IEEE.; 17th IEEE/CVF International Conference on Computer Vision, ICCV 2019 ; Conference date: 27-10-2019 Through 02-11-2019",

year = "2019",

month = oct,

doi = "10.1109/ICCV.2019.00901",

language = "English",

series = "Proceedings of the IEEE International Conference on Computer Vision",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "8917--8926",

booktitle = "Proceedings - 2019 International Conference on Computer Vision, ICCV 2019",

address = "United States",

}

Hou, J, Wu, X, Zhao, W, Luo, J & Jia, Y 2019, Joint syntax representation learning and visual cue translation for video captioning. in Proceedings - 2019 International Conference on Computer Vision, ICCV 2019., 9010931, Proceedings of the IEEE International Conference on Computer Vision, vol. 2019-October, Institute of Electrical and Electronics Engineers Inc., pp. 8917-8926, 17th IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea, Republic of, 27/10/19. https://doi.org/10.1109/ICCV.2019.00901

Joint syntax representation learning and visual cue translation for video captioning. / Hou, Jingyi; Wu, Xinxiao; Zhao, Wentian et al.
Proceedings - 2019 International Conference on Computer Vision, ICCV 2019. Institute of Electrical and Electronics Engineers Inc., 2019. p. 8917-8926 9010931 (Proceedings of the IEEE International Conference on Computer Vision; Vol. 2019-October).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Joint syntax representation learning and visual cue translation for video captioning

AU - Hou, Jingyi

AU - Wu, Xinxiao

AU - Zhao, Wentian

AU - Luo, Jiebo

AU - Jia, Yunde

PY - 2019/10

Y1 - 2019/10

N2 - Video captioning is a challenging task that involves not only visual perception but also syntax representation learning. Recent progress in video captioning has been achieved through visual perception, but syntax representation learning is still under-explored. We propose a novel video captioning approach that takes into account both visual perception and syntax representation learning to generate accurate descriptions of videos. Specifically, we use sentence templates composed of Part-of-Speech (POS) tags to represent the syntax structure of captions, and accordingly, syntax representation learning is performed by directly inferring POS tags from videos. The visual perception is implemented by a mixture model which translates visual cues into lexical words that are conditional on the learned syntactic structure of sentences. Thus, a video captioning task consists of two sub-tasks: Video POS tagging and visual cue translation, which are jointly modeled and trained in an end-to-end fashion. Evaluations on three public benchmark datasets demonstrate that our proposed method achieves substantially better performance than the state-of-the-art methods, which validates the superiority of joint modeling of syntax representation learning and visual perception for video captioning.

AB - Video captioning is a challenging task that involves not only visual perception but also syntax representation learning. Recent progress in video captioning has been achieved through visual perception, but syntax representation learning is still under-explored. We propose a novel video captioning approach that takes into account both visual perception and syntax representation learning to generate accurate descriptions of videos. Specifically, we use sentence templates composed of Part-of-Speech (POS) tags to represent the syntax structure of captions, and accordingly, syntax representation learning is performed by directly inferring POS tags from videos. The visual perception is implemented by a mixture model which translates visual cues into lexical words that are conditional on the learned syntactic structure of sentences. Thus, a video captioning task consists of two sub-tasks: Video POS tagging and visual cue translation, which are jointly modeled and trained in an end-to-end fashion. Evaluations on three public benchmark datasets demonstrate that our proposed method achieves substantially better performance than the state-of-the-art methods, which validates the superiority of joint modeling of syntax representation learning and visual perception for video captioning.

UR - http://www.scopus.com/inward/record.url?scp=85081904825&partnerID=8YFLogxK

U2 - 10.1109/ICCV.2019.00901

DO - 10.1109/ICCV.2019.00901

M3 - Conference contribution

AN - SCOPUS:85081904825

T3 - Proceedings of the IEEE International Conference on Computer Vision

SP - 8917

EP - 8926

BT - Proceedings - 2019 International Conference on Computer Vision, ICCV 2019

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 17th IEEE/CVF International Conference on Computer Vision, ICCV 2019

Y2 - 27 October 2019 through 2 November 2019

ER -

Hou J, Wu X, Zhao W, Luo J, Jia Y. Joint syntax representation learning and visual cue translation for video captioning. In Proceedings - 2019 International Conference on Computer Vision, ICCV 2019. Institute of Electrical and Electronics Engineers Inc. 2019. p. 8917-8926. 9010931. (Proceedings of the IEEE International Conference on Computer Vision). doi: 10.1109/ICCV.2019.00901

Joint syntax representation learning and visual cue translation for video captioning

Abstract

Publication series

Conference

Access to Document

Other files and links

Fingerprint

Cite this