Learning Semantic Concepts and Temporal Alignment for Narrated Video Procedural Captioning

Botian Shi; Lei Ji; Zhendong Niu; Nan Duan; Ming Zhou; Xilin Chen

doi:10.1145/3394171.3413498

Learning Semantic Concepts and Temporal Alignment for Narrated Video Procedural Captioning

Botian Shi, Lei Ji^*, Zhendong Niu, Nan Duan, Ming Zhou, Xilin Chen

^*此作品的通讯作者

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

18 引用（Scopus）

摘要

Video captioning is a fundamental task for visual understanding. Previous works employ end-to-end networks to learn from the low-level vision feature and generate descriptive captions, which are hard to recognize fine-grained objects and lacks the understanding of crucial semantic concepts. According to DPC [19], these concepts generally present in the narrative transcripts of the instructional videos. The incorporation of transcript and video can improve the captioning performance. However, DPC directly concatenates the embedding of transcript with video features, which is incapable of fusing language and vision features effectively and leads to the temporal mis-alignment between transcript and video. This motivates us to 1) learn the semantic concepts explicitly and 2) design a temporal alignment mechanism to better align the video and transcript for the captioning task. In this paper, we start with an encoder-decoder backbone using transformer models. Firstly, we design a semantic concept prediction module as a multi-task to train the encoder in a supervised way. Then, we develop an attention based cross-modality temporal alignment method that combines the sequential video frames and transcript sentences. Finally, we adopt a copy mechanism to enable the decoder(generation) module to copy important concepts from source transcript directly. The extensive experimental results demonstrate the effectiveness of our model, which achieves state-of-the-art results on YouCookII dataset.

源语言	英语
主期刊名	MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia
出版商	Association for Computing Machinery, Inc
页	4355-4363
页数	9
ISBN（电子版）	9781450379885
DOI	https://doi.org/10.1145/3394171.3413498
出版状态	已出版 - 12 10月 2020
活动	28th ACM International Conference on Multimedia, MM 2020 - Virtual, Online, 美国期限: 12 10月 2020 → 16 10月 2020

出版系列

姓名	MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia

会议

会议	28th ACM International Conference on Multimedia, MM 2020
国家/地区	美国
市	Virtual, Online
时期	12/10/20 → 16/10/20

访问文件

10.1145/3394171.3413498

其它文件与链接

链接到 Scopus 的出版物

引用此

Shi, B., Ji, L., Niu, Z., Duan, N., Zhou, M., & Chen, X. (2020). Learning Semantic Concepts and Temporal Alignment for Narrated Video Procedural Captioning. 在 MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia (页码 4355-4363). (MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia). Association for Computing Machinery, Inc. https://doi.org/10.1145/3394171.3413498

@inproceedings{533862f80f874f2194b2dbb7f3d539cc,

title = "Learning Semantic Concepts and Temporal Alignment for Narrated Video Procedural Captioning",

abstract = "Video captioning is a fundamental task for visual understanding. Previous works employ end-to-end networks to learn from the low-level vision feature and generate descriptive captions, which are hard to recognize fine-grained objects and lacks the understanding of crucial semantic concepts. According to DPC [19], these concepts generally present in the narrative transcripts of the instructional videos. The incorporation of transcript and video can improve the captioning performance. However, DPC directly concatenates the embedding of transcript with video features, which is incapable of fusing language and vision features effectively and leads to the temporal mis-alignment between transcript and video. This motivates us to 1) learn the semantic concepts explicitly and 2) design a temporal alignment mechanism to better align the video and transcript for the captioning task. In this paper, we start with an encoder-decoder backbone using transformer models. Firstly, we design a semantic concept prediction module as a multi-task to train the encoder in a supervised way. Then, we develop an attention based cross-modality temporal alignment method that combines the sequential video frames and transcript sentences. Finally, we adopt a copy mechanism to enable the decoder(generation) module to copy important concepts from source transcript directly. The extensive experimental results demonstrate the effectiveness of our model, which achieves state-of-the-art results on YouCookII dataset.",

keywords = "semantic concept, video captioning, video summarization",

author = "Botian Shi and Lei Ji and Zhendong Niu and Nan Duan and Ming Zhou and Xilin Chen",

note = "Publisher Copyright: {\textcopyright} 2020 ACM.; 28th ACM International Conference on Multimedia, MM 2020 ; Conference date: 12-10-2020 Through 16-10-2020",

year = "2020",

month = oct,

day = "12",

doi = "10.1145/3394171.3413498",

language = "English",

series = "MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia",

publisher = "Association for Computing Machinery, Inc",

pages = "4355--4363",

booktitle = "MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia",

}

Shi, B, Ji, L, Niu, Z, Duan, N, Zhou, M & Chen, X 2020, Learning Semantic Concepts and Temporal Alignment for Narrated Video Procedural Captioning. 在 MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia. MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia, Association for Computing Machinery, Inc, 页码 4355-4363, 28th ACM International Conference on Multimedia, MM 2020, Virtual, Online, 美国, 12/10/20. https://doi.org/10.1145/3394171.3413498

Learning Semantic Concepts and Temporal Alignment for Narrated Video Procedural Captioning. / Shi, Botian; Ji, Lei; Niu, Zhendong 等.
MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia. Association for Computing Machinery, Inc, 2020. 页码 4355-4363 (MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Learning Semantic Concepts and Temporal Alignment for Narrated Video Procedural Captioning

AU - Shi, Botian

AU - Ji, Lei

AU - Niu, Zhendong

AU - Duan, Nan

AU - Zhou, Ming

AU - Chen, Xilin

PY - 2020/10/12

Y1 - 2020/10/12

N2 - Video captioning is a fundamental task for visual understanding. Previous works employ end-to-end networks to learn from the low-level vision feature and generate descriptive captions, which are hard to recognize fine-grained objects and lacks the understanding of crucial semantic concepts. According to DPC [19], these concepts generally present in the narrative transcripts of the instructional videos. The incorporation of transcript and video can improve the captioning performance. However, DPC directly concatenates the embedding of transcript with video features, which is incapable of fusing language and vision features effectively and leads to the temporal mis-alignment between transcript and video. This motivates us to 1) learn the semantic concepts explicitly and 2) design a temporal alignment mechanism to better align the video and transcript for the captioning task. In this paper, we start with an encoder-decoder backbone using transformer models. Firstly, we design a semantic concept prediction module as a multi-task to train the encoder in a supervised way. Then, we develop an attention based cross-modality temporal alignment method that combines the sequential video frames and transcript sentences. Finally, we adopt a copy mechanism to enable the decoder(generation) module to copy important concepts from source transcript directly. The extensive experimental results demonstrate the effectiveness of our model, which achieves state-of-the-art results on YouCookII dataset.

AB - Video captioning is a fundamental task for visual understanding. Previous works employ end-to-end networks to learn from the low-level vision feature and generate descriptive captions, which are hard to recognize fine-grained objects and lacks the understanding of crucial semantic concepts. According to DPC [19], these concepts generally present in the narrative transcripts of the instructional videos. The incorporation of transcript and video can improve the captioning performance. However, DPC directly concatenates the embedding of transcript with video features, which is incapable of fusing language and vision features effectively and leads to the temporal mis-alignment between transcript and video. This motivates us to 1) learn the semantic concepts explicitly and 2) design a temporal alignment mechanism to better align the video and transcript for the captioning task. In this paper, we start with an encoder-decoder backbone using transformer models. Firstly, we design a semantic concept prediction module as a multi-task to train the encoder in a supervised way. Then, we develop an attention based cross-modality temporal alignment method that combines the sequential video frames and transcript sentences. Finally, we adopt a copy mechanism to enable the decoder(generation) module to copy important concepts from source transcript directly. The extensive experimental results demonstrate the effectiveness of our model, which achieves state-of-the-art results on YouCookII dataset.

KW - semantic concept

KW - video captioning

KW - video summarization

UR - http://www.scopus.com/inward/record.url?scp=85096362942&partnerID=8YFLogxK

U2 - 10.1145/3394171.3413498

DO - 10.1145/3394171.3413498

M3 - Conference contribution

AN - SCOPUS:85096362942

T3 - MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia

SP - 4355

EP - 4363

BT - MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia

PB - Association for Computing Machinery, Inc

T2 - 28th ACM International Conference on Multimedia, MM 2020

Y2 - 12 October 2020 through 16 October 2020

ER -

Shi B, Ji L, Niu Z, Duan N, Zhou M, Chen X. Learning Semantic Concepts and Temporal Alignment for Narrated Video Procedural Captioning. 在 MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia. Association for Computing Machinery, Inc. 2020. 页码 4355-4363. (MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia). doi: 10.1145/3394171.3413498

Learning Semantic Concepts and Temporal Alignment for Narrated Video Procedural Captioning

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此