Dense procedure captioning in narrated instructional videos

Botian Shi, Lei Ji, Yaobo Liang, Nan Duan, Peng Chen, Zhendong Niu*, Ming Zhou

*此作品的通讯作者

科研成果: 书/报告/会议事项章节会议稿件同行评审

34 引用 (Scopus)

摘要

Understanding narrated instructional videos is important for both research and real-world web applications. Motivated by video dense captioning, we propose a model to generate procedure captions from narrated instructional videos which are a sequence of stepwise clips with description. Previous works on video dense captioning learn video segments and generate captions without considering transcripts. We argue that transcripts in narrated instructional videos can enhance video representation by providing fine-grained complimentary and semantic textual information. In this paper, we introduce a framework to (1) extract procedures by a cross-modality module, which fuses video content with the entire transcript; and (2) generate captions by encoding video frames as well as a snippet of transcripts within each extracted procedure. Experiments show that our model can achieve state-of-the-art performance in procedure extraction and captioning, and the ablation studies demonstrate that both the video frames and the transcripts are important for the task.

源语言英语
主期刊名ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference
出版商Association for Computational Linguistics (ACL)
6382-6391
页数10
ISBN(电子版)9781950737482
出版状态已出版 - 2020
活动57th Annual Meeting of the Association for Computational Linguistics, ACL 2019 - Florence, 意大利
期限: 28 7月 20192 8月 2019

出版系列

姓名ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference

会议

会议57th Annual Meeting of the Association for Computational Linguistics, ACL 2019
国家/地区意大利
Florence
时期28/07/192/08/19

指纹

探究 'Dense procedure captioning in narrated instructional videos' 的科研主题。它们共同构成独一无二的指纹。

引用此