Dense procedure captioning in narrated instructional videos

Botian Shi; Lei Ji; Yaobo Liang; Nan Duan; Peng Chen; Zhendong Niu; Ming Zhou

Dense procedure captioning in narrated instructional videos

Botian Shi, Lei Ji, Yaobo Liang, Nan Duan, Peng Chen, Zhendong Niu^*, Ming Zhou

^*此作品的通讯作者

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

39 引用（Scopus）

摘要

Understanding narrated instructional videos is important for both research and real-world web applications. Motivated by video dense captioning, we propose a model to generate procedure captions from narrated instructional videos which are a sequence of stepwise clips with description. Previous works on video dense captioning learn video segments and generate captions without considering transcripts. We argue that transcripts in narrated instructional videos can enhance video representation by providing fine-grained complimentary and semantic textual information. In this paper, we introduce a framework to (1) extract procedures by a cross-modality module, which fuses video content with the entire transcript; and (2) generate captions by encoding video frames as well as a snippet of transcripts within each extracted procedure. Experiments show that our model can achieve state-of-the-art performance in procedure extraction and captioning, and the ablation studies demonstrate that both the video frames and the transcripts are important for the task.

源语言	英语
主期刊名	ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference
出版商	Association for Computational Linguistics (ACL)
页	6382-6391
页数	10
ISBN（电子版）	9781950737482
出版状态	已出版 - 2020
活动	57th Annual Meeting of the Association for Computational Linguistics, ACL 2019 - Florence, 意大利期限: 28 7月 2019 → 2 8月 2019

出版系列

姓名	ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference

会议

会议	57th Annual Meeting of the Association for Computational Linguistics, ACL 2019
国家/地区	意大利
市	Florence
时期	28/07/19 → 2/08/19

其它文件与链接

链接到 Scopus 的出版物

引用此

Shi, B., Ji, L., Liang, Y., Duan, N., Chen, P., Niu, Z., & Zhou, M. (2020). Dense procedure captioning in narrated instructional videos. 在 ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (页码 6382-6391). (ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference). Association for Computational Linguistics (ACL).

Shi, Botian ; Ji, Lei ; Liang, Yaobo 等. / Dense procedure captioning in narrated instructional videos. ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference. Association for Computational Linguistics (ACL), 2020. 页码 6382-6391 (ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference).

@inproceedings{36c33f389dae47d5846f34b667195f1c,

title = "Dense procedure captioning in narrated instructional videos",

abstract = "Understanding narrated instructional videos is important for both research and real-world web applications. Motivated by video dense captioning, we propose a model to generate procedure captions from narrated instructional videos which are a sequence of stepwise clips with description. Previous works on video dense captioning learn video segments and generate captions without considering transcripts. We argue that transcripts in narrated instructional videos can enhance video representation by providing fine-grained complimentary and semantic textual information. In this paper, we introduce a framework to (1) extract procedures by a cross-modality module, which fuses video content with the entire transcript; and (2) generate captions by encoding video frames as well as a snippet of transcripts within each extracted procedure. Experiments show that our model can achieve state-of-the-art performance in procedure extraction and captioning, and the ablation studies demonstrate that both the video frames and the transcripts are important for the task.",

author = "Botian Shi and Lei Ji and Yaobo Liang and Nan Duan and Peng Chen and Zhendong Niu and Ming Zhou",

note = "Publisher Copyright: {\textcopyright} 2019 Association for Computational Linguistics; 57th Annual Meeting of the Association for Computational Linguistics, ACL 2019 ; Conference date: 28-07-2019 Through 02-08-2019",

year = "2020",

language = "English",

series = "ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference",

publisher = "Association for Computational Linguistics (ACL)",

pages = "6382--6391",

booktitle = "ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference",

address = "United States",

}

Shi, B, Ji, L, Liang, Y, Duan, N, Chen, P, Niu, Z & Zhou, M 2020, Dense procedure captioning in narrated instructional videos. 在 ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference. ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, Association for Computational Linguistics (ACL), 页码 6382-6391, 57th Annual Meeting of the Association for Computational Linguistics, ACL 2019, Florence, 意大利, 28/07/19.

Dense procedure captioning in narrated instructional videos. / Shi, Botian; Ji, Lei; Liang, Yaobo 等.
ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference. Association for Computational Linguistics (ACL), 2020. 页码 6382-6391 (ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Dense procedure captioning in narrated instructional videos

AU - Shi, Botian

AU - Ji, Lei

AU - Liang, Yaobo

AU - Duan, Nan

AU - Chen, Peng

AU - Niu, Zhendong

AU - Zhou, Ming

PY - 2020

Y1 - 2020

N2 - Understanding narrated instructional videos is important for both research and real-world web applications. Motivated by video dense captioning, we propose a model to generate procedure captions from narrated instructional videos which are a sequence of stepwise clips with description. Previous works on video dense captioning learn video segments and generate captions without considering transcripts. We argue that transcripts in narrated instructional videos can enhance video representation by providing fine-grained complimentary and semantic textual information. In this paper, we introduce a framework to (1) extract procedures by a cross-modality module, which fuses video content with the entire transcript; and (2) generate captions by encoding video frames as well as a snippet of transcripts within each extracted procedure. Experiments show that our model can achieve state-of-the-art performance in procedure extraction and captioning, and the ablation studies demonstrate that both the video frames and the transcripts are important for the task.

AB - Understanding narrated instructional videos is important for both research and real-world web applications. Motivated by video dense captioning, we propose a model to generate procedure captions from narrated instructional videos which are a sequence of stepwise clips with description. Previous works on video dense captioning learn video segments and generate captions without considering transcripts. We argue that transcripts in narrated instructional videos can enhance video representation by providing fine-grained complimentary and semantic textual information. In this paper, we introduce a framework to (1) extract procedures by a cross-modality module, which fuses video content with the entire transcript; and (2) generate captions by encoding video frames as well as a snippet of transcripts within each extracted procedure. Experiments show that our model can achieve state-of-the-art performance in procedure extraction and captioning, and the ablation studies demonstrate that both the video frames and the transcripts are important for the task.

UR - http://www.scopus.com/inward/record.url?scp=85084092597&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85084092597

T3 - ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference

SP - 6382

EP - 6391

BT - ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference

PB - Association for Computational Linguistics (ACL)

T2 - 57th Annual Meeting of the Association for Computational Linguistics, ACL 2019

Y2 - 28 July 2019 through 2 August 2019

ER -

Shi B, Ji L, Liang Y, Duan N, Chen P, Niu Z 等. Dense procedure captioning in narrated instructional videos. 在 ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference. Association for Computational Linguistics (ACL). 2020. 页码 6382-6391. (ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference).

Dense procedure captioning in narrated instructional videos

摘要

出版系列

会议

其它文件与链接

指纹

引用此