Dense procedure captioning in narrated instructional videos

Botian Shi; Lei Ji; Yaobo Liang; Nan Duan; Peng Chen; Zhendong Niu; Ming Zhou

Dense procedure captioning in narrated instructional videos

Botian Shi, Lei Ji, Yaobo Liang, Nan Duan, Peng Chen, Zhendong Niu^*, Ming Zhou

^*Corresponding author for this work

School of Computer Science and Technology

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

39 Citations (Scopus)

Abstract

Understanding narrated instructional videos is important for both research and real-world web applications. Motivated by video dense captioning, we propose a model to generate procedure captions from narrated instructional videos which are a sequence of stepwise clips with description. Previous works on video dense captioning learn video segments and generate captions without considering transcripts. We argue that transcripts in narrated instructional videos can enhance video representation by providing fine-grained complimentary and semantic textual information. In this paper, we introduce a framework to (1) extract procedures by a cross-modality module, which fuses video content with the entire transcript; and (2) generate captions by encoding video frames as well as a snippet of transcripts within each extracted procedure. Experiments show that our model can achieve state-of-the-art performance in procedure extraction and captioning, and the ablation studies demonstrate that both the video frames and the transcripts are important for the task.

Original language	English
Title of host publication	ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference
Publisher	Association for Computational Linguistics (ACL)
Pages	6382-6391
Number of pages	10
ISBN (Electronic)	9781950737482
Publication status	Published - 2020
Event	57th Annual Meeting of the Association for Computational Linguistics, ACL 2019 - Florence, Italy Duration: 28 Jul 2019 → 2 Aug 2019

Publication series

Name	ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference

Conference

Conference	57th Annual Meeting of the Association for Computational Linguistics, ACL 2019
Country/Territory	Italy
City	Florence
Period	28/07/19 → 2/08/19

Cite this

Shi, B., Ji, L., Liang, Y., Duan, N., Chen, P., Niu, Z., & Zhou, M. (2020). Dense procedure captioning in narrated instructional videos. In ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (pp. 6382-6391). (ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference). Association for Computational Linguistics (ACL).

Shi, Botian ; Ji, Lei ; Liang, Yaobo et al. / Dense procedure captioning in narrated instructional videos. ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference. Association for Computational Linguistics (ACL), 2020. pp. 6382-6391 (ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference).

@inproceedings{36c33f389dae47d5846f34b667195f1c,

title = "Dense procedure captioning in narrated instructional videos",

abstract = "Understanding narrated instructional videos is important for both research and real-world web applications. Motivated by video dense captioning, we propose a model to generate procedure captions from narrated instructional videos which are a sequence of stepwise clips with description. Previous works on video dense captioning learn video segments and generate captions without considering transcripts. We argue that transcripts in narrated instructional videos can enhance video representation by providing fine-grained complimentary and semantic textual information. In this paper, we introduce a framework to (1) extract procedures by a cross-modality module, which fuses video content with the entire transcript; and (2) generate captions by encoding video frames as well as a snippet of transcripts within each extracted procedure. Experiments show that our model can achieve state-of-the-art performance in procedure extraction and captioning, and the ablation studies demonstrate that both the video frames and the transcripts are important for the task.",

author = "Botian Shi and Lei Ji and Yaobo Liang and Nan Duan and Peng Chen and Zhendong Niu and Ming Zhou",

note = "Publisher Copyright: {\textcopyright} 2019 Association for Computational Linguistics; 57th Annual Meeting of the Association for Computational Linguistics, ACL 2019 ; Conference date: 28-07-2019 Through 02-08-2019",

year = "2020",

language = "English",

series = "ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference",

publisher = "Association for Computational Linguistics (ACL)",

pages = "6382--6391",

booktitle = "ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference",

address = "United States",

}

Shi, B, Ji, L, Liang, Y, Duan, N, Chen, P, Niu, Z & Zhou, M 2020, Dense procedure captioning in narrated instructional videos. in ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference. ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, Association for Computational Linguistics (ACL), pp. 6382-6391, 57th Annual Meeting of the Association for Computational Linguistics, ACL 2019, Florence, Italy, 28/07/19.

Dense procedure captioning in narrated instructional videos. / Shi, Botian; Ji, Lei; Liang, Yaobo et al.
ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference. Association for Computational Linguistics (ACL), 2020. p. 6382-6391 (ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Dense procedure captioning in narrated instructional videos

AU - Shi, Botian

AU - Ji, Lei

AU - Liang, Yaobo

AU - Duan, Nan

AU - Chen, Peng

AU - Niu, Zhendong

AU - Zhou, Ming

PY - 2020

Y1 - 2020

N2 - Understanding narrated instructional videos is important for both research and real-world web applications. Motivated by video dense captioning, we propose a model to generate procedure captions from narrated instructional videos which are a sequence of stepwise clips with description. Previous works on video dense captioning learn video segments and generate captions without considering transcripts. We argue that transcripts in narrated instructional videos can enhance video representation by providing fine-grained complimentary and semantic textual information. In this paper, we introduce a framework to (1) extract procedures by a cross-modality module, which fuses video content with the entire transcript; and (2) generate captions by encoding video frames as well as a snippet of transcripts within each extracted procedure. Experiments show that our model can achieve state-of-the-art performance in procedure extraction and captioning, and the ablation studies demonstrate that both the video frames and the transcripts are important for the task.

AB - Understanding narrated instructional videos is important for both research and real-world web applications. Motivated by video dense captioning, we propose a model to generate procedure captions from narrated instructional videos which are a sequence of stepwise clips with description. Previous works on video dense captioning learn video segments and generate captions without considering transcripts. We argue that transcripts in narrated instructional videos can enhance video representation by providing fine-grained complimentary and semantic textual information. In this paper, we introduce a framework to (1) extract procedures by a cross-modality module, which fuses video content with the entire transcript; and (2) generate captions by encoding video frames as well as a snippet of transcripts within each extracted procedure. Experiments show that our model can achieve state-of-the-art performance in procedure extraction and captioning, and the ablation studies demonstrate that both the video frames and the transcripts are important for the task.

UR - http://www.scopus.com/inward/record.url?scp=85084092597&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85084092597

T3 - ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference

SP - 6382

EP - 6391

BT - ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference

PB - Association for Computational Linguistics (ACL)

T2 - 57th Annual Meeting of the Association for Computational Linguistics, ACL 2019

Y2 - 28 July 2019 through 2 August 2019

ER -

Shi B, Ji L, Liang Y, Duan N, Chen P, Niu Z et al. Dense procedure captioning in narrated instructional videos. In ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference. Association for Computational Linguistics (ACL). 2020. p. 6382-6391. (ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference).

Dense procedure captioning in narrated instructional videos

Abstract

Publication series

Conference

Other files and links

Fingerprint

Cite this