TY - GEN
T1 - Learning Semantic Concepts and Temporal Alignment for Narrated Video Procedural Captioning
AU - Shi, Botian
AU - Ji, Lei
AU - Niu, Zhendong
AU - Duan, Nan
AU - Zhou, Ming
AU - Chen, Xilin
N1 - Publisher Copyright:
© 2020 ACM.
PY - 2020/10/12
Y1 - 2020/10/12
N2 - Video captioning is a fundamental task for visual understanding. Previous works employ end-to-end networks to learn from the low-level vision feature and generate descriptive captions, which are hard to recognize fine-grained objects and lacks the understanding of crucial semantic concepts. According to DPC [19], these concepts generally present in the narrative transcripts of the instructional videos. The incorporation of transcript and video can improve the captioning performance. However, DPC directly concatenates the embedding of transcript with video features, which is incapable of fusing language and vision features effectively and leads to the temporal mis-alignment between transcript and video. This motivates us to 1) learn the semantic concepts explicitly and 2) design a temporal alignment mechanism to better align the video and transcript for the captioning task. In this paper, we start with an encoder-decoder backbone using transformer models. Firstly, we design a semantic concept prediction module as a multi-task to train the encoder in a supervised way. Then, we develop an attention based cross-modality temporal alignment method that combines the sequential video frames and transcript sentences. Finally, we adopt a copy mechanism to enable the decoder(generation) module to copy important concepts from source transcript directly. The extensive experimental results demonstrate the effectiveness of our model, which achieves state-of-the-art results on YouCookII dataset.
AB - Video captioning is a fundamental task for visual understanding. Previous works employ end-to-end networks to learn from the low-level vision feature and generate descriptive captions, which are hard to recognize fine-grained objects and lacks the understanding of crucial semantic concepts. According to DPC [19], these concepts generally present in the narrative transcripts of the instructional videos. The incorporation of transcript and video can improve the captioning performance. However, DPC directly concatenates the embedding of transcript with video features, which is incapable of fusing language and vision features effectively and leads to the temporal mis-alignment between transcript and video. This motivates us to 1) learn the semantic concepts explicitly and 2) design a temporal alignment mechanism to better align the video and transcript for the captioning task. In this paper, we start with an encoder-decoder backbone using transformer models. Firstly, we design a semantic concept prediction module as a multi-task to train the encoder in a supervised way. Then, we develop an attention based cross-modality temporal alignment method that combines the sequential video frames and transcript sentences. Finally, we adopt a copy mechanism to enable the decoder(generation) module to copy important concepts from source transcript directly. The extensive experimental results demonstrate the effectiveness of our model, which achieves state-of-the-art results on YouCookII dataset.
KW - semantic concept
KW - video captioning
KW - video summarization
UR - http://www.scopus.com/inward/record.url?scp=85096362942&partnerID=8YFLogxK
U2 - 10.1145/3394171.3413498
DO - 10.1145/3394171.3413498
M3 - Conference contribution
AN - SCOPUS:85096362942
T3 - MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia
SP - 4355
EP - 4363
BT - MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia
PB - Association for Computing Machinery, Inc
T2 - 28th ACM International Conference on Multimedia, MM 2020
Y2 - 12 October 2020 through 16 October 2020
ER -