摘要
Internet videos contain abounding meaningful information. The task of video captioning is to extract and understand video contents from video, and summarize them into a comprehensive description including one or multiple sentences. The research of video captioning involves challenges from both video understanding and natural language generation area. Among the technical obstacles confronted with video captioning, one of the most critical issue undermining the quality of video captioning is that the model tends to generate fictional contents, which is usually called “hallucination” problem. In this paper, we present scene-graph guidance and interaction (SGI) to solve this problem. The framework of SGI is composed of a faithful scene graph generation module and a multi-modal interactive network module. The scene graph generation module extracts a faithful scene graph from video, which is then regarded as the factual guidance for the text generator. The network module attends and interacts the video features and scene graph input, and generates a video caption including the faithful video contents. On this basis, we further explore our SGI model to realize user intention-based controllable video captioning using elaborate scene graphs. We performed experiments on Charades and ActivityNet Captions datasets, the SGI model achieved state-of-the-art performance by automatic metrics, proving the high quality and outstanding controllability of video captions.
源语言 | 英语 |
---|---|
页(从-至) | 797-809 |
页数 | 13 |
期刊 | Multimedia Systems |
卷 | 29 |
期 | 2 |
DOI | |
出版状态 | 已出版 - 4月 2023 |