Guide and interact: scene-graph based generation and control of video captions

Xuyang Lu; Yang Gao

doi:10.1007/s00530-022-01012-7

Guide and interact: scene-graph based generation and control of video captions

Xuyang Lu, Yang Gao^*

^*此作品的通讯作者

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

3 引用（Scopus）

摘要

Internet videos contain abounding meaningful information. The task of video captioning is to extract and understand video contents from video, and summarize them into a comprehensive description including one or multiple sentences. The research of video captioning involves challenges from both video understanding and natural language generation area. Among the technical obstacles confronted with video captioning, one of the most critical issue undermining the quality of video captioning is that the model tends to generate fictional contents, which is usually called “hallucination” problem. In this paper, we present scene-graph guidance and interaction (SGI) to solve this problem. The framework of SGI is composed of a faithful scene graph generation module and a multi-modal interactive network module. The scene graph generation module extracts a faithful scene graph from video, which is then regarded as the factual guidance for the text generator. The network module attends and interacts the video features and scene graph input, and generates a video caption including the faithful video contents. On this basis, we further explore our SGI model to realize user intention-based controllable video captioning using elaborate scene graphs. We performed experiments on Charades and ActivityNet Captions datasets, the SGI model achieved state-of-the-art performance by automatic metrics, proving the high quality and outstanding controllability of video captions.

源语言	英语
页（从-至）	797-809
页数	13
期刊	Multimedia Systems
卷	29
期	2
DOI	https://doi.org/10.1007/s00530-022-01012-7
出版状态	已出版 - 4月 2023

访问文件

10.1007/s00530-022-01012-7

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{214fa5af2e6a4f43b3ea9015f79ac1dc,

title = "Guide and interact: scene-graph based generation and control of video captions",

abstract = "Internet videos contain abounding meaningful information. The task of video captioning is to extract and understand video contents from video, and summarize them into a comprehensive description including one or multiple sentences. The research of video captioning involves challenges from both video understanding and natural language generation area. Among the technical obstacles confronted with video captioning, one of the most critical issue undermining the quality of video captioning is that the model tends to generate fictional contents, which is usually called “hallucination” problem. In this paper, we present scene-graph guidance and interaction (SGI) to solve this problem. The framework of SGI is composed of a faithful scene graph generation module and a multi-modal interactive network module. The scene graph generation module extracts a faithful scene graph from video, which is then regarded as the factual guidance for the text generator. The network module attends and interacts the video features and scene graph input, and generates a video caption including the faithful video contents. On this basis, we further explore our SGI model to realize user intention-based controllable video captioning using elaborate scene graphs. We performed experiments on Charades and ActivityNet Captions datasets, the SGI model achieved state-of-the-art performance by automatic metrics, proving the high quality and outstanding controllability of video captions.",

keywords = "Multi-modal, Scene graph, Text generation, Video captioning",

author = "Xuyang Lu and Yang Gao",

note = "Publisher Copyright: {\textcopyright} 2022, The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature.",

year = "2023",

month = apr,

doi = "10.1007/s00530-022-01012-7",

language = "English",

volume = "29",

pages = "797--809",

journal = "Multimedia Systems",

issn = "0942-4962",

publisher = "Springer Verlag",

number = "2",

}

TY - JOUR

T1 - Guide and interact

T2 - scene-graph based generation and control of video captions

AU - Lu, Xuyang

AU - Gao, Yang

PY - 2023/4

Y1 - 2023/4

N2 - Internet videos contain abounding meaningful information. The task of video captioning is to extract and understand video contents from video, and summarize them into a comprehensive description including one or multiple sentences. The research of video captioning involves challenges from both video understanding and natural language generation area. Among the technical obstacles confronted with video captioning, one of the most critical issue undermining the quality of video captioning is that the model tends to generate fictional contents, which is usually called “hallucination” problem. In this paper, we present scene-graph guidance and interaction (SGI) to solve this problem. The framework of SGI is composed of a faithful scene graph generation module and a multi-modal interactive network module. The scene graph generation module extracts a faithful scene graph from video, which is then regarded as the factual guidance for the text generator. The network module attends and interacts the video features and scene graph input, and generates a video caption including the faithful video contents. On this basis, we further explore our SGI model to realize user intention-based controllable video captioning using elaborate scene graphs. We performed experiments on Charades and ActivityNet Captions datasets, the SGI model achieved state-of-the-art performance by automatic metrics, proving the high quality and outstanding controllability of video captions.

AB - Internet videos contain abounding meaningful information. The task of video captioning is to extract and understand video contents from video, and summarize them into a comprehensive description including one or multiple sentences. The research of video captioning involves challenges from both video understanding and natural language generation area. Among the technical obstacles confronted with video captioning, one of the most critical issue undermining the quality of video captioning is that the model tends to generate fictional contents, which is usually called “hallucination” problem. In this paper, we present scene-graph guidance and interaction (SGI) to solve this problem. The framework of SGI is composed of a faithful scene graph generation module and a multi-modal interactive network module. The scene graph generation module extracts a faithful scene graph from video, which is then regarded as the factual guidance for the text generator. The network module attends and interacts the video features and scene graph input, and generates a video caption including the faithful video contents. On this basis, we further explore our SGI model to realize user intention-based controllable video captioning using elaborate scene graphs. We performed experiments on Charades and ActivityNet Captions datasets, the SGI model achieved state-of-the-art performance by automatic metrics, proving the high quality and outstanding controllability of video captions.

KW - Multi-modal

KW - Scene graph

KW - Text generation

KW - Video captioning

UR - http://www.scopus.com/inward/record.url?scp=85141999569&partnerID=8YFLogxK

U2 - 10.1007/s00530-022-01012-7

DO - 10.1007/s00530-022-01012-7

M3 - Article

AN - SCOPUS:85141999569

SN - 0942-4962

VL - 29

SP - 797

EP - 809

JO - Multimedia Systems

JF - Multimedia Systems

IS - 2

ER -

Guide and interact: scene-graph based generation and control of video captions

摘要

访问文件

其它文件与链接

指纹

引用此