融合语义信息的视频摘要生成

Rui Hua; Xinxiao Wu; Wentian Zhao

doi:10.13700/j.bh.1001-5965.2020.0447

融合语义信息的视频摘要生成

Rui Hua, Xinxiao Wu^*, Wentian Zhao

^*此作品的通讯作者

计算机学院

Beijing Institute of Technology

科研成果: 期刊稿件 › 文章 › 同行评审

2 引用（Scopus）

摘要

Video summarization aims to generate short and compact summary to represent original video. However, the existing methods focus more on representativeness and diversity of representation, but less on semantic information. In order to fully exploit semantic information of video content, we propose a novel video summarization model that learns a visual-semantic embedding space, so that the video features contain rich semantic information. It can generate video summaries and text summaries that describe the original video simultaneously. The model is mainly divided into three modules: frame-level score weighting module that combines convolutional layers and fully connected layers; visual-semantic embedding module that embeds the video and text in a common embedding space and make them lose to each other to achieve the purpose of mutual promotion of two features; video caption generation module that generates video summary with semantic information by minimizing the distance between the generated description of the video summary and the manually annotated text of the original video. During the test, while obtaining the video summary, we obtain a short text summary as a by-product, which can help people understand the video content more intuitively. Experiments on SumMe and TVSum datasets show that the proposed model achieves better performance than the existing advanced methods by fusing semantic information, and improves F-score by 0.5% and 1.6%, respectively.

投稿的翻译标题	Video summarization by learning semantic information
源语言	繁体中文
页（从-至）	650-657
页数	8
期刊	Beijing Hangkong Hangtian Daxue Xuebao/Journal of Beijing University of Aeronautics and Astronautics
卷	47
期	3
DOI	https://doi.org/10.13700/j.bh.1001-5965.2020.0447
出版状态	已出版 - 3月 2021

关键词

Long Short-Term Memory (LSTM) model
Video captioning
Video key frame
Video summarization
Visual-semantic embedding space

访问文件

10.13700/j.bh.1001-5965.2020.0447

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{0ed44b2bf59e40988306845e5528e870,

title = "融合语义信息的视频摘要生成",

abstract = "Video summarization aims to generate short and compact summary to represent original video. However, the existing methods focus more on representativeness and diversity of representation, but less on semantic information. In order to fully exploit semantic information of video content, we propose a novel video summarization model that learns a visual-semantic embedding space, so that the video features contain rich semantic information. It can generate video summaries and text summaries that describe the original video simultaneously. The model is mainly divided into three modules: frame-level score weighting module that combines convolutional layers and fully connected layers; visual-semantic embedding module that embeds the video and text in a common embedding space and make them lose to each other to achieve the purpose of mutual promotion of two features; video caption generation module that generates video summary with semantic information by minimizing the distance between the generated description of the video summary and the manually annotated text of the original video. During the test, while obtaining the video summary, we obtain a short text summary as a by-product, which can help people understand the video content more intuitively. Experiments on SumMe and TVSum datasets show that the proposed model achieves better performance than the existing advanced methods by fusing semantic information, and improves F-score by 0.5% and 1.6%, respectively.",

keywords = "Long Short-Term Memory (LSTM) model, Video captioning, Video key frame, Video summarization, Visual-semantic embedding space",

author = "Rui Hua and Xinxiao Wu and Wentian Zhao",

year = "2021",

month = mar,

doi = "10.13700/j.bh.1001-5965.2020.0447",

language = "繁体中文",

volume = "47",

pages = "650--657",

journal = "Beijing Hangkong Hangtian Daxue Xuebao/Journal of Beijing University of Aeronautics and Astronautics",

issn = "1001-5965",

publisher = "Beijing University of Aeronautics and Astronautics (BUAA)",

number = "3",

}

TY - JOUR

T1 - 融合语义信息的视频摘要生成

AU - Hua, Rui

AU - Wu, Xinxiao

AU - Zhao, Wentian

PY - 2021/3

Y1 - 2021/3

N2 - Video summarization aims to generate short and compact summary to represent original video. However, the existing methods focus more on representativeness and diversity of representation, but less on semantic information. In order to fully exploit semantic information of video content, we propose a novel video summarization model that learns a visual-semantic embedding space, so that the video features contain rich semantic information. It can generate video summaries and text summaries that describe the original video simultaneously. The model is mainly divided into three modules: frame-level score weighting module that combines convolutional layers and fully connected layers; visual-semantic embedding module that embeds the video and text in a common embedding space and make them lose to each other to achieve the purpose of mutual promotion of two features; video caption generation module that generates video summary with semantic information by minimizing the distance between the generated description of the video summary and the manually annotated text of the original video. During the test, while obtaining the video summary, we obtain a short text summary as a by-product, which can help people understand the video content more intuitively. Experiments on SumMe and TVSum datasets show that the proposed model achieves better performance than the existing advanced methods by fusing semantic information, and improves F-score by 0.5% and 1.6%, respectively.

AB - Video summarization aims to generate short and compact summary to represent original video. However, the existing methods focus more on representativeness and diversity of representation, but less on semantic information. In order to fully exploit semantic information of video content, we propose a novel video summarization model that learns a visual-semantic embedding space, so that the video features contain rich semantic information. It can generate video summaries and text summaries that describe the original video simultaneously. The model is mainly divided into three modules: frame-level score weighting module that combines convolutional layers and fully connected layers; visual-semantic embedding module that embeds the video and text in a common embedding space and make them lose to each other to achieve the purpose of mutual promotion of two features; video caption generation module that generates video summary with semantic information by minimizing the distance between the generated description of the video summary and the manually annotated text of the original video. During the test, while obtaining the video summary, we obtain a short text summary as a by-product, which can help people understand the video content more intuitively. Experiments on SumMe and TVSum datasets show that the proposed model achieves better performance than the existing advanced methods by fusing semantic information, and improves F-score by 0.5% and 1.6%, respectively.

KW - Long Short-Term Memory (LSTM) model

KW - Video captioning

KW - Video key frame

KW - Video summarization

KW - Visual-semantic embedding space

UR - http://www.scopus.com/inward/record.url?scp=85104306618&partnerID=8YFLogxK

U2 - 10.13700/j.bh.1001-5965.2020.0447

DO - 10.13700/j.bh.1001-5965.2020.0447

M3 - 文章

AN - SCOPUS:85104306618

SN - 1001-5965

VL - 47

SP - 650

EP - 657

JO - Beijing Hangkong Hangtian Daxue Xuebao/Journal of Beijing University of Aeronautics and Astronautics

JF - Beijing Hangkong Hangtian Daxue Xuebao/Journal of Beijing University of Aeronautics and Astronautics

IS - 3

ER -

融合语义信息的视频摘要生成

摘要

关键词

访问文件

其它文件与链接

指纹

引用此