融合语义信息的视频摘要生成

Rui Hua; Xinxiao Wu; Wentian Zhao

doi:10.13700/j.bh.1001-5965.2020.0447

融合语义信息的视频摘要生成

Translated title of the contribution: Video summarization by learning semantic information

Rui Hua, Xinxiao Wu^*, Wentian Zhao

^*Corresponding author for this work

School of Computer Science and Technology

Beijing Institute of Technology

Research output: Contribution to journal › Article › peer-review

2 Citations (Scopus)

Abstract

Video summarization aims to generate short and compact summary to represent original video. However, the existing methods focus more on representativeness and diversity of representation, but less on semantic information. In order to fully exploit semantic information of video content, we propose a novel video summarization model that learns a visual-semantic embedding space, so that the video features contain rich semantic information. It can generate video summaries and text summaries that describe the original video simultaneously. The model is mainly divided into three modules: frame-level score weighting module that combines convolutional layers and fully connected layers; visual-semantic embedding module that embeds the video and text in a common embedding space and make them lose to each other to achieve the purpose of mutual promotion of two features; video caption generation module that generates video summary with semantic information by minimizing the distance between the generated description of the video summary and the manually annotated text of the original video. During the test, while obtaining the video summary, we obtain a short text summary as a by-product, which can help people understand the video content more intuitively. Experiments on SumMe and TVSum datasets show that the proposed model achieves better performance than the existing advanced methods by fusing semantic information, and improves F-score by 0.5% and 1.6%, respectively.

Translated title of the contribution	Video summarization by learning semantic information
Original language	Chinese (Traditional)
Pages (from-to)	650-657
Number of pages	8
Journal	Beijing Hangkong Hangtian Daxue Xuebao/Journal of Beijing University of Aeronautics and Astronautics
Volume	47
Issue number	3
DOIs	https://doi.org/10.13700/j.bh.1001-5965.2020.0447
Publication status	Published - Mar 2021

Access to Document

10.13700/j.bh.1001-5965.2020.0447

Cite this

@article{0ed44b2bf59e40988306845e5528e870,

title = "融合语义信息的视频摘要生成",

abstract = "Video summarization aims to generate short and compact summary to represent original video. However, the existing methods focus more on representativeness and diversity of representation, but less on semantic information. In order to fully exploit semantic information of video content, we propose a novel video summarization model that learns a visual-semantic embedding space, so that the video features contain rich semantic information. It can generate video summaries and text summaries that describe the original video simultaneously. The model is mainly divided into three modules: frame-level score weighting module that combines convolutional layers and fully connected layers; visual-semantic embedding module that embeds the video and text in a common embedding space and make them lose to each other to achieve the purpose of mutual promotion of two features; video caption generation module that generates video summary with semantic information by minimizing the distance between the generated description of the video summary and the manually annotated text of the original video. During the test, while obtaining the video summary, we obtain a short text summary as a by-product, which can help people understand the video content more intuitively. Experiments on SumMe and TVSum datasets show that the proposed model achieves better performance than the existing advanced methods by fusing semantic information, and improves F-score by 0.5% and 1.6%, respectively.",

keywords = "Long Short-Term Memory (LSTM) model, Video captioning, Video key frame, Video summarization, Visual-semantic embedding space",

author = "Rui Hua and Xinxiao Wu and Wentian Zhao",

year = "2021",

month = mar,

doi = "10.13700/j.bh.1001-5965.2020.0447",

language = "繁体中文",

volume = "47",

pages = "650--657",

journal = "Beijing Hangkong Hangtian Daxue Xuebao/Journal of Beijing University of Aeronautics and Astronautics",

issn = "1001-5965",

publisher = "Beijing University of Aeronautics and Astronautics (BUAA)",

number = "3",

}

TY - JOUR

T1 - 融合语义信息的视频摘要生成

AU - Hua, Rui

AU - Wu, Xinxiao

AU - Zhao, Wentian

PY - 2021/3

Y1 - 2021/3

N2 - Video summarization aims to generate short and compact summary to represent original video. However, the existing methods focus more on representativeness and diversity of representation, but less on semantic information. In order to fully exploit semantic information of video content, we propose a novel video summarization model that learns a visual-semantic embedding space, so that the video features contain rich semantic information. It can generate video summaries and text summaries that describe the original video simultaneously. The model is mainly divided into three modules: frame-level score weighting module that combines convolutional layers and fully connected layers; visual-semantic embedding module that embeds the video and text in a common embedding space and make them lose to each other to achieve the purpose of mutual promotion of two features; video caption generation module that generates video summary with semantic information by minimizing the distance between the generated description of the video summary and the manually annotated text of the original video. During the test, while obtaining the video summary, we obtain a short text summary as a by-product, which can help people understand the video content more intuitively. Experiments on SumMe and TVSum datasets show that the proposed model achieves better performance than the existing advanced methods by fusing semantic information, and improves F-score by 0.5% and 1.6%, respectively.

AB - Video summarization aims to generate short and compact summary to represent original video. However, the existing methods focus more on representativeness and diversity of representation, but less on semantic information. In order to fully exploit semantic information of video content, we propose a novel video summarization model that learns a visual-semantic embedding space, so that the video features contain rich semantic information. It can generate video summaries and text summaries that describe the original video simultaneously. The model is mainly divided into three modules: frame-level score weighting module that combines convolutional layers and fully connected layers; visual-semantic embedding module that embeds the video and text in a common embedding space and make them lose to each other to achieve the purpose of mutual promotion of two features; video caption generation module that generates video summary with semantic information by minimizing the distance between the generated description of the video summary and the manually annotated text of the original video. During the test, while obtaining the video summary, we obtain a short text summary as a by-product, which can help people understand the video content more intuitively. Experiments on SumMe and TVSum datasets show that the proposed model achieves better performance than the existing advanced methods by fusing semantic information, and improves F-score by 0.5% and 1.6%, respectively.

KW - Long Short-Term Memory (LSTM) model

KW - Video captioning

KW - Video key frame

KW - Video summarization

KW - Visual-semantic embedding space

UR - http://www.scopus.com/inward/record.url?scp=85104306618&partnerID=8YFLogxK

U2 - 10.13700/j.bh.1001-5965.2020.0447

DO - 10.13700/j.bh.1001-5965.2020.0447

M3 - 文章

AN - SCOPUS:85104306618

SN - 1001-5965

VL - 47

SP - 650

EP - 657

JO - Beijing Hangkong Hangtian Daxue Xuebao/Journal of Beijing University of Aeronautics and Astronautics

JF - Beijing Hangkong Hangtian Daxue Xuebao/Journal of Beijing University of Aeronautics and Astronautics

IS - 3

ER -

融合语义信息的视频摘要生成

Abstract

Access to Document

Other files and links

Fingerprint

Cite this