Topic-aware video summarization using multimodal transformer

Yubo Zhu; Wentian Zhao; Rui Hua; Xinxiao Wu

doi:10.1016/j.patcog.2023.109578

Topic-aware video summarization using multimodal transformer

Yubo Zhu, Wentian Zhao, Rui Hua, Xinxiao Wu^*

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

13 Citations (Scopus)

Abstract

Video summarization aims to generate a short and compact summary to represent the original video. Existing methods mainly focus on how to extract a general objective synopsis that precisely summaries the video content. However, in real scenarios, a video usually contains rich content with multiple topics and people may cast diverse interests on the visual contents even for the same video. In this paper, we propose a novel topic-aware video summarization task that generates multiple video summaries with different topics. To support the study of this new task, we first build a video benchmark dataset by collecting videos from various types of movies and annotate them with topic labels and frame-level importance scores. Then we propose a multimodal Transformer model for the topic-aware video summarization, which simultaneously predicts topic labels and generates topic-related summaries by adaptively fusing multimodal features extracted from the video. Experimental results show the effectiveness of our method.

Original language	English
Article number	109578
Journal	Pattern Recognition
Volume	140
DOIs	https://doi.org/10.1016/j.patcog.2023.109578
Publication status	Published - Aug 2023
Externally published	Yes

Keywords

Multimodal transformer
Topic-aware video summarization
Video summarization dataset

Access to Document

10.1016/j.patcog.2023.109578

Cite this

@article{fc1c18b039524bfda4cc3b16e79616b9,

title = "Topic-aware video summarization using multimodal transformer",

abstract = "Video summarization aims to generate a short and compact summary to represent the original video. Existing methods mainly focus on how to extract a general objective synopsis that precisely summaries the video content. However, in real scenarios, a video usually contains rich content with multiple topics and people may cast diverse interests on the visual contents even for the same video. In this paper, we propose a novel topic-aware video summarization task that generates multiple video summaries with different topics. To support the study of this new task, we first build a video benchmark dataset by collecting videos from various types of movies and annotate them with topic labels and frame-level importance scores. Then we propose a multimodal Transformer model for the topic-aware video summarization, which simultaneously predicts topic labels and generates topic-related summaries by adaptively fusing multimodal features extracted from the video. Experimental results show the effectiveness of our method.",

keywords = "Multimodal transformer, Topic-aware video summarization, Video summarization dataset",

author = "Yubo Zhu and Wentian Zhao and Rui Hua and Xinxiao Wu",

note = "Publisher Copyright: {\textcopyright} 2023 Elsevier Ltd",

year = "2023",

month = aug,

doi = "10.1016/j.patcog.2023.109578",

language = "English",

volume = "140",

journal = "Pattern Recognition",

issn = "0031-3203",

publisher = "Elsevier Ltd.",

}

TY - JOUR

T1 - Topic-aware video summarization using multimodal transformer

AU - Zhu, Yubo

AU - Zhao, Wentian

AU - Hua, Rui

AU - Wu, Xinxiao

PY - 2023/8

Y1 - 2023/8

N2 - Video summarization aims to generate a short and compact summary to represent the original video. Existing methods mainly focus on how to extract a general objective synopsis that precisely summaries the video content. However, in real scenarios, a video usually contains rich content with multiple topics and people may cast diverse interests on the visual contents even for the same video. In this paper, we propose a novel topic-aware video summarization task that generates multiple video summaries with different topics. To support the study of this new task, we first build a video benchmark dataset by collecting videos from various types of movies and annotate them with topic labels and frame-level importance scores. Then we propose a multimodal Transformer model for the topic-aware video summarization, which simultaneously predicts topic labels and generates topic-related summaries by adaptively fusing multimodal features extracted from the video. Experimental results show the effectiveness of our method.

AB - Video summarization aims to generate a short and compact summary to represent the original video. Existing methods mainly focus on how to extract a general objective synopsis that precisely summaries the video content. However, in real scenarios, a video usually contains rich content with multiple topics and people may cast diverse interests on the visual contents even for the same video. In this paper, we propose a novel topic-aware video summarization task that generates multiple video summaries with different topics. To support the study of this new task, we first build a video benchmark dataset by collecting videos from various types of movies and annotate them with topic labels and frame-level importance scores. Then we propose a multimodal Transformer model for the topic-aware video summarization, which simultaneously predicts topic labels and generates topic-related summaries by adaptively fusing multimodal features extracted from the video. Experimental results show the effectiveness of our method.

KW - Multimodal transformer

KW - Topic-aware video summarization

KW - Video summarization dataset

UR - http://www.scopus.com/inward/record.url?scp=85151555885&partnerID=8YFLogxK

U2 - 10.1016/j.patcog.2023.109578

DO - 10.1016/j.patcog.2023.109578

M3 - Article

AN - SCOPUS:85151555885

SN - 0031-3203

VL - 140

JO - Pattern Recognition

JF - Pattern Recognition

M1 - 109578

ER -

Topic-aware video summarization using multimodal transformer

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this