SHTVS: Shot-level based Hierarchical Transformer for Video Summarization

Yubo An, Shenghui Zhao*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Citation (Scopus)

Abstract

In this paper, a Shot-level based Hierarchical Transformer for Video Summarization (SHTVS) is proposed for supervised video summarization. Different from most existing methods that employ bidirectional long short-term memory or use self-attention to replace certain components while keeping their overall structure in place, our methods show that a pure Transformer with video feature sequences as its input can achieve competitive performance in video summarization. In addition, to make better use of the multi-shot characteristic in a video, each video feature sequence is firstly split into shot-level feature sequences with kernel temporal segmentation, and then fed into shot-level Transformer encoder to learn shot-level representations. Finally, shot-level representations and original video feature sequence are integrated for the frame-level Transformer encoder to predict frame-level importance scores. Extensive experimental results on two benchmark datasets (SumMe and TVSum) prove the effectiveness of our methods.

Original languageEnglish
Title of host publicationICIGP 2022 - Proceedings of the 2022 5th International Conference on Image and Graphics Processing
PublisherAssociation for Computing Machinery
Pages268-274
Number of pages7
ISBN (Electronic)9781450395465
DOIs
Publication statusPublished - 7 Jan 2022
Event5th International Conference on Image and Graphics Processing, ICIGP 2022 - Virtual, Online, China
Duration: 7 Jan 20229 Jan 2022

Publication series

NameACM International Conference Proceeding Series

Conference

Conference5th International Conference on Image and Graphics Processing, ICIGP 2022
Country/TerritoryChina
CityVirtual, Online
Period7/01/229/01/22

Keywords

  • Hierarchical Transformer
  • Sequence labeling
  • Shot-level
  • Video summarization

Fingerprint

Dive into the research topics of 'SHTVS: Shot-level based Hierarchical Transformer for Video Summarization'. Together they form a unique fingerprint.

Cite this