VSS-Net: Visual Semantic Self-Mining Network for Video Summarization

Yunzuo Zhang*, Yameng Liu, Weili Kang, Ran Tao

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

4 Citations (Scopus)

Abstract

Video summarization, with the target to detect valuable segments given untrimmed videos, is a meaningful yet understudied topic. Previous methods primarily consider inter-frame and inter-shot temporal dependencies, which might be insufficient to pinpoint important content due to limited valuable information that can be learned. To address this limitation, we elaborate on a Visual Semantic Self-mining Network (VSS-Net), a novel summarization framework motivated by the widespread success of cross-modality learning tasks. VSS-Net initially adopts a two-stream structure consisting of a Context Representation Graph (CRG) and a Video Semantics Encoder (VSE). They are jointly exploited to establish the groundwork for further boosting the capability of content awareness. Specifically, CRG is constructed using an edge-set strategy tailored to the hierarchical structure of videos, enriching visual features with local and non-local temporal cues from temporal order and visual relationship perspectives. Meanwhile, by learning visual similarity across features, VSE adaptively acquires an instructive video-level semantic representation of the input video from coarse to fine. Subsequently, the two streams converge in a Context-Semantics Interaction Layer (CSIL) to achieve sophisticated information exchange across frame-level temporal cues and video-level semantic representation, guaranteeing informative representations and boosting the sensitivity to important segments. Eventually, importance scores are predicted utilizing a prediction head, followed by key shot selection. We evaluate the proposed framework and demonstrate its effectiveness and superiority against state-of-the-art methods on the widely used benchmarks.

Original languageEnglish
Pages (from-to)2775-2788
Number of pages14
JournalIEEE Transactions on Circuits and Systems for Video Technology
Volume34
Issue number4
DOIs
Publication statusPublished - 1 Apr 2024

Keywords

  • Video summarization
  • information exchange
  • self-mining
  • semantic representation
  • temporal cues

Fingerprint

Dive into the research topics of 'VSS-Net: Visual Semantic Self-Mining Network for Video Summarization'. Together they form a unique fingerprint.

Cite this