VSS-Net: Visual Semantic Self-Mining Network for Video Summarization

Yunzuo Zhang; Yameng Liu; Weili Kang; Ran Tao

doi:10.1109/TCSVT.2023.3312325

VSS-Net: Visual Semantic Self-Mining Network for Video Summarization

Yunzuo Zhang^*, Yameng Liu, Weili Kang, Ran Tao

^*此作品的通讯作者

信息与电子学院

Shijiazhuang Tiedao University

科研成果: 期刊稿件 › 文章 › 同行评审

7 引用（Scopus）

摘要

Video summarization, with the target to detect valuable segments given untrimmed videos, is a meaningful yet understudied topic. Previous methods primarily consider inter-frame and inter-shot temporal dependencies, which might be insufficient to pinpoint important content due to limited valuable information that can be learned. To address this limitation, we elaborate on a Visual Semantic Self-mining Network (VSS-Net), a novel summarization framework motivated by the widespread success of cross-modality learning tasks. VSS-Net initially adopts a two-stream structure consisting of a Context Representation Graph (CRG) and a Video Semantics Encoder (VSE). They are jointly exploited to establish the groundwork for further boosting the capability of content awareness. Specifically, CRG is constructed using an edge-set strategy tailored to the hierarchical structure of videos, enriching visual features with local and non-local temporal cues from temporal order and visual relationship perspectives. Meanwhile, by learning visual similarity across features, VSE adaptively acquires an instructive video-level semantic representation of the input video from coarse to fine. Subsequently, the two streams converge in a Context-Semantics Interaction Layer (CSIL) to achieve sophisticated information exchange across frame-level temporal cues and video-level semantic representation, guaranteeing informative representations and boosting the sensitivity to important segments. Eventually, importance scores are predicted utilizing a prediction head, followed by key shot selection. We evaluate the proposed framework and demonstrate its effectiveness and superiority against state-of-the-art methods on the widely used benchmarks.

源语言	英语
页（从-至）	2775-2788
页数	14
期刊	IEEE Transactions on Circuits and Systems for Video Technology
卷	34
期	4
DOI	https://doi.org/10.1109/TCSVT.2023.3312325
出版状态	已出版 - 1 4月 2024

访问文件

10.1109/TCSVT.2023.3312325

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{b3554d75e9854aabbfec9dd8738b303f,

title = "VSS-Net: Visual Semantic Self-Mining Network for Video Summarization",

abstract = "Video summarization, with the target to detect valuable segments given untrimmed videos, is a meaningful yet understudied topic. Previous methods primarily consider inter-frame and inter-shot temporal dependencies, which might be insufficient to pinpoint important content due to limited valuable information that can be learned. To address this limitation, we elaborate on a Visual Semantic Self-mining Network (VSS-Net), a novel summarization framework motivated by the widespread success of cross-modality learning tasks. VSS-Net initially adopts a two-stream structure consisting of a Context Representation Graph (CRG) and a Video Semantics Encoder (VSE). They are jointly exploited to establish the groundwork for further boosting the capability of content awareness. Specifically, CRG is constructed using an edge-set strategy tailored to the hierarchical structure of videos, enriching visual features with local and non-local temporal cues from temporal order and visual relationship perspectives. Meanwhile, by learning visual similarity across features, VSE adaptively acquires an instructive video-level semantic representation of the input video from coarse to fine. Subsequently, the two streams converge in a Context-Semantics Interaction Layer (CSIL) to achieve sophisticated information exchange across frame-level temporal cues and video-level semantic representation, guaranteeing informative representations and boosting the sensitivity to important segments. Eventually, importance scores are predicted utilizing a prediction head, followed by key shot selection. We evaluate the proposed framework and demonstrate its effectiveness and superiority against state-of-the-art methods on the widely used benchmarks.",

keywords = "Video summarization, information exchange, self-mining, semantic representation, temporal cues",

author = "Yunzuo Zhang and Yameng Liu and Weili Kang and Ran Tao",

note = "Publisher Copyright: {\textcopyright} 2023 IEEE.",

year = "2024",

month = apr,

day = "1",

doi = "10.1109/TCSVT.2023.3312325",

language = "English",

volume = "34",

pages = "2775--2788",

journal = "IEEE Transactions on Circuits and Systems for Video Technology",

issn = "1051-8215",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "4",

}

TY - JOUR

T1 - VSS-Net

T2 - Visual Semantic Self-Mining Network for Video Summarization

AU - Zhang, Yunzuo

AU - Liu, Yameng

AU - Kang, Weili

AU - Tao, Ran

PY - 2024/4/1

Y1 - 2024/4/1

N2 - Video summarization, with the target to detect valuable segments given untrimmed videos, is a meaningful yet understudied topic. Previous methods primarily consider inter-frame and inter-shot temporal dependencies, which might be insufficient to pinpoint important content due to limited valuable information that can be learned. To address this limitation, we elaborate on a Visual Semantic Self-mining Network (VSS-Net), a novel summarization framework motivated by the widespread success of cross-modality learning tasks. VSS-Net initially adopts a two-stream structure consisting of a Context Representation Graph (CRG) and a Video Semantics Encoder (VSE). They are jointly exploited to establish the groundwork for further boosting the capability of content awareness. Specifically, CRG is constructed using an edge-set strategy tailored to the hierarchical structure of videos, enriching visual features with local and non-local temporal cues from temporal order and visual relationship perspectives. Meanwhile, by learning visual similarity across features, VSE adaptively acquires an instructive video-level semantic representation of the input video from coarse to fine. Subsequently, the two streams converge in a Context-Semantics Interaction Layer (CSIL) to achieve sophisticated information exchange across frame-level temporal cues and video-level semantic representation, guaranteeing informative representations and boosting the sensitivity to important segments. Eventually, importance scores are predicted utilizing a prediction head, followed by key shot selection. We evaluate the proposed framework and demonstrate its effectiveness and superiority against state-of-the-art methods on the widely used benchmarks.

AB - Video summarization, with the target to detect valuable segments given untrimmed videos, is a meaningful yet understudied topic. Previous methods primarily consider inter-frame and inter-shot temporal dependencies, which might be insufficient to pinpoint important content due to limited valuable information that can be learned. To address this limitation, we elaborate on a Visual Semantic Self-mining Network (VSS-Net), a novel summarization framework motivated by the widespread success of cross-modality learning tasks. VSS-Net initially adopts a two-stream structure consisting of a Context Representation Graph (CRG) and a Video Semantics Encoder (VSE). They are jointly exploited to establish the groundwork for further boosting the capability of content awareness. Specifically, CRG is constructed using an edge-set strategy tailored to the hierarchical structure of videos, enriching visual features with local and non-local temporal cues from temporal order and visual relationship perspectives. Meanwhile, by learning visual similarity across features, VSE adaptively acquires an instructive video-level semantic representation of the input video from coarse to fine. Subsequently, the two streams converge in a Context-Semantics Interaction Layer (CSIL) to achieve sophisticated information exchange across frame-level temporal cues and video-level semantic representation, guaranteeing informative representations and boosting the sensitivity to important segments. Eventually, importance scores are predicted utilizing a prediction head, followed by key shot selection. We evaluate the proposed framework and demonstrate its effectiveness and superiority against state-of-the-art methods on the widely used benchmarks.

KW - Video summarization

KW - information exchange

KW - self-mining

KW - semantic representation

KW - temporal cues

UR - http://www.scopus.com/inward/record.url?scp=85171543260&partnerID=8YFLogxK

U2 - 10.1109/TCSVT.2023.3312325

DO - 10.1109/TCSVT.2023.3312325

M3 - Article

AN - SCOPUS:85171543260

SN - 1051-8215

VL - 34

SP - 2775

EP - 2788

JO - IEEE Transactions on Circuits and Systems for Video Technology

JF - IEEE Transactions on Circuits and Systems for Video Technology

IS - 4

ER -

VSS-Net: Visual Semantic Self-Mining Network for Video Summarization

摘要

访问文件

其它文件与链接

指纹

引用此