Self-supervised Video Summarization Guided by Semantic Inverse Optimal Transport

Yutong Wang; Hongteng Xu; Dixin Luo

doi:10.1145/3581783.3612087

Self-supervised Video Summarization Guided by Semantic Inverse Optimal Transport

Yutong Wang, Hongteng Xu, Dixin Luo^*

^*Corresponding author for this work

School of Computer Science and Technology

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

2 Citations (Scopus)

Abstract

Video summarization is a critical task in video analysis that aims to create a brief yet informative summary of the original video (i.e., a set of keyframes) while retaining its primary content. Supervised summarization methods rely on time-consuming keyframe labeling and thus often suffer from the insufficiency issue of training data. In contrast, the performance of unsupervised summarization methods is often unsatisfactory due to the lack of semantically-meaningful guidance on the keyframe selection. In this study, we propose a novel self-supervised video summarization framework with the help of computational optimal transport techniques. Specifically, we generate textual descriptions from video shots and learn the projection from the textual embeddings to the visual ones together with an optimal transport plan between them via solving an inverse optimal transport problem. We propose an alternating optimization algorithm to solve this problem efficiently and design an effective mechanism in the algorithm to avoid trivial solutions. Given the optimal transport plan and the underlying distance between the projected textual embeddings and the visual ones, we synthesize pseudo-significance scores for video frames and leverage the scores as offline supervision to train a keyframe selector. Without subjective and error-prone manual annotations, the proposed framework surpasses previous unsupervised methods in producing high-quality results for generic and instructional video summarization tasks, whose performance even is comparable to those supervised competitors. The code is available at https://github.com/Dixin-s-Lab/Video-Summary-IOT.

Original language	English
Title of host publication	MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia
Publisher	Association for Computing Machinery, Inc
Pages	6611-6622
Number of pages	12
ISBN (Electronic)	9798400701085
DOIs	https://doi.org/10.1145/3581783.3612087
Publication status	Published - 26 Oct 2023
Event	31st ACM International Conference on Multimedia, MM 2023 - Ottawa, Canada Duration: 29 Oct 2023 → 3 Nov 2023

Publication series

Name	MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia

Conference

Conference	31st ACM International Conference on Multimedia, MM 2023
Country/Territory	Canada
City	Ottawa
Period	29/10/23 → 3/11/23

Keywords

inverse optimal transport
self-supervised learning
semantic alignment
unbalanced wasserstein distance
video summarization

Access to Document

10.1145/3581783.3612087

Cite this

Wang, Y., Xu, H., & Luo, D. (2023). Self-supervised Video Summarization Guided by Semantic Inverse Optimal Transport. In MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia (pp. 6611-6622). (MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia). Association for Computing Machinery, Inc. https://doi.org/10.1145/3581783.3612087

@inproceedings{f0d3af185325456e9bd4c50055c12996,

title = "Self-supervised Video Summarization Guided by Semantic Inverse Optimal Transport",

abstract = "Video summarization is a critical task in video analysis that aims to create a brief yet informative summary of the original video (i.e., a set of keyframes) while retaining its primary content. Supervised summarization methods rely on time-consuming keyframe labeling and thus often suffer from the insufficiency issue of training data. In contrast, the performance of unsupervised summarization methods is often unsatisfactory due to the lack of semantically-meaningful guidance on the keyframe selection. In this study, we propose a novel self-supervised video summarization framework with the help of computational optimal transport techniques. Specifically, we generate textual descriptions from video shots and learn the projection from the textual embeddings to the visual ones together with an optimal transport plan between them via solving an inverse optimal transport problem. We propose an alternating optimization algorithm to solve this problem efficiently and design an effective mechanism in the algorithm to avoid trivial solutions. Given the optimal transport plan and the underlying distance between the projected textual embeddings and the visual ones, we synthesize pseudo-significance scores for video frames and leverage the scores as offline supervision to train a keyframe selector. Without subjective and error-prone manual annotations, the proposed framework surpasses previous unsupervised methods in producing high-quality results for generic and instructional video summarization tasks, whose performance even is comparable to those supervised competitors. The code is available at https://github.com/Dixin-s-Lab/Video-Summary-IOT.",

keywords = "inverse optimal transport, self-supervised learning, semantic alignment, unbalanced wasserstein distance, video summarization",

author = "Yutong Wang and Hongteng Xu and Dixin Luo",

note = "Publisher Copyright: {\textcopyright} 2023 ACM.; 31st ACM International Conference on Multimedia, MM 2023 ; Conference date: 29-10-2023 Through 03-11-2023",

year = "2023",

month = oct,

day = "26",

doi = "10.1145/3581783.3612087",

language = "English",

series = "MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia",

publisher = "Association for Computing Machinery, Inc",

pages = "6611--6622",

booktitle = "MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia",

}

Wang, Y, Xu, H & Luo, D 2023, Self-supervised Video Summarization Guided by Semantic Inverse Optimal Transport. in MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia. MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia, Association for Computing Machinery, Inc, pp. 6611-6622, 31st ACM International Conference on Multimedia, MM 2023, Ottawa, Canada, 29/10/23. https://doi.org/10.1145/3581783.3612087

Self-supervised Video Summarization Guided by Semantic Inverse Optimal Transport. / Wang, Yutong; Xu, Hongteng; Luo, Dixin.
MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia. Association for Computing Machinery, Inc, 2023. p. 6611-6622 (MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Self-supervised Video Summarization Guided by Semantic Inverse Optimal Transport

AU - Wang, Yutong

AU - Xu, Hongteng

AU - Luo, Dixin

PY - 2023/10/26

Y1 - 2023/10/26

N2 - Video summarization is a critical task in video analysis that aims to create a brief yet informative summary of the original video (i.e., a set of keyframes) while retaining its primary content. Supervised summarization methods rely on time-consuming keyframe labeling and thus often suffer from the insufficiency issue of training data. In contrast, the performance of unsupervised summarization methods is often unsatisfactory due to the lack of semantically-meaningful guidance on the keyframe selection. In this study, we propose a novel self-supervised video summarization framework with the help of computational optimal transport techniques. Specifically, we generate textual descriptions from video shots and learn the projection from the textual embeddings to the visual ones together with an optimal transport plan between them via solving an inverse optimal transport problem. We propose an alternating optimization algorithm to solve this problem efficiently and design an effective mechanism in the algorithm to avoid trivial solutions. Given the optimal transport plan and the underlying distance between the projected textual embeddings and the visual ones, we synthesize pseudo-significance scores for video frames and leverage the scores as offline supervision to train a keyframe selector. Without subjective and error-prone manual annotations, the proposed framework surpasses previous unsupervised methods in producing high-quality results for generic and instructional video summarization tasks, whose performance even is comparable to those supervised competitors. The code is available at https://github.com/Dixin-s-Lab/Video-Summary-IOT.

AB - Video summarization is a critical task in video analysis that aims to create a brief yet informative summary of the original video (i.e., a set of keyframes) while retaining its primary content. Supervised summarization methods rely on time-consuming keyframe labeling and thus often suffer from the insufficiency issue of training data. In contrast, the performance of unsupervised summarization methods is often unsatisfactory due to the lack of semantically-meaningful guidance on the keyframe selection. In this study, we propose a novel self-supervised video summarization framework with the help of computational optimal transport techniques. Specifically, we generate textual descriptions from video shots and learn the projection from the textual embeddings to the visual ones together with an optimal transport plan between them via solving an inverse optimal transport problem. We propose an alternating optimization algorithm to solve this problem efficiently and design an effective mechanism in the algorithm to avoid trivial solutions. Given the optimal transport plan and the underlying distance between the projected textual embeddings and the visual ones, we synthesize pseudo-significance scores for video frames and leverage the scores as offline supervision to train a keyframe selector. Without subjective and error-prone manual annotations, the proposed framework surpasses previous unsupervised methods in producing high-quality results for generic and instructional video summarization tasks, whose performance even is comparable to those supervised competitors. The code is available at https://github.com/Dixin-s-Lab/Video-Summary-IOT.

KW - inverse optimal transport

KW - self-supervised learning

KW - semantic alignment

KW - unbalanced wasserstein distance

KW - video summarization

UR - http://www.scopus.com/inward/record.url?scp=85179547980&partnerID=8YFLogxK

U2 - 10.1145/3581783.3612087

DO - 10.1145/3581783.3612087

M3 - Conference contribution

AN - SCOPUS:85179547980

T3 - MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia

SP - 6611

EP - 6622

BT - MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia

PB - Association for Computing Machinery, Inc

T2 - 31st ACM International Conference on Multimedia, MM 2023

Y2 - 29 October 2023 through 3 November 2023

ER -

Self-supervised Video Summarization Guided by Semantic Inverse Optimal Transport

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this