Self-supervised Video Summarization Guided by Semantic Inverse Optimal Transport

Yutong Wang, Hongteng Xu, Dixin Luo*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Citations (Scopus)

Abstract

Video summarization is a critical task in video analysis that aims to create a brief yet informative summary of the original video (i.e., a set of keyframes) while retaining its primary content. Supervised summarization methods rely on time-consuming keyframe labeling and thus often suffer from the insufficiency issue of training data. In contrast, the performance of unsupervised summarization methods is often unsatisfactory due to the lack of semantically-meaningful guidance on the keyframe selection. In this study, we propose a novel self-supervised video summarization framework with the help of computational optimal transport techniques. Specifically, we generate textual descriptions from video shots and learn the projection from the textual embeddings to the visual ones together with an optimal transport plan between them via solving an inverse optimal transport problem. We propose an alternating optimization algorithm to solve this problem efficiently and design an effective mechanism in the algorithm to avoid trivial solutions. Given the optimal transport plan and the underlying distance between the projected textual embeddings and the visual ones, we synthesize pseudo-significance scores for video frames and leverage the scores as offline supervision to train a keyframe selector. Without subjective and error-prone manual annotations, the proposed framework surpasses previous unsupervised methods in producing high-quality results for generic and instructional video summarization tasks, whose performance even is comparable to those supervised competitors. The code is available at https://github.com/Dixin-s-Lab/Video-Summary-IOT.

Original languageEnglish
Title of host publicationMM 2023 - Proceedings of the 31st ACM International Conference on Multimedia
PublisherAssociation for Computing Machinery, Inc
Pages6611-6622
Number of pages12
ISBN (Electronic)9798400701085
DOIs
Publication statusPublished - 26 Oct 2023
Event31st ACM International Conference on Multimedia, MM 2023 - Ottawa, Canada
Duration: 29 Oct 20233 Nov 2023

Publication series

NameMM 2023 - Proceedings of the 31st ACM International Conference on Multimedia

Conference

Conference31st ACM International Conference on Multimedia, MM 2023
Country/TerritoryCanada
CityOttawa
Period29/10/233/11/23

Keywords

  • inverse optimal transport
  • self-supervised learning
  • semantic alignment
  • unbalanced wasserstein distance
  • video summarization

Fingerprint

Dive into the research topics of 'Self-supervised Video Summarization Guided by Semantic Inverse Optimal Transport'. Together they form a unique fingerprint.

Cite this

Wang, Y., Xu, H., & Luo, D. (2023). Self-supervised Video Summarization Guided by Semantic Inverse Optimal Transport. In MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia (pp. 6611-6622). (MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia). Association for Computing Machinery, Inc. https://doi.org/10.1145/3581783.3612087