Skip to main navigation Skip to search Skip to main content

Weakly-Supervised Temporal Action Alignment Driven by Unbalanced Spectral Fused Gromov-Wasserstein Distance

  • Dixin Luo
  • , Yutong Wang
  • , Angxiao Yue
  • , Hongteng Xu*
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Temporal action alignment aims at segmenting videos into clips and tagging each clip with a textual description, which is an important task of video semantic analysis. Most existing methods, however, rely on supervised learning to train their alignment models, whose applications are limited because of the common insufficiency issue of labeled videos. To mitigate this issue, we propose a weakly-supervised temporal action alignment method based on a novel computational optimal transport technique called unbalanced spectral fused Gromov-Wasserstein (US-FGW) distance. Instead of using videos with known clips and corresponding textual tags, our method just needs each training video to be associated with a set of (unsorted) texts while does not require the fine-grained correspondence between the frames and the texts. Given such weakly-supervised video-text pairs, our method trains the representation models of the video frames and the texts jointly in a probabilistic or deterministic autoencoding architecture and penalizes the US-FGW distance between the distribution of visual latent codes and that of textual latent codes. We compute the US-FGW distance efficiently by leveraging the Bregman ADMM algorithm. Furthermore, we generalize classic contrastive learning framework and reformulate it based on the proposed US-FGW distance, which provides a new viewpoint of contrastive learning for our problem. Experimental results show that our method and its variants outperform state-of-the-art weakly-supervised temporal action alignment methods, whose results are even comparable to those derived by supervised learning methods on some specific evaluation measurements. The code is available at https://github.com/hhhh1138/Temporal-Action-Alignment-USFGW.

Original languageEnglish
Title of host publicationMM 2022 - Proceedings of the 30th ACM International Conference on Multimedia
PublisherAssociation for Computing Machinery, Inc
Pages728-739
Number of pages12
ISBN (Electronic)9781450392037
DOIs
Publication statusPublished - 10 Oct 2022
Event30th ACM International Conference on Multimedia, MM 2022 - Lisboa, Portugal
Duration: 10 Oct 202214 Oct 2022

Publication series

NameMM 2022 - Proceedings of the 30th ACM International Conference on Multimedia

Conference

Conference30th ACM International Conference on Multimedia, MM 2022
Country/TerritoryPortugal
CityLisboa
Period10/10/2214/10/22

Keywords

  • autoencoding
  • computational optimal transport
  • contrastive learning
  • temporal action alignment
  • weakly-supervised learning

Fingerprint

Dive into the research topics of 'Weakly-Supervised Temporal Action Alignment Driven by Unbalanced Spectral Fused Gromov-Wasserstein Distance'. Together they form a unique fingerprint.

Cite this