Spatio-Temporal Contrastive Learning for Compositional Action Recognition

Yezi Gong, Mingtao Pei*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Citation (Scopus)

Abstract

The task of compositional action recognition holds significant importance in the field of video understanding; however, the issue of static bias severely limits the generalization capability of models. Existing models often overly rely on sensitive features in videos, such as object appearance and background morphology, for action recognition, without fully leveraging true temporal action features, leading to recognition errors when faced with novel object-action combinations. To address this issue, this paper proposes an innovative framework for compositional action recognition, utilizing Spatio-Temporal contrastive learning to construct a three-branch architecture that distinguishes appearance and spatiotemporal features at the feature extraction stage. The model is encouraged to contrast features that predict factual probabilities with those that predict biased probabilities through contrastive learning, thereby reducing the direct and indirect reliance on sensitive features and enhancing the accuracy and generalization of recognition. Experimental results show that this method achieves state-of-the-art performance on the Something-Else dataset, validating its effectiveness in composite action recognition tasks. Furthermore, it achieves comparable or superior results to state-of-the-art methods on standard action recognition datasets such as Something-Something-V2, UCF101, and HMDB51.

Original languageEnglish
Title of host publicationPattern Recognition and Computer Vision - 7th Chinese Conference, PRCV 2024, Proceedings
EditorsZhouchen Lin, Hongbin Zha, Ming-Ming Cheng, Ran He, Cheng-Lin Liu, Kurban Ubul, Wushouer Silamu, Jie Zhou
PublisherSpringer Science and Business Media Deutschland GmbH
Pages424-438
Number of pages15
ISBN (Print)9789819785100
DOIs
Publication statusPublished - 2025
Event7th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2024 - Urumqi, China
Duration: 18 Oct 202420 Oct 2024

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume15037 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference7th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2024
Country/TerritoryChina
CityUrumqi
Period18/10/2420/10/24

Keywords

  • Compositional action recognition
  • Contrastive learning
  • Video understanding

Fingerprint

Dive into the research topics of 'Spatio-Temporal Contrastive Learning for Compositional Action Recognition'. Together they form a unique fingerprint.

Cite this