TY - GEN
T1 - Spatio-Temporal Contrastive Learning for Compositional Action Recognition
AU - Gong, Yezi
AU - Pei, Mingtao
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.
PY - 2025
Y1 - 2025
N2 - The task of compositional action recognition holds significant importance in the field of video understanding; however, the issue of static bias severely limits the generalization capability of models. Existing models often overly rely on sensitive features in videos, such as object appearance and background morphology, for action recognition, without fully leveraging true temporal action features, leading to recognition errors when faced with novel object-action combinations. To address this issue, this paper proposes an innovative framework for compositional action recognition, utilizing Spatio-Temporal contrastive learning to construct a three-branch architecture that distinguishes appearance and spatiotemporal features at the feature extraction stage. The model is encouraged to contrast features that predict factual probabilities with those that predict biased probabilities through contrastive learning, thereby reducing the direct and indirect reliance on sensitive features and enhancing the accuracy and generalization of recognition. Experimental results show that this method achieves state-of-the-art performance on the Something-Else dataset, validating its effectiveness in composite action recognition tasks. Furthermore, it achieves comparable or superior results to state-of-the-art methods on standard action recognition datasets such as Something-Something-V2, UCF101, and HMDB51.
AB - The task of compositional action recognition holds significant importance in the field of video understanding; however, the issue of static bias severely limits the generalization capability of models. Existing models often overly rely on sensitive features in videos, such as object appearance and background morphology, for action recognition, without fully leveraging true temporal action features, leading to recognition errors when faced with novel object-action combinations. To address this issue, this paper proposes an innovative framework for compositional action recognition, utilizing Spatio-Temporal contrastive learning to construct a three-branch architecture that distinguishes appearance and spatiotemporal features at the feature extraction stage. The model is encouraged to contrast features that predict factual probabilities with those that predict biased probabilities through contrastive learning, thereby reducing the direct and indirect reliance on sensitive features and enhancing the accuracy and generalization of recognition. Experimental results show that this method achieves state-of-the-art performance on the Something-Else dataset, validating its effectiveness in composite action recognition tasks. Furthermore, it achieves comparable or superior results to state-of-the-art methods on standard action recognition datasets such as Something-Something-V2, UCF101, and HMDB51.
KW - Compositional action recognition
KW - Contrastive learning
KW - Video understanding
UR - http://www.scopus.com/inward/record.url?scp=85209185811&partnerID=8YFLogxK
U2 - 10.1007/978-981-97-8511-7_30
DO - 10.1007/978-981-97-8511-7_30
M3 - Conference contribution
AN - SCOPUS:85209185811
SN - 9789819785100
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 424
EP - 438
BT - Pattern Recognition and Computer Vision - 7th Chinese Conference, PRCV 2024, Proceedings
A2 - Lin, Zhouchen
A2 - Zha, Hongbin
A2 - Cheng, Ming-Ming
A2 - He, Ran
A2 - Liu, Cheng-Lin
A2 - Ubul, Kurban
A2 - Silamu, Wushouer
A2 - Zhou, Jie
PB - Springer Science and Business Media Deutschland GmbH
T2 - 7th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2024
Y2 - 18 October 2024 through 20 October 2024
ER -