DANet: Semi-supervised differentiated auxiliaries guided network for video action recognition

Guangyu Gao*, Ziming Liu, Guangjun Zhang, Jinyang Li, A. K. Qin

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

12 Citations (Scopus)

Abstract

Video Action Recognition (ViAR) aims to identify the category of the human action observed in a given video. With the advent of Deep Learning (DL) techniques, noticeable performance breakthroughs have been achieved in this study. However, the success of most existing DL-based ViAR methods heavily relies on the existence of a large amount of annotated data, i.e., videos with corresponding action categories. In practice, obtaining such a desired number of annotations is often difficult due to expensive labeling costs, which may lead to significant performance degradation for these methods. To address this issue, we propose an end-to-end semi-supervised Differentiated Auxiliary guided Network (DANet) to best use a few annotated videos. Except for the common supervised learning on a few annotated videos, the DANet also involves the knowledge of multiple pre-trained auxiliary networks to optimize the ViAR network in a self-supervised way on the unannotated data by removing the annotations. Considering the tight connection between video action recognition and classical static image-based visual tasks, the abundant knowledge from the pre-trained static image-based models can be used for training the ViAR model. Specifically, the DANet is a two-branch architecture, which includes a target branch of the ViAR network, and an auxiliary branch of multiple auxiliary networks (i.e., referring to diverse off-the-shelf models of relevant image tasks). Given a limited number of annotated videos, we train the target ViAR network end-to-end in a semi-supervised way, namely, with both the supervised cross-entropy loss on annotated videos, and the per-auxiliary weighted self-supervised contrastive losses on the same videos but without using annotations. Besides, we further explore different weighted guidance of the auxiliary networks to the ViAR network to better reflect different relationships between the image-based models and the ViAR model. Finally, we conduct extensive experiments on several popular action recognition benchmarks in comparison with existing state-of-the-art methods, and the experimental results demonstrate the superiority of DANet over most of the compared methods. In particular, the DANet obviously suppresses state-of-the-art ViAR methods even with very fewer annotated videos.

Original languageEnglish
Pages (from-to)121-131
Number of pages11
JournalNeural Networks
Volume158
DOIs
Publication statusPublished - Jan 2023

Keywords

  • Action recognition
  • Contrastive loss
  • Semi-supervised learning
  • Unannotated video

Fingerprint

Dive into the research topics of 'DANet: Semi-supervised differentiated auxiliaries guided network for video action recognition'. Together they form a unique fingerprint.

Cite this