DANet: Semi-supervised differentiated auxiliaries guided network for video action recognition

Guangyu Gao*, Ziming Liu, Guangjun Zhang, Jinyang Li, A. K. Qin

*此作品的通讯作者

科研成果: 期刊稿件文章同行评审

7 引用 (Scopus)

摘要

Video Action Recognition (ViAR) aims to identify the category of the human action observed in a given video. With the advent of Deep Learning (DL) techniques, noticeable performance breakthroughs have been achieved in this study. However, the success of most existing DL-based ViAR methods heavily relies on the existence of a large amount of annotated data, i.e., videos with corresponding action categories. In practice, obtaining such a desired number of annotations is often difficult due to expensive labeling costs, which may lead to significant performance degradation for these methods. To address this issue, we propose an end-to-end semi-supervised Differentiated Auxiliary guided Network (DANet) to best use a few annotated videos. Except for the common supervised learning on a few annotated videos, the DANet also involves the knowledge of multiple pre-trained auxiliary networks to optimize the ViAR network in a self-supervised way on the unannotated data by removing the annotations. Considering the tight connection between video action recognition and classical static image-based visual tasks, the abundant knowledge from the pre-trained static image-based models can be used for training the ViAR model. Specifically, the DANet is a two-branch architecture, which includes a target branch of the ViAR network, and an auxiliary branch of multiple auxiliary networks (i.e., referring to diverse off-the-shelf models of relevant image tasks). Given a limited number of annotated videos, we train the target ViAR network end-to-end in a semi-supervised way, namely, with both the supervised cross-entropy loss on annotated videos, and the per-auxiliary weighted self-supervised contrastive losses on the same videos but without using annotations. Besides, we further explore different weighted guidance of the auxiliary networks to the ViAR network to better reflect different relationships between the image-based models and the ViAR model. Finally, we conduct extensive experiments on several popular action recognition benchmarks in comparison with existing state-of-the-art methods, and the experimental results demonstrate the superiority of DANet over most of the compared methods. In particular, the DANet obviously suppresses state-of-the-art ViAR methods even with very fewer annotated videos.

源语言英语
页(从-至)121-131
页数11
期刊Neural Networks
158
DOI
出版状态已出版 - 1月 2023

指纹

探究 'DANet: Semi-supervised differentiated auxiliaries guided network for video action recognition' 的科研主题。它们共同构成独一无二的指纹。

引用此