TY - JOUR
T1 - DANet
T2 - Semi-supervised differentiated auxiliaries guided network for video action recognition
AU - Gao, Guangyu
AU - Liu, Ziming
AU - Zhang, Guangjun
AU - Li, Jinyang
AU - Qin, A. K.
N1 - Publisher Copyright:
© 2022 Elsevier Ltd
PY - 2023/1
Y1 - 2023/1
N2 - Video Action Recognition (ViAR) aims to identify the category of the human action observed in a given video. With the advent of Deep Learning (DL) techniques, noticeable performance breakthroughs have been achieved in this study. However, the success of most existing DL-based ViAR methods heavily relies on the existence of a large amount of annotated data, i.e., videos with corresponding action categories. In practice, obtaining such a desired number of annotations is often difficult due to expensive labeling costs, which may lead to significant performance degradation for these methods. To address this issue, we propose an end-to-end semi-supervised Differentiated Auxiliary guided Network (DANet) to best use a few annotated videos. Except for the common supervised learning on a few annotated videos, the DANet also involves the knowledge of multiple pre-trained auxiliary networks to optimize the ViAR network in a self-supervised way on the unannotated data by removing the annotations. Considering the tight connection between video action recognition and classical static image-based visual tasks, the abundant knowledge from the pre-trained static image-based models can be used for training the ViAR model. Specifically, the DANet is a two-branch architecture, which includes a target branch of the ViAR network, and an auxiliary branch of multiple auxiliary networks (i.e., referring to diverse off-the-shelf models of relevant image tasks). Given a limited number of annotated videos, we train the target ViAR network end-to-end in a semi-supervised way, namely, with both the supervised cross-entropy loss on annotated videos, and the per-auxiliary weighted self-supervised contrastive losses on the same videos but without using annotations. Besides, we further explore different weighted guidance of the auxiliary networks to the ViAR network to better reflect different relationships between the image-based models and the ViAR model. Finally, we conduct extensive experiments on several popular action recognition benchmarks in comparison with existing state-of-the-art methods, and the experimental results demonstrate the superiority of DANet over most of the compared methods. In particular, the DANet obviously suppresses state-of-the-art ViAR methods even with very fewer annotated videos.
AB - Video Action Recognition (ViAR) aims to identify the category of the human action observed in a given video. With the advent of Deep Learning (DL) techniques, noticeable performance breakthroughs have been achieved in this study. However, the success of most existing DL-based ViAR methods heavily relies on the existence of a large amount of annotated data, i.e., videos with corresponding action categories. In practice, obtaining such a desired number of annotations is often difficult due to expensive labeling costs, which may lead to significant performance degradation for these methods. To address this issue, we propose an end-to-end semi-supervised Differentiated Auxiliary guided Network (DANet) to best use a few annotated videos. Except for the common supervised learning on a few annotated videos, the DANet also involves the knowledge of multiple pre-trained auxiliary networks to optimize the ViAR network in a self-supervised way on the unannotated data by removing the annotations. Considering the tight connection between video action recognition and classical static image-based visual tasks, the abundant knowledge from the pre-trained static image-based models can be used for training the ViAR model. Specifically, the DANet is a two-branch architecture, which includes a target branch of the ViAR network, and an auxiliary branch of multiple auxiliary networks (i.e., referring to diverse off-the-shelf models of relevant image tasks). Given a limited number of annotated videos, we train the target ViAR network end-to-end in a semi-supervised way, namely, with both the supervised cross-entropy loss on annotated videos, and the per-auxiliary weighted self-supervised contrastive losses on the same videos but without using annotations. Besides, we further explore different weighted guidance of the auxiliary networks to the ViAR network to better reflect different relationships between the image-based models and the ViAR model. Finally, we conduct extensive experiments on several popular action recognition benchmarks in comparison with existing state-of-the-art methods, and the experimental results demonstrate the superiority of DANet over most of the compared methods. In particular, the DANet obviously suppresses state-of-the-art ViAR methods even with very fewer annotated videos.
KW - Action recognition
KW - Contrastive loss
KW - Semi-supervised learning
KW - Unannotated video
UR - http://www.scopus.com/inward/record.url?scp=85145555397&partnerID=8YFLogxK
U2 - 10.1016/j.neunet.2022.11.009
DO - 10.1016/j.neunet.2022.11.009
M3 - Article
C2 - 36455427
AN - SCOPUS:85145555397
SN - 0893-6080
VL - 158
SP - 121
EP - 131
JO - Neural Networks
JF - Neural Networks
ER -