TY - JOUR
T1 - Content-Attention Representation by Factorized Action-Scene Network for Action Recognition
AU - Hou, Jingyi
AU - Wu, Xinxiao
AU - Sun, Yuchao
AU - Jia, Yunde
N1 - Publisher Copyright:
© 1999-2012 IEEE.
PY - 2018/6
Y1 - 2018/6
N2 - During action recognition in videos, irrelevant motions in the background can greatly degrade the performance of recognizing specific actions with which we actually concern ourself here. In this paper, a novel deep neural network, called factorized action-scene network (FASNet), is proposed to encode and fuse the most relevant and informative semantic cues for action recognition. Specifically, we decompose the FASNet into two components. One is a newly designed encoding network, named content attention network (CANet), which encodes local spatialoral features to learn the action representations with good robustness to the noise of irrelevant motions. The other is a fusion network, which integrates the pretrained CANet to fuse the encoded spatialoral features with contextual scene feature extracted from the same video, for learning more descriptive and discriminative action representations. Moreover, different from the existing deep learning based tasks for generic action recognition, which applies softmax loss function as the training guidance, we formulate two loss functions for guiding the proposed model to accomplish more specific action recognition tasks, i.e., the multilabel correlation loss for multilabel action recognition and the triplet loss for complex event detection. Extensive experiments on the Hollywood2 dataset and the TRECVID MEDTest 14 dataset show that our method achieves superior performance compared with the state-of-the-art methods.
AB - During action recognition in videos, irrelevant motions in the background can greatly degrade the performance of recognizing specific actions with which we actually concern ourself here. In this paper, a novel deep neural network, called factorized action-scene network (FASNet), is proposed to encode and fuse the most relevant and informative semantic cues for action recognition. Specifically, we decompose the FASNet into two components. One is a newly designed encoding network, named content attention network (CANet), which encodes local spatialoral features to learn the action representations with good robustness to the noise of irrelevant motions. The other is a fusion network, which integrates the pretrained CANet to fuse the encoded spatialoral features with contextual scene feature extracted from the same video, for learning more descriptive and discriminative action representations. Moreover, different from the existing deep learning based tasks for generic action recognition, which applies softmax loss function as the training guidance, we formulate two loss functions for guiding the proposed model to accomplish more specific action recognition tasks, i.e., the multilabel correlation loss for multilabel action recognition and the triplet loss for complex event detection. Extensive experiments on the Hollywood2 dataset and the TRECVID MEDTest 14 dataset show that our method achieves superior performance compared with the state-of-the-art methods.
KW - Deep neural network
KW - complex event detection
KW - multi-label action recognition
UR - http://www.scopus.com/inward/record.url?scp=85033724652&partnerID=8YFLogxK
U2 - 10.1109/TMM.2017.2771462
DO - 10.1109/TMM.2017.2771462
M3 - Article
AN - SCOPUS:85033724652
SN - 1520-9210
VL - 20
SP - 1537
EP - 1547
JO - IEEE Transactions on Multimedia
JF - IEEE Transactions on Multimedia
IS - 6
ER -