TY - JOUR
T1 - Combining multiple deep cues for action recognition
AU - Wang, Ruiqi
AU - Wu, Xinxiao
N1 - Publisher Copyright:
© 2018, Springer Science+Business Media, LLC, part of Springer Nature.
PY - 2019/4/1
Y1 - 2019/4/1
N2 - In this paper, we propose a novel deep learning based framework to fuse multiple cues of action motions, objects and scenes for complex action recognition. Since the deep features achieve promising results, three deep representations are extracted for capturing both temporal and contextual information of actions. Particularly, for the action cue, we first adopt a deep detection model to detect persons frame by frame and then feed the deep representations of persons into a Gated Recurrent Unit model to generate the action features. Different from the existing deep action features, our feature is capable of modeling the global dynamics of long human motion. The scene and object cues are also represented by deep features pooling on all the frames in a video. Moreover, we introduce an l p -norm multiple kernel learning method to effectively combine the multiple deep representations of the video to learn robust classifiers of actions by capturing the contextual relationships between action, object and scene. Extensive experiments on two real-world action datasets (i.e., UCF101 and HMDB51) clearly demonstrate the effectiveness of our method.
AB - In this paper, we propose a novel deep learning based framework to fuse multiple cues of action motions, objects and scenes for complex action recognition. Since the deep features achieve promising results, three deep representations are extracted for capturing both temporal and contextual information of actions. Particularly, for the action cue, we first adopt a deep detection model to detect persons frame by frame and then feed the deep representations of persons into a Gated Recurrent Unit model to generate the action features. Different from the existing deep action features, our feature is capable of modeling the global dynamics of long human motion. The scene and object cues are also represented by deep features pooling on all the frames in a video. Moreover, we introduce an l p -norm multiple kernel learning method to effectively combine the multiple deep representations of the video to learn robust classifiers of actions by capturing the contextual relationships between action, object and scene. Extensive experiments on two real-world action datasets (i.e., UCF101 and HMDB51) clearly demonstrate the effectiveness of our method.
KW - Action recognition
KW - Multiple deep cues
KW - l -norm multiple kernel learning
UR - http://www.scopus.com/inward/record.url?scp=85053393371&partnerID=8YFLogxK
U2 - 10.1007/s11042-018-6509-0
DO - 10.1007/s11042-018-6509-0
M3 - Article
AN - SCOPUS:85053393371
SN - 1380-7501
VL - 78
SP - 9933
EP - 9950
JO - Multimedia Tools and Applications
JF - Multimedia Tools and Applications
IS - 8
ER -