TY - GEN
T1 - Multimedia event detection via deep spatial-temporal neural networks
AU - Hou, Jingyi
AU - Wu, Xinxiao
AU - Yu, Feiwu
AU - Jia, Yunde
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2016/8/25
Y1 - 2016/8/25
N2 - This paper proposes a novel method using deep spatial-temporal neural networks based on deep Convolutional Neural Network (CNN) for multimedia event detection. To sufficiently take advantage of the motion and appearance information of events from videos, our networks contain two branches: a temporal neural network and a spatial neural network. The temporal neural network captures motion information by Recurrent Neural Networks with the mutation of gated recurrent unit. The spatial neural network catches object information by using the deep CNN, to encode the CNN features as a bag of semantics with more discriminative representations. Both the temporal and spatial features are beneficial for event detection in a fully coupled way. Finally, we employ the generalized multiple kernel learning method to effectively fuse these two types of heterogeneous and complementary features for action recognition. Experiments on TRECVID MEDTest 14 dataset show that our method achieves better performance than the state of the art.
AB - This paper proposes a novel method using deep spatial-temporal neural networks based on deep Convolutional Neural Network (CNN) for multimedia event detection. To sufficiently take advantage of the motion and appearance information of events from videos, our networks contain two branches: a temporal neural network and a spatial neural network. The temporal neural network captures motion information by Recurrent Neural Networks with the mutation of gated recurrent unit. The spatial neural network catches object information by using the deep CNN, to encode the CNN features as a bag of semantics with more discriminative representations. Both the temporal and spatial features are beneficial for event detection in a fully coupled way. Finally, we employ the generalized multiple kernel learning method to effectively fuse these two types of heterogeneous and complementary features for action recognition. Experiments on TRECVID MEDTest 14 dataset show that our method achieves better performance than the state of the art.
KW - multimedia event detection
KW - recurrent neural networks
KW - spatial-temporal networks
UR - http://www.scopus.com/inward/record.url?scp=84987621313&partnerID=8YFLogxK
U2 - 10.1109/ICME.2016.7552981
DO - 10.1109/ICME.2016.7552981
M3 - Conference contribution
AN - SCOPUS:84987621313
T3 - Proceedings - IEEE International Conference on Multimedia and Expo
BT - 2016 IEEE International Conference on Multimedia and Expo, ICME 2016
PB - IEEE Computer Society
T2 - 2016 IEEE International Conference on Multimedia and Expo, ICME 2016
Y2 - 11 July 2016 through 15 July 2016
ER -