TY - GEN
T1 - 3D convolutional two-stream network for action recognition in videos
AU - Li, Min
AU - Qi, Yuezhu
AU - Yang, Jian
AU - Zhang, Yanfang
AU - Ren, Junxing
AU - Du, Hong
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/11
Y1 - 2019/11
N2 - In recent years, action recognition based on two-stream networks has developed rapidly. However, most existing methods describe incomplete and distorted video content due to cropped and warped frame or clip-level feature extraction. This paper proposed an approach based on deep learning that preserves the complete contextual relation of temporal human actions in videos. The proposed architecture follows the two-stream network with a novel 3D Convolutional Network (ConvNets) and pyramid pooling layer, to design an end-to-end behavioral feature learning method. The 3D ConvNets extract video-level, spatial-temporal features from two input streams, the RGB images and the corresponding optical flow. The multi-scale pyramid pooling layer fixed the generated feature maps into a unified size regardless of input video size. The final predictions are resulted from a fused softmax scores of two streams, and subject to the weighting factor of each stream. Our experimental results suggest spatial stream slightly higher than the temporal stream, and the performance of the trained model is conditionally optimized. The proposed method is experimented on two challenging video action datasets UCF101 and HMDB51, in which our method achieves the most advanced performance above 96.1% on UCF101 dataset.
AB - In recent years, action recognition based on two-stream networks has developed rapidly. However, most existing methods describe incomplete and distorted video content due to cropped and warped frame or clip-level feature extraction. This paper proposed an approach based on deep learning that preserves the complete contextual relation of temporal human actions in videos. The proposed architecture follows the two-stream network with a novel 3D Convolutional Network (ConvNets) and pyramid pooling layer, to design an end-to-end behavioral feature learning method. The 3D ConvNets extract video-level, spatial-temporal features from two input streams, the RGB images and the corresponding optical flow. The multi-scale pyramid pooling layer fixed the generated feature maps into a unified size regardless of input video size. The final predictions are resulted from a fused softmax scores of two streams, and subject to the weighting factor of each stream. Our experimental results suggest spatial stream slightly higher than the temporal stream, and the performance of the trained model is conditionally optimized. The proposed method is experimented on two challenging video action datasets UCF101 and HMDB51, in which our method achieves the most advanced performance above 96.1% on UCF101 dataset.
KW - 3D ConvNets
KW - Action recognition
KW - Pyramid pooling layer
KW - Video-level feature representation
UR - http://www.scopus.com/inward/record.url?scp=85081082840&partnerID=8YFLogxK
U2 - 10.1109/ICTAI.2019.00250
DO - 10.1109/ICTAI.2019.00250
M3 - Conference contribution
AN - SCOPUS:85081082840
T3 - Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI
SP - 1697
EP - 1701
BT - Proceedings - IEEE 31st International Conference on Tools with Artificial Intelligence, ICTAI 2019
PB - IEEE Computer Society
T2 - 31st IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2019
Y2 - 4 November 2019 through 6 November 2019
ER -