3D convolutional two-stream network for action recognition in videos

Min Li; Yuezhu Qi; Jian Yang; Yanfang Zhang; Junxing Ren; Hong Du

doi:10.1109/ICTAI.2019.00250

3D convolutional two-stream network for action recognition in videos

Min Li, Yuezhu Qi, Jian Yang, Yanfang Zhang, Junxing Ren, Hong Du

School of Optics and Photonics

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

4 Citations (Scopus)

Abstract

In recent years, action recognition based on two-stream networks has developed rapidly. However, most existing methods describe incomplete and distorted video content due to cropped and warped frame or clip-level feature extraction. This paper proposed an approach based on deep learning that preserves the complete contextual relation of temporal human actions in videos. The proposed architecture follows the two-stream network with a novel 3D Convolutional Network (ConvNets) and pyramid pooling layer, to design an end-to-end behavioral feature learning method. The 3D ConvNets extract video-level, spatial-temporal features from two input streams, the RGB images and the corresponding optical flow. The multi-scale pyramid pooling layer fixed the generated feature maps into a unified size regardless of input video size. The final predictions are resulted from a fused softmax scores of two streams, and subject to the weighting factor of each stream. Our experimental results suggest spatial stream slightly higher than the temporal stream, and the performance of the trained model is conditionally optimized. The proposed method is experimented on two challenging video action datasets UCF101 and HMDB51, in which our method achieves the most advanced performance above 96.1% on UCF101 dataset.

Original language	English
Title of host publication	Proceedings - IEEE 31st International Conference on Tools with Artificial Intelligence, ICTAI 2019
Publisher	IEEE Computer Society
Pages	1697-1701
Number of pages	5
ISBN (Electronic)	9781728137988
DOIs	https://doi.org/10.1109/ICTAI.2019.00250
Publication status	Published - Nov 2019
Event	31st IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2019 - Portland, United States Duration: 4 Nov 2019 → 6 Nov 2019

Publication series

Name	Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI
Volume	2019-November
ISSN (Print)	1082-3409

Conference

Conference	31st IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2019
Country/Territory	United States
City	Portland
Period	4/11/19 → 6/11/19

Keywords

3D ConvNets
Action recognition
Pyramid pooling layer
Video-level feature representation

Access to Document

10.1109/ICTAI.2019.00250

Cite this

Li, M., Qi, Y., Yang, J., Zhang, Y., Ren, J., & Du, H. (2019). 3D convolutional two-stream network for action recognition in videos. In Proceedings - IEEE 31st International Conference on Tools with Artificial Intelligence, ICTAI 2019 (pp. 1697-1701). Article 8995257 (Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI; Vol. 2019-November). IEEE Computer Society. https://doi.org/10.1109/ICTAI.2019.00250

@inproceedings{32c710baec46444c9893412f8fcc07a9,

title = "3D convolutional two-stream network for action recognition in videos",

abstract = "In recent years, action recognition based on two-stream networks has developed rapidly. However, most existing methods describe incomplete and distorted video content due to cropped and warped frame or clip-level feature extraction. This paper proposed an approach based on deep learning that preserves the complete contextual relation of temporal human actions in videos. The proposed architecture follows the two-stream network with a novel 3D Convolutional Network (ConvNets) and pyramid pooling layer, to design an end-to-end behavioral feature learning method. The 3D ConvNets extract video-level, spatial-temporal features from two input streams, the RGB images and the corresponding optical flow. The multi-scale pyramid pooling layer fixed the generated feature maps into a unified size regardless of input video size. The final predictions are resulted from a fused softmax scores of two streams, and subject to the weighting factor of each stream. Our experimental results suggest spatial stream slightly higher than the temporal stream, and the performance of the trained model is conditionally optimized. The proposed method is experimented on two challenging video action datasets UCF101 and HMDB51, in which our method achieves the most advanced performance above 96.1% on UCF101 dataset.",

keywords = "3D ConvNets, Action recognition, Pyramid pooling layer, Video-level feature representation",

author = "Min Li and Yuezhu Qi and Jian Yang and Yanfang Zhang and Junxing Ren and Hong Du",

note = "Publisher Copyright: {\textcopyright} 2019 IEEE.; 31st IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2019 ; Conference date: 04-11-2019 Through 06-11-2019",

year = "2019",

month = nov,

doi = "10.1109/ICTAI.2019.00250",

language = "English",

series = "Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI",

publisher = "IEEE Computer Society",

pages = "1697--1701",

booktitle = "Proceedings - IEEE 31st International Conference on Tools with Artificial Intelligence, ICTAI 2019",

address = "United States",

}

Li, M, Qi, Y, Yang, J, Zhang, Y, Ren, J & Du, H 2019, 3D convolutional two-stream network for action recognition in videos. in Proceedings - IEEE 31st International Conference on Tools with Artificial Intelligence, ICTAI 2019., 8995257, Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI, vol. 2019-November, IEEE Computer Society, pp. 1697-1701, 31st IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2019, Portland, United States, 4/11/19. https://doi.org/10.1109/ICTAI.2019.00250

3D convolutional two-stream network for action recognition in videos. / Li, Min; Qi, Yuezhu; Yang, Jian et al.
Proceedings - IEEE 31st International Conference on Tools with Artificial Intelligence, ICTAI 2019. IEEE Computer Society, 2019. p. 1697-1701 8995257 (Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI; Vol. 2019-November).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - 3D convolutional two-stream network for action recognition in videos

AU - Li, Min

AU - Qi, Yuezhu

AU - Yang, Jian

AU - Zhang, Yanfang

AU - Ren, Junxing

AU - Du, Hong

PY - 2019/11

Y1 - 2019/11

N2 - In recent years, action recognition based on two-stream networks has developed rapidly. However, most existing methods describe incomplete and distorted video content due to cropped and warped frame or clip-level feature extraction. This paper proposed an approach based on deep learning that preserves the complete contextual relation of temporal human actions in videos. The proposed architecture follows the two-stream network with a novel 3D Convolutional Network (ConvNets) and pyramid pooling layer, to design an end-to-end behavioral feature learning method. The 3D ConvNets extract video-level, spatial-temporal features from two input streams, the RGB images and the corresponding optical flow. The multi-scale pyramid pooling layer fixed the generated feature maps into a unified size regardless of input video size. The final predictions are resulted from a fused softmax scores of two streams, and subject to the weighting factor of each stream. Our experimental results suggest spatial stream slightly higher than the temporal stream, and the performance of the trained model is conditionally optimized. The proposed method is experimented on two challenging video action datasets UCF101 and HMDB51, in which our method achieves the most advanced performance above 96.1% on UCF101 dataset.

AB - In recent years, action recognition based on two-stream networks has developed rapidly. However, most existing methods describe incomplete and distorted video content due to cropped and warped frame or clip-level feature extraction. This paper proposed an approach based on deep learning that preserves the complete contextual relation of temporal human actions in videos. The proposed architecture follows the two-stream network with a novel 3D Convolutional Network (ConvNets) and pyramid pooling layer, to design an end-to-end behavioral feature learning method. The 3D ConvNets extract video-level, spatial-temporal features from two input streams, the RGB images and the corresponding optical flow. The multi-scale pyramid pooling layer fixed the generated feature maps into a unified size regardless of input video size. The final predictions are resulted from a fused softmax scores of two streams, and subject to the weighting factor of each stream. Our experimental results suggest spatial stream slightly higher than the temporal stream, and the performance of the trained model is conditionally optimized. The proposed method is experimented on two challenging video action datasets UCF101 and HMDB51, in which our method achieves the most advanced performance above 96.1% on UCF101 dataset.

KW - 3D ConvNets

KW - Action recognition

KW - Pyramid pooling layer

KW - Video-level feature representation

UR - http://www.scopus.com/inward/record.url?scp=85081082840&partnerID=8YFLogxK

U2 - 10.1109/ICTAI.2019.00250

DO - 10.1109/ICTAI.2019.00250

M3 - Conference contribution

AN - SCOPUS:85081082840

T3 - Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI

SP - 1697

EP - 1701

BT - Proceedings - IEEE 31st International Conference on Tools with Artificial Intelligence, ICTAI 2019

PB - IEEE Computer Society

T2 - 31st IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2019

Y2 - 4 November 2019 through 6 November 2019

ER -

Li M, Qi Y, Yang J, Zhang Y, Ren J, Du H. 3D convolutional two-stream network for action recognition in videos. In Proceedings - IEEE 31st International Conference on Tools with Artificial Intelligence, ICTAI 2019. IEEE Computer Society. 2019. p. 1697-1701. 8995257. (Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI). doi: 10.1109/ICTAI.2019.00250

3D convolutional two-stream network for action recognition in videos

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this