3D convolutional two-stream network for action recognition in videos

Min Li; Yuezhu Qi; Jian Yang; Yanfang Zhang; Junxing Ren; Hong Du

doi:10.1109/ICTAI.2019.00250

3D convolutional two-stream network for action recognition in videos

Min Li, Yuezhu Qi, Jian Yang, Yanfang Zhang, Junxing Ren, Hong Du

光电学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

4 引用（Scopus）

摘要

In recent years, action recognition based on two-stream networks has developed rapidly. However, most existing methods describe incomplete and distorted video content due to cropped and warped frame or clip-level feature extraction. This paper proposed an approach based on deep learning that preserves the complete contextual relation of temporal human actions in videos. The proposed architecture follows the two-stream network with a novel 3D Convolutional Network (ConvNets) and pyramid pooling layer, to design an end-to-end behavioral feature learning method. The 3D ConvNets extract video-level, spatial-temporal features from two input streams, the RGB images and the corresponding optical flow. The multi-scale pyramid pooling layer fixed the generated feature maps into a unified size regardless of input video size. The final predictions are resulted from a fused softmax scores of two streams, and subject to the weighting factor of each stream. Our experimental results suggest spatial stream slightly higher than the temporal stream, and the performance of the trained model is conditionally optimized. The proposed method is experimented on two challenging video action datasets UCF101 and HMDB51, in which our method achieves the most advanced performance above 96.1% on UCF101 dataset.

源语言	英语
主期刊名	Proceedings - IEEE 31st International Conference on Tools with Artificial Intelligence, ICTAI 2019
出版商	IEEE Computer Society
页	1697-1701
页数	5
ISBN（电子版）	9781728137988
DOI	https://doi.org/10.1109/ICTAI.2019.00250
出版状态	已出版 - 11月 2019
活动	31st IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2019 - Portland, 美国期限: 4 11月 2019 → 6 11月 2019

出版系列

姓名	Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI
卷	2019-November
ISSN（印刷版）	1082-3409

会议

会议	31st IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2019
国家/地区	美国
市	Portland
时期	4/11/19 → 6/11/19

访问文件

10.1109/ICTAI.2019.00250

其它文件与链接

链接到 Scopus 的出版物

引用此

Li, M., Qi, Y., Yang, J., Zhang, Y., Ren, J., & Du, H. (2019). 3D convolutional two-stream network for action recognition in videos. 在 Proceedings - IEEE 31st International Conference on Tools with Artificial Intelligence, ICTAI 2019 (页码 1697-1701). 文章 8995257 (Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI; 卷 2019-November). IEEE Computer Society. https://doi.org/10.1109/ICTAI.2019.00250

@inproceedings{32c710baec46444c9893412f8fcc07a9,

title = "3D convolutional two-stream network for action recognition in videos",

abstract = "In recent years, action recognition based on two-stream networks has developed rapidly. However, most existing methods describe incomplete and distorted video content due to cropped and warped frame or clip-level feature extraction. This paper proposed an approach based on deep learning that preserves the complete contextual relation of temporal human actions in videos. The proposed architecture follows the two-stream network with a novel 3D Convolutional Network (ConvNets) and pyramid pooling layer, to design an end-to-end behavioral feature learning method. The 3D ConvNets extract video-level, spatial-temporal features from two input streams, the RGB images and the corresponding optical flow. The multi-scale pyramid pooling layer fixed the generated feature maps into a unified size regardless of input video size. The final predictions are resulted from a fused softmax scores of two streams, and subject to the weighting factor of each stream. Our experimental results suggest spatial stream slightly higher than the temporal stream, and the performance of the trained model is conditionally optimized. The proposed method is experimented on two challenging video action datasets UCF101 and HMDB51, in which our method achieves the most advanced performance above 96.1% on UCF101 dataset.",

keywords = "3D ConvNets, Action recognition, Pyramid pooling layer, Video-level feature representation",

author = "Min Li and Yuezhu Qi and Jian Yang and Yanfang Zhang and Junxing Ren and Hong Du",

note = "Publisher Copyright: {\textcopyright} 2019 IEEE.; 31st IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2019 ; Conference date: 04-11-2019 Through 06-11-2019",

year = "2019",

month = nov,

doi = "10.1109/ICTAI.2019.00250",

language = "English",

series = "Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI",

publisher = "IEEE Computer Society",

pages = "1697--1701",

booktitle = "Proceedings - IEEE 31st International Conference on Tools with Artificial Intelligence, ICTAI 2019",

address = "United States",

}

Li, M, Qi, Y, Yang, J, Zhang, Y, Ren, J & Du, H 2019, 3D convolutional two-stream network for action recognition in videos. 在 Proceedings - IEEE 31st International Conference on Tools with Artificial Intelligence, ICTAI 2019., 8995257, Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI, 卷 2019-November, IEEE Computer Society, 页码 1697-1701, 31st IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2019, Portland, 美国, 4/11/19. https://doi.org/10.1109/ICTAI.2019.00250

3D convolutional two-stream network for action recognition in videos. / Li, Min; Qi, Yuezhu; Yang, Jian 等.
Proceedings - IEEE 31st International Conference on Tools with Artificial Intelligence, ICTAI 2019. IEEE Computer Society, 2019. 页码 1697-1701 8995257 (Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI; 卷 2019-November).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - 3D convolutional two-stream network for action recognition in videos

AU - Li, Min

AU - Qi, Yuezhu

AU - Yang, Jian

AU - Zhang, Yanfang

AU - Ren, Junxing

AU - Du, Hong

PY - 2019/11

Y1 - 2019/11

N2 - In recent years, action recognition based on two-stream networks has developed rapidly. However, most existing methods describe incomplete and distorted video content due to cropped and warped frame or clip-level feature extraction. This paper proposed an approach based on deep learning that preserves the complete contextual relation of temporal human actions in videos. The proposed architecture follows the two-stream network with a novel 3D Convolutional Network (ConvNets) and pyramid pooling layer, to design an end-to-end behavioral feature learning method. The 3D ConvNets extract video-level, spatial-temporal features from two input streams, the RGB images and the corresponding optical flow. The multi-scale pyramid pooling layer fixed the generated feature maps into a unified size regardless of input video size. The final predictions are resulted from a fused softmax scores of two streams, and subject to the weighting factor of each stream. Our experimental results suggest spatial stream slightly higher than the temporal stream, and the performance of the trained model is conditionally optimized. The proposed method is experimented on two challenging video action datasets UCF101 and HMDB51, in which our method achieves the most advanced performance above 96.1% on UCF101 dataset.

AB - In recent years, action recognition based on two-stream networks has developed rapidly. However, most existing methods describe incomplete and distorted video content due to cropped and warped frame or clip-level feature extraction. This paper proposed an approach based on deep learning that preserves the complete contextual relation of temporal human actions in videos. The proposed architecture follows the two-stream network with a novel 3D Convolutional Network (ConvNets) and pyramid pooling layer, to design an end-to-end behavioral feature learning method. The 3D ConvNets extract video-level, spatial-temporal features from two input streams, the RGB images and the corresponding optical flow. The multi-scale pyramid pooling layer fixed the generated feature maps into a unified size regardless of input video size. The final predictions are resulted from a fused softmax scores of two streams, and subject to the weighting factor of each stream. Our experimental results suggest spatial stream slightly higher than the temporal stream, and the performance of the trained model is conditionally optimized. The proposed method is experimented on two challenging video action datasets UCF101 and HMDB51, in which our method achieves the most advanced performance above 96.1% on UCF101 dataset.

KW - 3D ConvNets

KW - Action recognition

KW - Pyramid pooling layer

KW - Video-level feature representation

UR - http://www.scopus.com/inward/record.url?scp=85081082840&partnerID=8YFLogxK

U2 - 10.1109/ICTAI.2019.00250

DO - 10.1109/ICTAI.2019.00250

M3 - Conference contribution

AN - SCOPUS:85081082840

T3 - Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI

SP - 1697

EP - 1701

BT - Proceedings - IEEE 31st International Conference on Tools with Artificial Intelligence, ICTAI 2019

PB - IEEE Computer Society

T2 - 31st IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2019

Y2 - 4 November 2019 through 6 November 2019

ER -

Li M, Qi Y, Yang J, Zhang Y, Ren J, Du H. 3D convolutional two-stream network for action recognition in videos. 在 Proceedings - IEEE 31st International Conference on Tools with Artificial Intelligence, ICTAI 2019. IEEE Computer Society. 2019. 页码 1697-1701. 8995257. (Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI). doi: 10.1109/ICTAI.2019.00250

3D convolutional two-stream network for action recognition in videos

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此