Spatiotemporal pyramid pooling in 3D convolutional neural networks for action recognition

Cheng Cheng, Pin Lv, Bing Su

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

8 Citations (Scopus)

Abstract

Deep 3-dimensional convolutional networks (3D ConvNets) trained on large scale video datasets have achieved promising results on action recognition. This paper improves their performance by taking into account the spatiotemporal pyramid pooling. Specifically, we propose the spatiotemporal pyramid pooling layer to tackle the temporal variations of video sequences. Based on this layer, we develop a new network architecture, called STPP-net, by incorporating it with 3D ConvNets. The proposed network is robust to spatial and temporal variation of human actions and can generate a fixed-dimensional representation regardless of video size/scale. We show that our new network architecture outperforms the original 3D ConvNets by a large margin on three large-scale video classification/action recognition benchmarks including HMDB51, UCF101, and Kinetics.

Original languageEnglish
Title of host publication2018 IEEE International Conference on Image Processing, ICIP 2018 - Proceedings
PublisherIEEE Computer Society
Pages3468-3472
Number of pages5
ISBN (Electronic)9781479970612
DOIs
Publication statusPublished - 29 Aug 2018
Externally publishedYes
Event25th IEEE International Conference on Image Processing, ICIP 2018 - Athens, Greece
Duration: 7 Oct 201810 Oct 2018

Publication series

NameProceedings - International Conference on Image Processing, ICIP
ISSN (Print)1522-4880

Conference

Conference25th IEEE International Conference on Image Processing, ICIP 2018
Country/TerritoryGreece
CityAthens
Period7/10/1810/10/18

Keywords

  • 3D Convolutional Neural Networks
  • Spatiotemporal Pyramid Pooling
  • Video Recognition

Fingerprint

Dive into the research topics of 'Spatiotemporal pyramid pooling in 3D convolutional neural networks for action recognition'. Together they form a unique fingerprint.

Cite this

Cheng, C., Lv, P., & Su, B. (2018). Spatiotemporal pyramid pooling in 3D convolutional neural networks for action recognition. In 2018 IEEE International Conference on Image Processing, ICIP 2018 - Proceedings (pp. 3468-3472). Article 8451625 (Proceedings - International Conference on Image Processing, ICIP). IEEE Computer Society. https://doi.org/10.1109/ICIP.2018.8451625