Spatiotemporal pyramid pooling in 3D convolutional neural networks for action recognition

Cheng Cheng; Pin Lv; Bing Su

doi:10.1109/ICIP.2018.8451625

Spatiotemporal pyramid pooling in 3D convolutional neural networks for action recognition

Cheng Cheng, Pin Lv, Bing Su

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

9 Citations (Scopus)

Abstract

Deep 3-dimensional convolutional networks (3D ConvNets) trained on large scale video datasets have achieved promising results on action recognition. This paper improves their performance by taking into account the spatiotemporal pyramid pooling. Specifically, we propose the spatiotemporal pyramid pooling layer to tackle the temporal variations of video sequences. Based on this layer, we develop a new network architecture, called STPP-net, by incorporating it with 3D ConvNets. The proposed network is robust to spatial and temporal variation of human actions and can generate a fixed-dimensional representation regardless of video size/scale. We show that our new network architecture outperforms the original 3D ConvNets by a large margin on three large-scale video classification/action recognition benchmarks including HMDB51, UCF101, and Kinetics.

Original language	English
Title of host publication	2018 IEEE International Conference on Image Processing, ICIP 2018 - Proceedings
Publisher	IEEE Computer Society
Pages	3468-3472
Number of pages	5
ISBN (Electronic)	9781479970612
DOIs	https://doi.org/10.1109/ICIP.2018.8451625
Publication status	Published - 29 Aug 2018
Externally published	Yes
Event	25th IEEE International Conference on Image Processing, ICIP 2018 - Athens, Greece Duration: 7 Oct 2018 → 10 Oct 2018

Publication series

Name	Proceedings - International Conference on Image Processing, ICIP
ISSN (Print)	1522-4880

Conference

Conference	25th IEEE International Conference on Image Processing, ICIP 2018
Country/Territory	Greece
City	Athens
Period	7/10/18 → 10/10/18

Keywords

3D Convolutional Neural Networks
Spatiotemporal Pyramid Pooling
Video Recognition

Access to Document

10.1109/ICIP.2018.8451625

Cite this

Cheng, C., Lv, P., & Su, B. (2018). Spatiotemporal pyramid pooling in 3D convolutional neural networks for action recognition. In 2018 IEEE International Conference on Image Processing, ICIP 2018 - Proceedings (pp. 3468-3472). Article 8451625 (Proceedings - International Conference on Image Processing, ICIP). IEEE Computer Society. https://doi.org/10.1109/ICIP.2018.8451625

@inproceedings{2618feb96f404b78aa59f09882832615,

title = "Spatiotemporal pyramid pooling in 3D convolutional neural networks for action recognition",

abstract = "Deep 3-dimensional convolutional networks (3D ConvNets) trained on large scale video datasets have achieved promising results on action recognition. This paper improves their performance by taking into account the spatiotemporal pyramid pooling. Specifically, we propose the spatiotemporal pyramid pooling layer to tackle the temporal variations of video sequences. Based on this layer, we develop a new network architecture, called STPP-net, by incorporating it with 3D ConvNets. The proposed network is robust to spatial and temporal variation of human actions and can generate a fixed-dimensional representation regardless of video size/scale. We show that our new network architecture outperforms the original 3D ConvNets by a large margin on three large-scale video classification/action recognition benchmarks including HMDB51, UCF101, and Kinetics.",

keywords = "3D Convolutional Neural Networks, Spatiotemporal Pyramid Pooling, Video Recognition",

author = "Cheng Cheng and Pin Lv and Bing Su",

note = "Publisher Copyright: {\textcopyright} 2018 IEEE.; 25th IEEE International Conference on Image Processing, ICIP 2018 ; Conference date: 07-10-2018 Through 10-10-2018",

year = "2018",

month = aug,

day = "29",

doi = "10.1109/ICIP.2018.8451625",

language = "English",

series = "Proceedings - International Conference on Image Processing, ICIP",

publisher = "IEEE Computer Society",

pages = "3468--3472",

booktitle = "2018 IEEE International Conference on Image Processing, ICIP 2018 - Proceedings",

address = "United States",

}

Cheng, C, Lv, P & Su, B 2018, Spatiotemporal pyramid pooling in 3D convolutional neural networks for action recognition. in 2018 IEEE International Conference on Image Processing, ICIP 2018 - Proceedings., 8451625, Proceedings - International Conference on Image Processing, ICIP, IEEE Computer Society, pp. 3468-3472, 25th IEEE International Conference on Image Processing, ICIP 2018, Athens, Greece, 7/10/18. https://doi.org/10.1109/ICIP.2018.8451625

Spatiotemporal pyramid pooling in 3D convolutional neural networks for action recognition. / Cheng, Cheng; Lv, Pin; Su, Bing.
2018 IEEE International Conference on Image Processing, ICIP 2018 - Proceedings. IEEE Computer Society, 2018. p. 3468-3472 8451625 (Proceedings - International Conference on Image Processing, ICIP).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Spatiotemporal pyramid pooling in 3D convolutional neural networks for action recognition

AU - Cheng, Cheng

AU - Lv, Pin

AU - Su, Bing

PY - 2018/8/29

Y1 - 2018/8/29

N2 - Deep 3-dimensional convolutional networks (3D ConvNets) trained on large scale video datasets have achieved promising results on action recognition. This paper improves their performance by taking into account the spatiotemporal pyramid pooling. Specifically, we propose the spatiotemporal pyramid pooling layer to tackle the temporal variations of video sequences. Based on this layer, we develop a new network architecture, called STPP-net, by incorporating it with 3D ConvNets. The proposed network is robust to spatial and temporal variation of human actions and can generate a fixed-dimensional representation regardless of video size/scale. We show that our new network architecture outperforms the original 3D ConvNets by a large margin on three large-scale video classification/action recognition benchmarks including HMDB51, UCF101, and Kinetics.

AB - Deep 3-dimensional convolutional networks (3D ConvNets) trained on large scale video datasets have achieved promising results on action recognition. This paper improves their performance by taking into account the spatiotemporal pyramid pooling. Specifically, we propose the spatiotemporal pyramid pooling layer to tackle the temporal variations of video sequences. Based on this layer, we develop a new network architecture, called STPP-net, by incorporating it with 3D ConvNets. The proposed network is robust to spatial and temporal variation of human actions and can generate a fixed-dimensional representation regardless of video size/scale. We show that our new network architecture outperforms the original 3D ConvNets by a large margin on three large-scale video classification/action recognition benchmarks including HMDB51, UCF101, and Kinetics.

KW - 3D Convolutional Neural Networks

KW - Spatiotemporal Pyramid Pooling

KW - Video Recognition

UR - http://www.scopus.com/inward/record.url?scp=85062911341&partnerID=8YFLogxK

U2 - 10.1109/ICIP.2018.8451625

DO - 10.1109/ICIP.2018.8451625

M3 - Conference contribution

AN - SCOPUS:85062911341

T3 - Proceedings - International Conference on Image Processing, ICIP

SP - 3468

EP - 3472

BT - 2018 IEEE International Conference on Image Processing, ICIP 2018 - Proceedings

PB - IEEE Computer Society

T2 - 25th IEEE International Conference on Image Processing, ICIP 2018

Y2 - 7 October 2018 through 10 October 2018

ER -

Spatiotemporal pyramid pooling in 3D convolutional neural networks for action recognition

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this