Multi-Branch Spatial-Temporal Network for Action Recognition

Yingying Wang; Wei Li; Ran Tao

doi:10.1109/LSP.2019.2940111

Multi-Branch Spatial-Temporal Network for Action Recognition

Yingying Wang, Wei Li^*, Ran Tao

^*Corresponding author for this work

School of Information and Electronics

Beijing University of Chemical Technology

Research output: Contribution to journal › Article › peer-review

13 Citations (Scopus)

Abstract

Human action recognition based on deep-learning methods have received increasing attention and developed rapidly. However, current methods suffer from the confusion caused by convolving over time and space independently, processing shorter sequences, restricted to single temporal scale modeling and so on. The key objective of precisely classifying actions is to capture the appearance and motion throughout entire videos. Based on this purpose, a multi-branch spatial-temporal network (MSTN) is proposed. It consists of a multi-branch deep network and a long-term feature (LTF) layer. Benefits of the proposed MSTN include: (a) the multi-branch spatial-temporal network aims at encoding spatial and temporal information simultaneously, and (b) the LTF layer is used to aggregate the video-level representation with multiple temporal scales. Evaluations on two action datasets and comparison with several state-of-the-art approaches demonstrate the effectiveness of the proposed network.

Original language	English
Article number	8832232
Pages (from-to)	1556-1560
Number of pages	5
Journal	IEEE Signal Processing Letters
Volume	26
Issue number	10
DOIs	https://doi.org/10.1109/LSP.2019.2940111
Publication status	Published - Oct 2019

Keywords

Action recognition
deep learning
long-term feature layer
spatial-temporal network

Access to Document

10.1109/LSP.2019.2940111

Cite this

Wang, Y., Li, W., & Tao, R. (2019). Multi-Branch Spatial-Temporal Network for Action Recognition. IEEE Signal Processing Letters, 26(10), 1556-1560. Article 8832232. https://doi.org/10.1109/LSP.2019.2940111

@article{fa76593196004306bbdd8a597dae8d99,

title = "Multi-Branch Spatial-Temporal Network for Action Recognition",

abstract = "Human action recognition based on deep-learning methods have received increasing attention and developed rapidly. However, current methods suffer from the confusion caused by convolving over time and space independently, processing shorter sequences, restricted to single temporal scale modeling and so on. The key objective of precisely classifying actions is to capture the appearance and motion throughout entire videos. Based on this purpose, a multi-branch spatial-temporal network (MSTN) is proposed. It consists of a multi-branch deep network and a long-term feature (LTF) layer. Benefits of the proposed MSTN include: (a) the multi-branch spatial-temporal network aims at encoding spatial and temporal information simultaneously, and (b) the LTF layer is used to aggregate the video-level representation with multiple temporal scales. Evaluations on two action datasets and comparison with several state-of-the-art approaches demonstrate the effectiveness of the proposed network.",

keywords = "Action recognition, deep learning, long-term feature layer, spatial-temporal network",

author = "Yingying Wang and Wei Li and Ran Tao",

note = "Publisher Copyright: {\textcopyright} 1994-2012 IEEE.",

year = "2019",

month = oct,

doi = "10.1109/LSP.2019.2940111",

language = "English",

volume = "26",

pages = "1556--1560",

journal = "IEEE Signal Processing Letters",

issn = "1070-9908",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "10",

}

TY - JOUR

T1 - Multi-Branch Spatial-Temporal Network for Action Recognition

AU - Wang, Yingying

AU - Li, Wei

AU - Tao, Ran

PY - 2019/10

Y1 - 2019/10

N2 - Human action recognition based on deep-learning methods have received increasing attention and developed rapidly. However, current methods suffer from the confusion caused by convolving over time and space independently, processing shorter sequences, restricted to single temporal scale modeling and so on. The key objective of precisely classifying actions is to capture the appearance and motion throughout entire videos. Based on this purpose, a multi-branch spatial-temporal network (MSTN) is proposed. It consists of a multi-branch deep network and a long-term feature (LTF) layer. Benefits of the proposed MSTN include: (a) the multi-branch spatial-temporal network aims at encoding spatial and temporal information simultaneously, and (b) the LTF layer is used to aggregate the video-level representation with multiple temporal scales. Evaluations on two action datasets and comparison with several state-of-the-art approaches demonstrate the effectiveness of the proposed network.

AB - Human action recognition based on deep-learning methods have received increasing attention and developed rapidly. However, current methods suffer from the confusion caused by convolving over time and space independently, processing shorter sequences, restricted to single temporal scale modeling and so on. The key objective of precisely classifying actions is to capture the appearance and motion throughout entire videos. Based on this purpose, a multi-branch spatial-temporal network (MSTN) is proposed. It consists of a multi-branch deep network and a long-term feature (LTF) layer. Benefits of the proposed MSTN include: (a) the multi-branch spatial-temporal network aims at encoding spatial and temporal information simultaneously, and (b) the LTF layer is used to aggregate the video-level representation with multiple temporal scales. Evaluations on two action datasets and comparison with several state-of-the-art approaches demonstrate the effectiveness of the proposed network.

KW - Action recognition

KW - deep learning

KW - long-term feature layer

KW - spatial-temporal network

UR - http://www.scopus.com/inward/record.url?scp=85072191003&partnerID=8YFLogxK

U2 - 10.1109/LSP.2019.2940111

DO - 10.1109/LSP.2019.2940111

M3 - Article

AN - SCOPUS:85072191003

SN - 1070-9908

VL - 26

SP - 1556

EP - 1560

JO - IEEE Signal Processing Letters

JF - IEEE Signal Processing Letters

IS - 10

M1 - 8832232

ER -

Multi-Branch Spatial-Temporal Network for Action Recognition

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this