Multi-Branch Spatial-Temporal Network for Action Recognition

Yingying Wang; Wei Li; Ran Tao

doi:10.1109/LSP.2019.2940111

Multi-Branch Spatial-Temporal Network for Action Recognition

Yingying Wang, Wei Li^*, Ran Tao

^*此作品的通讯作者

信息与电子学院

Beijing University of Chemical Technology

科研成果: 期刊稿件 › 文章 › 同行评审

13 引用（Scopus）

摘要

Human action recognition based on deep-learning methods have received increasing attention and developed rapidly. However, current methods suffer from the confusion caused by convolving over time and space independently, processing shorter sequences, restricted to single temporal scale modeling and so on. The key objective of precisely classifying actions is to capture the appearance and motion throughout entire videos. Based on this purpose, a multi-branch spatial-temporal network (MSTN) is proposed. It consists of a multi-branch deep network and a long-term feature (LTF) layer. Benefits of the proposed MSTN include: (a) the multi-branch spatial-temporal network aims at encoding spatial and temporal information simultaneously, and (b) the LTF layer is used to aggregate the video-level representation with multiple temporal scales. Evaluations on two action datasets and comparison with several state-of-the-art approaches demonstrate the effectiveness of the proposed network.

源语言	英语
文章编号	8832232
页（从-至）	1556-1560
页数	5
期刊	IEEE Signal Processing Letters
卷	26
期	10
DOI	https://doi.org/10.1109/LSP.2019.2940111
出版状态	已出版 - 10月 2019

访问文件

10.1109/LSP.2019.2940111

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{fa76593196004306bbdd8a597dae8d99,

title = "Multi-Branch Spatial-Temporal Network for Action Recognition",

abstract = "Human action recognition based on deep-learning methods have received increasing attention and developed rapidly. However, current methods suffer from the confusion caused by convolving over time and space independently, processing shorter sequences, restricted to single temporal scale modeling and so on. The key objective of precisely classifying actions is to capture the appearance and motion throughout entire videos. Based on this purpose, a multi-branch spatial-temporal network (MSTN) is proposed. It consists of a multi-branch deep network and a long-term feature (LTF) layer. Benefits of the proposed MSTN include: (a) the multi-branch spatial-temporal network aims at encoding spatial and temporal information simultaneously, and (b) the LTF layer is used to aggregate the video-level representation with multiple temporal scales. Evaluations on two action datasets and comparison with several state-of-the-art approaches demonstrate the effectiveness of the proposed network.",

keywords = "Action recognition, deep learning, long-term feature layer, spatial-temporal network",

author = "Yingying Wang and Wei Li and Ran Tao",

note = "Publisher Copyright: {\textcopyright} 1994-2012 IEEE.",

year = "2019",

month = oct,

doi = "10.1109/LSP.2019.2940111",

language = "English",

volume = "26",

pages = "1556--1560",

journal = "IEEE Signal Processing Letters",

issn = "1070-9908",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "10",

}

TY - JOUR

T1 - Multi-Branch Spatial-Temporal Network for Action Recognition

AU - Wang, Yingying

AU - Li, Wei

AU - Tao, Ran

PY - 2019/10

Y1 - 2019/10

N2 - Human action recognition based on deep-learning methods have received increasing attention and developed rapidly. However, current methods suffer from the confusion caused by convolving over time and space independently, processing shorter sequences, restricted to single temporal scale modeling and so on. The key objective of precisely classifying actions is to capture the appearance and motion throughout entire videos. Based on this purpose, a multi-branch spatial-temporal network (MSTN) is proposed. It consists of a multi-branch deep network and a long-term feature (LTF) layer. Benefits of the proposed MSTN include: (a) the multi-branch spatial-temporal network aims at encoding spatial and temporal information simultaneously, and (b) the LTF layer is used to aggregate the video-level representation with multiple temporal scales. Evaluations on two action datasets and comparison with several state-of-the-art approaches demonstrate the effectiveness of the proposed network.

AB - Human action recognition based on deep-learning methods have received increasing attention and developed rapidly. However, current methods suffer from the confusion caused by convolving over time and space independently, processing shorter sequences, restricted to single temporal scale modeling and so on. The key objective of precisely classifying actions is to capture the appearance and motion throughout entire videos. Based on this purpose, a multi-branch spatial-temporal network (MSTN) is proposed. It consists of a multi-branch deep network and a long-term feature (LTF) layer. Benefits of the proposed MSTN include: (a) the multi-branch spatial-temporal network aims at encoding spatial and temporal information simultaneously, and (b) the LTF layer is used to aggregate the video-level representation with multiple temporal scales. Evaluations on two action datasets and comparison with several state-of-the-art approaches demonstrate the effectiveness of the proposed network.

KW - Action recognition

KW - deep learning

KW - long-term feature layer

KW - spatial-temporal network

UR - http://www.scopus.com/inward/record.url?scp=85072191003&partnerID=8YFLogxK

U2 - 10.1109/LSP.2019.2940111

DO - 10.1109/LSP.2019.2940111

M3 - Article

AN - SCOPUS:85072191003

SN - 1070-9908

VL - 26

SP - 1556

EP - 1560

JO - IEEE Signal Processing Letters

JF - IEEE Signal Processing Letters

IS - 10

M1 - 8832232

ER -

Multi-Branch Spatial-Temporal Network for Action Recognition

摘要

访问文件

其它文件与链接

指纹

引用此