Temporal Memory Network towards Real-Time Video Understanding

Ziming Liu; Jinyang Li; Guangyu Gao; Alex K. Qin

doi:10.1109/ACCESS.2020.3043386

Temporal Memory Network towards Real-Time Video Understanding

Ziming Liu, Jinyang Li, Guangyu Gao^*, Alex K. Qin

^*此作品的通讯作者

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

3 引用（Scopus）

摘要

Action recognition is the basic task for video understanding. Although the action recognition has achieved impressive performance in the static image-based task (e.g. Stanford40) with deep learning, real-time video-based action recognition is still a challenging task due to video's high complexity and computation cost. Motivated by human's recognition ability with only a short glance, we propose the fast light-weighted Temporal Memory Network (TMNet) to achieve real-time video action recognition. The TMNet has a self-supervised structure for exploring both spatial and temporal information with a single video frame. TMNet has three main parts, the base backbone, the regression branch, and the classification branch. Specifically, the base backbone network is a shallow 2D CNN network to obtain the video's initial feature sequences. The classification branch is based on existing successful video recognition models(e.g. TSN, I3D). To make the TMNet learn the spatial-temporal information at a lower cost, we add a self-supervised regression branch. This branch is based on light-weighted 2D CNN and only uses one frame as input. In the training stage, the input of TMNet is a video sequence, the classification branch combined with the base backbone is responsible for learning the video sequence's spatial-temporal feature. Meanwhile, the self-supervised regression branch aims to learn the same spatial-temporal feature under the supervision of the classification branch's output. And the regression branch's input is a single-frame feature sampled from the encoded video sequence. In this way, the regression branch is forced to learn temporal information of adjacent frames with one frame. Therefore, TMNet only needs one frame to predict each video's spatial-temporal information in the inference stage. Finally, TMNet can achieve real-time action recognition and better accuracy by extracting temporal information from a static image. Abundant ablation experiments demonstrate TMNet has a good trade-off between accuracy and speed.

源语言	英语
文章编号	9288798
页（从-至）	223837-223847
页数	11
期刊	IEEE Access
卷	8
DOI	https://doi.org/10.1109/ACCESS.2020.3043386
出版状态	已出版 - 2020

访问文件

10.1109/ACCESS.2020.3043386

其它文件与链接

链接到 Scopus 的出版物

引用此

Liu, Z., Li, J., Gao, G., & Qin, A. K. (2020). Temporal Memory Network towards Real-Time Video Understanding. IEEE Access, 8, 223837-223847. 文章 9288798. https://doi.org/10.1109/ACCESS.2020.3043386

@article{e3e2254212b740ad9641964ffb60f707,

title = "Temporal Memory Network towards Real-Time Video Understanding",

abstract = "Action recognition is the basic task for video understanding. Although the action recognition has achieved impressive performance in the static image-based task (e.g. Stanford40) with deep learning, real-time video-based action recognition is still a challenging task due to video's high complexity and computation cost. Motivated by human's recognition ability with only a short glance, we propose the fast light-weighted Temporal Memory Network (TMNet) to achieve real-time video action recognition. The TMNet has a self-supervised structure for exploring both spatial and temporal information with a single video frame. TMNet has three main parts, the base backbone, the regression branch, and the classification branch. Specifically, the base backbone network is a shallow 2D CNN network to obtain the video's initial feature sequences. The classification branch is based on existing successful video recognition models(e.g. TSN, I3D). To make the TMNet learn the spatial-temporal information at a lower cost, we add a self-supervised regression branch. This branch is based on light-weighted 2D CNN and only uses one frame as input. In the training stage, the input of TMNet is a video sequence, the classification branch combined with the base backbone is responsible for learning the video sequence's spatial-temporal feature. Meanwhile, the self-supervised regression branch aims to learn the same spatial-temporal feature under the supervision of the classification branch's output. And the regression branch's input is a single-frame feature sampled from the encoded video sequence. In this way, the regression branch is forced to learn temporal information of adjacent frames with one frame. Therefore, TMNet only needs one frame to predict each video's spatial-temporal information in the inference stage. Finally, TMNet can achieve real-time action recognition and better accuracy by extracting temporal information from a static image. Abundant ablation experiments demonstrate TMNet has a good trade-off between accuracy and speed.",

keywords = "Video action recognition, real-time video understanding, spatial-temporal feature",

author = "Ziming Liu and Jinyang Li and Guangyu Gao and Qin, {Alex K.}",

note = "Publisher Copyright: {\textcopyright} 2013 IEEE.",

year = "2020",

doi = "10.1109/ACCESS.2020.3043386",

language = "English",

volume = "8",

pages = "223837--223847",

journal = "IEEE Access",

issn = "2169-3536",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Temporal Memory Network towards Real-Time Video Understanding

AU - Liu, Ziming

AU - Li, Jinyang

AU - Gao, Guangyu

AU - Qin, Alex K.

PY - 2020

Y1 - 2020

N2 - Action recognition is the basic task for video understanding. Although the action recognition has achieved impressive performance in the static image-based task (e.g. Stanford40) with deep learning, real-time video-based action recognition is still a challenging task due to video's high complexity and computation cost. Motivated by human's recognition ability with only a short glance, we propose the fast light-weighted Temporal Memory Network (TMNet) to achieve real-time video action recognition. The TMNet has a self-supervised structure for exploring both spatial and temporal information with a single video frame. TMNet has three main parts, the base backbone, the regression branch, and the classification branch. Specifically, the base backbone network is a shallow 2D CNN network to obtain the video's initial feature sequences. The classification branch is based on existing successful video recognition models(e.g. TSN, I3D). To make the TMNet learn the spatial-temporal information at a lower cost, we add a self-supervised regression branch. This branch is based on light-weighted 2D CNN and only uses one frame as input. In the training stage, the input of TMNet is a video sequence, the classification branch combined with the base backbone is responsible for learning the video sequence's spatial-temporal feature. Meanwhile, the self-supervised regression branch aims to learn the same spatial-temporal feature under the supervision of the classification branch's output. And the regression branch's input is a single-frame feature sampled from the encoded video sequence. In this way, the regression branch is forced to learn temporal information of adjacent frames with one frame. Therefore, TMNet only needs one frame to predict each video's spatial-temporal information in the inference stage. Finally, TMNet can achieve real-time action recognition and better accuracy by extracting temporal information from a static image. Abundant ablation experiments demonstrate TMNet has a good trade-off between accuracy and speed.

AB - Action recognition is the basic task for video understanding. Although the action recognition has achieved impressive performance in the static image-based task (e.g. Stanford40) with deep learning, real-time video-based action recognition is still a challenging task due to video's high complexity and computation cost. Motivated by human's recognition ability with only a short glance, we propose the fast light-weighted Temporal Memory Network (TMNet) to achieve real-time video action recognition. The TMNet has a self-supervised structure for exploring both spatial and temporal information with a single video frame. TMNet has three main parts, the base backbone, the regression branch, and the classification branch. Specifically, the base backbone network is a shallow 2D CNN network to obtain the video's initial feature sequences. The classification branch is based on existing successful video recognition models(e.g. TSN, I3D). To make the TMNet learn the spatial-temporal information at a lower cost, we add a self-supervised regression branch. This branch is based on light-weighted 2D CNN and only uses one frame as input. In the training stage, the input of TMNet is a video sequence, the classification branch combined with the base backbone is responsible for learning the video sequence's spatial-temporal feature. Meanwhile, the self-supervised regression branch aims to learn the same spatial-temporal feature under the supervision of the classification branch's output. And the regression branch's input is a single-frame feature sampled from the encoded video sequence. In this way, the regression branch is forced to learn temporal information of adjacent frames with one frame. Therefore, TMNet only needs one frame to predict each video's spatial-temporal information in the inference stage. Finally, TMNet can achieve real-time action recognition and better accuracy by extracting temporal information from a static image. Abundant ablation experiments demonstrate TMNet has a good trade-off between accuracy and speed.

KW - Video action recognition

KW - real-time video understanding

KW - spatial-temporal feature

UR - http://www.scopus.com/inward/record.url?scp=85097954565&partnerID=8YFLogxK

U2 - 10.1109/ACCESS.2020.3043386

DO - 10.1109/ACCESS.2020.3043386

M3 - Article

AN - SCOPUS:85097954565

SN - 2169-3536

VL - 8

SP - 223837

EP - 223847

JO - IEEE Access

JF - IEEE Access

M1 - 9288798

ER -

Temporal Memory Network towards Real-Time Video Understanding

摘要

访问文件

其它文件与链接

指纹

引用此