TY - JOUR
T1 - Temporal Memory Network towards Real-Time Video Understanding
AU - Liu, Ziming
AU - Li, Jinyang
AU - Gao, Guangyu
AU - Qin, Alex K.
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2020
Y1 - 2020
N2 - Action recognition is the basic task for video understanding. Although the action recognition has achieved impressive performance in the static image-based task (e.g. Stanford40) with deep learning, real-time video-based action recognition is still a challenging task due to video's high complexity and computation cost. Motivated by human's recognition ability with only a short glance, we propose the fast light-weighted Temporal Memory Network (TMNet) to achieve real-time video action recognition. The TMNet has a self-supervised structure for exploring both spatial and temporal information with a single video frame. TMNet has three main parts, the base backbone, the regression branch, and the classification branch. Specifically, the base backbone network is a shallow 2D CNN network to obtain the video's initial feature sequences. The classification branch is based on existing successful video recognition models(e.g. TSN, I3D). To make the TMNet learn the spatial-temporal information at a lower cost, we add a self-supervised regression branch. This branch is based on light-weighted 2D CNN and only uses one frame as input. In the training stage, the input of TMNet is a video sequence, the classification branch combined with the base backbone is responsible for learning the video sequence's spatial-temporal feature. Meanwhile, the self-supervised regression branch aims to learn the same spatial-temporal feature under the supervision of the classification branch's output. And the regression branch's input is a single-frame feature sampled from the encoded video sequence. In this way, the regression branch is forced to learn temporal information of adjacent frames with one frame. Therefore, TMNet only needs one frame to predict each video's spatial-temporal information in the inference stage. Finally, TMNet can achieve real-time action recognition and better accuracy by extracting temporal information from a static image. Abundant ablation experiments demonstrate TMNet has a good trade-off between accuracy and speed.
AB - Action recognition is the basic task for video understanding. Although the action recognition has achieved impressive performance in the static image-based task (e.g. Stanford40) with deep learning, real-time video-based action recognition is still a challenging task due to video's high complexity and computation cost. Motivated by human's recognition ability with only a short glance, we propose the fast light-weighted Temporal Memory Network (TMNet) to achieve real-time video action recognition. The TMNet has a self-supervised structure for exploring both spatial and temporal information with a single video frame. TMNet has three main parts, the base backbone, the regression branch, and the classification branch. Specifically, the base backbone network is a shallow 2D CNN network to obtain the video's initial feature sequences. The classification branch is based on existing successful video recognition models(e.g. TSN, I3D). To make the TMNet learn the spatial-temporal information at a lower cost, we add a self-supervised regression branch. This branch is based on light-weighted 2D CNN and only uses one frame as input. In the training stage, the input of TMNet is a video sequence, the classification branch combined with the base backbone is responsible for learning the video sequence's spatial-temporal feature. Meanwhile, the self-supervised regression branch aims to learn the same spatial-temporal feature under the supervision of the classification branch's output. And the regression branch's input is a single-frame feature sampled from the encoded video sequence. In this way, the regression branch is forced to learn temporal information of adjacent frames with one frame. Therefore, TMNet only needs one frame to predict each video's spatial-temporal information in the inference stage. Finally, TMNet can achieve real-time action recognition and better accuracy by extracting temporal information from a static image. Abundant ablation experiments demonstrate TMNet has a good trade-off between accuracy and speed.
KW - Video action recognition
KW - real-time video understanding
KW - spatial-temporal feature
UR - http://www.scopus.com/inward/record.url?scp=85097954565&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2020.3043386
DO - 10.1109/ACCESS.2020.3043386
M3 - Article
AN - SCOPUS:85097954565
SN - 2169-3536
VL - 8
SP - 223837
EP - 223847
JO - IEEE Access
JF - IEEE Access
M1 - 9288798
ER -