Lightweight Multiscale Spatiotemporal Locally Connected Graph Convolutional Networks for Single Human Motion Forecasting

Di Hua Zhai; Zigeng Yan; Yuanqing Xia

doi:10.1109/TASE.2023.3301657

Lightweight Multiscale Spatiotemporal Locally Connected Graph Convolutional Networks for Single Human Motion Forecasting

Di Hua Zhai, Zigeng Yan, Yuanqing Xia

School of Automation

Beijing Institute of Technology

Research output: Contribution to journal › Article › peer-review

2 Citations (Scopus)

Abstract

Human motion forecasting is an important and challenging task in many computer vision application domains. Recent work concentrates on utilizing the timing processing ability of recurrent neural networks (RNNs) to achieve smooth and reliable results in short-term prediction. However, as evidenced by previous works, RNNs suffer from error accumulation, leading to unreliable results. In this paper, we propose a simple feed-forward deep neural network for motion prediction, which takes into account temporal smoothness between frames and spatial dependencies between human body joints. We design Lightweight Multiscale Spatiotemporal Locally Connected Graph Convolutional Networks (MST-LCGCN) for Single Human Motion Forecasting to implicitly establish the spatiotemporal dependence in the process of human movement, where different scales fuse dynamically during training. The entire model is action-agnostic and follows a framework of encoder-decoder. The encoder consists of temporal GCNs (TGCNs) to capture motion features between frames and locally connected spatial GCNs (SGCNs) to extract spatial structure among joints. The decoder uses temporal convolution networks (TCNs) to maintain its extensibility for long-term prediction. Considerable experiments show that our approach outperforms previous methods on the Human3.6M and CMU Mocap datasets while only requiring much fewer parameters. <italic>Note to Practitioners</italic>—Accuracy and real-time performance are the two most significant evaluation factors for the challenge of human motion forecasting. Existing methods tend to use models with a huge amount of parameters, sacrificing operation speed to obtain a small increase in accuracy. However, in practical scenarios, the slowdown in speed makes predictions meaningless. Therefore, we propose a lightweight MST-LCGCN network to learn human action patterns over time. To obtain higher accuracy, we extract features from the spatial and temporal dimensions to contain more information; to obtain faster operation speed, we design our network while reducing unnecessary depth as much as possible. We demonstrate the advantages of our model in terms of efficiency and accuracy through extensive quantitative and qualitative experiments on two datasets. Our network will be helpful for robots to avoid obstacles in advance and compensate for network delays, and we will apply them to real life in the future.

Original language	English
Pages (from-to)	1-10
Number of pages	10
Journal	IEEE Transactions on Automation Science and Engineering
DOIs	https://doi.org/10.1109/TASE.2023.3301657
Publication status	Accepted/In press - 2023

Keywords

Convolution
Data mining
Feature extraction
Forecasting
GCN
Human motion forecasting
Kinematics
Predictive models
Spatiotemporal phenomena
lightweight
locally connected
multiscale
spatiotemporal

Access to Document

10.1109/TASE.2023.3301657

Cite this

@article{97f49722b482402f86c23e0583731627,

title = "Lightweight Multiscale Spatiotemporal Locally Connected Graph Convolutional Networks for Single Human Motion Forecasting",

abstract = "Human motion forecasting is an important and challenging task in many computer vision application domains. Recent work concentrates on utilizing the timing processing ability of recurrent neural networks (RNNs) to achieve smooth and reliable results in short-term prediction. However, as evidenced by previous works, RNNs suffer from error accumulation, leading to unreliable results. In this paper, we propose a simple feed-forward deep neural network for motion prediction, which takes into account temporal smoothness between frames and spatial dependencies between human body joints. We design Lightweight Multiscale Spatiotemporal Locally Connected Graph Convolutional Networks (MST-LCGCN) for Single Human Motion Forecasting to implicitly establish the spatiotemporal dependence in the process of human movement, where different scales fuse dynamically during training. The entire model is action-agnostic and follows a framework of encoder-decoder. The encoder consists of temporal GCNs (TGCNs) to capture motion features between frames and locally connected spatial GCNs (SGCNs) to extract spatial structure among joints. The decoder uses temporal convolution networks (TCNs) to maintain its extensibility for long-term prediction. Considerable experiments show that our approach outperforms previous methods on the Human3.6M and CMU Mocap datasets while only requiring much fewer parameters. Note to Practitioners—Accuracy and real-time performance are the two most significant evaluation factors for the challenge of human motion forecasting. Existing methods tend to use models with a huge amount of parameters, sacrificing operation speed to obtain a small increase in accuracy. However, in practical scenarios, the slowdown in speed makes predictions meaningless. Therefore, we propose a lightweight MST-LCGCN network to learn human action patterns over time. To obtain higher accuracy, we extract features from the spatial and temporal dimensions to contain more information; to obtain faster operation speed, we design our network while reducing unnecessary depth as much as possible. We demonstrate the advantages of our model in terms of efficiency and accuracy through extensive quantitative and qualitative experiments on two datasets. Our network will be helpful for robots to avoid obstacles in advance and compensate for network delays, and we will apply them to real life in the future.",

keywords = "Convolution, Data mining, Feature extraction, Forecasting, GCN, Human motion forecasting, Kinematics, Predictive models, Spatiotemporal phenomena, lightweight, locally connected, multiscale, spatiotemporal",

author = "Zhai, {Di Hua} and Zigeng Yan and Yuanqing Xia",

note = "Publisher Copyright: IEEE",

year = "2023",

doi = "10.1109/TASE.2023.3301657",

language = "English",

pages = "1--10",

journal = "IEEE Transactions on Automation Science and Engineering",

issn = "1545-5955",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Lightweight Multiscale Spatiotemporal Locally Connected Graph Convolutional Networks for Single Human Motion Forecasting

AU - Zhai, Di Hua

AU - Yan, Zigeng

AU - Xia, Yuanqing

N1 - Publisher Copyright: IEEE

PY - 2023

Y1 - 2023

N2 - Human motion forecasting is an important and challenging task in many computer vision application domains. Recent work concentrates on utilizing the timing processing ability of recurrent neural networks (RNNs) to achieve smooth and reliable results in short-term prediction. However, as evidenced by previous works, RNNs suffer from error accumulation, leading to unreliable results. In this paper, we propose a simple feed-forward deep neural network for motion prediction, which takes into account temporal smoothness between frames and spatial dependencies between human body joints. We design Lightweight Multiscale Spatiotemporal Locally Connected Graph Convolutional Networks (MST-LCGCN) for Single Human Motion Forecasting to implicitly establish the spatiotemporal dependence in the process of human movement, where different scales fuse dynamically during training. The entire model is action-agnostic and follows a framework of encoder-decoder. The encoder consists of temporal GCNs (TGCNs) to capture motion features between frames and locally connected spatial GCNs (SGCNs) to extract spatial structure among joints. The decoder uses temporal convolution networks (TCNs) to maintain its extensibility for long-term prediction. Considerable experiments show that our approach outperforms previous methods on the Human3.6M and CMU Mocap datasets while only requiring much fewer parameters. Note to Practitioners—Accuracy and real-time performance are the two most significant evaluation factors for the challenge of human motion forecasting. Existing methods tend to use models with a huge amount of parameters, sacrificing operation speed to obtain a small increase in accuracy. However, in practical scenarios, the slowdown in speed makes predictions meaningless. Therefore, we propose a lightweight MST-LCGCN network to learn human action patterns over time. To obtain higher accuracy, we extract features from the spatial and temporal dimensions to contain more information; to obtain faster operation speed, we design our network while reducing unnecessary depth as much as possible. We demonstrate the advantages of our model in terms of efficiency and accuracy through extensive quantitative and qualitative experiments on two datasets. Our network will be helpful for robots to avoid obstacles in advance and compensate for network delays, and we will apply them to real life in the future.

AB - Human motion forecasting is an important and challenging task in many computer vision application domains. Recent work concentrates on utilizing the timing processing ability of recurrent neural networks (RNNs) to achieve smooth and reliable results in short-term prediction. However, as evidenced by previous works, RNNs suffer from error accumulation, leading to unreliable results. In this paper, we propose a simple feed-forward deep neural network for motion prediction, which takes into account temporal smoothness between frames and spatial dependencies between human body joints. We design Lightweight Multiscale Spatiotemporal Locally Connected Graph Convolutional Networks (MST-LCGCN) for Single Human Motion Forecasting to implicitly establish the spatiotemporal dependence in the process of human movement, where different scales fuse dynamically during training. The entire model is action-agnostic and follows a framework of encoder-decoder. The encoder consists of temporal GCNs (TGCNs) to capture motion features between frames and locally connected spatial GCNs (SGCNs) to extract spatial structure among joints. The decoder uses temporal convolution networks (TCNs) to maintain its extensibility for long-term prediction. Considerable experiments show that our approach outperforms previous methods on the Human3.6M and CMU Mocap datasets while only requiring much fewer parameters. Note to Practitioners—Accuracy and real-time performance are the two most significant evaluation factors for the challenge of human motion forecasting. Existing methods tend to use models with a huge amount of parameters, sacrificing operation speed to obtain a small increase in accuracy. However, in practical scenarios, the slowdown in speed makes predictions meaningless. Therefore, we propose a lightweight MST-LCGCN network to learn human action patterns over time. To obtain higher accuracy, we extract features from the spatial and temporal dimensions to contain more information; to obtain faster operation speed, we design our network while reducing unnecessary depth as much as possible. We demonstrate the advantages of our model in terms of efficiency and accuracy through extensive quantitative and qualitative experiments on two datasets. Our network will be helpful for robots to avoid obstacles in advance and compensate for network delays, and we will apply them to real life in the future.

KW - Convolution

KW - Data mining

KW - Feature extraction

KW - Forecasting

KW - GCN

KW - Human motion forecasting

KW - Kinematics

KW - Predictive models

KW - Spatiotemporal phenomena

KW - lightweight

KW - locally connected

KW - multiscale

KW - spatiotemporal

UR - http://www.scopus.com/inward/record.url?scp=85168276356&partnerID=8YFLogxK

U2 - 10.1109/TASE.2023.3301657

DO - 10.1109/TASE.2023.3301657

M3 - Article

AN - SCOPUS:85168276356

SN - 1545-5955

SP - 1

EP - 10

JO - IEEE Transactions on Automation Science and Engineering

JF - IEEE Transactions on Automation Science and Engineering

ER -

Lightweight Multiscale Spatiotemporal Locally Connected Graph Convolutional Networks for Single Human Motion Forecasting

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this