Dual-stream structured graph convolution network for skeleton-based action recognition

Chunyan Xu; Rong Liu; Tong Zhang; Zhen Cui; Jian Yang; Chunlong Hu

doi:10.1145/3450410

Dual-stream structured graph convolution network for skeleton-based action recognition

Chunyan Xu, Rong Liu, Tong Zhang^*, Zhen Cui, Jian Yang, Chunlong Hu

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

10 Citations (Scopus)

Abstract

In this work, we propose a dual-stream structured graph convolution network (DS-SGCN) to solve the skeleton-based action recognition problem. The spatio-temporal coordinates and appearance contexts of the skeletal joints are jointly integrated into the graph convolution learning process on both the video and skeleton modalities. To effectively represent the skeletal graph of discrete joints, we create a structured graph convolution module specifically designed to encode partitioned body parts along with their dynamic interactions in the spatio-temporal sequence. In more detail, we build a set of structured intra-part graphs, each of which can be adopted to represent a distinctive body part (e.g., left arm, right leg, head). The inter-part graph is then constructed to model the dynamic interactions across different body parts; here each node corresponds to an intra-part graph built above, while an edge between two nodes is used to express these internal relationships of human movement. We implement the graph convolution learning on both intra- and inter-part graphs in order to obtain the inherent characteristics and dynamic interactions, respectively, of human action. After integrating the intra- and inter-levels of spatial context/coordinate cues, a convolution filtering process is conducted on time slices to capture these temporal dynamics of human motion. Finally, we fuse two streams of graph convolution responses in order to predict the category information of human action in an end-to-end fashion. Comprehensive experiments on five single/multi-modal benchmark datasets (including NTU RGB+D 60, NTU RGB+D 120, MSR-Daily 3D, N-UCLA, and HDM05) demonstrate that the proposed DS-SGCN framework achieves encouraging performance on the skeleton-based action recognition task.

Original language	English
Article number	3450410
Journal	ACM Transactions on Multimedia Computing, Communications and Applications
Volume	17
Issue number	4
DOIs	https://doi.org/10.1145/3450410
Publication status	Published - Nov 2021
Externally published	Yes

Keywords

Action recognition
Dual-stream structured graph convolution
Graph convolution network

Access to Document

10.1145/3450410

Cite this

Xu, C., Liu, R., Zhang, T., Cui, Z., Yang, J., & Hu, C. (2021). Dual-stream structured graph convolution network for skeleton-based action recognition. ACM Transactions on Multimedia Computing, Communications and Applications, 17(4), Article 3450410. https://doi.org/10.1145/3450410

@article{25e311f67d9544ab90cff5d48c5aa22c,

title = "Dual-stream structured graph convolution network for skeleton-based action recognition",

abstract = "In this work, we propose a dual-stream structured graph convolution network (DS-SGCN) to solve the skeleton-based action recognition problem. The spatio-temporal coordinates and appearance contexts of the skeletal joints are jointly integrated into the graph convolution learning process on both the video and skeleton modalities. To effectively represent the skeletal graph of discrete joints, we create a structured graph convolution module specifically designed to encode partitioned body parts along with their dynamic interactions in the spatio-temporal sequence. In more detail, we build a set of structured intra-part graphs, each of which can be adopted to represent a distinctive body part (e.g., left arm, right leg, head). The inter-part graph is then constructed to model the dynamic interactions across different body parts; here each node corresponds to an intra-part graph built above, while an edge between two nodes is used to express these internal relationships of human movement. We implement the graph convolution learning on both intra- and inter-part graphs in order to obtain the inherent characteristics and dynamic interactions, respectively, of human action. After integrating the intra- and inter-levels of spatial context/coordinate cues, a convolution filtering process is conducted on time slices to capture these temporal dynamics of human motion. Finally, we fuse two streams of graph convolution responses in order to predict the category information of human action in an end-to-end fashion. Comprehensive experiments on five single/multi-modal benchmark datasets (including NTU RGB+D 60, NTU RGB+D 120, MSR-Daily 3D, N-UCLA, and HDM05) demonstrate that the proposed DS-SGCN framework achieves encouraging performance on the skeleton-based action recognition task.",

keywords = "Action recognition, Dual-stream structured graph convolution, Graph convolution network",

author = "Chunyan Xu and Rong Liu and Tong Zhang and Zhen Cui and Jian Yang and Chunlong Hu",

note = "Publisher Copyright: {\textcopyright} 2021 Association for Computing Machinery.",

year = "2021",

month = nov,

doi = "10.1145/3450410",

language = "English",

volume = "17",

journal = "ACM Transactions on Multimedia Computing, Communications and Applications",

issn = "1551-6857",

publisher = "Association for Computing Machinery (ACM)",

number = "4",

}

TY - JOUR

T1 - Dual-stream structured graph convolution network for skeleton-based action recognition

AU - Xu, Chunyan

AU - Liu, Rong

AU - Zhang, Tong

AU - Cui, Zhen

AU - Yang, Jian

AU - Hu, Chunlong

PY - 2021/11

Y1 - 2021/11

N2 - In this work, we propose a dual-stream structured graph convolution network (DS-SGCN) to solve the skeleton-based action recognition problem. The spatio-temporal coordinates and appearance contexts of the skeletal joints are jointly integrated into the graph convolution learning process on both the video and skeleton modalities. To effectively represent the skeletal graph of discrete joints, we create a structured graph convolution module specifically designed to encode partitioned body parts along with their dynamic interactions in the spatio-temporal sequence. In more detail, we build a set of structured intra-part graphs, each of which can be adopted to represent a distinctive body part (e.g., left arm, right leg, head). The inter-part graph is then constructed to model the dynamic interactions across different body parts; here each node corresponds to an intra-part graph built above, while an edge between two nodes is used to express these internal relationships of human movement. We implement the graph convolution learning on both intra- and inter-part graphs in order to obtain the inherent characteristics and dynamic interactions, respectively, of human action. After integrating the intra- and inter-levels of spatial context/coordinate cues, a convolution filtering process is conducted on time slices to capture these temporal dynamics of human motion. Finally, we fuse two streams of graph convolution responses in order to predict the category information of human action in an end-to-end fashion. Comprehensive experiments on five single/multi-modal benchmark datasets (including NTU RGB+D 60, NTU RGB+D 120, MSR-Daily 3D, N-UCLA, and HDM05) demonstrate that the proposed DS-SGCN framework achieves encouraging performance on the skeleton-based action recognition task.

AB - In this work, we propose a dual-stream structured graph convolution network (DS-SGCN) to solve the skeleton-based action recognition problem. The spatio-temporal coordinates and appearance contexts of the skeletal joints are jointly integrated into the graph convolution learning process on both the video and skeleton modalities. To effectively represent the skeletal graph of discrete joints, we create a structured graph convolution module specifically designed to encode partitioned body parts along with their dynamic interactions in the spatio-temporal sequence. In more detail, we build a set of structured intra-part graphs, each of which can be adopted to represent a distinctive body part (e.g., left arm, right leg, head). The inter-part graph is then constructed to model the dynamic interactions across different body parts; here each node corresponds to an intra-part graph built above, while an edge between two nodes is used to express these internal relationships of human movement. We implement the graph convolution learning on both intra- and inter-part graphs in order to obtain the inherent characteristics and dynamic interactions, respectively, of human action. After integrating the intra- and inter-levels of spatial context/coordinate cues, a convolution filtering process is conducted on time slices to capture these temporal dynamics of human motion. Finally, we fuse two streams of graph convolution responses in order to predict the category information of human action in an end-to-end fashion. Comprehensive experiments on five single/multi-modal benchmark datasets (including NTU RGB+D 60, NTU RGB+D 120, MSR-Daily 3D, N-UCLA, and HDM05) demonstrate that the proposed DS-SGCN framework achieves encouraging performance on the skeleton-based action recognition task.

KW - Action recognition

KW - Dual-stream structured graph convolution

KW - Graph convolution network

UR - http://www.scopus.com/inward/record.url?scp=85123316823&partnerID=8YFLogxK

U2 - 10.1145/3450410

DO - 10.1145/3450410

M3 - Article

AN - SCOPUS:85123316823

SN - 1551-6857

VL - 17

JO - ACM Transactions on Multimedia Computing, Communications and Applications

JF - ACM Transactions on Multimedia Computing, Communications and Applications

IS - 4

M1 - 3450410

ER -

Dual-stream structured graph convolution network for skeleton-based action recognition

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this