Skeletal Spatial-Temporal Semantics Guided Homogeneous-Heterogeneous Multimodal Network for Action Recognition

Chenwei Zhang; Yuxuan Hu; Min Yang; Chengming Li; Xiping Hu

doi:10.1145/3581783.3612560

Skeletal Spatial-Temporal Semantics Guided Homogeneous-Heterogeneous Multimodal Network for Action Recognition

Chenwei Zhang, Yuxuan Hu, Min Yang, Chengming Li^*, Xiping Hu^*

^*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

1 Citation (Scopus)

Abstract

Action recognition research has gained significant attention with two dominant unimodal approaches: skeleton-based and RGB video-based. While the former is known for its robustness in complex backgrounds, the latter provides rich environmental information useful for context-based analysis. However, the fusion of these two modalities remains an open challenge. In this paper, we propose a Spatial Transformer & Selective Temporal encoder (ST&ST) for skeleton-based action recognition by constructing two modules: Reranking-Enhanced Dynamic Mask Transformer (RE-DMT) and Selective Kernel Temporal Convolution (SK-TC). The RE-DMT captures global spatial features, while the dynamic mask strategy and reranking strategy reduce redundancy. The SK-TC captures both long-term and short-term temporal features and enables adaptive fusion. Furthermore, in two phases, we propose a Homogeneous-Heterogeneous Multimodal Network (HHMNet) for multi-modal action recognition. In the first phase, contrastive learning is employed to achieve implicit semantic fusion within the four homogeneous skeletal modalities (joint, bone, etc.). In the second phase, the fusion of heterogeneous modalities (skeleton & RGB video) is carried out at three levels: model, feature, and decision. At the model level, the powerful skeleton-based model from the previous phase provides explicit attention guidance to the RGB video-based model. At the feature level, multi-part contrastive learning enables semantic distillation between heterogeneous modalities. At the decision level, ensemble learning combines outputs for final action recognition. We evaluate our proposed ST&ST guided HHMNet on NTU RGB+D 60 & 120 and NW-UCLA datasets and demonstrate that it achieves state-of-the-art performance in both skeleton-based and multi-modal action recognition tasks.

Original language	English
Title of host publication	MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia
Publisher	Association for Computing Machinery, Inc
Pages	3657-3666
Number of pages	10
ISBN (Electronic)	9798400701085
DOIs	https://doi.org/10.1145/3581783.3612560
Publication status	Published - 26 Oct 2023
Externally published	Yes
Event	31st ACM International Conference on Multimedia, MM 2023 - Ottawa, Canada Duration: 29 Oct 2023 → 3 Nov 2023

Publication series

Name	MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia

Conference

Conference	31st ACM International Conference on Multimedia, MM 2023
Country/Territory	Canada
City	Ottawa
Period	29/10/23 → 3/11/23

Keywords

action recognition
contrastive learning
multi-modal

Access to Document

10.1145/3581783.3612560

Cite this

Zhang, C., Hu, Y., Yang, M., Li, C., & Hu, X. (2023). Skeletal Spatial-Temporal Semantics Guided Homogeneous-Heterogeneous Multimodal Network for Action Recognition. In MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia (pp. 3657-3666). (MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia). Association for Computing Machinery, Inc. https://doi.org/10.1145/3581783.3612560

Zhang, Chenwei ; Hu, Yuxuan ; Yang, Min et al. / Skeletal Spatial-Temporal Semantics Guided Homogeneous-Heterogeneous Multimodal Network for Action Recognition. MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia. Association for Computing Machinery, Inc, 2023. pp. 3657-3666 (MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia).

@inproceedings{dd71f5e0e54c4330ab9f30bd06a1f96f,

title = "Skeletal Spatial-Temporal Semantics Guided Homogeneous-Heterogeneous Multimodal Network for Action Recognition",

abstract = "Action recognition research has gained significant attention with two dominant unimodal approaches: skeleton-based and RGB video-based. While the former is known for its robustness in complex backgrounds, the latter provides rich environmental information useful for context-based analysis. However, the fusion of these two modalities remains an open challenge. In this paper, we propose a Spatial Transformer & Selective Temporal encoder (ST&ST) for skeleton-based action recognition by constructing two modules: Reranking-Enhanced Dynamic Mask Transformer (RE-DMT) and Selective Kernel Temporal Convolution (SK-TC). The RE-DMT captures global spatial features, while the dynamic mask strategy and reranking strategy reduce redundancy. The SK-TC captures both long-term and short-term temporal features and enables adaptive fusion. Furthermore, in two phases, we propose a Homogeneous-Heterogeneous Multimodal Network (HHMNet) for multi-modal action recognition. In the first phase, contrastive learning is employed to achieve implicit semantic fusion within the four homogeneous skeletal modalities (joint, bone, etc.). In the second phase, the fusion of heterogeneous modalities (skeleton & RGB video) is carried out at three levels: model, feature, and decision. At the model level, the powerful skeleton-based model from the previous phase provides explicit attention guidance to the RGB video-based model. At the feature level, multi-part contrastive learning enables semantic distillation between heterogeneous modalities. At the decision level, ensemble learning combines outputs for final action recognition. We evaluate our proposed ST&ST guided HHMNet on NTU RGB+D 60 & 120 and NW-UCLA datasets and demonstrate that it achieves state-of-the-art performance in both skeleton-based and multi-modal action recognition tasks.",

keywords = "action recognition, contrastive learning, multi-modal",

author = "Chenwei Zhang and Yuxuan Hu and Min Yang and Chengming Li and Xiping Hu",

note = "Publisher Copyright: {\textcopyright} 2023 ACM.; 31st ACM International Conference on Multimedia, MM 2023 ; Conference date: 29-10-2023 Through 03-11-2023",

year = "2023",

month = oct,

day = "26",

doi = "10.1145/3581783.3612560",

language = "English",

series = "MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia",

publisher = "Association for Computing Machinery, Inc",

pages = "3657--3666",

booktitle = "MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia",

}

Zhang, C, Hu, Y, Yang, M, Li, C & Hu, X 2023, Skeletal Spatial-Temporal Semantics Guided Homogeneous-Heterogeneous Multimodal Network for Action Recognition. in MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia. MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia, Association for Computing Machinery, Inc, pp. 3657-3666, 31st ACM International Conference on Multimedia, MM 2023, Ottawa, Canada, 29/10/23. https://doi.org/10.1145/3581783.3612560

Skeletal Spatial-Temporal Semantics Guided Homogeneous-Heterogeneous Multimodal Network for Action Recognition. / Zhang, Chenwei; Hu, Yuxuan; Yang, Min et al.
MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia. Association for Computing Machinery, Inc, 2023. p. 3657-3666 (MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Skeletal Spatial-Temporal Semantics Guided Homogeneous-Heterogeneous Multimodal Network for Action Recognition

AU - Zhang, Chenwei

AU - Hu, Yuxuan

AU - Yang, Min

AU - Li, Chengming

AU - Hu, Xiping

PY - 2023/10/26

Y1 - 2023/10/26

N2 - Action recognition research has gained significant attention with two dominant unimodal approaches: skeleton-based and RGB video-based. While the former is known for its robustness in complex backgrounds, the latter provides rich environmental information useful for context-based analysis. However, the fusion of these two modalities remains an open challenge. In this paper, we propose a Spatial Transformer & Selective Temporal encoder (ST&ST) for skeleton-based action recognition by constructing two modules: Reranking-Enhanced Dynamic Mask Transformer (RE-DMT) and Selective Kernel Temporal Convolution (SK-TC). The RE-DMT captures global spatial features, while the dynamic mask strategy and reranking strategy reduce redundancy. The SK-TC captures both long-term and short-term temporal features and enables adaptive fusion. Furthermore, in two phases, we propose a Homogeneous-Heterogeneous Multimodal Network (HHMNet) for multi-modal action recognition. In the first phase, contrastive learning is employed to achieve implicit semantic fusion within the four homogeneous skeletal modalities (joint, bone, etc.). In the second phase, the fusion of heterogeneous modalities (skeleton & RGB video) is carried out at three levels: model, feature, and decision. At the model level, the powerful skeleton-based model from the previous phase provides explicit attention guidance to the RGB video-based model. At the feature level, multi-part contrastive learning enables semantic distillation between heterogeneous modalities. At the decision level, ensemble learning combines outputs for final action recognition. We evaluate our proposed ST&ST guided HHMNet on NTU RGB+D 60 & 120 and NW-UCLA datasets and demonstrate that it achieves state-of-the-art performance in both skeleton-based and multi-modal action recognition tasks.

AB - Action recognition research has gained significant attention with two dominant unimodal approaches: skeleton-based and RGB video-based. While the former is known for its robustness in complex backgrounds, the latter provides rich environmental information useful for context-based analysis. However, the fusion of these two modalities remains an open challenge. In this paper, we propose a Spatial Transformer & Selective Temporal encoder (ST&ST) for skeleton-based action recognition by constructing two modules: Reranking-Enhanced Dynamic Mask Transformer (RE-DMT) and Selective Kernel Temporal Convolution (SK-TC). The RE-DMT captures global spatial features, while the dynamic mask strategy and reranking strategy reduce redundancy. The SK-TC captures both long-term and short-term temporal features and enables adaptive fusion. Furthermore, in two phases, we propose a Homogeneous-Heterogeneous Multimodal Network (HHMNet) for multi-modal action recognition. In the first phase, contrastive learning is employed to achieve implicit semantic fusion within the four homogeneous skeletal modalities (joint, bone, etc.). In the second phase, the fusion of heterogeneous modalities (skeleton & RGB video) is carried out at three levels: model, feature, and decision. At the model level, the powerful skeleton-based model from the previous phase provides explicit attention guidance to the RGB video-based model. At the feature level, multi-part contrastive learning enables semantic distillation between heterogeneous modalities. At the decision level, ensemble learning combines outputs for final action recognition. We evaluate our proposed ST&ST guided HHMNet on NTU RGB+D 60 & 120 and NW-UCLA datasets and demonstrate that it achieves state-of-the-art performance in both skeleton-based and multi-modal action recognition tasks.

KW - action recognition

KW - contrastive learning

KW - multi-modal

UR - http://www.scopus.com/inward/record.url?scp=85179557071&partnerID=8YFLogxK

U2 - 10.1145/3581783.3612560

DO - 10.1145/3581783.3612560

M3 - Conference contribution

AN - SCOPUS:85179557071

T3 - MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia

SP - 3657

EP - 3666

BT - MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia

PB - Association for Computing Machinery, Inc

T2 - 31st ACM International Conference on Multimedia, MM 2023

Y2 - 29 October 2023 through 3 November 2023

ER -

Zhang C, Hu Y, Yang M, Li C, Hu X. Skeletal Spatial-Temporal Semantics Guided Homogeneous-Heterogeneous Multimodal Network for Action Recognition. In MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia. Association for Computing Machinery, Inc. 2023. p. 3657-3666. (MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia). doi: 10.1145/3581783.3612560

Skeletal Spatial-Temporal Semantics Guided Homogeneous-Heterogeneous Multimodal Network for Action Recognition

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this