Skeletal Spatial-Temporal Semantics Guided Homogeneous-Heterogeneous Multimodal Network for Action Recognition

Chenwei Zhang, Yuxuan Hu, Min Yang, Chengming Li*, Xiping Hu*

*此作品的通讯作者

科研成果: 书/报告/会议事项章节会议稿件同行评审

1 引用 (Scopus)

摘要

Action recognition research has gained significant attention with two dominant unimodal approaches: skeleton-based and RGB video-based. While the former is known for its robustness in complex backgrounds, the latter provides rich environmental information useful for context-based analysis. However, the fusion of these two modalities remains an open challenge. In this paper, we propose a Spatial Transformer & Selective Temporal encoder (ST&ST) for skeleton-based action recognition by constructing two modules: Reranking-Enhanced Dynamic Mask Transformer (RE-DMT) and Selective Kernel Temporal Convolution (SK-TC). The RE-DMT captures global spatial features, while the dynamic mask strategy and reranking strategy reduce redundancy. The SK-TC captures both long-term and short-term temporal features and enables adaptive fusion. Furthermore, in two phases, we propose a Homogeneous-Heterogeneous Multimodal Network (HHMNet) for multi-modal action recognition. In the first phase, contrastive learning is employed to achieve implicit semantic fusion within the four homogeneous skeletal modalities (joint, bone, etc.). In the second phase, the fusion of heterogeneous modalities (skeleton & RGB video) is carried out at three levels: model, feature, and decision. At the model level, the powerful skeleton-based model from the previous phase provides explicit attention guidance to the RGB video-based model. At the feature level, multi-part contrastive learning enables semantic distillation between heterogeneous modalities. At the decision level, ensemble learning combines outputs for final action recognition. We evaluate our proposed ST&ST guided HHMNet on NTU RGB+D 60 & 120 and NW-UCLA datasets and demonstrate that it achieves state-of-the-art performance in both skeleton-based and multi-modal action recognition tasks.

源语言英语
主期刊名MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia
出版商Association for Computing Machinery, Inc
3657-3666
页数10
ISBN(电子版)9798400701085
DOI
出版状态已出版 - 26 10月 2023
已对外发布
活动31st ACM International Conference on Multimedia, MM 2023 - Ottawa, 加拿大
期限: 29 10月 20233 11月 2023

出版系列

姓名MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia

会议

会议31st ACM International Conference on Multimedia, MM 2023
国家/地区加拿大
Ottawa
时期29/10/233/11/23

指纹

探究 'Skeletal Spatial-Temporal Semantics Guided Homogeneous-Heterogeneous Multimodal Network for Action Recognition' 的科研主题。它们共同构成独一无二的指纹。

引用此