TY - JOUR
T1 - MoChat
T2 - Joints-Grouped Spatio-Temporal Grounding Multimodal Large Language Model for Multi-Turn Motion Comprehension and Description
AU - Mo, Jiawei
AU - Chen, Yixuan
AU - Lin, Rifen
AU - Ni, Yongkang
AU - Liang, Feng
AU - Zeng, Min
AU - Hu, Xiping
AU - Li, Min
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2025
Y1 - 2025
N2 - Despite continuous advancements in deep learning for understanding human motion, existing models often struggle to accurately identify action timing and specific body parts, typically supporting only single-round interaction. This limitation is particularly pronounced in home exercise monitoring, neurological disorder assessment, and rehabilitation, where precise motion analysis is crucial for ensuring exercise efficacy, detecting early signs of neurological conditions, and guiding personalized recovery programs. In this paper, we propose MoChat, a multimodal large language model capable of spatio-temporal grounding of human motion and multi-turn dialogue understanding. To achieve this, we first group spatial features in skeleton frames according to human anatomical structures and process them through a Joints-Grouped Skeleton Encoder. The encoder's outputs are fused with large language model embeddings to generate spatio-aware representations. A cross-attention-based Regression Head module is then designed to align hidden-layer embeddings and skeletal sequence embeddings, enabling precise temporal grounding. Furthermore, we develop a pipeline for temporal grounding task to extract timestamps from skeleton-text pairs and construct a multi-turn instruction dialogues for spatial grounding task. Finally, various task instructions are generated for jointly training. Experimental results demonstrate that MoChat achieves state-of-the-art performance across multiple metrics in motion understanding tasks, making it as the first model capable of fine-grained spatio-temporal grounding of human motion.
AB - Despite continuous advancements in deep learning for understanding human motion, existing models often struggle to accurately identify action timing and specific body parts, typically supporting only single-round interaction. This limitation is particularly pronounced in home exercise monitoring, neurological disorder assessment, and rehabilitation, where precise motion analysis is crucial for ensuring exercise efficacy, detecting early signs of neurological conditions, and guiding personalized recovery programs. In this paper, we propose MoChat, a multimodal large language model capable of spatio-temporal grounding of human motion and multi-turn dialogue understanding. To achieve this, we first group spatial features in skeleton frames according to human anatomical structures and process them through a Joints-Grouped Skeleton Encoder. The encoder's outputs are fused with large language model embeddings to generate spatio-aware representations. A cross-attention-based Regression Head module is then designed to align hidden-layer embeddings and skeletal sequence embeddings, enabling precise temporal grounding. Furthermore, we develop a pipeline for temporal grounding task to extract timestamps from skeleton-text pairs and construct a multi-turn instruction dialogues for spatial grounding task. Finally, various task instructions are generated for jointly training. Experimental results demonstrate that MoChat achieves state-of-the-art performance across multiple metrics in motion understanding tasks, making it as the first model capable of fine-grained spatio-temporal grounding of human motion.
KW - Large language model
KW - motion analysis
KW - multimodal
KW - skeleton
KW - spatiotemporal phenomena
UR - https://www.scopus.com/pages/publications/105021422891
U2 - 10.1109/JBHI.2025.3631045
DO - 10.1109/JBHI.2025.3631045
M3 - Article
C2 - 41212709
AN - SCOPUS:105021422891
SN - 2168-2194
JO - IEEE Journal of Biomedical and Health Informatics
JF - IEEE Journal of Biomedical and Health Informatics
ER -