MoChat: Joints-Grouped Spatio-Temporal Grounding Multimodal Large Language Model for Multi-Turn Motion Comprehension and Description

  • Jiawei Mo
  • , Yixuan Chen
  • , Rifen Lin
  • , Yongkang Ni
  • , Feng Liang
  • , Min Zeng
  • , Xiping Hu
  • , Min Li*
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Despite continuous advancements in deep learning for understanding human motion, existing models often struggle to accurately identify action timing and specific body parts, typically supporting only single-round interaction. This limitation is particularly pronounced in home exercise monitoring, neurological disorder assessment, and rehabilitation, where precise motion analysis is crucial for ensuring exercise efficacy, detecting early signs of neurological conditions, and guiding personalized recovery programs. In this paper, we propose MoChat, a multimodal large language model capable of spatio-temporal grounding of human motion and multi-turn dialogue understanding. To achieve this, we first group spatial features in skeleton frames according to human anatomical structures and process them through a Joints-Grouped Skeleton Encoder. The encoder's outputs are fused with large language model embeddings to generate spatio-aware representations. A cross-attention-based Regression Head module is then designed to align hidden-layer embeddings and skeletal sequence embeddings, enabling precise temporal grounding. Furthermore, we develop a pipeline for temporal grounding task to extract timestamps from skeleton-text pairs and construct a multi-turn instruction dialogues for spatial grounding task. Finally, various task instructions are generated for jointly training. Experimental results demonstrate that MoChat achieves state-of-the-art performance across multiple metrics in motion understanding tasks, making it as the first model capable of fine-grained spatio-temporal grounding of human motion.

Original languageEnglish
JournalIEEE Journal of Biomedical and Health Informatics
DOIs
Publication statusAccepted/In press - 2025
Externally publishedYes

Keywords

  • Large language model
  • motion analysis
  • multimodal
  • skeleton
  • spatiotemporal phenomena

Fingerprint

Dive into the research topics of 'MoChat: Joints-Grouped Spatio-Temporal Grounding Multimodal Large Language Model for Multi-Turn Motion Comprehension and Description'. Together they form a unique fingerprint.

Cite this