TY - JOUR
T1 - MoViSense
T2 - Multiview Spatiotemporal Transformer for 3-D Human Kinematics Sensing
AU - Zheng, Dezhi
AU - Yang, Zhiyi
AU - Liu, Wenfeng
AU - Liang, Xiao
AU - Hu, Chun
AU - Ma, Kang
N1 - Publisher Copyright:
© 2001-2012 IEEE.
PY - 2026/5/1
Y1 - 2026/5/1
N2 - Human pose estimation has wide applications in health monitoring, disease diagnosis, and motion rehabilitation. These applications rely on motion assessment using kinematic parameters, which can be obtained from the coordinates of human body key points. Most current noncontact key point measurement methods rely on multiple cameras to reconstruct the human body in realistic multiperson interaction and occlusion scenarios. However, sparse camera configurations often lead to limited pose estimation accuracy, while increasing the number of viewpoints to support high-accuracy kinematic analysis reduces efficiency. In this study, MoViSense is proposed for 3-D human pose estimation and kinematic analysis under sparse camera configurations by exploiting the spatial and temporal continuity. Based on the Transformer, the encoder integrates a multiscale gated feedforward (MSFF) module to enhance spatial representations and cross-view alignment, while a dynamic history fusion-deformable multiscale attention (DHF-DMA) module utilizes temporal continuity of human motion to improve robustness under occlusion. In addition, a biomechanical constraint mechanism (BCM) enforces bone-length consistency. A motion kinetics extractor (MKE) converts estimated 3-D key points into interpretable kinematic parameters. Experiments on the CMU Panoptic dataset show that MoViSense achieves an AP50 of 86.49 and an MPJPE of 27.85 mm, outperforming the other representative methods under sparse camera configurations. The relative deviation (RD) of stride length was 5.37%, and the RD of cadence was 1.45%.
AB - Human pose estimation has wide applications in health monitoring, disease diagnosis, and motion rehabilitation. These applications rely on motion assessment using kinematic parameters, which can be obtained from the coordinates of human body key points. Most current noncontact key point measurement methods rely on multiple cameras to reconstruct the human body in realistic multiperson interaction and occlusion scenarios. However, sparse camera configurations often lead to limited pose estimation accuracy, while increasing the number of viewpoints to support high-accuracy kinematic analysis reduces efficiency. In this study, MoViSense is proposed for 3-D human pose estimation and kinematic analysis under sparse camera configurations by exploiting the spatial and temporal continuity. Based on the Transformer, the encoder integrates a multiscale gated feedforward (MSFF) module to enhance spatial representations and cross-view alignment, while a dynamic history fusion-deformable multiscale attention (DHF-DMA) module utilizes temporal continuity of human motion to improve robustness under occlusion. In addition, a biomechanical constraint mechanism (BCM) enforces bone-length consistency. A motion kinetics extractor (MKE) converts estimated 3-D key points into interpretable kinematic parameters. Experiments on the CMU Panoptic dataset show that MoViSense achieves an AP50 of 86.49 and an MPJPE of 27.85 mm, outperforming the other representative methods under sparse camera configurations. The relative deviation (RD) of stride length was 5.37%, and the RD of cadence was 1.45%.
KW - Kinematic analysis
KW - Transformer
KW - motion assessment
KW - multiview 3-D human pose estimation
UR - https://www.scopus.com/pages/publications/105039184950
U2 - 10.1109/JSEN.2026.3682936
DO - 10.1109/JSEN.2026.3682936
M3 - Article
AN - SCOPUS:105039184950
SN - 1530-437X
VL - 26
SP - 16027
EP - 16036
JO - IEEE Sensors Journal
JF - IEEE Sensors Journal
IS - 10
ER -