TY - JOUR
T1 - AITEPose
T2 - Learning an End-to-End Monocular 3D Human Pose Estimator via Auxiliary-Information-Driven Training Enhancement
AU - Xie, Bowei
AU - Liu, Geyuan
AU - Deng, Fang
AU - Lu, Maobin
N1 - Publisher Copyright:
© 1991-2012 IEEE.
PY - 2025
Y1 - 2025
N2 - 3D human pose estimation (3DHPE) from a single monocular RGB image is fundamental in many image-related fields, such as virtual reality, motion analysis, and human-computer interaction. To improve estimation accuracy, existing works typically integrate complex networks or divide monocular 3DHPE into multiple stages. However, complicating the estimation process to improve the estimation accuracy sacrifices the estimation speed and limits its application. To alleviate this, we propose AITEPose, an end-to-end model, which achieves higher monocular 3DHPE accuracy with a simpler model structure. Specifically, inspired by online knowledge distillation, we design an Auxiliary-Information-Driven Training Enhancement (AITE) framework. In the AITE framework, during training, an adjustment network is introduced between the prediction network and the loss function to incorporate auxiliary information and enhance the training process. Notably, the adjustment network is constructed by developing a novel cascaded Disturbance-Correction Module (DCM). It adjusts the poses to get more accurate results based on ground-truth bone lengths. Both AITE and DCM are employed only during training, thereby improving training outcomes without complicating the inference process. The AITEPose model achieves state-of-the-art performance for single-frame monocular 3DHPE on the most comprehensive dataset Human3.6M. To further validate the effectiveness of AITE and DCM, we design a monocular 2DHPE model, AITEPose2D, and conduct extensive ablation experiments on the COCO2017 dataset, demonstrating the robustness and generalizability of our proposed AITEPose.
AB - 3D human pose estimation (3DHPE) from a single monocular RGB image is fundamental in many image-related fields, such as virtual reality, motion analysis, and human-computer interaction. To improve estimation accuracy, existing works typically integrate complex networks or divide monocular 3DHPE into multiple stages. However, complicating the estimation process to improve the estimation accuracy sacrifices the estimation speed and limits its application. To alleviate this, we propose AITEPose, an end-to-end model, which achieves higher monocular 3DHPE accuracy with a simpler model structure. Specifically, inspired by online knowledge distillation, we design an Auxiliary-Information-Driven Training Enhancement (AITE) framework. In the AITE framework, during training, an adjustment network is introduced between the prediction network and the loss function to incorporate auxiliary information and enhance the training process. Notably, the adjustment network is constructed by developing a novel cascaded Disturbance-Correction Module (DCM). It adjusts the poses to get more accurate results based on ground-truth bone lengths. Both AITE and DCM are employed only during training, thereby improving training outcomes without complicating the inference process. The AITEPose model achieves state-of-the-art performance for single-frame monocular 3DHPE on the most comprehensive dataset Human3.6M. To further validate the effectiveness of AITE and DCM, we design a monocular 2DHPE model, AITEPose2D, and conduct extensive ablation experiments on the COCO2017 dataset, demonstrating the robustness and generalizability of our proposed AITEPose.
KW - monocular 3D human pose estimation
KW - training enhancement
UR - http://www.scopus.com/inward/record.url?scp=105005540534&partnerID=8YFLogxK
U2 - 10.1109/TCSVT.2025.3570967
DO - 10.1109/TCSVT.2025.3570967
M3 - Article
AN - SCOPUS:105005540534
SN - 1051-8215
JO - IEEE Transactions on Circuits and Systems for Video Technology
JF - IEEE Transactions on Circuits and Systems for Video Technology
ER -