AITEPose: Learning an End-to-End Monocular 3D Human Pose Estimator via Auxiliary-Information-Driven Training Enhancement

Bowei Xie, Geyuan Liu, Fang Deng, Maobin Lu*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

3D human pose estimation (3DHPE) from a single monocular RGB image is fundamental in many image-related fields, such as virtual reality, motion analysis, and human-computer interaction. To improve estimation accuracy, existing works typically integrate complex networks or divide monocular 3DHPE into multiple stages. However, complicating the estimation process to improve the estimation accuracy sacrifices the estimation speed and limits its application. To alleviate this, we propose AITEPose, an end-to-end model, which achieves higher monocular 3DHPE accuracy with a simpler model structure. Specifically, inspired by online knowledge distillation, we design an Auxiliary-Information-Driven Training Enhancement (AITE) framework. In the AITE framework, during training, an adjustment network is introduced between the prediction network and the loss function to incorporate auxiliary information and enhance the training process. Notably, the adjustment network is constructed by developing a novel cascaded Disturbance-Correction Module (DCM). It adjusts the poses to get more accurate results based on ground-truth bone lengths. Both AITE and DCM are employed only during training, thereby improving training outcomes without complicating the inference process. The AITEPose model achieves state-of-the-art performance for single-frame monocular 3DHPE on the most comprehensive dataset Human3.6M. To further validate the effectiveness of AITE and DCM, we design a monocular 2DHPE model, AITEPose2D, and conduct extensive ablation experiments on the COCO2017 dataset, demonstrating the robustness and generalizability of our proposed AITEPose.

Original languageEnglish
JournalIEEE Transactions on Circuits and Systems for Video Technology
DOIs
Publication statusAccepted/In press - 2025

Keywords

  • monocular 3D human pose estimation
  • training enhancement

Fingerprint

Dive into the research topics of 'AITEPose: Learning an End-to-End Monocular 3D Human Pose Estimator via Auxiliary-Information-Driven Training Enhancement'. Together they form a unique fingerprint.

Cite this