Self-supervised method for 3D human pose estimation with consistent shape and viewpoint factorization

Zhichao Ma; Kan Li; Yang Li

doi:10.1007/s10489-022-03714-x

Self-supervised method for 3D human pose estimation with consistent shape and viewpoint factorization

Zhichao Ma, Kan Li^*, Yang Li

^*此作品的通讯作者

计算机学院

Beijing Institute of Technology

科研成果: 期刊稿件 › 文章 › 同行评审

7 引用（Scopus）

摘要

3D human pose estimation from monocular images has shown great success due to the sophisticated deep network architectures and large 3D human pose datasets. However, it is still an open problem when such datasets are unavailable. Estimating 3D human poses from monocular images is an ill-posed inverse problem. In our work, we propose a novel self-supervised method, which effectively trains a 3D human pose estimation network without any extra 3D pose annotations. Different from the commonly used GAN-based technique, our method overcomes the projection ambiguity problem by fully disentangling the camera viewpoint information from the 3D human shape. Specifically, we design a factorization network to predict the coefficients of canonical 3D human pose and camera viewpoint in two separate channels. Here, we represent the canonical 3D human pose as a combination of pose basis from a dictionary. To guarantee consistent factorization, we design a simple yet effective loss function taking advantage of multi-view information. Besides, in order to generate robust canonical reconstruction from the 3D pose coefficient, we exploit the underlying 3D geometry of human poses to learn a novel hierarchical dictionary from 2D poses. The hierarchical dictionary has stronger 3D pose expressibility than the traditional single-level dictionary. We comprehensively evaluate the proposed method on two public 3D human pose datasets, Human3.6M and MPI-INF-3DHP. The experimental results show that our method can maximally disentangle 3D human shapes and camera viewpoints, as well as reconstruct 3D human poses accurately. Moreover, our method achieves state-of-the-art results compared with recent weakly/self-supervised methods.

源语言	英语
页（从-至）	3864-3876
页数	13
期刊	Applied Intelligence
卷	53
期	4
DOI	https://doi.org/10.1007/s10489-022-03714-x
出版状态	已出版 - 2月 2023

访问文件

10.1007/s10489-022-03714-x

其它文件与链接

链接到 Scopus 的出版物

引用此

Ma, Z., Li, K., & Li, Y. (2023). Self-supervised method for 3D human pose estimation with consistent shape and viewpoint factorization. Applied Intelligence, 53(4), 3864-3876. https://doi.org/10.1007/s10489-022-03714-x

@article{8277bedfd5bb4c0bb0511c09e2a7a692,

title = "Self-supervised method for 3D human pose estimation with consistent shape and viewpoint factorization",

abstract = "3D human pose estimation from monocular images has shown great success due to the sophisticated deep network architectures and large 3D human pose datasets. However, it is still an open problem when such datasets are unavailable. Estimating 3D human poses from monocular images is an ill-posed inverse problem. In our work, we propose a novel self-supervised method, which effectively trains a 3D human pose estimation network without any extra 3D pose annotations. Different from the commonly used GAN-based technique, our method overcomes the projection ambiguity problem by fully disentangling the camera viewpoint information from the 3D human shape. Specifically, we design a factorization network to predict the coefficients of canonical 3D human pose and camera viewpoint in two separate channels. Here, we represent the canonical 3D human pose as a combination of pose basis from a dictionary. To guarantee consistent factorization, we design a simple yet effective loss function taking advantage of multi-view information. Besides, in order to generate robust canonical reconstruction from the 3D pose coefficient, we exploit the underlying 3D geometry of human poses to learn a novel hierarchical dictionary from 2D poses. The hierarchical dictionary has stronger 3D pose expressibility than the traditional single-level dictionary. We comprehensively evaluate the proposed method on two public 3D human pose datasets, Human3.6M and MPI-INF-3DHP. The experimental results show that our method can maximally disentangle 3D human shapes and camera viewpoints, as well as reconstruct 3D human poses accurately. Moreover, our method achieves state-of-the-art results compared with recent weakly/self-supervised methods.",

keywords = "3D pose estimation, Consistent factorization, Hierarchical dictionary, Self-supervised learning",

author = "Zhichao Ma and Kan Li and Yang Li",

note = "Publisher Copyright: {\textcopyright} 2022, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.",

year = "2023",

month = feb,

doi = "10.1007/s10489-022-03714-x",

language = "English",

volume = "53",

pages = "3864--3876",

journal = "Applied Intelligence",

issn = "0924-669X",

publisher = "Springer Netherlands",

number = "4",

}

TY - JOUR

T1 - Self-supervised method for 3D human pose estimation with consistent shape and viewpoint factorization

AU - Ma, Zhichao

AU - Li, Kan

AU - Li, Yang

PY - 2023/2

Y1 - 2023/2

N2 - 3D human pose estimation from monocular images has shown great success due to the sophisticated deep network architectures and large 3D human pose datasets. However, it is still an open problem when such datasets are unavailable. Estimating 3D human poses from monocular images is an ill-posed inverse problem. In our work, we propose a novel self-supervised method, which effectively trains a 3D human pose estimation network without any extra 3D pose annotations. Different from the commonly used GAN-based technique, our method overcomes the projection ambiguity problem by fully disentangling the camera viewpoint information from the 3D human shape. Specifically, we design a factorization network to predict the coefficients of canonical 3D human pose and camera viewpoint in two separate channels. Here, we represent the canonical 3D human pose as a combination of pose basis from a dictionary. To guarantee consistent factorization, we design a simple yet effective loss function taking advantage of multi-view information. Besides, in order to generate robust canonical reconstruction from the 3D pose coefficient, we exploit the underlying 3D geometry of human poses to learn a novel hierarchical dictionary from 2D poses. The hierarchical dictionary has stronger 3D pose expressibility than the traditional single-level dictionary. We comprehensively evaluate the proposed method on two public 3D human pose datasets, Human3.6M and MPI-INF-3DHP. The experimental results show that our method can maximally disentangle 3D human shapes and camera viewpoints, as well as reconstruct 3D human poses accurately. Moreover, our method achieves state-of-the-art results compared with recent weakly/self-supervised methods.

AB - 3D human pose estimation from monocular images has shown great success due to the sophisticated deep network architectures and large 3D human pose datasets. However, it is still an open problem when such datasets are unavailable. Estimating 3D human poses from monocular images is an ill-posed inverse problem. In our work, we propose a novel self-supervised method, which effectively trains a 3D human pose estimation network without any extra 3D pose annotations. Different from the commonly used GAN-based technique, our method overcomes the projection ambiguity problem by fully disentangling the camera viewpoint information from the 3D human shape. Specifically, we design a factorization network to predict the coefficients of canonical 3D human pose and camera viewpoint in two separate channels. Here, we represent the canonical 3D human pose as a combination of pose basis from a dictionary. To guarantee consistent factorization, we design a simple yet effective loss function taking advantage of multi-view information. Besides, in order to generate robust canonical reconstruction from the 3D pose coefficient, we exploit the underlying 3D geometry of human poses to learn a novel hierarchical dictionary from 2D poses. The hierarchical dictionary has stronger 3D pose expressibility than the traditional single-level dictionary. We comprehensively evaluate the proposed method on two public 3D human pose datasets, Human3.6M and MPI-INF-3DHP. The experimental results show that our method can maximally disentangle 3D human shapes and camera viewpoints, as well as reconstruct 3D human poses accurately. Moreover, our method achieves state-of-the-art results compared with recent weakly/self-supervised methods.

KW - 3D pose estimation

KW - Consistent factorization

KW - Hierarchical dictionary

KW - Self-supervised learning

UR - http://www.scopus.com/inward/record.url?scp=85131323448&partnerID=8YFLogxK

U2 - 10.1007/s10489-022-03714-x

DO - 10.1007/s10489-022-03714-x

M3 - Article

AN - SCOPUS:85131323448

SN - 0924-669X

VL - 53

SP - 3864

EP - 3876

JO - Applied Intelligence

JF - Applied Intelligence

IS - 4

ER -

Self-supervised method for 3D human pose estimation with consistent shape and viewpoint factorization

摘要

访问文件

其它文件与链接

指纹

引用此