Exploring temporal consistency for human pose estimation in videos

Yang Li; Kan Li; Xinxin Wang; Richard Yi Da Xu

doi:10.1016/j.patcog.2020.107258

Exploring temporal consistency for human pose estimation in videos

Yang Li, Kan Li^*, Xinxin Wang, Richard Yi Da Xu

^*此作品的通讯作者

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

15 引用（Scopus）

摘要

In this paper, we introduce a method of exploring temporal information for estimating human poses in videos. The current state-of-the-art methods utilizing temporal information can be categorized into two major branches. The first category is a model-based method that captures the temporal information entirely by using a learnable function such as RNN or 3D convolution. However, these methods are limited in exploring temporal consistency, which is essential for estimating human joint positions in videos. The second category is the posterior enhancement method, where an independent post-processing step (e.g., using optical flow) is applied to enhance the prediction. However, operations such as optical flow estimation can be susceptible to the occlusion and motion blur problems, which will adversely affect the final performance. We propose a novel Temporal Consistency Exploration (TCE) module to address both shortcomings. Compared to previous approaches, the TCE module is more efficient as it captures the temporal consistency at the feature level without having to post-process and calculate extra optical flow. Further, to capture the rich spatial context in video data, we design a multi-scale TCE to explore the time consistency information at multi-scale spatial levels. Finally, a video-based pose estimation network is designed, which is based on the encoder-decoder architecture and extended with the powerful multi-scale TCE module. We comprehensively evaluate the proposed model on two video datasets, Sub-JHMDB and Penn, and our model achieves state-of-the-art performance on both datasets.

源语言	英语
文章编号	107258
期刊	Pattern Recognition
卷	103
DOI	https://doi.org/10.1016/j.patcog.2020.107258
出版状态	已出版 - 7月 2020

访问文件

10.1016/j.patcog.2020.107258

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{444a20f39125412ea16f9a931a6fb451,

title = "Exploring temporal consistency for human pose estimation in videos",

abstract = "In this paper, we introduce a method of exploring temporal information for estimating human poses in videos. The current state-of-the-art methods utilizing temporal information can be categorized into two major branches. The first category is a model-based method that captures the temporal information entirely by using a learnable function such as RNN or 3D convolution. However, these methods are limited in exploring temporal consistency, which is essential for estimating human joint positions in videos. The second category is the posterior enhancement method, where an independent post-processing step (e.g., using optical flow) is applied to enhance the prediction. However, operations such as optical flow estimation can be susceptible to the occlusion and motion blur problems, which will adversely affect the final performance. We propose a novel Temporal Consistency Exploration (TCE) module to address both shortcomings. Compared to previous approaches, the TCE module is more efficient as it captures the temporal consistency at the feature level without having to post-process and calculate extra optical flow. Further, to capture the rich spatial context in video data, we design a multi-scale TCE to explore the time consistency information at multi-scale spatial levels. Finally, a video-based pose estimation network is designed, which is based on the encoder-decoder architecture and extended with the powerful multi-scale TCE module. We comprehensively evaluate the proposed model on two video datasets, Sub-JHMDB and Penn, and our model achieves state-of-the-art performance on both datasets.",

keywords = "Convolution neural network, Temporal information, Video-based pose estimation",

author = "Yang Li and Kan Li and Xinxin Wang and Xu, {Richard Yi Da}",

note = "Publisher Copyright: {\textcopyright} 2020",

year = "2020",

month = jul,

doi = "10.1016/j.patcog.2020.107258",

language = "English",

volume = "103",

journal = "Pattern Recognition",

issn = "0031-3203",

publisher = "Elsevier Ltd.",

}

TY - JOUR

T1 - Exploring temporal consistency for human pose estimation in videos

AU - Li, Yang

AU - Li, Kan

AU - Wang, Xinxin

AU - Xu, Richard Yi Da

PY - 2020/7

Y1 - 2020/7

N2 - In this paper, we introduce a method of exploring temporal information for estimating human poses in videos. The current state-of-the-art methods utilizing temporal information can be categorized into two major branches. The first category is a model-based method that captures the temporal information entirely by using a learnable function such as RNN or 3D convolution. However, these methods are limited in exploring temporal consistency, which is essential for estimating human joint positions in videos. The second category is the posterior enhancement method, where an independent post-processing step (e.g., using optical flow) is applied to enhance the prediction. However, operations such as optical flow estimation can be susceptible to the occlusion and motion blur problems, which will adversely affect the final performance. We propose a novel Temporal Consistency Exploration (TCE) module to address both shortcomings. Compared to previous approaches, the TCE module is more efficient as it captures the temporal consistency at the feature level without having to post-process and calculate extra optical flow. Further, to capture the rich spatial context in video data, we design a multi-scale TCE to explore the time consistency information at multi-scale spatial levels. Finally, a video-based pose estimation network is designed, which is based on the encoder-decoder architecture and extended with the powerful multi-scale TCE module. We comprehensively evaluate the proposed model on two video datasets, Sub-JHMDB and Penn, and our model achieves state-of-the-art performance on both datasets.

AB - In this paper, we introduce a method of exploring temporal information for estimating human poses in videos. The current state-of-the-art methods utilizing temporal information can be categorized into two major branches. The first category is a model-based method that captures the temporal information entirely by using a learnable function such as RNN or 3D convolution. However, these methods are limited in exploring temporal consistency, which is essential for estimating human joint positions in videos. The second category is the posterior enhancement method, where an independent post-processing step (e.g., using optical flow) is applied to enhance the prediction. However, operations such as optical flow estimation can be susceptible to the occlusion and motion blur problems, which will adversely affect the final performance. We propose a novel Temporal Consistency Exploration (TCE) module to address both shortcomings. Compared to previous approaches, the TCE module is more efficient as it captures the temporal consistency at the feature level without having to post-process and calculate extra optical flow. Further, to capture the rich spatial context in video data, we design a multi-scale TCE to explore the time consistency information at multi-scale spatial levels. Finally, a video-based pose estimation network is designed, which is based on the encoder-decoder architecture and extended with the powerful multi-scale TCE module. We comprehensively evaluate the proposed model on two video datasets, Sub-JHMDB and Penn, and our model achieves state-of-the-art performance on both datasets.

KW - Convolution neural network

KW - Temporal information

KW - Video-based pose estimation

UR - http://www.scopus.com/inward/record.url?scp=85079553907&partnerID=8YFLogxK

U2 - 10.1016/j.patcog.2020.107258

DO - 10.1016/j.patcog.2020.107258

M3 - Article

AN - SCOPUS:85079553907

SN - 0031-3203

VL - 103

JO - Pattern Recognition

JF - Pattern Recognition

M1 - 107258

ER -

Exploring temporal consistency for human pose estimation in videos

摘要

访问文件

其它文件与链接

指纹

引用此