research-article

Learning Joint Structure for Human Pose Estimation

Authors:
Shenming Feng

School of Electronics and Information Technology, Sun Yat-Sen University, Guangzhou, Guangdong, People's Republic of China

School of Electronics and Information Technology, Sun Yat-Sen University, Guangzhou, Guangdong, People's Republic of China
View Profile

,
Haifeng Hu

School of Electronics and Information Technology, Sun Yat-Sen University, Guangzhou, Guangdong, People's Republic of China

School of Electronics and Information Technology, Sun Yat-Sen University, Guangzhou, Guangdong, People's Republic of China

0000-0002-4884-323X
View Profile

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 16 Issue 3Article No.: 85pp 1–17https://doi.org/10.1145/3392302

Published:05 July 2020Publication History

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

Recently, tremendous progress has been achieved on human pose estimation with the development of convolutional neural networks (CNNs). However, current methods still suffer from severe occlusion, back view, and large pose variation due to the lack of consideration of the spatial relationship between different joints, which can provide strong cues for localizing the hidden keypoints. In this work, we design a Structural Pose Network (SPN) to take full advantage of joint structure for human pose estimation under unconstrained environment. Specifically, the proposed model is composed of two subnets: Structure Residual Network (SRN) and Structure Improving Network (SIN). Given an input image, SRN first captures rich joint structure as priors through a multi-branch feature extraction module, following a hourglass network with pyramid residual units to enlarge the receptive field and further obtain structural feature representations. SIN, based on coordinate regression, can optimize the spatial relationship of different joints via the attention mechanism, thus refining the initial prediction from SRN. In addition, we propose a novel structure-consistency constraint, which can maintain the structural consistency between the joints and body parts via estimating whether the joints are located in their corresponding parts. At the same time, an online hard regions mining (OHRM) strategy is introduced to drive the network to pay corresponding attention to different body parts. The experimental results on three challenging datasets show that our method outperforms other state-of-the-art algorithms.

References

Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2014. 2D human pose estimation: New benchmark and state of the art analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3686--369Google ScholarDigital Library
Mykhaylo Andriluka, Stefan Roth, and Bernt Schiele. 2009. Pictorial structures revisited: People detection and articulated pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1014--1021.Google ScholarCross Ref
Adrian Bulat and Georgios Tzimiropoulos. 2016. Human pose estimation via convolutional part heatmap regression. In Proceedings of the European Conference on Computer Vision. 717--732.Google ScholarCross Ref
Adrian Bulat and Georgios Tzimiropoulos. 2017. Binarized convolutional landmark localizers for human pose estimation and face alignment with limited resources. In Proceedings of the IEEE International Conference on Computer Vision. 3706--3714.Google ScholarCross Ref
Yujun Cai, Liuhao Ge, Jun Liu, Jianfei Cai, Tat-Jen Cham, Junsong Yuan, and Nadia Magnenat Thalmann. 2019. Exploiting spatial-temporal relationships for 3D pose estimation via graph convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 2272--2281.Google ScholarCross Ref
Joao Carreira, Pulkit Agrawal, Katerina Fragkiadaki, and Jitendra Malik. 2016. Human pose estimation with iterative error feedback. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4733--4742.Google ScholarCross Ref
Xianjie Chen and Alan L. Yuille. 2014. Articulated pose estimation by a graphical model with image dependent pairwise relations. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 1736--1744.Google Scholar
Yu Chen, Chunhua Shen, Xiu-Shen Wei, Lingqiao Liu, and Jian Yang. 2017. Adversarial PoseNet: A structure-aware convolutional network for human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision. 1212--1221.Google ScholarCross Ref
Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang Zhang, Gang Yu, and Jian Sun. 2018. Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7103--7112.Google ScholarCross Ref
Chia-Jung Chou, Jui-Ting Chien, and Hwann-Tzong Chen. 2018. Self adversarial training for human pose estimation. In Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC’18). 17--30.Google ScholarCross Ref
Xiao Chu, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. 2016. Structured feature learning for pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4715--4723.Google ScholarCross Ref
Xiao Chu, Wei Yang, Wanli Ouyang, Cheng Ma, Alan L. Yuille, and Xiaogang Wang. 2017. Multi-context attention for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1831--1840.Google ScholarCross Ref
Haoqiang Fan and Erjin Zhou. 2016. Approaching human level facial landmark localization by deep learning. Image Vis. Comput. 47 (2016), 27--35.Google ScholarDigital Library
Pedro F. Felzenszwalb, David A. McAllester, Deva Ramanan, et al. 2008. A discriminatively trained, multiscale, deformable part model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2. 7.Google ScholarCross Ref
Martin A. Fischler and Robert A. Elschlager. 1973. The representation and matching of pictorial structures. IEEE Trans. Comput. 1 (1973), 67--92.Google ScholarDigital Library
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google ScholarCross Ref
Zhiao Huang, Erjin Zhou, and Zhimin Cao. 2015. Coarse-to-fine face alignment with multi-scale local patch regression. arXiv preprint arXiv:1511.04901 (2015).Google Scholar
Eldar Insafutdinov, Leonid Pishchulin, Bjoern Andres, Mykhaylo Andriluka, and Bernt Schiele. 2016. DeeperCut: A deeper, stronger, and faster multi-person pose estimation model. In Proceedings of the European Conference on Computer Vision. 34--50.Google ScholarCross Ref
Arjun Jain, Jonathan Tompson, Yann LeCun, and Christoph Bregler. 2014. MoDeep: A deep learning framework using motion features for human pose estimation. In Proceedings of the Asian Conference on Computer Vision. 302--315.Google Scholar
Sam Johnson and Mark Everingham. 2010. Clustered pose and nonlinear appearance models for human pose estimation. In Proceedings of the British Machine Vision Conference, Vol. 2. 5.Google ScholarCross Ref
Lipeng Ke, Ming-Ching Chang, Honggang Qi, and Siwei Lyu. 2018. Multi-scale structure-aware network for human pose estimation. In Proceedings of the European Conference on Computer Vision. 713--728.Google ScholarCross Ref
Jun Liu, Henghui Ding, Amir Shahroudy, Ling-Yu Duan, Xudong Jiang, Gang Wang, and Alex Kot Chichung. 2020. Feature boosting network for 3D pose estimation. IEEE Trans. Pattern Anal. Mach. Intell. 42, 2 (2020), 494–501Google ScholarCross Ref
Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked hourglass networks for human pose estimation. In Proceedings of the European Conference on Computer Vision. 483--499.Google ScholarCross Ref
Bruce Xiaohan Nie, Ping Wei, and Song-Chun Zhu. 2017. Monocular 3D human pose estimation by predicting depth on joints. In Proceedings of the IEEE International Conference on Computer Vision. 3467--3475.Google ScholarCross Ref
Xuecheng Nie, Jiashi Feng, Yiming Zuo, and Shuicheng Yan. 2018. Human pose estimation with parsing induced learner. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2100--2108.Google ScholarCross Ref
Leonid Pishchulin, Eldar Insafutdinov, Siyu Tang, Bjoern Andres, Mykhaylo Andriluka, Peter V. Gehler, and Bernt Schiele. 2016. DeepCut: Joint subset partition and labeling for multi person pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4929--4937.Google ScholarCross Ref
Ben Sapp and Ben Taskar. 2013. MODEC: Multimodal decomposable models for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3674--3681.Google ScholarDigital Library
Ke Sun, Cuiling Lan, Junliang Xing, Wenjun Zeng, Dong Liu, and Jingdong Wang. 2017. Human pose estimation using global and local normalization. In Proceedings of the IEEE International Conference on Computer Vision. 5599--5607.Google ScholarCross Ref
Wei Tang, Pei Yu, and Ying Wu. 2018. Deeply learned compositional models for human pose estimation. In Proceedings of the European Conference on Computer Vision. 190--206.Google ScholarCross Ref
Tijmen Tieleman and Geoffrey Hinton. 2012. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw. Mach. Learn. 4, 2 (2012), 26--31.Google Scholar
Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bregler. 2015. Efficient object localization using convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 648--656.Google ScholarCross Ref
Jonathan J. Tompson, Arjun Jain, Yann LeCun, and Christoph Bregler. 2014. Joint training of a convolutional network and a graphical model for human pose estimation. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 1799--1807.Google Scholar
Alexander Toshev and Christian Szegedy. 2014. DeepPose: Human pose estimation via deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1653--1660.Google ScholarDigital Library
Chunyu Wang, Yizhou Wang, and Alan L. Yuille. 2013. An approach to pose-based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 915--922.Google Scholar
Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. 2016. Convolutional pose machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4724--4732.Google ScholarCross Ref
Tong Xiao, Hongsheng Li, Wanli Ouyang, and Xiaogang Wang. 2016. Learning deep feature representations with domain guided dropout for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1249--1258.Google ScholarCross Ref
Ran Xu, Priyanshu Agarwal, Suren Kumar, Venkat N. Krovi, and Jason J. Corso. 2012. Combining skeletal pose with local motion for human activity recognition. In Proceedings of the International Conference on Articulated Motion and Deformable Objects. 114--123.Google Scholar
Wei Yang, Shuang Li, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. 2017. Learning feature pyramids for human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision. 1281--1290.Google ScholarCross Ref
Yi Yang and Deva Ramanan. 2011. Articulated pose estimation with flexible mixtures-of-parts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1385--1392.Google ScholarDigital Library
Fisher Yu and Vladlen Koltun. 2015. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015).Google Scholar
Hong Zhang, Hao Ouyang, Shu Liu, Xiaojuan Qi, Xiaoyong Shen, Ruigang Yang, and Jiaya Jia. 2019. Human pose estimation with spatial contextual information. arXiv preprint arXiv:1901.01760 (2019).Google Scholar

Index Terms

Learning Joint Structure for Human Pose Estimation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision representations
        Image representations
    2. Natural language processing
      1. Natural language generation
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Machine learning theory

Recommendations

A deep structure for human pose estimation

Articulated human pose estimation in unconstrained conditions is a great challenge. We propose a deep structure that represents a human body in different granularity from coarse-to-fine for better detecting parts and describing spatial constrains ...
Read More
Joint relation based human pose estimation
Abstract
With the increasing application of computer vision technology in real life, human pose estimation task becomes more and more important. However, inferencing accurate coordinates of limb joints or invisible joints is still difficult for even state-...
Read More
3D Human pose estimation

Review of the recent literature in 3D human pose estimation from RGB images and videos.Release of a challenging, publicly available, 3D pose estimation synthetic dataset.Extensive experimental evaluation of some representative state-of-the-art methods. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Multimedia Computing, Communications, and Applications Volume 16, Issue 3
August 2020
364 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3409646
Editor:
Alberto Del Bimbo
University of Firenze, Italy
Issue’s Table of Contents
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 5 July 2020
- Online AM: 7 May 2020
- Accepted: 1 April 2020
- Revised: 1 February 2020
- Received: 1 June 2019
Published in tomm Volume 16, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Learning joint structure
heatmap and offset estimation
single person pose estimation
structural consistency
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 268
  Total Downloads
- Downloads (Last 12 months)24
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Learning Joint Structure for Human Pose Estimation

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

References

Cited By

Index Terms

Recommendations

A deep structure for human pose estimation

Joint relation based human pose estimation

3D Human pose estimation