Abstract
Recently, tremendous progress has been achieved on human pose estimation with the development of convolutional neural networks (CNNs). However, current methods still suffer from severe occlusion, back view, and large pose variation due to the lack of consideration of the spatial relationship between different joints, which can provide strong cues for localizing the hidden keypoints. In this work, we design a Structural Pose Network (SPN) to take full advantage of joint structure for human pose estimation under unconstrained environment. Specifically, the proposed model is composed of two subnets: Structure Residual Network (SRN) and Structure Improving Network (SIN). Given an input image, SRN first captures rich joint structure as priors through a multi-branch feature extraction module, following a hourglass network with pyramid residual units to enlarge the receptive field and further obtain structural feature representations. SIN, based on coordinate regression, can optimize the spatial relationship of different joints via the attention mechanism, thus refining the initial prediction from SRN. In addition, we propose a novel structure-consistency constraint, which can maintain the structural consistency between the joints and body parts via estimating whether the joints are located in their corresponding parts. At the same time, an online hard regions mining (OHRM) strategy is introduced to drive the network to pay corresponding attention to different body parts. The experimental results on three challenging datasets show that our method outperforms other state-of-the-art algorithms.
- Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2014. 2D human pose estimation: New benchmark and state of the art analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3686--369Google ScholarDigital Library
- Mykhaylo Andriluka, Stefan Roth, and Bernt Schiele. 2009. Pictorial structures revisited: People detection and articulated pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1014--1021.Google ScholarCross Ref
- Adrian Bulat and Georgios Tzimiropoulos. 2016. Human pose estimation via convolutional part heatmap regression. In Proceedings of the European Conference on Computer Vision. 717--732.Google ScholarCross Ref
- Adrian Bulat and Georgios Tzimiropoulos. 2017. Binarized convolutional landmark localizers for human pose estimation and face alignment with limited resources. In Proceedings of the IEEE International Conference on Computer Vision. 3706--3714.Google ScholarCross Ref
- Yujun Cai, Liuhao Ge, Jun Liu, Jianfei Cai, Tat-Jen Cham, Junsong Yuan, and Nadia Magnenat Thalmann. 2019. Exploiting spatial-temporal relationships for 3D pose estimation via graph convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 2272--2281.Google ScholarCross Ref
- Joao Carreira, Pulkit Agrawal, Katerina Fragkiadaki, and Jitendra Malik. 2016. Human pose estimation with iterative error feedback. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4733--4742.Google ScholarCross Ref
- Xianjie Chen and Alan L. Yuille. 2014. Articulated pose estimation by a graphical model with image dependent pairwise relations. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 1736--1744.Google Scholar
- Yu Chen, Chunhua Shen, Xiu-Shen Wei, Lingqiao Liu, and Jian Yang. 2017. Adversarial PoseNet: A structure-aware convolutional network for human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision. 1212--1221.Google ScholarCross Ref
- Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang Zhang, Gang Yu, and Jian Sun. 2018. Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7103--7112.Google ScholarCross Ref
- Chia-Jung Chou, Jui-Ting Chien, and Hwann-Tzong Chen. 2018. Self adversarial training for human pose estimation. In Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC’18). 17--30.Google ScholarCross Ref
- Xiao Chu, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. 2016. Structured feature learning for pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4715--4723.Google ScholarCross Ref
- Xiao Chu, Wei Yang, Wanli Ouyang, Cheng Ma, Alan L. Yuille, and Xiaogang Wang. 2017. Multi-context attention for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1831--1840.Google ScholarCross Ref
- Haoqiang Fan and Erjin Zhou. 2016. Approaching human level facial landmark localization by deep learning. Image Vis. Comput. 47 (2016), 27--35.Google ScholarDigital Library
- Pedro F. Felzenszwalb, David A. McAllester, Deva Ramanan, et al. 2008. A discriminatively trained, multiscale, deformable part model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2. 7.Google ScholarCross Ref
- Martin A. Fischler and Robert A. Elschlager. 1973. The representation and matching of pictorial structures. IEEE Trans. Comput. 1 (1973), 67--92.Google ScholarDigital Library
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google ScholarCross Ref
- Zhiao Huang, Erjin Zhou, and Zhimin Cao. 2015. Coarse-to-fine face alignment with multi-scale local patch regression. arXiv preprint arXiv:1511.04901 (2015).Google Scholar
- Eldar Insafutdinov, Leonid Pishchulin, Bjoern Andres, Mykhaylo Andriluka, and Bernt Schiele. 2016. DeeperCut: A deeper, stronger, and faster multi-person pose estimation model. In Proceedings of the European Conference on Computer Vision. 34--50.Google ScholarCross Ref
- Arjun Jain, Jonathan Tompson, Yann LeCun, and Christoph Bregler. 2014. MoDeep: A deep learning framework using motion features for human pose estimation. In Proceedings of the Asian Conference on Computer Vision. 302--315.Google Scholar
- Sam Johnson and Mark Everingham. 2010. Clustered pose and nonlinear appearance models for human pose estimation. In Proceedings of the British Machine Vision Conference, Vol. 2. 5.Google ScholarCross Ref
- Lipeng Ke, Ming-Ching Chang, Honggang Qi, and Siwei Lyu. 2018. Multi-scale structure-aware network for human pose estimation. In Proceedings of the European Conference on Computer Vision. 713--728.Google ScholarCross Ref
- Jun Liu, Henghui Ding, Amir Shahroudy, Ling-Yu Duan, Xudong Jiang, Gang Wang, and Alex Kot Chichung. 2020. Feature boosting network for 3D pose estimation. IEEE Trans. Pattern Anal. Mach. Intell. 42, 2 (2020), 494–501Google ScholarCross Ref
- Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked hourglass networks for human pose estimation. In Proceedings of the European Conference on Computer Vision. 483--499.Google ScholarCross Ref
- Bruce Xiaohan Nie, Ping Wei, and Song-Chun Zhu. 2017. Monocular 3D human pose estimation by predicting depth on joints. In Proceedings of the IEEE International Conference on Computer Vision. 3467--3475.Google ScholarCross Ref
- Xuecheng Nie, Jiashi Feng, Yiming Zuo, and Shuicheng Yan. 2018. Human pose estimation with parsing induced learner. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2100--2108.Google ScholarCross Ref
- Leonid Pishchulin, Eldar Insafutdinov, Siyu Tang, Bjoern Andres, Mykhaylo Andriluka, Peter V. Gehler, and Bernt Schiele. 2016. DeepCut: Joint subset partition and labeling for multi person pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4929--4937.Google ScholarCross Ref
- Ben Sapp and Ben Taskar. 2013. MODEC: Multimodal decomposable models for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3674--3681.Google ScholarDigital Library
- Ke Sun, Cuiling Lan, Junliang Xing, Wenjun Zeng, Dong Liu, and Jingdong Wang. 2017. Human pose estimation using global and local normalization. In Proceedings of the IEEE International Conference on Computer Vision. 5599--5607.Google ScholarCross Ref
- Wei Tang, Pei Yu, and Ying Wu. 2018. Deeply learned compositional models for human pose estimation. In Proceedings of the European Conference on Computer Vision. 190--206.Google ScholarCross Ref
- Tijmen Tieleman and Geoffrey Hinton. 2012. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw. Mach. Learn. 4, 2 (2012), 26--31.Google Scholar
- Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bregler. 2015. Efficient object localization using convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 648--656.Google ScholarCross Ref
- Jonathan J. Tompson, Arjun Jain, Yann LeCun, and Christoph Bregler. 2014. Joint training of a convolutional network and a graphical model for human pose estimation. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 1799--1807.Google Scholar
- Alexander Toshev and Christian Szegedy. 2014. DeepPose: Human pose estimation via deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1653--1660.Google ScholarDigital Library
- Chunyu Wang, Yizhou Wang, and Alan L. Yuille. 2013. An approach to pose-based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 915--922.Google Scholar
- Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. 2016. Convolutional pose machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4724--4732.Google ScholarCross Ref
- Tong Xiao, Hongsheng Li, Wanli Ouyang, and Xiaogang Wang. 2016. Learning deep feature representations with domain guided dropout for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1249--1258.Google ScholarCross Ref
- Ran Xu, Priyanshu Agarwal, Suren Kumar, Venkat N. Krovi, and Jason J. Corso. 2012. Combining skeletal pose with local motion for human activity recognition. In Proceedings of the International Conference on Articulated Motion and Deformable Objects. 114--123.Google Scholar
- Wei Yang, Shuang Li, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. 2017. Learning feature pyramids for human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision. 1281--1290.Google ScholarCross Ref
- Yi Yang and Deva Ramanan. 2011. Articulated pose estimation with flexible mixtures-of-parts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1385--1392.Google ScholarDigital Library
- Fisher Yu and Vladlen Koltun. 2015. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015).Google Scholar
- Hong Zhang, Hao Ouyang, Shu Liu, Xiaojuan Qi, Xiaoyong Shen, Ruigang Yang, and Jiaya Jia. 2019. Human pose estimation with spatial contextual information. arXiv preprint arXiv:1901.01760 (2019).Google Scholar
Index Terms
- Learning Joint Structure for Human Pose Estimation
Recommendations
A deep structure for human pose estimation
Articulated human pose estimation in unconstrained conditions is a great challenge. We propose a deep structure that represents a human body in different granularity from coarse-to-fine for better detecting parts and describing spatial constrains ...
Joint relation based human pose estimation
AbstractWith the increasing application of computer vision technology in real life, human pose estimation task becomes more and more important. However, inferencing accurate coordinates of limb joints or invisible joints is still difficult for even state-...
3D Human pose estimation
Review of the recent literature in 3D human pose estimation from RGB images and videos.Release of a challenging, publicly available, 3D pose estimation synthetic dataset.Extensive experimental evaluation of some representative state-of-the-art methods. ...
Comments