skip to main content
research-article

Learning Joint Structure for Human Pose Estimation

Authors Info & Claims
Published:05 July 2020Publication History
Skip Abstract Section

Abstract

Recently, tremendous progress has been achieved on human pose estimation with the development of convolutional neural networks (CNNs). However, current methods still suffer from severe occlusion, back view, and large pose variation due to the lack of consideration of the spatial relationship between different joints, which can provide strong cues for localizing the hidden keypoints. In this work, we design a Structural Pose Network (SPN) to take full advantage of joint structure for human pose estimation under unconstrained environment. Specifically, the proposed model is composed of two subnets: Structure Residual Network (SRN) and Structure Improving Network (SIN). Given an input image, SRN first captures rich joint structure as priors through a multi-branch feature extraction module, following a hourglass network with pyramid residual units to enlarge the receptive field and further obtain structural feature representations. SIN, based on coordinate regression, can optimize the spatial relationship of different joints via the attention mechanism, thus refining the initial prediction from SRN. In addition, we propose a novel structure-consistency constraint, which can maintain the structural consistency between the joints and body parts via estimating whether the joints are located in their corresponding parts. At the same time, an online hard regions mining (OHRM) strategy is introduced to drive the network to pay corresponding attention to different body parts. The experimental results on three challenging datasets show that our method outperforms other state-of-the-art algorithms.

References

  1. Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2014. 2D human pose estimation: New benchmark and state of the art analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3686--369Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Mykhaylo Andriluka, Stefan Roth, and Bernt Schiele. 2009. Pictorial structures revisited: People detection and articulated pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1014--1021.Google ScholarGoogle ScholarCross RefCross Ref
  3. Adrian Bulat and Georgios Tzimiropoulos. 2016. Human pose estimation via convolutional part heatmap regression. In Proceedings of the European Conference on Computer Vision. 717--732.Google ScholarGoogle ScholarCross RefCross Ref
  4. Adrian Bulat and Georgios Tzimiropoulos. 2017. Binarized convolutional landmark localizers for human pose estimation and face alignment with limited resources. In Proceedings of the IEEE International Conference on Computer Vision. 3706--3714.Google ScholarGoogle ScholarCross RefCross Ref
  5. Yujun Cai, Liuhao Ge, Jun Liu, Jianfei Cai, Tat-Jen Cham, Junsong Yuan, and Nadia Magnenat Thalmann. 2019. Exploiting spatial-temporal relationships for 3D pose estimation via graph convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 2272--2281.Google ScholarGoogle ScholarCross RefCross Ref
  6. Joao Carreira, Pulkit Agrawal, Katerina Fragkiadaki, and Jitendra Malik. 2016. Human pose estimation with iterative error feedback. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4733--4742.Google ScholarGoogle ScholarCross RefCross Ref
  7. Xianjie Chen and Alan L. Yuille. 2014. Articulated pose estimation by a graphical model with image dependent pairwise relations. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 1736--1744.Google ScholarGoogle Scholar
  8. Yu Chen, Chunhua Shen, Xiu-Shen Wei, Lingqiao Liu, and Jian Yang. 2017. Adversarial PoseNet: A structure-aware convolutional network for human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision. 1212--1221.Google ScholarGoogle ScholarCross RefCross Ref
  9. Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang Zhang, Gang Yu, and Jian Sun. 2018. Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7103--7112.Google ScholarGoogle ScholarCross RefCross Ref
  10. Chia-Jung Chou, Jui-Ting Chien, and Hwann-Tzong Chen. 2018. Self adversarial training for human pose estimation. In Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC’18). 17--30.Google ScholarGoogle ScholarCross RefCross Ref
  11. Xiao Chu, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. 2016. Structured feature learning for pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4715--4723.Google ScholarGoogle ScholarCross RefCross Ref
  12. Xiao Chu, Wei Yang, Wanli Ouyang, Cheng Ma, Alan L. Yuille, and Xiaogang Wang. 2017. Multi-context attention for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1831--1840.Google ScholarGoogle ScholarCross RefCross Ref
  13. Haoqiang Fan and Erjin Zhou. 2016. Approaching human level facial landmark localization by deep learning. Image Vis. Comput. 47 (2016), 27--35.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Pedro F. Felzenszwalb, David A. McAllester, Deva Ramanan, et al. 2008. A discriminatively trained, multiscale, deformable part model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2. 7.Google ScholarGoogle ScholarCross RefCross Ref
  15. Martin A. Fischler and Robert A. Elschlager. 1973. The representation and matching of pictorial structures. IEEE Trans. Comput. 1 (1973), 67--92.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  17. Zhiao Huang, Erjin Zhou, and Zhimin Cao. 2015. Coarse-to-fine face alignment with multi-scale local patch regression. arXiv preprint arXiv:1511.04901 (2015).Google ScholarGoogle Scholar
  18. Eldar Insafutdinov, Leonid Pishchulin, Bjoern Andres, Mykhaylo Andriluka, and Bernt Schiele. 2016. DeeperCut: A deeper, stronger, and faster multi-person pose estimation model. In Proceedings of the European Conference on Computer Vision. 34--50.Google ScholarGoogle ScholarCross RefCross Ref
  19. Arjun Jain, Jonathan Tompson, Yann LeCun, and Christoph Bregler. 2014. MoDeep: A deep learning framework using motion features for human pose estimation. In Proceedings of the Asian Conference on Computer Vision. 302--315.Google ScholarGoogle Scholar
  20. Sam Johnson and Mark Everingham. 2010. Clustered pose and nonlinear appearance models for human pose estimation. In Proceedings of the British Machine Vision Conference, Vol. 2. 5.Google ScholarGoogle ScholarCross RefCross Ref
  21. Lipeng Ke, Ming-Ching Chang, Honggang Qi, and Siwei Lyu. 2018. Multi-scale structure-aware network for human pose estimation. In Proceedings of the European Conference on Computer Vision. 713--728.Google ScholarGoogle ScholarCross RefCross Ref
  22. Jun Liu, Henghui Ding, Amir Shahroudy, Ling-Yu Duan, Xudong Jiang, Gang Wang, and Alex Kot Chichung. 2020. Feature boosting network for 3D pose estimation. IEEE Trans. Pattern Anal. Mach. Intell. 42, 2 (2020), 494–501Google ScholarGoogle ScholarCross RefCross Ref
  23. Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked hourglass networks for human pose estimation. In Proceedings of the European Conference on Computer Vision. 483--499.Google ScholarGoogle ScholarCross RefCross Ref
  24. Bruce Xiaohan Nie, Ping Wei, and Song-Chun Zhu. 2017. Monocular 3D human pose estimation by predicting depth on joints. In Proceedings of the IEEE International Conference on Computer Vision. 3467--3475.Google ScholarGoogle ScholarCross RefCross Ref
  25. Xuecheng Nie, Jiashi Feng, Yiming Zuo, and Shuicheng Yan. 2018. Human pose estimation with parsing induced learner. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2100--2108.Google ScholarGoogle ScholarCross RefCross Ref
  26. Leonid Pishchulin, Eldar Insafutdinov, Siyu Tang, Bjoern Andres, Mykhaylo Andriluka, Peter V. Gehler, and Bernt Schiele. 2016. DeepCut: Joint subset partition and labeling for multi person pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4929--4937.Google ScholarGoogle ScholarCross RefCross Ref
  27. Ben Sapp and Ben Taskar. 2013. MODEC: Multimodal decomposable models for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3674--3681.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Ke Sun, Cuiling Lan, Junliang Xing, Wenjun Zeng, Dong Liu, and Jingdong Wang. 2017. Human pose estimation using global and local normalization. In Proceedings of the IEEE International Conference on Computer Vision. 5599--5607.Google ScholarGoogle ScholarCross RefCross Ref
  29. Wei Tang, Pei Yu, and Ying Wu. 2018. Deeply learned compositional models for human pose estimation. In Proceedings of the European Conference on Computer Vision. 190--206.Google ScholarGoogle ScholarCross RefCross Ref
  30. Tijmen Tieleman and Geoffrey Hinton. 2012. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw. Mach. Learn. 4, 2 (2012), 26--31.Google ScholarGoogle Scholar
  31. Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bregler. 2015. Efficient object localization using convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 648--656.Google ScholarGoogle ScholarCross RefCross Ref
  32. Jonathan J. Tompson, Arjun Jain, Yann LeCun, and Christoph Bregler. 2014. Joint training of a convolutional network and a graphical model for human pose estimation. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 1799--1807.Google ScholarGoogle Scholar
  33. Alexander Toshev and Christian Szegedy. 2014. DeepPose: Human pose estimation via deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1653--1660.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Chunyu Wang, Yizhou Wang, and Alan L. Yuille. 2013. An approach to pose-based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 915--922.Google ScholarGoogle Scholar
  35. Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. 2016. Convolutional pose machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4724--4732.Google ScholarGoogle ScholarCross RefCross Ref
  36. Tong Xiao, Hongsheng Li, Wanli Ouyang, and Xiaogang Wang. 2016. Learning deep feature representations with domain guided dropout for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1249--1258.Google ScholarGoogle ScholarCross RefCross Ref
  37. Ran Xu, Priyanshu Agarwal, Suren Kumar, Venkat N. Krovi, and Jason J. Corso. 2012. Combining skeletal pose with local motion for human activity recognition. In Proceedings of the International Conference on Articulated Motion and Deformable Objects. 114--123.Google ScholarGoogle Scholar
  38. Wei Yang, Shuang Li, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. 2017. Learning feature pyramids for human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision. 1281--1290.Google ScholarGoogle ScholarCross RefCross Ref
  39. Yi Yang and Deva Ramanan. 2011. Articulated pose estimation with flexible mixtures-of-parts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1385--1392.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Fisher Yu and Vladlen Koltun. 2015. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015).Google ScholarGoogle Scholar
  41. Hong Zhang, Hao Ouyang, Shu Liu, Xiaojuan Qi, Xiaoyong Shen, Ruigang Yang, and Jiaya Jia. 2019. Human pose estimation with spatial contextual information. arXiv preprint arXiv:1901.01760 (2019).Google ScholarGoogle Scholar

Index Terms

  1. Learning Joint Structure for Human Pose Estimation

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Multimedia Computing, Communications, and Applications
          ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 16, Issue 3
          August 2020
          364 pages
          ISSN:1551-6857
          EISSN:1551-6865
          DOI:10.1145/3409646
          Issue’s Table of Contents

          Copyright © 2020 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 5 July 2020
          • Online AM: 7 May 2020
          • Accepted: 1 April 2020
          • Revised: 1 February 2020
          • Received: 1 June 2019
          Published in tomm Volume 16, Issue 3

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format