A survey of human pose estimation: The body parts parsing based methods

https://doi.org/10.1016/j.jvcir.2015.06.013Get rights and content

Highlights

  • Summarization of methods on human pose estimation in recent years.

  • Conclusion of the traditional human pose estimation methods.

  • Illustrated based on a two-stage framework.

  • Comprehensive comparisons are given based on the open source methods.

Abstract

Estimating human pose from videos and image sequences is not only an important computer vision problem, but also plays very critical role in many real-world applications. Main challenges for human pose estimation are variation of body poses, complicated background and depth ambiguities. To solve these problems, considerable research efforts have been devoted to the related fields. In this survey, we focus our attention on the recent advances in vision-based human pose estimation. We first present a general framework of human pose estimation, and then go through the latest technical progress on each stage. Finally, we discuss the limitations of the existing approaches and foresee the future directions to be explored.

Introduction

Human pose estimation (HPE) is the process of inferring the 2D or 3D human body part positions from still images or videos. Conventional HPE methods usually employ extra hardware devices to capture human poses and construct a human skeleton based on the captured body joints. These methods are either expensive or inefficient. During the past decade, considerable research efforts have been devoted to HPE problem in computer vision domain.

Although having investigated the issues of human body part configuration, human body detection and human motion [1] in the previous studies, there still lacks a survey to summarize the most recent progress on body pose estimation. In this survey, we mainly review the recent advances in vision-based human pose estimation. Human pose estimation includes nearly all the human-related problems in computer vision, ranging from the whole human body pose parsing to the detailed body parts localization. As it is hard to cover all these fields within a single survey, we mainly focus on the body part parsing methods. For better comparison of different body part parsing methods, we divide them into four parts, including 2D single person parsing in images, 2D multi-person parsing in images, 2D single person parsing in videos and 3D single person parsing in images and videos. Moreover, we discuss the limitations of the existing approaches and foresee the future trend.

Human pose estimation techniques become more and more mature in the past decades. Being the great interest of different domains, new applications constantly emerge along with the technological evolutions. Human pose estimation is not only an important computer vision problem, but also plays critical role in a variety of real-world applications in the following.

Video Surveillance. Video surveillance aims at tracking and monitoring the locations and motions of pedestrians in special circumstances. It is the earliest application area that HPE technologies have been used. The common scenes are the supermarket and airport passageway.

Human–Computer Interaction (HCI). Advanced human computer interaction systems with human pose estimation have been developed rapidly. In these systems, instructions can be analyzed accurately by capturing the human body poses. In recent years, intelligence driving emerges as a novel practical application.

Digital Entertainment. Digital entertainment, including computer games, computer animation and films, has become a huge industry and an active domain in recent years. For instance, People enjoys the pleasure the body sensor games give to them. Also, In the pre-production of the special effects for movie Avatar [2], actors wear the special equipments to animate the activities of Avatars.

Medical Imaging. Human pose estimation has been widely used in the automatic medical field. A specific instance is that HPE can be used to assist doctors to check patients’ activities from the remote monitor, which greatly simplifies the therapeutic process.

Sports Scenes. In sports news and live broadcast, human pose estimation is employed to track athletes’ locations and activities. Moreover, the estimated poses can be used to employed the detailed movements of their actions.

Other applications include military, children mental development, virtual reality, and so on. The related application fields of HPE are shown in Fig. 1.

In recent years, various devices and commercial systems have been released accompany with HPE technology, including Microsoft Kinect sensor [3], [4], Leap Motion [5], body mounted camera [6], 3D laser scanner [7] and infrared light source [8]. These commercial systems have quite different implementation principles and application fields, as shown in Table 1.

Section snippets

Related surveys and overview

During the last decade, several surveys have been published to summarize the related work on human pose estimation. 3D HPE has attracted lots of attentions in computer vision. For instance, Hen and Paramesran [13] summarize the single camera 3D pose estimation from images and Sminchisescu [14] aims to reconstruct 3D human poses from monocular video sequences. Wearable equipments make it possible to estimate the depth in motion capture, Helten et al. [15] review the depth camera based motion

Preprocessing work

The preprocessing stage for HPE includes camera calibration, foreground segmentation and human body detection, in this section we review the recent advance on these techniques.

Body parts parsing

Body parts parsing aims at locating different body parts in the images, which is the most important step in human pose estimation. In this section, we review the recent technique advances in parsing human body parts.

The body parsing methods varying from 2D body parsing to 3D body parsing, and from images to videos. To make a clear illustration, we divided these methods into four subcategories, which are single person parsing in single 2D images, single person parsing in 3D images/videos, single

Datasets

Due to the large variations in different scenes, it is difficult to build a universal dataset to evaluate the human pose estimation. Alternatively, researchers have created lots of datasets to evaluate their proposed techniques for the specific task, which makes the fair comparison on the different algorithms even harder.

We summarize the current publicly available datasets into Table 5. HumanEva [90] dataset is made of a number of images capturing the synchronized people performing the

Future work and conclusion

Due to challenges ranging from most of the important topics in computer vision domain, estimating human poses from images and videos is always hard. This survey summarize the recent research efforts on this problem.

However, these technologies are limited especially for the irregular poses. A future trend is to explore the unsupervised or semi-supervised learning in body parts parsing. Over-segmentation is useful to keep the contour information, which is a promising preprocessing technique.

Acknowledgments

The authors appreciate the reviewers for their extensive and informative comments for the improvement of this manuscript. This work was supported in part by National Natural Science Foundation of China under the Grant (61103105), National High Technology Research and Development Program of China (2013AA040601).

References (104)

  • T. Shiratori, H.S. Park, L. Sigal, Y. Sheikh, J.K. Hodgins, Motion capture from body-mounted cameras, in: ACM SIGGRAPH...
  • N. Werghi

    Segmentation and modeling of full human body shape from 3-d scan data: a survey

    IEEE Trans. Syst. Man Cybern. Part C

    (2007)
  • A. Boyali, M. Kavakli, J. Twamley, Real time six degree of freedom pose estimation using infrared light sources and...
  • J. Tong et al.

    Scanning 3d full human bodies using kinects

    IEEE Trans. Visual Comput. Graphics

    (2012)
  • J. Palacios et al.

    Human–computer interaction based on hand gestures using rgb-d sensors

    Sensors

    (2013)
  • F. Weichert et al.

    Analysis of the accuracy and robustness of the leap motion controller

    Sensors

    (2013)
  • F. Anderson et al.

    Lean on Wii: Physical rehabilitation with virtual reality and Wii peripherals

    Annu. Rev. CyberTherapy Telemedicine

    (2010)
  • H.Y. Wooi, P. Raveendran, Single camera 3d human pose estimation: a review of current techniques, in: International...
  • C. Sminchisescu, 3d human motion analysis in monocular video techniques and challenges, in: Proceedings of the IEEE...
  • T. Helten, A. Baak, M. Müller, C. Theobalt, Full-body human motion capture from monocular depth images, in:...
  • M. Eichner et al.

    2d articulated human pose estimation and retrieval in (almost) unconstrained still images

    Int. J. Comput. Vision

    (2012)
  • Y. Yang et al.

    Articulated human detection with flexible mixtures of parts

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2013)
  • R. Klette, G. Tee, Understanding human motion: a historic review, in: Human Motion, vol. 36, 2008, pp....
  • J. Aggarwal, M. Ryoo, Human activity analysis: a review 43(3) (2011)...
  • M. Ye, Q. Zhang, L.W. 0002, J. Zhu, R. Yang, J. Gall, A survey on human motion analysis from depth data, in:...
  • A. Toshev, C. Szegedy, Deeppose: human pose estimation via deep neural networks, in: 2014 IEEE Conference on Computer...
  • W. Ouyang, X. Chu, X. Wang, Multi-source deep learning for human pose estimation, in: 2014 IEEE Conference on Computer...
  • C. Stoll, N. Hasler, J. Gall, H.-P. Seidel, C. Theobalt, Fast articulated motion tracking using a sums of gaussians...
  • G. Juergen, Y. Angela, L.J.V. Gool, 2D Action recognition serves 3D human pose estimation, in: European Conference on...
  • J. Shotton, A.W. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, A. Blake, Real-time human pose...
  • C. Ennis, R. McDonnell, C. O’Sullivan, Seeing is believing: body motion dominates in multisensory conversations, in:...
  • A. Mykhaylo, R. Stefan, B. Schiele, Pictorial structures revisited: People detection and articulated pose estimation,...
  • B. Sapp, D. Weiss, B. Taskar, Parsing human motion with stretchable models, in: IEEE Conference on Computer Vision and...
  • M. Eichner, V. Ferrari, We are family: joint pose estimation of multiple persons, in: European Conference on Computer...
  • J. Kim, K. Grauman, Boundary preserving dense local regions, in: IEEE Conference on Computer Vision and Pattern...
  • C. Guillot, M. Taron, P. Sayd, Q.-C. Pham, C. Tilmant, J.-M. Lavest, Background subtraction adapted to ptz cameras by...
  • Y.J. Lee, J. Kim, K. Grauman, Key-segments for video object segmentation, in: IEEE International Conference on Computer...
  • P. Anestis, F. Vittorio, Fast object segmentation in unconstrained video, in: Proceedings of the International...
  • H. Wang, D. Koller, Multi-level inference by relaxed dual decomposition for human pose segmentation, in: IEEE...
  • J. Puwein, L. Ballan, R. Ziegler, M. Pollefeys, Foreground consistent human pose estimation using branch and bound, in:...
  • D. Stavens, S. Thrun, Unsupervised learning of invariant features using video, in: IEEE Conference on Computer Vision...
  • B. Sapp, A. Toshev, B. Taskar, Cascaded models for articulated pose estimation, in: IEEE Conference on Computer Vision...
  • X. Chen, A.L. Yuille, Articulated pose estimation by a graphical model with image dependent pairwise relations,...
  • J. Tompson, A. Jain, Y. LeCun, C. Bregler, Joint training of a convolutional network and a graphical model for human...
  • L.D. Bourdev, J. Malik, Poselets: Body part detectors trained using 3d human pose annotations, in: IEEE International...
  • F. Wang, Y. Li, Beyond physical connections: Tree models in human pose estimation, in: IEEE Conference on Computer...
  • Y. Wang, D. Tran, Z. Liao, Learning hierarchical poselets for human parsing, in: IEEE Conference on Computer Vision and...
  • D. Tran, Y. Wang, D.A. Forsyth, Human parsing with a cascade of hierarchical poselet based pruners, in: International...
  • P. Srinivasan, J. Shi, Bottom-up recognition and parsing of the human body, in: A.L. Yuille, S.C. Zhu, D. Cremers, Y....
  • Cited by (114)

    View all citing articles on Scopus

    This paper has been recommended for acceptance by M.T. Sun.

    View full text