A comprehensive survey on 2D multi-person pose estimation methods

https://doi.org/10.1016/j.engappai.2021.104260Get rights and content

Highlights

  • Illustrate and analyze popular multi-person pose estimation methods.

  • Compare the pros and cons of popular methods.

  • Introduce commonly used datasets, evaluation metrics, and open-source systems.

  • Summarize development of multi-person pose estimation.

Abstract

Human pose estimation is a fundamental yet challenging computer vision task and studied by many researchers around the world in recent years. As a basic task in computer vision, multi-person pose estimation is the core component for many practical applications. This paper extensively reviews recent works on multi-person pose estimation. Specifically, we illustrate and analyze popular methods in detail and compare their pros and cons to fill in the gaps existing in other surveys. In addition, the commonly used datasets, evaluation metrics, and open-source systems are also introduced respectively. Finally, we summarize the development of multi-person pose estimation frameworks and discuss the research trends.

Introduction

Human pose estimation, a fundamental yet challenging detection task, aims to obtain body postures from images. Multi-person Pose Estimation (MPE) captures the poses of many persons simultaneously and is one of the most important fields in human pose estimation. It can be applied to many fields such as action recognition (Chéron et al., 2015), person re-identification (Qian et al., 2018), pedestrian tracking (Andriluka et al., 2010), animation production and digital film-making (Siarohin et al., 2019), virtual reality (Pavlakos et al., 2019), human–computer interaction (Weidenbacher et al., 2006), video surveillance (Hattori et al., 2018), self-driving (Murphy-Chutorian et al., 2007), sports motion analysis (Martinez et al., 2017), etc.

The rapid development of deep learning brings about excellent and remarkable achievements in computer vision and has been introduced into various fields, e.g., image classification (Krizhevsky et al., 2012), object detection (Ren et al., 2015, Redmon et al., 2016), gameplay (Mnih et al., 2015), robotic navigation (Mirowski et al., 2018), etc. MPE also benefits from its development greatly. Although there are some existing works (Perez-Sala et al., 2014, Gong et al., 2016, Sarafianos et al., 2016, Presti and Cascia, 2016, Wang et al., 2018, Chen et al., 2020) for human pose estimation, there still lacks a survey to summarize the most recent MPE achievements in detail. Perez-Sala et al. (2014), Aggarwal and Cai (1999), Gavrila (1999), Ji and Liu (2010), Holte et al. (2012) and Moeslund et al., 2006, Moeslund and Granum, 2001 introduced early model-based works which model human body as pictorial structures. Chen et al. (2013) and Wang et al. (2018) reviewed works on human motion analysis with depth imagery. Sarafianos et al. (2016) and Presti and Cascia (2016) discussed 3D human action classification using MPE. Chen et al. (2020) presented monocular human pose estimation from the single-person perspective to the multi-person perspective and from 2D pose estimation to 3D pose estimation. However, all these works failed to provide a comprehensive retrospect on MPE. Furthermore, existing surveys do not cover the state-of-the-art methods such as Zhang et al. (2019b), Sun et al. (2018a), Cheng et al. (2020), etc. Our survey gives a clear and holistic knowledge of MPE. We analyze the characteristics of these methods, capture the slight differences between them, and highlight their advantages and motivations. In addition, the commonly used datasets, metrics, and open-source systems are also given.

The remainder of this paper is organized as follows. Section 2 introduces different categories of multi-person pose estimation methods. In this paper, we introduce recent works from two aspects: two-stage MPE and one-stage MPE. In Sections 3 Two-stage: top-down methods, 4 Two-stage: bottom-up methods, two-stage MPE methods including top-down and bottom-up approaches are illustrated systematically and comprehensively according to their research motivations such as multi-scale feature learning, data preprocessing, quantization error, etc. Section 5 presents researches on one-stage MPE, which attempt to strike a balance between speed and accuracy. Besides, the commonly used datasets and metrics for MPE are illustrated in Section 6. Meanwhile, several open-source systems are also given in Section 7. Finally, we summarize two-stage and one-stage MPE methods and discuss the future research direction for inspiring new ideas in the MPE field.

Section snippets

Categories of multi-person pose estimation

Generally, MPE methods can be divided into three categories according to different classification criteria: (1) deep learning-based vs. model-based, (2) two-stage vs. one-stage, (3) graph-free vs. graph matching-based.

Deep Learning-based vs. Model-based: The difference between deep learning-based and model-based methods is whether we define an explicit hand-crafted model to estimate human poses or not. In Fig. 1(a), traditional model-based methods (Andriluka et al., 2009, Fischler and

Two-stage: top-down methods

Fig. 2(a) shows that the top-down methods typically use a person detector to find person instances in each image and then perform single-person pose estimation for each person instance. In this section, we will introduce the top-down methods from five aspects according to their research motivations: target representation, quantization error, multi-scale feature learning, data preprocessing and non-maximum suppression.

Two-stage: bottom-up methods

Unlike top-down methods, bottom-up methods only need one-time network forwarding to regress all keypoints and then assign these joints to different human bodies (see item (b) shown in Fig. 2). This section introduces bottom-up methods from two aspects: joint parsing and example imbalance.

One-stage methods

One-stage methods, which inherit the strength of two-stage methods and overcome their shortcomings, predict joint candidates and group assignments simultaneously, e.g. associative embedding (Newell et al., 2017), MultiPoseNet (Kocabas et al., 2018), etc.

This section introduces one-stage models based on the joint assignment mechanism and classifies them into bounding box-based, embedding-based and offset-based methods as depicted in Fig. 15. Bounding box-based methods assign joints according to

Datasets

Dataset plays an important role on MPE. Various datasets are released for pose estimation research such as Leeds Sports Poses (LSP), Frames Labeled In Cinema (FLIC), Max Planck Institute Informatik (MPII), Common Objects in Context (COCO), AI Challenger (AIC), etc. LSP and FLIC belong to single-person datasets, while MPII, COCO and AIC are multi-person datasets. Top-down methods can also utilize existing single-person datasets. In this section, 2D pose estimation datasets and metrics are

Open-source systems

In order to apply MPE methods to the real-world applications, several open-source MPE systems are released, such as OpenPose, Mask R-CNN, AlphaPose, etc.

OpenPose, the first real-time MPE system, is developed and maintained by Carnegie Mellon University (CMU) which detects various body parts (135 keypoints) efficiently.1 It supports 2D/3D keypoint detection, camera calibration and single-person

Conclusion and future research directions

Multi-person pose estimation is a hot research field of computer vision and attracts researchers from both companies and universities all over the world. This paper reviews recent popular MPE methods from two aspects: two-stage to one-stage. Two-stage methods can be divided into top-down and bottom-up methods. In Section 3, we discuss top-down approaches in different perspectives such as target representation, quantization error, multi-scale feature learning, data preprocessing and non-maximum

CRediT authorship contribution statement

Chen Wang: Writing - original draft. Feng Zhang: Writing - review & editing. Shuzhi Sam Ge: Resources, Supervision, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China (U1813202, 61773093), the National Key R & D Program of China (2018-YFC0831800), Research Programs of Sichuan Science and Technology Department, China (17ZDYF3184), Shenzhen and Hong Kong Joint Innovation Project under Grant (SGLH20161209-145252406) and Important Science and Technology Innovation Projects in Chengdu, China (2018-YF08-00039-GX).

References (78)

  • Andriluka, M., Roth, S., Schiele, B., 2009. Pictorial structures revisited: People detection and articulated pose...
  • Andriluka, M., Roth, S., Schiele, B., 2010. Monocular 3D pose estimation and tracking by detection. In: Proceedings of...
  • Cao, Z., Simon, T., Wei, S., Sheikh, Y., 2017. Realtime multi-person 2d pose estimation using part affinity fields. In:...
  • Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J., 2018. Cascaded pyramid network for multi-person pose...
  • Cheng, B., Xiao, B., Wang, J., Shi, H., Huang, T.S., Zhang, L., 2020. HigherHRNet: scale-aware representation learning...
  • Chéron, G., Laptev, I., Schmid, C., 2015. P-CNN: pose-based CNN features for action recognition. In: Proceedings of the...
  • Fan, X., Zheng, K., Lin, Y., Wang, S., 2015. Combining local appearance and holistic view: Dual-Source Deep Neural...
  • Fang, H., Xie, S., Tai, Y., Lu, C., 2017. RMPE: regional multi-person pose estimation. In: Proceedings of the IEEE...
  • FischlerM.A. et al.

    The representation and matching of pictorial structures

    IEEE Trans. Comput.

    (1973)
  • GongW. et al.

    Human pose estimation from monocular images: a comprehensive survey

    Sensors

    (2016)
  • Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.C., Bengio, Y., 2014....
  • HattoriH. et al.

    Synthesizing a scene-specific pedestrian detector and pose estimator for static video surveillance - can we learn pedestrian detectors and pose estimators without real data?

    Int. J. Comput. Vis.

    (2018)
  • He, K., Gkioxari, G., Dollár, P., Girshick, R.B., 2017. Mask R-CNN. In: Proceedings of the IEEE International...
  • HolteM.B. et al.

    Human pose estimation and activity recognition from multi-view videos: comparative explorations of recent developments

    IEEE J. Sel. Top. Signal Process.

    (2012)
  • HuangJ. et al.

    The devil is in the details: delving into unbiased data processing for human pose estimation

    (2019)
  • Insafutdinov, E., Pishchulin, L., Andres, B., Andriluka, M., Schiele, B., 2016. DeeperCut: a deeper, stronger, and...
  • Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K., 2015. Spatial transformer networks. In: Proceedings of the...
  • JiX. et al.

    Advances in view-invariant human motion analysis: a review

    IEEE Trans. Syst. Man Cybern. C

    (2010)
  • Johnson, S., Everingham, M., 2010. Clustered pose and nonlinear appearance models for human pose estimation. In:...
  • Kocabas, M., Karagoz, S., Akbas, E., 2018. MultiPoseNet: fast multi-person pose estimation using pose residual network....
  • Kreiss, S., Bertoni, L., Alahi, A., 2019. PifPaf: composite fields for human pose estimation. In: Proceedings of the...
  • Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. ImageNet classification with deep convolutional neural networks. In:...
  • Law, H., Deng, J., 2018. CornerNet: detecting objects as paired keypoints. In: Proceedings of the 15th European...
  • LawH. et al.

    CornerNet: detecting objects as paired keypoints

    Int. J. Comput. Vis.

    (2020)
  • Li, J., Su, W., Wang, Z., 2020. Simple pose: rethinking and improving a bottom-up approach for multi-person pose...
  • Li, J., Wang, C., Zhu, H., Mao, Y., Fang, H.-S., Lu, C., 2019. Crowdpose: efficient crowded scenes pose estimation and...
  • Lin, T., Dollár, P., Girshick, R.B., He, K., Hariharan, B., Belongie, S.J., 2017. Feature pyramid networks for object...
  • LinT. et al.

    Focal loss for dense object detection

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2020)
  • Lin, T., Maire, M., Belongie, S.J., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L., 2014. Microsoft COCO:...
  • Cited by (0)

    View full text