A comprehensive survey on 2D multi-person pose estimation methods
Introduction
Human pose estimation, a fundamental yet challenging detection task, aims to obtain body postures from images. Multi-person Pose Estimation (MPE) captures the poses of many persons simultaneously and is one of the most important fields in human pose estimation. It can be applied to many fields such as action recognition (Chéron et al., 2015), person re-identification (Qian et al., 2018), pedestrian tracking (Andriluka et al., 2010), animation production and digital film-making (Siarohin et al., 2019), virtual reality (Pavlakos et al., 2019), human–computer interaction (Weidenbacher et al., 2006), video surveillance (Hattori et al., 2018), self-driving (Murphy-Chutorian et al., 2007), sports motion analysis (Martinez et al., 2017), etc.
The rapid development of deep learning brings about excellent and remarkable achievements in computer vision and has been introduced into various fields, e.g., image classification (Krizhevsky et al., 2012), object detection (Ren et al., 2015, Redmon et al., 2016), gameplay (Mnih et al., 2015), robotic navigation (Mirowski et al., 2018), etc. MPE also benefits from its development greatly. Although there are some existing works (Perez-Sala et al., 2014, Gong et al., 2016, Sarafianos et al., 2016, Presti and Cascia, 2016, Wang et al., 2018, Chen et al., 2020) for human pose estimation, there still lacks a survey to summarize the most recent MPE achievements in detail. Perez-Sala et al. (2014), Aggarwal and Cai (1999), Gavrila (1999), Ji and Liu (2010), Holte et al. (2012) and Moeslund et al., 2006, Moeslund and Granum, 2001 introduced early model-based works which model human body as pictorial structures. Chen et al. (2013) and Wang et al. (2018) reviewed works on human motion analysis with depth imagery. Sarafianos et al. (2016) and Presti and Cascia (2016) discussed 3D human action classification using MPE. Chen et al. (2020) presented monocular human pose estimation from the single-person perspective to the multi-person perspective and from 2D pose estimation to 3D pose estimation. However, all these works failed to provide a comprehensive retrospect on MPE. Furthermore, existing surveys do not cover the state-of-the-art methods such as Zhang et al. (2019b), Sun et al. (2018a), Cheng et al. (2020), etc. Our survey gives a clear and holistic knowledge of MPE. We analyze the characteristics of these methods, capture the slight differences between them, and highlight their advantages and motivations. In addition, the commonly used datasets, metrics, and open-source systems are also given.
The remainder of this paper is organized as follows. Section 2 introduces different categories of multi-person pose estimation methods. In this paper, we introduce recent works from two aspects: two-stage MPE and one-stage MPE. In Sections 3 Two-stage: top-down methods, 4 Two-stage: bottom-up methods, two-stage MPE methods including top-down and bottom-up approaches are illustrated systematically and comprehensively according to their research motivations such as multi-scale feature learning, data preprocessing, quantization error, etc. Section 5 presents researches on one-stage MPE, which attempt to strike a balance between speed and accuracy. Besides, the commonly used datasets and metrics for MPE are illustrated in Section 6. Meanwhile, several open-source systems are also given in Section 7. Finally, we summarize two-stage and one-stage MPE methods and discuss the future research direction for inspiring new ideas in the MPE field.
Section snippets
Categories of multi-person pose estimation
Generally, MPE methods can be divided into three categories according to different classification criteria: (1) deep learning-based vs. model-based, (2) two-stage vs. one-stage, (3) graph-free vs. graph matching-based.
Deep Learning-based vs. Model-based: The difference between deep learning-based and model-based methods is whether we define an explicit hand-crafted model to estimate human poses or not. In Fig. 1(a), traditional model-based methods (Andriluka et al., 2009, Fischler and
Two-stage: top-down methods
Fig. 2(a) shows that the top-down methods typically use a person detector to find person instances in each image and then perform single-person pose estimation for each person instance. In this section, we will introduce the top-down methods from five aspects according to their research motivations: target representation, quantization error, multi-scale feature learning, data preprocessing and non-maximum suppression.
Two-stage: bottom-up methods
Unlike top-down methods, bottom-up methods only need one-time network forwarding to regress all keypoints and then assign these joints to different human bodies (see item (b) shown in Fig. 2). This section introduces bottom-up methods from two aspects: joint parsing and example imbalance.
One-stage methods
One-stage methods, which inherit the strength of two-stage methods and overcome their shortcomings, predict joint candidates and group assignments simultaneously, e.g. associative embedding (Newell et al., 2017), MultiPoseNet (Kocabas et al., 2018), etc.
This section introduces one-stage models based on the joint assignment mechanism and classifies them into bounding box-based, embedding-based and offset-based methods as depicted in Fig. 15. Bounding box-based methods assign joints according to
Datasets
Dataset plays an important role on MPE. Various datasets are released for pose estimation research such as Leeds Sports Poses (LSP), Frames Labeled In Cinema (FLIC), Max Planck Institute Informatik (MPII), Common Objects in Context (COCO), AI Challenger (AIC), etc. LSP and FLIC belong to single-person datasets, while MPII, COCO and AIC are multi-person datasets. Top-down methods can also utilize existing single-person datasets. In this section, 2D pose estimation datasets and metrics are
Open-source systems
In order to apply MPE methods to the real-world applications, several open-source MPE systems are released, such as OpenPose, Mask R-CNN, AlphaPose, etc.
OpenPose, the first real-time MPE system, is developed and maintained by Carnegie Mellon University (CMU) which detects various body parts (135 keypoints) efficiently.1 It supports 2D/3D keypoint detection, camera calibration and single-person
Conclusion and future research directions
Multi-person pose estimation is a hot research field of computer vision and attracts researchers from both companies and universities all over the world. This paper reviews recent popular MPE methods from two aspects: two-stage to one-stage. Two-stage methods can be divided into top-down and bottom-up methods. In Section 3, we discuss top-down approaches in different perspectives such as target representation, quantization error, multi-scale feature learning, data preprocessing and non-maximum
CRediT authorship contribution statement
Chen Wang: Writing - original draft. Feng Zhang: Writing - review & editing. Shuzhi Sam Ge: Resources, Supervision, Funding acquisition.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China (U1813202, 61773093), the National Key R & D Program of China (2018-YFC0831800), Research Programs of Sichuan Science and Technology Department, China (17ZDYF3184), Shenzhen and Hong Kong Joint Innovation Project under Grant (SGLH20161209-145252406) and Important Science and Technology Innovation Projects in Chengdu, China (2018-YF08-00039-GX).
References (78)
- et al.
Human motion analysis: a review
Comput. Vis. Image Underst.
(1999) - et al.
Monocular human pose estimation: a survey of deep learning-based methods
Comput. Vis. Image Underst.
(2020) - et al.
A survey of human motion analysis using depth imagery
Pattern Recognit. Lett.
(2013) The visual analysis of human movement: a survey
Comput. Vis. Image Underst.
(1999)- et al.
Human pose regression by combining indirect part detection and contextual information
Comput. Graph.
(2019) - et al.
A survey of computer vision-based human motion capture
Comput. Vis. Image Underst.
(2001) - et al.
A survey of advances in vision-based human motion capture and analysis
Comput. Vis. Image Underst.
(2006) - et al.
3D human pose estimation: a review of the literature and analysis of covariates
Comput. Vis. Image Underst.
(2016) - et al.
RGB-D-based human motion recognition with deep learning: a survey
Comput. Vis. Image Underst.
(2018) - Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B., 2014. 2D human pose estimation: new benchmark and state of the...