Scale-aware heatmap representation for human pose estimation
Introduction
Human pose estimation, which aims to detect and locate all human anatomical keypoints from given images or video sequences, is one of the core tasks in computer vision. It has captured a lot of research interests due to its significance in various applications, such as human action recognition and pedestrian tracking. Traditional methods [1], [2] are mostly based on graphical models or pictorial structures, but they are vulnerable and frangible in complicated situations (e.g., occlusion). Thanks to the rapid development of deep learning, the employment of convolutional neural networks [3], [4] has made substantial progress in this field.
Recently, the approaches for multi-person pose estimation can be divided into top-down and bottom-up methods. The top-down method first obtains person instances based on object detectors and then performs single person pose estimation individually. The bottom-up method first detects all keypoints in the image and then groups them into the corresponding people. The top-down method usually achieves higher accuracy as it crops and resizes the sub-image which contains a single person to a fixed scale for pose estimation. However, its performance tightly depends on the quality of the object detectors and its runtime is almost linearly related to the number of people in the image. In contrast, the bottom-up method can achieve real-time inference regardless of the number of people and performs better in more complicated situations (e.g., crowd scenarios).
Heatmap representation is widely used to encode the location of keypoints. An illustration of a standard heatmap is shown in Fig. 1(b). The heatmap can be defined as a confidence map generated from the Gaussian kernel centered at each annotated keypoint, and the heat value indicates the probability whether there exists the target keypoint. Correspondingly, the keypoint positions can be obtained by identifying the local maximums in the heatmap. Compared with prior approaches [3], heatmap contains more spatial information near the keypoints, boosting the performance of convolutional neural networks. Therefore, most of the mainstream methods (e.g., HigherHRNet [5], RSN [6], DarkPose [7]) use heatmap as the default configuration. However, little attention has been paid to the scale variation problem within the generation of heatmaps. The standard heatmap adopts a fixed Gaussian kernel for all types of keypoints, which may cause confusion as different types of keypoints have distinct scales. As shown in Fig. 1(b), two kinds of keypoints with different sizes (like eye and hip) are consistent in heatmap, but their influence on adjacent keypoints is completely different. It is unreasonable to use the same Gaussian kernel for each type of keypoints to generate heatmaps.
Motivated by the above observation, we propose a novel algorithm, namely scale-aware heatmap generator, to tackle the scale variation of keypoints within heatmap generation. We have counted the relative scale proportions of all types of keypoints according to [8], and calculated adaptive variance of Gaussian kernel based on the relative scales. Specifically, we form a unified variance and compute the suitable variance for each type of keypoints proportionally, thus we can incorporate the scale information into heatmaps.
In addition, plenty of experimental results show that difference in the scale of keypoints may also affect the localization accuracy (shown in Fig. 2). There exists a detection imbalance problem that keypoints with smaller scales are harder to detect. To deal with this problem, the design idea of focal loss [9] is adopted in the network. We build a weight-redistributed loss and it can automatically adjust the contribution between disparate keypoints during the training process.
We use HRNet [10] as backbone and follow [11] to build a bottom-up pose estimation system. We evaluate our method on COCO val2017 dataset without multi-scale testing and it achieves 67.5% AP, outperforming baseline by 2.5% AP. Our model achieves 69.4% AP on COCO test-dev dataset, which is comparable with those state-of-the-art methods. We have performed several ablation experiments to show the effectiveness of each component in our approach.
The main contributions can be summarized as follows:
- •
We propose a scale-aware heatmap generator (SAHG) to tackle the scale variation of keypoints within the construction of heatmaps. Different from the standard heatmap that adopts a unified kernel for all types of keypoints, our SAHG will generate a customized heatmap for each type of keypoints according to their relative scales;
- •
To alleviate the detection imbalance problem and facilitate the detection of hard keypoints, we design a weight-redistributed loss function, which regulates the contribution between different keypoints automatically and thus improves the estimation accuracy.
Section snippets
Related work
Extensive efforts have been made to handle the scale variation of human bodies within pose estimation, which can be classified into two categories. One is to modify the structure of convolutional neural network referring to the image pyramid structure, another is to incorporate the impact of scale in the construction of loss functions.
Feature pyramid. FPN [16] exploits feature pyramids to extract multi-scale features, designing a pyramidal hierarchy network followed by a number of works [17],
Proposed methods
According to [5], [10], [22], [24], heatmaps are widely leveraged as coordinate representations to learn a robust model. Since heatmaps can provide rich spatial information during training and reflect the keypoint location probability for detection, reasonable heatmaps have a great effect on the performance of human pose estimation.
However, how to generate more suitable heatmaps has rarely been considered systematically by existing works. To address this problem, we propose SAHG and the whole
Implementation details
We use [11] as our baseline to construct a bottom-up multi-person pose estimation system. Our method is trained and evaluated on MS-COCO 2017 dataset [8], which consists of train set (includes 57K images), val set (includes 5K images) and test-dev set (includes 20K images).
Evaluation metric. The average precision and average recall based on Object Keypoint Similarity (OKS) are employed as evaluation metrics. , where stands for the Euclidean distance
Conclusions
In this paper, we have proposed a SAHG to tackle the scale variation of keypoints in the construction of heatmaps. The generator can produce more precise heatmaps for each type of keypoints based on their relative scales, and it can be readily embedded in data preprocessing for human pose estimation. We also proposed a weight-redistributed loss function to optimize the contribution between different keypoints, in order to further increase the detection accuracy of hard keypoints. Extensive
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China under Grant 61,871,437 and in part by the Natural Science Foundation of Hubei Province of China under Grant 2019CFA022. The authors would also like to thank the editors and anonymous reviewers for their insightful comments, which greatly improved the quality of this paper.
References (28)
- et al.
The representation and matching of pictorial structures
IEEE Trans. Comput.
(1973) - et al.
Efficient matching of pictorial structures
CVPR
(2000) - et al.
Deeppose: Human pose estimation via deep neural networks
CVPR
(2014) - et al.
Joint training of a convolutional network and a graphical model for human pose estimation
NIPS
(2014) - et al.
Higherhrnet: Scale-aware representation learning for bottom-up human pose estimation
CVPR
(2020) - et al.
Learning delicate local representations for multi-person pose estimation
arXiv preprint arXiv:2003.04030
(2020) - et al.
Distribution-aware coordinate representation for human pose estimation
CVPR
(2020) - et al.
Microsoft coco: Common objects in context
ECCV
(2014) - et al.
Focal loss for dense object detection
ICCV
(2017) - et al.
Deep high-resolution representation learning for human pose estimation
CVPR
(2019)
Bottom-up human pose estimation by ranking heatmap-guided adaptive keypoint estimates
arXiv preprint arXiv:2006.15480
2d human pose estimation: New benchmark and state of the art analysis
CVPR
Multi-scale structure-aware network for human pose estimation
ECCV
Adversarial semantic data augmentation for human pose estimation
ECCV
Cited by (9)
Towards reliable multi-person pose estimation using Conditional Random Fields
2023, Pattern Recognition LettersSeeing the unseen: Wifi-based 2D human pose estimation via an evolving attentive spatial-Frequency network
2023, Pattern Recognition LettersMobile-LRPose: Low-Resolution Representation Learning for Human Pose Estimation in Mobile Devices
2024, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)A Review of Human Pose Estimation Methods in Markerless Motion Capture
2024, Computer-Aided Design and Applications