Elsevier

Pattern Recognition Letters

Volume 154, February 2022, Pages 1-6
Pattern Recognition Letters

Scale-aware heatmap representation for human pose estimation

https://doi.org/10.1016/j.patrec.2021.12.018Get rights and content

Highlights

  • We attempt to tackle the scale variation of keypoints within heatmap generation.

  • We generate customized heatmaps for each type of keypoints according to their scales.

  • We propose a weight-redistributed loss to deal with the detection imbalance problem.

  • Our method outperforms baseline and is comparable with the state-of-the-art methods.

Abstract

The performance of multi-person pose estimation is seriously affected by scale variation. Extensive works have been devoted to reducing the effect by modifying convolutional network structure or loss function, but little attention has been paid to the problem in the construction of heatmaps. In this paper, we focus on the scale variation of keypoints within heatmap generation and propose a novel method called scale-aware heatmap generator, which constructs a customized heatmap for each type of keypoints based on their relative scales. In addition, we design a weight-redistributed loss function to facilitate the detection of keypoints that are hard to identify. Our approach outperforms the baseline by nearly 2.5% in average precision and performs on par with the state-of-the-art result in bottom-up pose estimation with multi-scale testing (69.4% AP) on the COCO test-dev dataset.

Introduction

Human pose estimation, which aims to detect and locate all human anatomical keypoints from given images or video sequences, is one of the core tasks in computer vision. It has captured a lot of research interests due to its significance in various applications, such as human action recognition and pedestrian tracking. Traditional methods [1], [2] are mostly based on graphical models or pictorial structures, but they are vulnerable and frangible in complicated situations (e.g., occlusion). Thanks to the rapid development of deep learning, the employment of convolutional neural networks [3], [4] has made substantial progress in this field.

Recently, the approaches for multi-person pose estimation can be divided into top-down and bottom-up methods. The top-down method first obtains person instances based on object detectors and then performs single person pose estimation individually. The bottom-up method first detects all keypoints in the image and then groups them into the corresponding people. The top-down method usually achieves higher accuracy as it crops and resizes the sub-image which contains a single person to a fixed scale for pose estimation. However, its performance tightly depends on the quality of the object detectors and its runtime is almost linearly related to the number of people in the image. In contrast, the bottom-up method can achieve real-time inference regardless of the number of people and performs better in more complicated situations (e.g., crowd scenarios).

Heatmap representation is widely used to encode the location of keypoints. An illustration of a standard heatmap is shown in Fig. 1(b). The heatmap can be defined as a confidence map generated from the Gaussian kernel centered at each annotated keypoint, and the heat value indicates the probability whether there exists the target keypoint. Correspondingly, the keypoint positions can be obtained by identifying the local maximums in the heatmap. Compared with prior approaches [3], heatmap contains more spatial information near the keypoints, boosting the performance of convolutional neural networks. Therefore, most of the mainstream methods (e.g., HigherHRNet [5], RSN [6], DarkPose [7]) use heatmap as the default configuration. However, little attention has been paid to the scale variation problem within the generation of heatmaps. The standard heatmap adopts a fixed Gaussian kernel for all types of keypoints, which may cause confusion as different types of keypoints have distinct scales. As shown in Fig. 1(b), two kinds of keypoints with different sizes (like eye and hip) are consistent in heatmap, but their influence on adjacent keypoints is completely different. It is unreasonable to use the same Gaussian kernel for each type of keypoints to generate heatmaps.

Motivated by the above observation, we propose a novel algorithm, namely scale-aware heatmap generator, to tackle the scale variation of keypoints within heatmap generation. We have counted the relative scale proportions of all types of keypoints according to [8], and calculated adaptive variance of Gaussian kernel based on the relative scales. Specifically, we form a unified variance and compute the suitable variance for each type of keypoints proportionally, thus we can incorporate the scale information into heatmaps.

In addition, plenty of experimental results show that difference in the scale of keypoints may also affect the localization accuracy (shown in Fig. 2). There exists a detection imbalance problem that keypoints with smaller scales are harder to detect. To deal with this problem, the design idea of focal loss [9] is adopted in the network. We build a weight-redistributed loss and it can automatically adjust the contribution between disparate keypoints during the training process.

We use HRNet [10] as backbone and follow [11] to build a bottom-up pose estimation system. We evaluate our method on COCO val2017 dataset without multi-scale testing and it achieves 67.5% AP, outperforming baseline by 2.5% AP. Our model achieves 69.4% AP on COCO test-dev dataset, which is comparable with those state-of-the-art methods. We have performed several ablation experiments to show the effectiveness of each component in our approach.

The main contributions can be summarized as follows:

  • We propose a scale-aware heatmap generator (SAHG) to tackle the scale variation of keypoints within the construction of heatmaps. Different from the standard heatmap that adopts a unified kernel for all types of keypoints, our SAHG will generate a customized heatmap for each type of keypoints according to their relative scales;

  • To alleviate the detection imbalance problem and facilitate the detection of hard keypoints, we design a weight-redistributed loss function, which regulates the contribution between different keypoints automatically and thus improves the estimation accuracy.

Section snippets

Related work

Extensive efforts have been made to handle the scale variation of human bodies within pose estimation, which can be classified into two categories. One is to modify the structure of convolutional neural network referring to the image pyramid structure, another is to incorporate the impact of scale in the construction of loss functions.

Feature pyramid. FPN [16] exploits feature pyramids to extract multi-scale features, designing a pyramidal hierarchy network followed by a number of works [17],

Proposed methods

According to [5], [10], [22], [24], heatmaps are widely leveraged as coordinate representations to learn a robust model. Since heatmaps can provide rich spatial information during training and reflect the keypoint location probability for detection, reasonable heatmaps have a great effect on the performance of human pose estimation.

However, how to generate more suitable heatmaps has rarely been considered systematically by existing works. To address this problem, we propose SAHG and the whole

Implementation details

We use [11] as our baseline to construct a bottom-up multi-person pose estimation system. Our method is trained and evaluated on MS-COCO 2017 dataset [8], which consists of train set (includes 57K images), val set (includes 5K images) and test-dev set (includes 20K images).

Evaluation metric. The average precision and average recall based on Object Keypoint Similarity (OKS) are employed as evaluation metrics. OKS=iexp(di2/2s2ki2)δ(vi>0)iδ(vi>0), where di stands for the Euclidean distance

Conclusions

In this paper, we have proposed a SAHG to tackle the scale variation of keypoints in the construction of heatmaps. The generator can produce more precise heatmaps for each type of keypoints based on their relative scales, and it can be readily embedded in data preprocessing for human pose estimation. We also proposed a weight-redistributed loss function to optimize the contribution between different keypoints, in order to further increase the detection accuracy of hard keypoints. Extensive

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grant 61,871,437 and in part by the Natural Science Foundation of Hubei Province of China under Grant 2019CFA022. The authors would also like to thank the editors and anonymous reviewers for their insightful comments, which greatly improved the quality of this paper.

References (28)

  • M.A. Fischler et al.

    The representation and matching of pictorial structures

    IEEE Trans. Comput.

    (1973)
  • P.F. Felzenszwalb et al.

    Efficient matching of pictorial structures

    CVPR

    (2000)
  • A. Toshev et al.

    Deeppose: Human pose estimation via deep neural networks

    CVPR

    (2014)
  • J.J. Tompson et al.

    Joint training of a convolutional network and a graphical model for human pose estimation

    NIPS

    (2014)
  • B. Cheng et al.

    Higherhrnet: Scale-aware representation learning for bottom-up human pose estimation

    CVPR

    (2020)
  • Y. Cai et al.

    Learning delicate local representations for multi-person pose estimation

    arXiv preprint arXiv:2003.04030

    (2020)
  • F. Zhang et al.

    Distribution-aware coordinate representation for human pose estimation

    CVPR

    (2020)
  • T.-Y. Lin et al.

    Microsoft coco: Common objects in context

    ECCV

    (2014)
  • T.-Y. Lin et al.

    Focal loss for dense object detection

    ICCV

    (2017)
  • K. Sun et al.

    Deep high-resolution representation learning for human pose estimation

    CVPR

    (2019)
  • K. Sun et al.

    Bottom-up human pose estimation by ranking heatmap-guided adaptive keypoint estimates

    arXiv preprint arXiv:2006.15480

    (2020)
  • M. Andriluka et al.

    2d human pose estimation: New benchmark and state of the art analysis

    CVPR

    (2014)
  • L. Ke et al.

    Multi-scale structure-aware network for human pose estimation

    ECCV

    (2018)
  • Y. Bin et al.

    Adversarial semantic data augmentation for human pose estimation

    ECCV

    (2020)
  • Cited by (9)

    • Mobile-LRPose: Low-Resolution Representation Learning for Human Pose Estimation in Mobile Devices

      2024, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    View all citing articles on Scopus
    View full text