Abstract
Human pose estimation plays a crucial role in computer vision, such as understanding body language and tracking behavior. We observed that neural networks trained to generate heatmaps of human joints often produce blurred outputs that lacks a well-defined Gaussian structure. Despite this, existing methods largely prioritize network architecture innovations, neglecting heatmap generation itself. In light of the discovered importance, we propose a novel approach that incorporates a visual center module and a heatmap enhancer to improve existing human pose estimation methods. First, we extract features from the backbone network (any model based on convolutional neural networks) at different depths. The visual center module is then used to capture the global long-range dependencies and the cross-channel message of these features, which facilitates the optimal generation of the heatmap. Finally, the heatmap value distribution is adjusted using the heatmap enhancer. The heatmap enhancer can handle multiple peaks around the maximum activation through a Gaussian filter, allowing the heatmap to achieve accurate localization of the body’s joint points. We combine our module with current mainstream human pose estimation methods for experiments. The experimental results show that the proposed method has achieved good results on the two benchmark datasets of MSCOCO and MPII.









Similar content being viewed by others
Data availibility
The datasets generated during and/or analysed during the current study are available in the COCO and MPII repository, https://cocodataset.org/http://human-pose.mpi-inf.mpg.de/.
References
Chen, H., Feng, R., Wu, S., Xu, H., Zhou, F., Liu, Z.: 2d human pose estimation: a survey. Multimed. Syst. 29(5), 3115–3138 (2023). https://doi.org/10.1007/s00530-022-01019-0
Chen, Y., Tian, Y., He, M.: Monocular human pose estimation: a survey of deep learning-based methods. Comput. Vis. Image Underst. 192, 102897 (2020). https://doi.org/10.1016/j.cviu.2019.102897
Dubey, S., Dixit, M.: A comprehensive survey on human pose estimation approaches. Multimed. Syst. 29(1), 167–195 (2023). https://doi.org/10.1007/s00530-022-00980-0
Andriluka, M., Roth, S., Schiele, B.: Pictorial structures revisited: people detection and articulated pose estimation. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1014–1021. IEEE (2009). https://doi.org/10.1109/cvpr.2009.5206754
Johnson, S., Everingham, M.: Clustered pose and nonlinear appearance models for human pose estimation. In: Bmvc, vol. 2, p. 5. Aberystwyth, UK (2010). https://doi.org/10.5244/c.24.12
Voulodimos, A., Doulamis, N., Doulamis, A., Protopapadakis, E., et al.: Deep learning for computer vision: a brief review. Comput. Intell. Neurosci. (2018). https://doi.org/10.1155/2018/7068349
Jain, A., Tompson, J., Andriluka, M., Taylor, G.W., Bregler, C.: Learning human pose estimation features with convolutional networks (2013). https://doi.org/10.48550/arXiv.1312.7302. arXiv preprint arXiv:1312.7302
Luvizon, D.C., Tabia, H., Picard, D.: Human pose regression by combining indirect part detection and contextual information. Comput. Graph. 85, 15–22 (2019). https://doi.org/10.1016/j.cag.2019.09.002
Sun, X., Shang, J., Liang, S., Wei, Y.: Compositional human pose regression. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2602–2611 (2017). https://doi.org/10.1016/j.cviu.2018.10.006
Yang, W., Li, S., Ouyang, W., Li, H., Wang, X.: Learning feature pyramids for human pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1281–1290 (2017). https://doi.org/10.1109/ICCV.2017.144
Wei, S.-E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4732 (2016). https://doi.org/10.1109/CVPR.2016.511
Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7103–7112 (2018). https://doi.org/10.1109/CVPR.2018.00742
Xue, N., Wu, T., Xia, G.-S., Zhang, L.: Learning local-global contextual adaptation for multi-person pose estimation. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13055–13064 (2022). https://doi.org/10.1109/CVPR52688.2022.01272
Diller, C., Funkhouser, T., Dai, A.: Forecasting characteristic 3d poses of human actions. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15893–15902 (2022). https://doi.org/10.1109/CVPR52688.2022.01545
Zhao, Q., Zheng, C., Liu, M., Wang, P., Chen, C.: Poseformerv2: exploring frequency domain for efficient and robust 3d human pose estimation. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8877–8886 (2023). https://doi.org/10.1109/CVPR52729.2023.00857
Tompson, J., Jain, A., LeCun, Y., Bregler, C.: Joint training of a convolutional network and a graphical model for human pose estimation. In: NIPS’14, pp. 1799–1807. MIT Press, Cambridge (2014). https://doi.org/10.5555/2968826.2969027
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Computer Vision—ECCV 2016, pp. 483–499. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_29
Ke, L., Chang, M.-C., Qi, H., Lyu, S.: Multi-scale structure-aware network for human pose estimation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision—ECCV 2018, pp. 731–746. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_44
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5686–5696 (2019). https://doi.org/10.1109/CVPR.2019.00584
Zhang, F., Zhu, X., Ye, M.: Fast human pose estimation. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3512–3521 (2019). https://doi.org/10.1109/CVPR.2019.00363
Cao, Z., Hidalgo, G., Simon, T., Wei, S.-E., Sheikh, Y.: Openpose: realtime multi-person 2d pose estimation using part affinity fields. IEEE Trans. Pattern Anal. Mach. Intell. 43(1), 172–186 (2021). https://doi.org/10.1109/TPAMI.2019.2929257
Cheng, B., Xiao, B., Wang, J., Shi, H., Huang, T.S., Zhang, L.: Higherhrnet: scale-aware representation learning for bottom-up human pose estimation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5385–5394 (2020). https://doi.org/10.1109/CVPR42600.2020.00543
Chen, C.-H., Ramanan, D.: 3d human pose estimation = 2d pose estimation + matching. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5759–5767 (2017). https://doi.org/10.1109/CVPR.2017.610
Ma, X., Su, J., Wang, C., Zhu, W., Wang, Y.: 3d human mesh estimation from virtual markers. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 534–543 (2023). https://doi.org/10.1109/CVPR52729.2023.00059
Wang, Z., Nie, X., Qu, X., Chen, Y., Liu, S.: Distribution-aware single-stage models for multi-person 3d pose estimation. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13086–13095 (2022). https://doi.org/10.1109/CVPR52688.2022.01275
Tolstikhin, I.O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., Yung, J., Steiner, A., Keysers, D., Uszkoreit, J., Lucic, M., Dosovitskiy, A.: Mlp-mixer: an all-mlp architecture for vision. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P.S., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems, vol. 34, pp. 24261–24272. Curran Associates, Inc. (2021)
Ding, X., Xia, C., Zhang, X., Chu, X., Han, J., Ding, G.: Repmlp: re-parameterizing convolutions into fully-connected layers for image recognition (2021). arXiv preprint arXiv:2105.01883
Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10809–10819 (2022). https://doi.org/10.1109/CVPR52688.2022.01055
Sun, X., Xiao, B., Wei, F., Liang, S., Wei, Y.: Integral human pose regression. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision—ECCV 2018, pp. 536–553. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_33
Mao, W., Ge, Y., Shen, C., Tian, Z., Wang, X., Wang, Z.: Tfpose: direct human pose estimation with transformers (2021). arXiv preprint arXiv:2103.15320
Toshev, A., Szegedy, C.: Deeppose: human pose estimation via deep neural networks. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1653–1660 (2014). https://doi.org/10.1109/CVPR.2014.214
Hu, J., Shen, L., Albanie, S., Sun, G., Wu, E.: Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell. 42(8), 2011–2023 (2020). https://doi.org/10.1109/TPAMI.2019.2913372
Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 315–323 (2011) (JMLR Workshop and Conference Proceedings). https://doi.org/10.1109/IWAENC.2016.7602891
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90
Larsson, G., Maire, M., Shakhnarovich, G.: Fractalnet: ultra-deep neural networks without residuals (2016). arXiv preprint arXiv:1605.07648
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision—ECCV 2014, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2d human pose estimation: New benchmark and state of the art analysis. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2014). arXiv preprint arXiv:1412.6980
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision—ECCV 2018, pp. 472–487. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_29
Li, S., Wang, Z., Liu, Z., Tan, C., Lin, H., Wu, D., Chen, Z., Zheng, J., Li, S.Z.: Efficient multi-order gated aggregation network (2022). https://doi.org/10.48550/arXiv.2211.03295. arXiv preprint arXiv:2211.03295
Zhang, J., Chen, Z., Tao, D.: Towards high performance human keypoint detection. Int. J. Comput. Vis. 129(9), 2639–2662 (2021). https://doi.org/10.1007/s11263-021-01482-8
Xie, Z., Geng, Z., Hu, J., Zhang, Z., Hu, H., Cao, Y.: Revealing the dark secrets of masked image modeling. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14475–14485 (2023). https://doi.org/10.1109/CVPR52729.2023.01391
Geng, Z., Wang, C., Wei, Y., Liu, Z., Li, H., Hu, H.: Human pose as compositional tokens. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 660–671 (2023). https://doi.org/10.1109/CVPR52729.2023.00071
Liu, H., Liu, F., Fan, X., Huang, D.: Polarized self-attention: towards high-quality pixel-wise mapping. Neurocomputing 506, 158–167 (2022)
Li, K., Wang, S., Zhang, X., Xu, Y., Xu, W., Tu, Z.: Pose recognition with cascade transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1944–1953 (2021). https://doi.org/10.1109/CVPR46437.2021.00198
Li, Y., Zhang, S., Wang, Z., Yang, S., Yang, W., Xia, S.-T., Zhou, E.: Tokenpose: learning keypoint tokens for human pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11313–11322 (2021). https://doi.org/10.1109/ICCV48922.2021.01112
Mao, W., Ge, Y., Shen, C., Tian, Z., Wang, X., Wang, Z.: Tfpose: direct human pose estimation with transformers (2021). https://doi.org/10.48550/arXiv.2103.15320. arXiv preprint arXiv:2103.15320
Zhang, F., Zhu, X., Dai, H., Ye, M., Zhu, C.: Distribution-aware coordinate representation for human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7093–7102 (2020). https://doi.org/10.1109/CVPR42600.2020.00712
Acknowledgements
The research was supported by the National Natural Science Foundation of China under Grant 62267007, Gansu Provincial Department of Education Higher Education Industry Support Plan Project under Grant 2022CYZC-16.
Author information
Authors and Affiliations
Contributions
Hengrui Zhang wrote the main manuscript text. Jia Liu is responsible for proofreading. Qi Yongfeng is responsible for assisting.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no Conflict of interest.
Additional information
Communicated by Chenggang Yan.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Qi, Y., Zhang, H. & Liu, J. More accurate heatmap generation method for human pose estimation. Multimedia Systems 30, 180 (2024). https://doi.org/10.1007/s00530-024-01390-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00530-024-01390-0