Skip to main content

Advertisement

Log in

EfficientHRNet

Efficient and scalable high-resolution networks for real-time multi-person 2D human pose estimation

  • Special Issue Paper
  • Published:
Journal of Real-Time Image Processing Aims and scope Submit manuscript

Abstract

There is an increasing demand for lightweight multi-person pose estimation for many emerging smart IoT applications. However, the existing algorithms tend to have large model sizes and intense computational requirements, making them ill-suited for real-time applications and deployment on resource-constrained hardware. Lightweight and real-time approaches are exceedingly rare and come at the cost of inferior accuracy. In this paper, we present EfficientHRNet, a family of lightweight multi-person human pose estimators that are able to perform in real-time on resource-constrained devices. By unifying recent advances in model scaling with high-resolution feature representations, EfficientHRNet creates highly accurate models while reducing computation enough to achieve real-time performance. The largest model is able to come within 4.4% accuracy of the current state-of-the-art, while having 1/3 the model size and 1/6 the computation, achieving 23 FPS on Nvidia Jetson Xavier. Compared to the top real-time approach, EfficientHRNet increases accuracy by 22% while achieving similar FPS with \(\frac{1}{3}\) the power. At every level, EfficientHRNet proves to be more computationally efficient than other bottom-up 2D human pose estimation approaches, while achieving highly competitive accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. The source code of EfficientHRNet has been provided here: https://github.com/TeCSAR-UNCC/EfficientHRNet.

  2. Bottom-up implementation reported in [13].

  3. http://cocodataset.org/#keypoints-eval.

References

  1. Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: a deep convolutional encoder–decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(12), 2481–2495 (2017)

    Article  Google Scholar 

  2. Bulat, A., Tzimiropoulos, G.: Human pose estimation via convolutional part heatmap regression. CoRR arXiv:1609.01743 (2016)

  3. Bulat, A., Tzimiropoulos, G.: Binarized convolutional landmark localizers for human pose estimation and face alignment with limited resources. CoRR arXiv:1703.00862 (2017)

  4. Cao, Z., Hidalgo, G., Simon, T., Wei, S., Sheikh, Y.: Openpose: realtime multi-person 2d pose estimation using part affinity fields. CoRR arXiv:1812.08008 (2018)

  5. Cao, Z., Simon, T., Wei, S., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. CoRR arXiv:1611.08050 (2016)

  6. Chen, K., Gabriel, P., Alasfour, A., Gong, C., Doyle, W.K., Devinsky, O., Friedman, D., Dugan, P., Melloni, L., Thesen, T., Gonda, D., Sattar, S., Wang, S., Gilja, V.: Patient-specific pose estimation in clinical environments. IEEE J. Transl. Eng. Health Med. 6, 1–11 (2018). https://doi.org/10.1109/JTEHM.2018.2875464

    Article  Google Scholar 

  7. Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. CoRR arXiv:1606.00915 (2016)

  8. Chen, L., Yang, Y., Wang, J., Xu, W., Yuille, A.L.: Attention to scale: Scale-aware semantic image segmentation. CoRR arXiv:1511.03339 (2015)

  9. Chen, L., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. CoRR arXiv:1802.02611 (2018)

  10. Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. CoRR arXiv:1711.07319 (2017)

  11. Cheng, B., Wei, Y., Shi, H., Feris, R.S., Xiong, J., Huang, T.S.: Decoupled classification refinement: Hard false positive suppression for object detection. CoRR arXiv:1810.04002 (2018)

  12. Cheng, B., Wei, Y., Shi, H., Feris, R.S., Xiong, J., Huang, T.S.: Revisiting RCNN: on awakening the classification power of faster RCNN. CoRR arXiv:1803.06799 (2018)

  13. Cheng, B., Xiao, B., Wang, J., Shi, H., Huang T.S., Zhang, L.: HigherHRNet: scale-aware representation learning for bottom-up human pose estimation. In: Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 5385–5394. https://doi.org/10.1109/CVPR42600.2020.00543

  14. Dantone, M., Gall, J., Leistner, C., Van Gool, L.: Human pose estimation using body parts dependent joint regressors. In: Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3041–3048. https://doi.org/10.1109/CVPR.2013.391

  15. Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255. https://doi.org/10.1109/CVPR.2009.5206848

  16. Ditty, M., Karandikar, A., Reed, D.: Nvidia xavier soc (2018)

  17. Fang, H., Xie, S., Tai, Y., Lu, C.: RMPE: regional multi-person pose estimation. In: Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2353–2362. https://doi.org/10.1109/ICCV.2017.256

  18. Fang, Z., López, A.M.: Intention recognition of pedestrians and cyclists by 2d pose estimation. IEEE Trans. Intell. Transport. Syst. 21(11), 4773–4783 (2020). https://doi.org/10.1109/TITS.2019.2946642

    Article  Google Scholar 

  19. Ge, R., Kakade, S.M., Kidambi, R., Netrapalli, P.: The step decay schedule: A near optimal, geometrically decaying learning rate procedure. CoRR arXiv:1904.12838 (2019)

  20. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Y.W. Teh, M. Titterington (eds.) Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, vol. 9, pp. 249–256. PMLR, Chia Laguna Resort, Sardinia, Italy (2010). http://proceedings.mlr.press/v9/glorot10a.html

  21. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR arXiv:1512.03385 (2015)

  22. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: efficient convolutional neural networks for mobile vision applications (2017)

  23. Huang, G., Chen, D., Li, T., Wu, F., van der Maaten, L., Weinberger, K.Q.: Multi-scale dense convolutional networks for efficient prediction. CoRR arXiv:1703.09844 (2017)

  24. Huang, S., Gong, M., Tao, D.: A coarse-fine network for keypoint localization. In: Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 3047–3056. https://doi.org/10.1109/ICCV.2017.329

  25. Insafutdinov, E., Pishchulin, L., Andres, B., Andriluka, M., Schiele, B.: Deepercut: A deeper, stronger, and faster multi-person pose estimation model. CoRR arXiv:1605.03170 (2016)

  26. Iqbal, U., Gall, J.: Multi-person pose estimation with local joint-to-person associations. CoRR arXiv:1608.08526 (2016)

  27. Jain, A., Tompson, J., Andriluka, M., Taylor, G. W., Bregler, C.: Learning human pose estimation features with convolutional networks. In: Proceedings of the 2nd international conference on learning representations, ICLR 2014, Banff, AB, Canada, April 14–16, 2014, Conference Track Proceedings

  28. Jetson xavier nx developer kit (2020). https://developer.nvidia.com/embedded/jetson-xavier-nx-devkit. Accessed 8 Nov 2020

  29. John: trt\_pose. https://github.com/NVIDIA-AI-IOT/trt_pose. Accessed 9 Nov 2020

  30. Ke, L., Chang, M., Qi, H., Lyu, S.: Multi-scale structure-aware network for human pose estimation. CoRR. arXiv:1803.09894 (2018)

  31. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. CoRR arXiv:1412.6980 (2015)

  32. Kocabas, M., Karagoz, S., Akbas, E.: Multiposenet: Fast multi-person pose estimation using pose residual network. CoRR arXiv:1807.04067 (2018)

  33. Kreiss, S., Bertoni, L., Alahi, A.: Pifpaf: Composite fields for human pose estimation. CoRR arXiv:1903.06593 (2019)

  34. Krizhevsky, A.: Learning multiple layers of features from tiny images. Tech. rep. (2009). http://www.cs.toronto.edu/~kriz/cifar.html

  35. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems: Volume 1, NIPS’12, pp. 1097–1105. Curran Associates Inc., Red Hook, NY, USA (2012)

  36. Lin, T., Dollár, P., Girshick, R.B., He, K., Hariharan, B., Belongie, S.J.: Feature pyramid networks for object detection. CoRR arXiv:1612.03144 (2016)

  37. Lin TY. et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision – ECCV 2014. ECCV 2014. Lecture Notes in Computer Science, vol 8693. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

  38. Neff, C., Mendieta, M., Mohan, S., Baharani, M., Rogers, S., Tabkhi, H.: Revamp2t: real-time edge video analytics for multicamera privacy-aware pedestrian tracking. IEEE Internet Things J. 7(4), 2591–2602 (2020). https://doi.org/10.1109/JIOT.2019.2954804

    Article  Google Scholar 

  39. Newell, A., Deng, J.: Associative embedding: End-to-end learning for joint detection and grouping. CoRR arXiv:1611.05424 (2016)

  40. Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. CoRR arXiv:1603.06937 (2016)

  41. Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. CoRR arXiv:1505.04366 (2015)

  42. Openvino toolkit. https://software.intel.com/en-us/openvino-toolkit. Accessed 8 Nov 2020.

  43. Osokin, D.: Real-time 2d multi-person pose estimation on CPU: lightweight openpose. CoRR arXiv:1811.12004 (2018)

  44. Papandreou, G., Zhu, T., Chen, L., Gidaris, S., Tompson, J., Murphy, K.: Personlab: Person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. CoRR arXiv:1803.08225 (2018)

  45. Papandreou, G., Zhu, T., Kanazawa, N., Toshev, A., Tompson, J., Bregler, C., Murphy, K.P.: Towards accurate multi-person pose estimation in the wild. CoRR arXiv:1701.01779 (2017)

  46. Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Gehler, P.V., Schiele, B.: Deepcut: Joint subset partition and labeling for multi person pose estimation. CoRR arXiv:1511.06645 (2015)

  47. Radosavovic, I., Kosaraju, R.P., Girshick, R., He K., Dollár, P.: Designing network design spaces. In: Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 10425–10433. https://doi.org/10.1109/CVPR42600.2020.01044

  48. Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. CoRR arXiv:1506.01497 (2015)

  49. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. CoRR arXiv:1505.04597 (2015)

  50. Ruder, S.: An overview of gradient descent optimization algorithms. CoRR arXiv:1609.04747 (2016)

  51. Saharan, A.: Creating a human pose estimation application with nvidia deepstream (2020). https://developer.nvidia.com/blog/creating-a-human-pose-estimation-application-with-deepstream-sdk/. Accessed 8 Nov 2020

  52. Saxena, S., Verbeek, J.: Convolutional neural fabrics. CoRR arXiv:1606.02492 (2016)

  53. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014)

  54. Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 5686–5696. https://doi.org/10.1109/CVPR.2019.00584

  55. Sun, K., Zhao, Y., Jiang, B., Cheng, T., Xiao, B., Liu, D., Mu, Y., Wang, X., Liu, W., Wang, J.: High-resolution representations for labeling pixels and regions. CoRR arXiv:1904.04514 (2019)

  56. Tan, M., Le, Q.V.: Efficientnet: Rethinking model scaling for convolutional neural networks. CoRR arXiv:1905.11946 (2019)

  57. Tan, M., Pang R., Le, Q.V.: EfficientDet: scalable and efficient object detection. In: Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 10778–10787. https://doi.org/10.1109/CVPR42600.2020.01079

  58. Toshev, A., Szegedy, C.: Deeppose: Human pose estimation via deep neural networks. CoRR arXiv:1312.4659 (2013)

  59. Wang, J., et al.: Deep high-resolution representation learning for visual recognition. In: Proceedings of the IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2020.2983686

  60. Wei, S., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. CoRR arXiv:1602.00134 (2016)

  61. Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. CoRR arXiv:1807.10221 (2018)

  62. Yang, L., Qin, Y., Zhang, X.: Lightweight densely connected residual network for human pose estimation. J Real-Time Image Proc 18, 825–837 (2021)

    Article  Google Scholar 

  63. Yang, W., Li, S., Ouyang, W., Li, H., Wang, X.: Learning feature pyramids for human pose estimation. CoRR arXiv:1708.01101 (2017)

  64. Yang, Y., Ramanan, D.: Articulated pose estimation with flexible mixtures-of-parts. CVPR 2011, 1385–1392 (2011)

    Google Scholar 

  65. Zhang, Z., Zhang, X., Peng, C., Cheng, D., Sun, J.: Exfuse: enhancing feature fusion for semantic segmentation. CoRR arXiv:1804.03821 (2018)

  66. Zhong, F., Li, M., Zhang, K., Hu, J., Liu, L.: Dspnet: a low computational-cost network for human pose estimation. Neurocomputing 423, 327–335 (2021)

    Article  Google Scholar 

  67. Zhou, Y., Hu, X., Zhang, B.: Interlinked convolutional neural networks for face parsing. CoRR arXiv:1806.02479 (2018)

  68. Zhu, H., Qiao, Y., Xu, G., Deng, L., Yu, Y.F.: Dspnet: a lightweight dilated convolution neural networks for spectral deconvolution with self-paced learning. IEEE Trans. Ind. Inform. 16(12), 7392–7401 (2020). https://doi.org/10.1109/TII.2019.2960837

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christopher Neff.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This research is supported by the National Science Foundation (NSF) under Award no. 1831795.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Neff, C., Sheth, A., Furgurson, S. et al. EfficientHRNet. J Real-Time Image Proc 18, 1037–1049 (2021). https://doi.org/10.1007/s11554-021-01132-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11554-021-01132-9

Keywords

Navigation