EfficientHRNet

Neff, Christopher; Sheth, Aneri; Furgurson, Steven; Middleton, John; Tabkhi, Hamed

doi:10.1007/s11554-021-01132-9

EfficientHRNet

Efficient and scalable high-resolution networks for real-time multi-person 2D human pose estimation

Special Issue Paper
Published: 01 June 2021

Volume 18, pages 1037–1049, (2021)
Cite this article

Journal of Real-Time Image Processing Aims and scope Submit manuscript

Christopher Neff ORCID: orcid.org/0000-0003-0850-3449¹,
Aneri Sheth¹,
Steven Furgurson¹,
John Middleton¹ &
…
Hamed Tabkhi¹

545 Accesses
16 Citations
Explore all metrics

Abstract

There is an increasing demand for lightweight multi-person pose estimation for many emerging smart IoT applications. However, the existing algorithms tend to have large model sizes and intense computational requirements, making them ill-suited for real-time applications and deployment on resource-constrained hardware. Lightweight and real-time approaches are exceedingly rare and come at the cost of inferior accuracy. In this paper, we present EfficientHRNet, a family of lightweight multi-person human pose estimators that are able to perform in real-time on resource-constrained devices. By unifying recent advances in model scaling with high-resolution feature representations, EfficientHRNet creates highly accurate models while reducing computation enough to achieve real-time performance. The largest model is able to come within 4.4% accuracy of the current state-of-the-art, while having 1/3 the model size and 1/6 the computation, achieving 23 FPS on Nvidia Jetson Xavier. Compared to the top real-time approach, EfficientHRNet increases accuracy by 22% while achieving similar FPS with \(\frac{1}{3}\) the power. At every level, EfficientHRNet proves to be more computationally efficient than other bottom-up 2D human pose estimation approaches, while achieving highly competitive accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ByteTrack: Multi-object Tracking by Associating Every Detection Box

HOTA: A Higher Order Metric for Evaluating Multi-object Tracking

Article Open access 08 October 2020

Jonathon Luiten, Aljos̆a Os̆ep, … Bastian Leibe

A review on face recognition systems: recent approaches and challenges

Article 30 July 2020

Muhtahir O. Oloyede, Gerhard P. Hancke & Hermanus C. Myburgh

Notes

The source code of EfficientHRNet has been provided here: https://github.com/TeCSAR-UNCC/EfficientHRNet.
Bottom-up implementation reported in [13].
http://cocodataset.org/#keypoints-eval.

References

Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: a deep convolutional encoder–decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(12), 2481–2495 (2017)
Article Google Scholar
Bulat, A., Tzimiropoulos, G.: Human pose estimation via convolutional part heatmap regression. CoRR arXiv:1609.01743 (2016)
Bulat, A., Tzimiropoulos, G.: Binarized convolutional landmark localizers for human pose estimation and face alignment with limited resources. CoRR arXiv:1703.00862 (2017)
Cao, Z., Hidalgo, G., Simon, T., Wei, S., Sheikh, Y.: Openpose: realtime multi-person 2d pose estimation using part affinity fields. CoRR arXiv:1812.08008 (2018)
Cao, Z., Simon, T., Wei, S., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. CoRR arXiv:1611.08050 (2016)
Chen, K., Gabriel, P., Alasfour, A., Gong, C., Doyle, W.K., Devinsky, O., Friedman, D., Dugan, P., Melloni, L., Thesen, T., Gonda, D., Sattar, S., Wang, S., Gilja, V.: Patient-specific pose estimation in clinical environments. IEEE J. Transl. Eng. Health Med. 6, 1–11 (2018). https://doi.org/10.1109/JTEHM.2018.2875464
Article Google Scholar
Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. CoRR arXiv:1606.00915 (2016)
Chen, L., Yang, Y., Wang, J., Xu, W., Yuille, A.L.: Attention to scale: Scale-aware semantic image segmentation. CoRR arXiv:1511.03339 (2015)
Chen, L., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. CoRR arXiv:1802.02611 (2018)
Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. CoRR arXiv:1711.07319 (2017)
Cheng, B., Wei, Y., Shi, H., Feris, R.S., Xiong, J., Huang, T.S.: Decoupled classification refinement: Hard false positive suppression for object detection. CoRR arXiv:1810.04002 (2018)
Cheng, B., Wei, Y., Shi, H., Feris, R.S., Xiong, J., Huang, T.S.: Revisiting RCNN: on awakening the classification power of faster RCNN. CoRR arXiv:1803.06799 (2018)
Cheng, B., Xiao, B., Wang, J., Shi, H., Huang T.S., Zhang, L.: HigherHRNet: scale-aware representation learning for bottom-up human pose estimation. In: Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 5385–5394. https://doi.org/10.1109/CVPR42600.2020.00543
Dantone, M., Gall, J., Leistner, C., Van Gool, L.: Human pose estimation using body parts dependent joint regressors. In: Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3041–3048. https://doi.org/10.1109/CVPR.2013.391
Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255. https://doi.org/10.1109/CVPR.2009.5206848
Ditty, M., Karandikar, A., Reed, D.: Nvidia xavier soc (2018)
Fang, H., Xie, S., Tai, Y., Lu, C.: RMPE: regional multi-person pose estimation. In: Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2353–2362. https://doi.org/10.1109/ICCV.2017.256
Fang, Z., López, A.M.: Intention recognition of pedestrians and cyclists by 2d pose estimation. IEEE Trans. Intell. Transport. Syst. 21(11), 4773–4783 (2020). https://doi.org/10.1109/TITS.2019.2946642
Article Google Scholar
Ge, R., Kakade, S.M., Kidambi, R., Netrapalli, P.: The step decay schedule: A near optimal, geometrically decaying learning rate procedure. CoRR arXiv:1904.12838 (2019)
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Y.W. Teh, M. Titterington (eds.) Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, vol. 9, pp. 249–256. PMLR, Chia Laguna Resort, Sardinia, Italy (2010). http://proceedings.mlr.press/v9/glorot10a.html
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR arXiv:1512.03385 (2015)
Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: efficient convolutional neural networks for mobile vision applications (2017)
Huang, G., Chen, D., Li, T., Wu, F., van der Maaten, L., Weinberger, K.Q.: Multi-scale dense convolutional networks for efficient prediction. CoRR arXiv:1703.09844 (2017)
Huang, S., Gong, M., Tao, D.: A coarse-fine network for keypoint localization. In: Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 3047–3056. https://doi.org/10.1109/ICCV.2017.329
Insafutdinov, E., Pishchulin, L., Andres, B., Andriluka, M., Schiele, B.: Deepercut: A deeper, stronger, and faster multi-person pose estimation model. CoRR arXiv:1605.03170 (2016)
Iqbal, U., Gall, J.: Multi-person pose estimation with local joint-to-person associations. CoRR arXiv:1608.08526 (2016)
Jain, A., Tompson, J., Andriluka, M., Taylor, G. W., Bregler, C.: Learning human pose estimation features with convolutional networks. In: Proceedings of the 2nd international conference on learning representations, ICLR 2014, Banff, AB, Canada, April 14–16, 2014, Conference Track Proceedings
Jetson xavier nx developer kit (2020). https://developer.nvidia.com/embedded/jetson-xavier-nx-devkit. Accessed 8 Nov 2020
John: trt\_pose. https://github.com/NVIDIA-AI-IOT/trt_pose. Accessed 9 Nov 2020
Ke, L., Chang, M., Qi, H., Lyu, S.: Multi-scale structure-aware network for human pose estimation. CoRR. arXiv:1803.09894 (2018)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. CoRR arXiv:1412.6980 (2015)
Kocabas, M., Karagoz, S., Akbas, E.: Multiposenet: Fast multi-person pose estimation using pose residual network. CoRR arXiv:1807.04067 (2018)
Kreiss, S., Bertoni, L., Alahi, A.: Pifpaf: Composite fields for human pose estimation. CoRR arXiv:1903.06593 (2019)
Krizhevsky, A.: Learning multiple layers of features from tiny images. Tech. rep. (2009). http://www.cs.toronto.edu/~kriz/cifar.html
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems: Volume 1, NIPS’12, pp. 1097–1105. Curran Associates Inc., Red Hook, NY, USA (2012)
Lin, T., Dollár, P., Girshick, R.B., He, K., Hariharan, B., Belongie, S.J.: Feature pyramid networks for object detection. CoRR arXiv:1612.03144 (2016)
Lin TY. et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision – ECCV 2014. ECCV 2014. Lecture Notes in Computer Science, vol 8693. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Neff, C., Mendieta, M., Mohan, S., Baharani, M., Rogers, S., Tabkhi, H.: Revamp2t: real-time edge video analytics for multicamera privacy-aware pedestrian tracking. IEEE Internet Things J. 7(4), 2591–2602 (2020). https://doi.org/10.1109/JIOT.2019.2954804
Article Google Scholar
Newell, A., Deng, J.: Associative embedding: End-to-end learning for joint detection and grouping. CoRR arXiv:1611.05424 (2016)
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. CoRR arXiv:1603.06937 (2016)
Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. CoRR arXiv:1505.04366 (2015)
Openvino toolkit. https://software.intel.com/en-us/openvino-toolkit. Accessed 8 Nov 2020.
Osokin, D.: Real-time 2d multi-person pose estimation on CPU: lightweight openpose. CoRR arXiv:1811.12004 (2018)
Papandreou, G., Zhu, T., Chen, L., Gidaris, S., Tompson, J., Murphy, K.: Personlab: Person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. CoRR arXiv:1803.08225 (2018)
Papandreou, G., Zhu, T., Kanazawa, N., Toshev, A., Tompson, J., Bregler, C., Murphy, K.P.: Towards accurate multi-person pose estimation in the wild. CoRR arXiv:1701.01779 (2017)
Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Gehler, P.V., Schiele, B.: Deepcut: Joint subset partition and labeling for multi person pose estimation. CoRR arXiv:1511.06645 (2015)
Radosavovic, I., Kosaraju, R.P., Girshick, R., He K., Dollár, P.: Designing network design spaces. In: Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 10425–10433. https://doi.org/10.1109/CVPR42600.2020.01044
Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. CoRR arXiv:1506.01497 (2015)
Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. CoRR arXiv:1505.04597 (2015)
Ruder, S.: An overview of gradient descent optimization algorithms. CoRR arXiv:1609.04747 (2016)
Saharan, A.: Creating a human pose estimation application with nvidia deepstream (2020). https://developer.nvidia.com/blog/creating-a-human-pose-estimation-application-with-deepstream-sdk/. Accessed 8 Nov 2020
Saxena, S., Verbeek, J.: Convolutional neural fabrics. CoRR arXiv:1606.02492 (2016)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014)
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 5686–5696. https://doi.org/10.1109/CVPR.2019.00584
Sun, K., Zhao, Y., Jiang, B., Cheng, T., Xiao, B., Liu, D., Mu, Y., Wang, X., Liu, W., Wang, J.: High-resolution representations for labeling pixels and regions. CoRR arXiv:1904.04514 (2019)
Tan, M., Le, Q.V.: Efficientnet: Rethinking model scaling for convolutional neural networks. CoRR arXiv:1905.11946 (2019)
Tan, M., Pang R., Le, Q.V.: EfficientDet: scalable and efficient object detection. In: Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 10778–10787. https://doi.org/10.1109/CVPR42600.2020.01079
Toshev, A., Szegedy, C.: Deeppose: Human pose estimation via deep neural networks. CoRR arXiv:1312.4659 (2013)
Wang, J., et al.: Deep high-resolution representation learning for visual recognition. In: Proceedings of the IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2020.2983686
Wei, S., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. CoRR arXiv:1602.00134 (2016)
Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. CoRR arXiv:1807.10221 (2018)
Yang, L., Qin, Y., Zhang, X.: Lightweight densely connected residual network for human pose estimation. J Real-Time Image Proc 18, 825–837 (2021)
Article Google Scholar
Yang, W., Li, S., Ouyang, W., Li, H., Wang, X.: Learning feature pyramids for human pose estimation. CoRR arXiv:1708.01101 (2017)
Yang, Y., Ramanan, D.: Articulated pose estimation with flexible mixtures-of-parts. CVPR 2011, 1385–1392 (2011)
Google Scholar
Zhang, Z., Zhang, X., Peng, C., Cheng, D., Sun, J.: Exfuse: enhancing feature fusion for semantic segmentation. CoRR arXiv:1804.03821 (2018)
Zhong, F., Li, M., Zhang, K., Hu, J., Liu, L.: Dspnet: a low computational-cost network for human pose estimation. Neurocomputing 423, 327–335 (2021)
Article Google Scholar
Zhou, Y., Hu, X., Zhang, B.: Interlinked convolutional neural networks for face parsing. CoRR arXiv:1806.02479 (2018)
Zhu, H., Qiao, Y., Xu, G., Deng, L., Yu, Y.F.: Dspnet: a lightweight dilated convolution neural networks for spectral deconvolution with self-paced learning. IEEE Trans. Ind. Inform. 16(12), 7392–7401 (2020). https://doi.org/10.1109/TII.2019.2960837
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of North Carolina at Charlotte, Charlotte, NC, USA
Christopher Neff, Aneri Sheth, Steven Furgurson, John Middleton & Hamed Tabkhi

Authors

Christopher Neff
View author publications
You can also search for this author in PubMed Google Scholar
Aneri Sheth
View author publications
You can also search for this author in PubMed Google Scholar
Steven Furgurson
View author publications
You can also search for this author in PubMed Google Scholar
John Middleton
View author publications
You can also search for this author in PubMed Google Scholar
Hamed Tabkhi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christopher Neff.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This research is supported by the National Science Foundation (NSF) under Award no. 1831795.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Neff, C., Sheth, A., Furgurson, S. et al. EfficientHRNet. J Real-Time Image Proc 18, 1037–1049 (2021). https://doi.org/10.1007/s11554-021-01132-9

Download citation

Received: 06 January 2021
Accepted: 13 May 2021
Published: 01 June 2021
Issue Date: August 2021
DOI: https://doi.org/10.1007/s11554-021-01132-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

EfficientHRNet

Abstract

Access this article

Similar content being viewed by others

ByteTrack: Multi-object Tracking by Associating Every Detection Box

HOTA: A Higher Order Metric for Evaluating Multi-object Tracking

A review on face recognition systems: recent approaches and challenges

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

EfficientHRNet

Abstract

Access this article

Similar content being viewed by others

ByteTrack: Multi-object Tracking by Associating Every Detection Box

HOTA: A Higher Order Metric for Evaluating Multi-object Tracking

A review on face recognition systems: recent approaches and challenges

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation