Skip to main content
Log in

MTPose: Human Pose Estimation with High-Resolution Multi-scale Transformers

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

HRNet (High-Resolution Networks) as reported by Sun et al. (in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2019) has been the state-of-the-art human pose estimation method, benefitting from its parallel high-resolution designed network structures. However, HRNet is still a typical CNN (Convolutional Neural Networks) architecture, with local convolution operations. Recently, Transformers have been successfully applied in many computer vision areas. The main mechanism in Transformers is self-attention, which can learn global or long-range dependencies among different parts. In this paper, we propose a human pose estimation framework built upon High-Resolution Multi-scale Transformers, termed MTPose. We combine the two advantages of high-resolution and Transformers together to improve the performance. Specifically, we design a sub-network, MTNet (Multi-scale Transformers-based high-resolution Networks), which consists of two parallel branches. One is high-resolution with convolutional local operations, named as local branch. The other is the global branch utilizing multi-scale Transformer encoders to learn long-range dependencies of the whole body keypoints. At the end of the networks, the two branches are integrated together to predict the final keypoint heatmaps. Experiments on two benchmark datasets, the MSCOCO keypoint detection dataset and MPII human pose dataset, demonstrate that our method can significantly improve the state-of-the-art human pose estimation methods. Code will be available at: https://github.com/fudiGeng/MTPose.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Si C, Chen W, Wang W, Wang L, Tan T (2019) An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 1227–1236. https://doi.org/10.1109/CVPR.2019.00132

  2. Yang C, Xu Y, Shi J, Dai B, Zhou B (2020) Temporal pyramid network for action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 591–600. https://doi.org/10.1109/CVPR42600.2020.00067

  3. Gao J, Zheng WS, Pan JH, Gao C, Wang Y, Zeng W, Lai J (2020) An asymmetric modeling for action assessment. In: European conference on computer vision (ECCV), pp. 222–238. https://doi.org/10.1007/978-3-030-58577-8_14

  4. Pan JH, Gao J, Zheng WS (2019) Action assessment by joint relation graphs. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 63341–6340. https://doi.org/10.1109/ICCV.2019.00643

  5. Snower M, Kadav A, Lai F, Graf HP (2020) 15 Keypoints is all you need. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 6738–6748. https://doi.org/10.1109/CVPR42600.2020.00677

  6. Ning G, Pei J, Huang H (2020) LightTrack: a generic framework for online top-down human pose tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 1034–1035. https://doi.org/10.1109/CVPRW50498.2020.00525

  7. Wang M, Tighe J, Modolo D (2020) Combining detection and tracking for human pose estimation in videos. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 11088–11096. https://doi.org/10.1109/CVPR42600.2020.01110

  8. Rafi U, Doering A, Leibe B, Gall J (2020) Self-supervised keypoint correspondences for multi-person pose estimation and tracking in videos. In: European conference on computer vision (ECCV). https://doi.org/10.1007/978-3-030-58565-5_3

  9. Kwon OH, Tanke J, Gall J (2020) Recursive Bayesian filtering for multiple human pose tracking from multiple cameras. In: Proceedings of the asian conference on computer vision (ACCV). https://doi.org/10.1007/978-3-030-69532-3_27

  10. Kocabas M, Athanasiou N, Black MJ (2020) VIBE: video inference for human body pose and shape estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5253–5263. https://doi.org/10.1109/CVPR42600.2020.00530

  11. Chen H, Guo P, Li P, Lee GH, Chirikjian G (2020) Multi-person 3D pose estimation in crowded scenes based on multi-view geometry. In: European conference on computer vision (ECCV), pp 541–557. https://doi.org/10.1007/978-3-030-58580-8_32

  12. Kolotouros N, Pavlakos G, Black MJ, Daniilidis K (2019) Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp. 2252–2261. https://doi.org/10.1109/ICCV.2019.00234

  13. Qiu H, Wang C, Wang J, Wang N, Zeng W (2019) Cross view fusion for 3D human pose estimation. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp. 4342–4351. https://doi.org/10.1109/ICCV.2019.00444

  14. Zhou X, Huang Q, Sun X, Xue X, Wei Y (2017) Towards 3D human pose estimation in the wild: a weakly-supervised approach. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 398–407. https://doi.org/10.1109/ICCV.2017.51

  15. Newell A, Yang K, Deng J (2016) Stacked hourglass networks for human pose estimation. In: European conference on computer vision (ECCV). https://doi.org/10.1007/978-3-319-46484-8_29

  16. Xiao B, Wu H, Wei Y (2018) Simple baselines for human pose estimation and tracking. In: Proceedings of the European conference on computer vision (ECCV), pp 466–481. https://doi.org/10.1007/978-3-030-01231-1_29

  17. Sun K, Xiao B, Liu D, Wang J (2019) Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5693–5703. https://doi.org/10.1109/CVPR.2019.00584

  18. Wang J, Sun K, Cheng T, Jiang B, Deng C, Zhao Y, Liu D, Mu Y, Tan M, Wang X, Liu W, Xiao B (2020) Deep high-resolution representation learning for visual recognition. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2020.2983686

    Article  Google Scholar 

  19. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2021) An image is worth 16x16 words: transformers for image recognition at scale. In: International conference on learning representations (ICLR)

  20. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision (ECCV)

  21. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft COCO: common objects in context. In: European conference on computer vision (ECCV), pp 740–755. https://doi.org/10.1007/978-3-319-10602-1_48

  22. Andriluka M, Pishchulin L, Gehler P, Schiele B (2014) 2D human pose estimation: new benchmark and state of the art analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 3686–3693. https://doi.org/10.1109/CVPR.2014.471

  23. Toshev A, Szegedy C (2014) DeepPose: human pose estimation via deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 1653–1660. https://doi.org/10.1109/CVPR.2014.214

  24. Wei SE, Ramakrishna V, Kanade T, Sheikh Y (2016) Convolutional pose machines. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 4724–4732. https://doi.org/10.1109/CVPR.2016.511

  25. Cao Z, Simon T, Wei SE, Sheikh Y (2017) Realtime multi-person 2D pose estimation using part affinity fields. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). https://doi.org/10.1109/CVPR.2017.143

  26. Newell A, Huang Z, Deng J (2017) Associative embedding: end-to-end learning for joint detection and grouping. Adv Neural Inf Process Syst

  27. Kreiss S, Bertoni L, Alahi A (2019) PifPaf: composite fields for human pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 11977–11986. https://doi.org/10.1109/CVPR.2019.01225

  28. Cheng B, Xiao B, Wang J, Shi H, Huang TS, Zhang L (2020) HigherHRNet: scale-aware representation learning for bottom-up human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5386–5395. https://doi.org/10.1109/CVPR42600.2020.00543

  29. Luo Z, Wang Z, Huang Y, Wang L, Tan T, Zhou E (2021) Rethinking the heatmap regression for bottom-up human pose estimation. In: Proceedings IEEE conference on computer vision and pattern recognition (CVPR)

  30. Geng Z, Sun K, Xiao B, Zhang Z, Wang J (2021) Bottom-up human pose estimation via disentangled keypoint regression. In: Proceedings IEEE conference on computer vision and pattern recognition (CVPR)

  31. Weian M, Tian Z, Wang X, Shen C (2021) FCPose: fully convolutional multi-person pose estimation with dynamic instance-aware convolutions. In: Proc. IEEE conference on computer vision and pattern recognition (CVPR)

  32. Shaoqing Ren, Kaiming He, Ross Girshick and ian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In: Advances in Neural Information Processing Systems, 2015, pp. 91–99. https://doi.org/10.1109/TPAMI.2016.2577031.

  33. Yu C, Xiao B, Gao C, Yuan L, Zhang L, Sang N, Wang J (2021) Lite-HRNet: a lightweight high-resolution network. In: Proceedings IEEE conference on computer vision and pattern recognition (CVPR), pp 10440–10450

  34. Zhang X, Zhou X, Lin M, Sun J (2018) ShuffleNet: an extremely efficient convolutional neural network for mobile devices. In: Proc. IEEE conference on computer vision and pattern recognition (CVPR), pp 6848–6856. https://doi.org/10.1109/CVPR.2018.00716

  35. He K, Zhang X, Ren S, Sun J (2016) Deep Residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778. https://doi.org/10.1109/CVPR.2016.90

  36. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008

  37. Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2021) Deformable DETR: deformable transformers for end-to-end object detection. In: International conference on learning representations (ICLR)

  38. Wang Y, Xu Z, Wang X, Shen C, Cheng B, Shen H, Xia H (2021) End-to-end video instance segmentation with transformers. In: Proceedings IEEE conference on computer vision and pattern recognition (CVPR), pp 8741–8750

  39. Huang L, Tan J, Liu J, Yuan J (2020) Hand-transformer: non-autoregressive structured modeling for 3D hand pose estimation. In: Proceedings European conference on computer vision (ECCV). Springer, pp 17–33

  40. Wang N, Zhou W, Wang J, Li H (2021) Transformer meets tracker: exploiting temporal context for robust visual tracking. In: Proceedings IEEE conference on computer vision and pattern recognition (CVPR), pp 1571–1580

  41. Meinhardt T, Kirillov A, Leal-Taixe L, Feichtenhofer C (2021) TrackFormer: multi-object tracking with transformers. arXiv preprint arXiv:2101.02702

  42. Chen H, Wang Y, Guo T, Xu C, Deng Y, Liu Z, Ma S, Xu C, Xu C, Gao W (2021) Pre-trained image processing transformer. In: Proceedings IEEE conference on computer vision and pattern recognition (CVPR), pp 12299–12310

  43. Dai Z, Liu H, Le Q, Tan M (2021) CoAtNet: marrying convolution and attention for all data sizes. arXiv preprint arXiv:2106.04803

  44. Bello I, Zoph B, Vaswani A, Shlens J, Le QV (2019) Attention augmented convolutional networks. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 3286–3295. https://doi.org/10.1109/ICCV.2019.00338

  45. Srinivas A, Lin TY, Parmar N, Shlens J, Abbeel P, Vaswani A (2021) Bottleneck transformers for visual recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 16519–16529

  46. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030

  47. Wu H, Xiao B, Codella N, Liu M, Dai X, Yuan L, Zhang L (2021) CvT: introducing convolutions to vision transformers. arXiv preprint arXiv:2103.15808

  48. Yuan L, Chen Y, Wang T, Yu W, Shi Y, Jiang ZH, Tay FE, Feng J, Yan S (2021) Tokens-to-Token ViT: training vision transformers from scratch on ImageNet. arXiv preprint arXiv:2101.11986

  49. Wang W, Xie E, Li X, Fan DP, Song K, Liang D, Lu T, Luo P, Shao L (2021) Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. arXiv preprint arXiv:2102.12122

  50. Mao W, Ge Y, Shen C, Tian Z, Wang X, Wang Z (2021) TFPose: direct human pose estimation with transformers. arXiv preprint arXiv:2103.15320

  51. Chen Y, Wang Z, Peng Y, Zhang Z, Yu G, Sun J (2018) Cascaded pyramid network for multi-person pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 7103–7112. https://doi.org/10.1109/CVPR.2018.00742

  52. Papandreou G, Zhu T, Chen LC, Gidaris S, Tompson J, Murphy K (2018) PersonLab: person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In: Proceedings of the European conference on computer vision (ECCV), pp 269–286

  53. Kocabas M, Karagoz S, Akbas E (2018) MultiPoseNet: fast multi-person pose estimation using pose residual network. In: Proceedings of the European conference on computer vision (ECCV), pp 417–433

  54. He K, Gkioxari G, Dollár P, Girshick R (2017) Mask R-CNN. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 2961–2969. https://doi.org/10.1109/TPAMI.2018.2844175

  55. Papandreou G, Zhu T, Kanazawa N, Toshev A, Tompson J, Bregler C, Murphy K (2017) Towards accurate multi-person pose estimation in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 4903–4911. https://doi.org/10.1109/CVPR.2017.395

  56. Fang HS, Xie S, Tai YW, Lu C (2017) RMPE: regional multi-person pose estimation. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 2334–2343. https://doi.org/10.1109/ICCV.2017.256

  57. Gamra MB, Akhloufi MA (2021) A review of deep learning techniques for 2D and 3D human pose estimation. Image Vis Comput 114:104282. https://doi.org/10.1016/j.imavis.2021.104282

    Article  Google Scholar 

  58. Nibali A, Millward J, He Z, Morgan S (2021) ASPset: an outdoor sports pose video dataset with 3D keypoint annotations. Image Vis Comput 111:104196. https://doi.org/10.1016/j.imavis.2021.104196

    Article  Google Scholar 

  59. Zhang W, Wang X, You W, Chen J, Dai P, Zhang P (2019) RESLS: region and edge synergetic level set framework for image segmentation. IEEE Trans Image Process 29:57–71. https://doi.org/10.1109/TIP.2019.2928134

    Article  MathSciNet  MATH  Google Scholar 

  60. Xiao Y (2014) Blurred trace infrared image segmentation based on template approach and immune factor. Infrared Phys Technol 67:116–120. https://doi.org/10.1016/j.infrared.2014.07.002

    Article  Google Scholar 

  61. Xiao Y, Zijie Z (2020) Infrared image extraction algorithm based on adaptive growth immune field. Neural Process Lett 51:2575–2587. https://doi.org/10.1007/s11063-020-10218-7

    Article  Google Scholar 

  62. Zhu H, Zhang Q, Wang Q, Li H (2017) 4D light field superpixel and segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 6384–6392. https://doi.org/10.1109/TIP.2019.2927330

  63. Yu X, Zhou Z, Gao Q, Li D, Ríha K (2014) Infrared image segmentation using growing immune field and clone threshold. Infrared Phys Technol 88:184–193. https://doi.org/10.1016/j.infrared.2017.11.029

    Article  Google Scholar 

  64. Zhou Z, Zhang B, Yu X (2021) Infrared handprint classification using deep convolution neural network. Neural Process Lett. https://doi.org/10.1007/s11063-021-10429-6

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grant 61771299.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiangyang Wang.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, R., Geng, F. & Wang, X. MTPose: Human Pose Estimation with High-Resolution Multi-scale Transformers. Neural Process Lett 54, 3941–3964 (2022). https://doi.org/10.1007/s11063-022-10794-w

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-022-10794-w

Keywords

Navigation