Skip to main content
Log in

MSRT: multi-scale representation transformer for regression-based human pose estimation

  • Theoretical Advances
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

In this paper, we are interested in the human pose estimation problem with a focus on leveraging discriminative pose features. Recent pose estimation works concentrate on extracting high-level features but ignore the low-level details, thus reducing the prediction accuracy. To mitigate the above issues, we propose an end-to-end method called multi-scale representation transformer network (MSRT). Our network consists of two key components: feature aggregation module (FAM) and transformers. The FAM splits and stacks feature maps of different scales, then fuses them to achieve multi-scale representation learning. This module makes up for the lack of detailed information in the high-level features. Furthermore, we utilize Transformers to identify long-range interactions among feature maps, and capture implicit body structure information, which allows the proposed network to refine the locations of terminal and occluded joints. Compared with existing regression-based methods, MSRT achieves superior results on the COCO2017 and MPII datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data Availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

References

  1. Geng Z, Sun K, Xiao B, Zhang Z, Wang J (2021) Bottom-up human pose estimation via disentangled keypoint regression. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 14676–14686

  2. Su C, Li J, Zhang S, Xing J, Gao W, Tian Q (2017) Pose-driven deep convolutional model for person re-identification. In: Proceedings of the IEEE international conference on computer vision, pp. 3960–3969

  3. Farrajota M, Rodrigues JM, du Buf JH (2019) Human action recognition in videos with articulated pose information by deep networks. Pattern Anal Appl 22(4):1307–1318

    Article  MathSciNet  Google Scholar 

  4. Xiao B, Wu H, Wei Y (2018) Simple baselines for human pose estimation and tracking. In: Proceedings of the European conference on computer vision (ECCV), pp. 466–481

  5. Sun K, Xiao B, Liu D, Wang J (2019) Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5693–5703

  6. Sun X, Xiao B, Wei F, Liang S, Wei Y (2018) Integral human pose regression. In: Proceedings of the European conference on computer vision (ECCV), pp. 529–545

  7. Wei F, Sun X, Li H, Wang J, Lin S (2020) Point-set anchors for object detection, instance segmentation and pose estimation. In: European conference on computer vision, pp. 527–544

  8. Fang H.-S, Xie S, Tai Y.-W, Lu C (2017) Rmpe: regional multi-person pose estimation. In: Proceedings of the IEEE international conference on computer vision, pp. 2334–2343

  9. Li J, Wang C, Zhu H, Mao Y, Fang H-S, Lu C (2019) Crowdpose: efficient crowded scenes pose estimation and a new benchmark. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10863–10872

  10. Hidalgo G, Raaj Y, Idrees H, Xiang D, Joo H, Simon T, Sheikh Y (2019) Single-network whole-body pose estimation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 6982–6991

  11. Shi Q, Di H, Lu Y, Lv F, Tian X (2017) Video pose estimation with global motion cues. Neurocomputing 219:269–279

    Article  Google Scholar 

  12. Zhou T, Wang W, Liu S, Yang Y, Van Gool L (2021) Differentiable multi-granularity human representation learning for instance-aware human semantic parsing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1622–1631

  13. Zhou L, Chen Y, Gao Y, Wang J, Lu H (2020) Occlusion-aware Siamese network for human pose estimation. In: European conference on computer vision, pp. 396–412

  14. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30:5998–6008

    Google Scholar 

  15. Sun X, Shang J, Liang S, Wei Y (2017) Compositional human pose regression. In: Proceedings of the IEEE international conference on computer vision, pp. 2602–2611

  16. Li K, Wang S, Zhang X, Xu Y, Xu W, Tu Z (2021) Pose recognition with cascade transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1944–1953

  17. Papandreou G, Zhu T, Kanazawa N, Toshev A, Tompson J, Bregler C, Murphy K(2017) Towards accurate multi-person pose estimation in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4903–4911

  18. Su K, Yu D, Xu Z, Geng X, Wang C (2019) Multi-person pose estimation with enhanced channel-wise and spatial information. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5674–5682

  19. Li W, Wang Z, Yin B, Peng Q, Du Y, Xiao T, Yu G, Lu H, Wei Y, Sun J (2019) Rethinking on multi-stage networks for human pose estimation. arXiv preprint arXiv:1901.00148

  20. Wang J, Long X, Gao Y, Ding E, Wen S (2020) Graph-PCNN: two stage human pose estimation with graph pose refinement. In: European conference on computer vision, pp. 492–508

  21. Toshev A, Szegedy C (2014) Human pose estimation via deep neural networks. CVPR.(Columbus, Ohio, 2014), pp. 1653–1660

  22. Carreira J, Agrawal P, Fragkiadaki K, Malik J (2016) Human pose estimation with iterative error feedback. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4733–4742

  23. Tian Z, Chen H, Shen C (2019) Directpose: direct end-to-end multi-person pose estimation. arXiv preprint arXiv:1911.07451

  24. Zhou X, Wang D, Krähenbühl P (2019) Objects as points. arXiv preprint arXiv:1904.07850

  25. Nie X, Feng J, Zhang J, Yan S (2019) Single-stage multi-person pose machines. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 6951–6960

  26. Li J, Bian S, Zeng A, Wang C, Pang B, Liu W, Lu C (2021) Human pose regression with residual log-likelihood estimation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 11025–11034

  27. Mao W, Ge Y, Shen C, Tian Z, Wang X, Wang Z, Hengel A.V.D (2022) Poseur: direct human pose regression with transformers. arXiv preprint arXiv:2201.07412

  28. Wang W, Song H, Zhao S, Shen J, Zhao S, Hoi S.C, Ling H (2019) Learning unsupervised video object segmentation through visual attention. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3064–3074

  29. Zhou T, Li J, Wang S, Tao R, Shen J (2020) Matnet: motion-attentive transition network for zero-shot video object segmentation. IEEE Trans Image Process 29:8326–8338

    Article  MATH  Google Scholar 

  30. Wang W, Zhao S, Shen J, Hoi S.C, Borji A (2019) Salient object detection with pyramid attention and salient edges. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1448–1457

  31. Fan D.-P, Wang W, Cheng M.-M, Shen J (2019) Shifting more attention to video salient object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8554–8564

  32. Wang W, Shen J (2017) Deep visual attention prediction. IEEE Trans Image Process 27(5):2368–2378

    Article  MathSciNet  Google Scholar 

  33. Wang W, Shen J (2017) Deep cropping via attention box prediction and aesthetics assessment. In: Proceedings of the IEEE international conference on computer vision, pp. 2186–2194

  34. Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2020) Deformable DETR: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159

  35. Yang S, Quan Z, Nie M, Yang W (2020) Transpose: towards explainable human pose estimation by transformer. arXiv preprint arXiv:2012.14214

  36. Khan S, Naseer M, Hayat M, Zamir S.W, Khan F.S, Shah M (2021) Transformers in vision: a survey. arXiv preprint arXiv:2101.01169

  37. Zheng C, Zhu S, Mendieta M, Yang T, Chen C, Ding Z (2021) 3d human pose estimation with spatial and temporal transformers. arXiv preprint arXiv:2103.10455

  38. Han K, Wang Y, Chen H, Chen X, Guo J, Liu Z, Tang Y, Xiao A, Xu C, Xu Y, et al. (2020) A survey on visual transformer. arXiv preprint arXiv:2012.12556

  39. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision, pp. 213–229

  40. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

  41. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030

  42. Li Y, Zhang S, Wang Z, Yang S, Yang W, Xia S.-T, Zhou E (2021) Tokenpose: learning keypoint tokens for human pose estimation. arXiv preprint arXiv:2104.03516

  43. Mao W, Ge Y, Shen C, Tian Z, Wang X, Wang Z (2021) Tfpose: direct human pose estimation with transformers. arXiv preprint arXiv:2103.15320

  44. Yang Y, Ramanan D (2011) Articulated pose estimation with flexible mixtures-of-parts. In: CVPR 2011, pp. 1385–1392. IEEE

  45. Chen X, Yuille AL (2015) Parsing occluded people by flexible compositions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3945–3954

  46. Fu L, Zhang J, Huang K (2016) ORGM: occlusion relational graphical model for human pose estimation. IEEE Trans Image Process 26(2):927–941

    Article  MathSciNet  MATH  Google Scholar 

  47. Islam M.A, Jia S, Bruce N.D (2020) How much position information do convolutional neural networks encode? arXiv preprint arXiv:2001.08248

  48. Wu K, Peng H, Chen M, Fu J, Chao H (2021) Rethinking and improving relative position encoding for vision transformer. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 10033–10041

  49. Lin T.-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick C.L (2014) Microsoft coco: common objects in context. In: European conference on computer vision, pp. 740–755

  50. Andriluka M, Pishchulin L, Gehler P, Schiele B (2014) 2d human pose estimation: new benchmark and state of the art analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3686–3693

  51. Chen Y, Wang Z, Peng Y, Zhang Z, Yu G, Sun J (2018) Cascaded pyramid network for multi-person pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7103–7112

  52. Li Z, Ye J, Song M, Huang Y, Pan Z (2021) Online knowledge distillation for efficient pose estimation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 11740–11750

  53. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778

  54. Tang W, Yu P, Wu Y (2018) Deeply learned compositional models for human pose estimation. In: Proceedings of the European conference on computer vision (ECCV), pp. 190–206

  55. Nibali A, He Z, Morganc S, Prendergast L (2018) Numerical coordinate regression with convolutional neural networks. arXiv preprint arXiv:1801.07372

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qingxuan Shi.

Ethics declarations

Conflict of interest

The authors declare that they have no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shan, B., Shi, Q. & Yang, F. MSRT: multi-scale representation transformer for regression-based human pose estimation. Pattern Anal Applic 26, 591–603 (2023). https://doi.org/10.1007/s10044-023-01130-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-023-01130-6

Keywords

Navigation