MSRT: multi-scale representation transformer for regression-based human pose estimation

Shan, Beiguang; Shi, Qingxuan; Yang, Fang

doi:10.1007/s10044-023-01130-6

MSRT: multi-scale representation transformer for regression-based human pose estimation

Theoretical Advances
Published: 01 February 2023

Volume 26, pages 591–603, (2023)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

Beiguang Shan¹,
Qingxuan Shi^1,2,3 &
Fang Yang^1,2,3

474 Accesses
3 Citations
1 Altmetric
Explore all metrics

Abstract

In this paper, we are interested in the human pose estimation problem with a focus on leveraging discriminative pose features. Recent pose estimation works concentrate on extracting high-level features but ignore the low-level details, thus reducing the prediction accuracy. To mitigate the above issues, we propose an end-to-end method called multi-scale representation transformer network (MSRT). Our network consists of two key components: feature aggregation module (FAM) and transformers. The FAM splits and stacks feature maps of different scales, then fuses them to achieve multi-scale representation learning. This module makes up for the lack of detailed information in the high-level features. Furthermore, we utilize Transformers to identify long-range interactions among feature maps, and capture implicit body structure information, which allows the proposed network to refine the locations of terminal and occluded joints. Compared with existing regression-based methods, MSRT achieves superior results on the COCO2017 and MPII datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 3

Visual attention network

Article Open access 28 July 2023

Multi-scale Dilated Attention Graph Convolutional Network for Skeleton-Based Action Recognition

Yoga pose classification: a CNN and MediaPipe inspired deep learning approach for real-world application

Article 03 June 2022

Data Availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

References

Geng Z, Sun K, Xiao B, Zhang Z, Wang J (2021) Bottom-up human pose estimation via disentangled keypoint regression. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 14676–14686
Su C, Li J, Zhang S, Xing J, Gao W, Tian Q (2017) Pose-driven deep convolutional model for person re-identification. In: Proceedings of the IEEE international conference on computer vision, pp. 3960–3969
Farrajota M, Rodrigues JM, du Buf JH (2019) Human action recognition in videos with articulated pose information by deep networks. Pattern Anal Appl 22(4):1307–1318
Article MathSciNet Google Scholar
Xiao B, Wu H, Wei Y (2018) Simple baselines for human pose estimation and tracking. In: Proceedings of the European conference on computer vision (ECCV), pp. 466–481
Sun K, Xiao B, Liu D, Wang J (2019) Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5693–5703
Sun X, Xiao B, Wei F, Liang S, Wei Y (2018) Integral human pose regression. In: Proceedings of the European conference on computer vision (ECCV), pp. 529–545
Wei F, Sun X, Li H, Wang J, Lin S (2020) Point-set anchors for object detection, instance segmentation and pose estimation. In: European conference on computer vision, pp. 527–544
Fang H.-S, Xie S, Tai Y.-W, Lu C (2017) Rmpe: regional multi-person pose estimation. In: Proceedings of the IEEE international conference on computer vision, pp. 2334–2343
Li J, Wang C, Zhu H, Mao Y, Fang H-S, Lu C (2019) Crowdpose: efficient crowded scenes pose estimation and a new benchmark. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10863–10872
Hidalgo G, Raaj Y, Idrees H, Xiang D, Joo H, Simon T, Sheikh Y (2019) Single-network whole-body pose estimation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 6982–6991
Shi Q, Di H, Lu Y, Lv F, Tian X (2017) Video pose estimation with global motion cues. Neurocomputing 219:269–279
Article Google Scholar
Zhou T, Wang W, Liu S, Yang Y, Van Gool L (2021) Differentiable multi-granularity human representation learning for instance-aware human semantic parsing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1622–1631
Zhou L, Chen Y, Gao Y, Wang J, Lu H (2020) Occlusion-aware Siamese network for human pose estimation. In: European conference on computer vision, pp. 396–412
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30:5998–6008
Google Scholar
Sun X, Shang J, Liang S, Wei Y (2017) Compositional human pose regression. In: Proceedings of the IEEE international conference on computer vision, pp. 2602–2611
Li K, Wang S, Zhang X, Xu Y, Xu W, Tu Z (2021) Pose recognition with cascade transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1944–1953
Papandreou G, Zhu T, Kanazawa N, Toshev A, Tompson J, Bregler C, Murphy K(2017) Towards accurate multi-person pose estimation in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4903–4911
Su K, Yu D, Xu Z, Geng X, Wang C (2019) Multi-person pose estimation with enhanced channel-wise and spatial information. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5674–5682
Li W, Wang Z, Yin B, Peng Q, Du Y, Xiao T, Yu G, Lu H, Wei Y, Sun J (2019) Rethinking on multi-stage networks for human pose estimation. arXiv preprint arXiv:1901.00148
Wang J, Long X, Gao Y, Ding E, Wen S (2020) Graph-PCNN: two stage human pose estimation with graph pose refinement. In: European conference on computer vision, pp. 492–508
Toshev A, Szegedy C (2014) Human pose estimation via deep neural networks. CVPR.(Columbus, Ohio, 2014), pp. 1653–1660
Carreira J, Agrawal P, Fragkiadaki K, Malik J (2016) Human pose estimation with iterative error feedback. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4733–4742
Tian Z, Chen H, Shen C (2019) Directpose: direct end-to-end multi-person pose estimation. arXiv preprint arXiv:1911.07451
Zhou X, Wang D, Krähenbühl P (2019) Objects as points. arXiv preprint arXiv:1904.07850
Nie X, Feng J, Zhang J, Yan S (2019) Single-stage multi-person pose machines. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 6951–6960
Li J, Bian S, Zeng A, Wang C, Pang B, Liu W, Lu C (2021) Human pose regression with residual log-likelihood estimation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 11025–11034
Mao W, Ge Y, Shen C, Tian Z, Wang X, Wang Z, Hengel A.V.D (2022) Poseur: direct human pose regression with transformers. arXiv preprint arXiv:2201.07412
Wang W, Song H, Zhao S, Shen J, Zhao S, Hoi S.C, Ling H (2019) Learning unsupervised video object segmentation through visual attention. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3064–3074
Zhou T, Li J, Wang S, Tao R, Shen J (2020) Matnet: motion-attentive transition network for zero-shot video object segmentation. IEEE Trans Image Process 29:8326–8338
Article MATH Google Scholar
Wang W, Zhao S, Shen J, Hoi S.C, Borji A (2019) Salient object detection with pyramid attention and salient edges. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1448–1457
Fan D.-P, Wang W, Cheng M.-M, Shen J (2019) Shifting more attention to video salient object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8554–8564
Wang W, Shen J (2017) Deep visual attention prediction. IEEE Trans Image Process 27(5):2368–2378
Article MathSciNet Google Scholar
Wang W, Shen J (2017) Deep cropping via attention box prediction and aesthetics assessment. In: Proceedings of the IEEE international conference on computer vision, pp. 2186–2194
Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2020) Deformable DETR: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159
Yang S, Quan Z, Nie M, Yang W (2020) Transpose: towards explainable human pose estimation by transformer. arXiv preprint arXiv:2012.14214
Khan S, Naseer M, Hayat M, Zamir S.W, Khan F.S, Shah M (2021) Transformers in vision: a survey. arXiv preprint arXiv:2101.01169
Zheng C, Zhu S, Mendieta M, Yang T, Chen C, Ding Z (2021) 3d human pose estimation with spatial and temporal transformers. arXiv preprint arXiv:2103.10455
Han K, Wang Y, Chen H, Chen X, Guo J, Liu Z, Tang Y, Xiao A, Xu C, Xu Y, et al. (2020) A survey on visual transformer. arXiv preprint arXiv:2012.12556
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision, pp. 213–229
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030
Li Y, Zhang S, Wang Z, Yang S, Yang W, Xia S.-T, Zhou E (2021) Tokenpose: learning keypoint tokens for human pose estimation. arXiv preprint arXiv:2104.03516
Mao W, Ge Y, Shen C, Tian Z, Wang X, Wang Z (2021) Tfpose: direct human pose estimation with transformers. arXiv preprint arXiv:2103.15320
Yang Y, Ramanan D (2011) Articulated pose estimation with flexible mixtures-of-parts. In: CVPR 2011, pp. 1385–1392. IEEE
Chen X, Yuille AL (2015) Parsing occluded people by flexible compositions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3945–3954
Fu L, Zhang J, Huang K (2016) ORGM: occlusion relational graphical model for human pose estimation. IEEE Trans Image Process 26(2):927–941
Article MathSciNet MATH Google Scholar
Islam M.A, Jia S, Bruce N.D (2020) How much position information do convolutional neural networks encode? arXiv preprint arXiv:2001.08248
Wu K, Peng H, Chen M, Fu J, Chao H (2021) Rethinking and improving relative position encoding for vision transformer. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 10033–10041
Lin T.-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick C.L (2014) Microsoft coco: common objects in context. In: European conference on computer vision, pp. 740–755
Andriluka M, Pishchulin L, Gehler P, Schiele B (2014) 2d human pose estimation: new benchmark and state of the art analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3686–3693
Chen Y, Wang Z, Peng Y, Zhang Z, Yu G, Sun J (2018) Cascaded pyramid network for multi-person pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7103–7112
Li Z, Ye J, Song M, Huang Y, Pan Z (2021) Online knowledge distillation for efficient pose estimation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 11740–11750
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778
Tang W, Yu P, Wu Y (2018) Deeply learned compositional models for human pose estimation. In: Proceedings of the European conference on computer vision (ECCV), pp. 190–206
Nibali A, He Z, Morganc S, Prendergast L (2018) Numerical coordinate regression with convolutional neural networks. arXiv preprint arXiv:1801.07372

Download references

Author information

Authors and Affiliations

School of Cyber Security and Computer, Hebei University, Baoding, 071000, China
Beiguang Shan, Qingxuan Shi & Fang Yang
Hebei Machine Vision Engineering Research Center, Hebei University, Baoding, 071000, China
Qingxuan Shi & Fang Yang
Institute of Intelligent Image and Document Information Processing, Hebei University, Baoding, 071000, China
Qingxuan Shi & Fang Yang

Authors

Beiguang Shan
View author publications
You can also search for this author in PubMed Google Scholar
Qingxuan Shi
View author publications
You can also search for this author in PubMed Google Scholar
Fang Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qingxuan Shi.

Ethics declarations

Conflict of interest

The authors declare that they have no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Shan, B., Shi, Q. & Yang, F. MSRT: multi-scale representation transformer for regression-based human pose estimation. Pattern Anal Applic 26, 591–603 (2023). https://doi.org/10.1007/s10044-023-01130-6

Download citation

Received: 17 April 2022
Accepted: 12 January 2023
Published: 01 February 2023
Issue Date: May 2023
DOI: https://doi.org/10.1007/s10044-023-01130-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MSRT: multi-scale representation transformer for regression-based human pose estimation

Abstract

Access this article

Similar content being viewed by others

Visual attention network

Multi-scale Dilated Attention Graph Convolutional Network for Skeleton-Based Action Recognition

Yoga pose classification: a CNN and MediaPipe inspired deep learning approach for real-world application

Data Availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

MSRT: multi-scale representation transformer for regression-based human pose estimation

Abstract

Access this article

Similar content being viewed by others

Visual attention network

Multi-scale Dilated Attention Graph Convolutional Network for Skeleton-Based Action Recognition

Yoga pose classification: a CNN and MediaPipe inspired deep learning approach for real-world application

Data Availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation