MTPose: Human Pose Estimation with High-Resolution Multi-scale Transformers

Wang, Rui; Geng, Fudi; Wang, Xiangyang

doi:10.1007/s11063-022-10794-w

MTPose: Human Pose Estimation with High-Resolution Multi-scale Transformers

Published: 29 March 2022

Volume 54, pages 3941–3964, (2022)
Cite this article

Neural Processing Letters Aims and scope Submit manuscript

815 Accesses
5 Citations
1 Altmetric
Explore all metrics

Abstract

HRNet (High-Resolution Networks) as reported by Sun et al. (in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2019) has been the state-of-the-art human pose estimation method, benefitting from its parallel high-resolution designed network structures. However, HRNet is still a typical CNN (Convolutional Neural Networks) architecture, with local convolution operations. Recently, Transformers have been successfully applied in many computer vision areas. The main mechanism in Transformers is self-attention, which can learn global or long-range dependencies among different parts. In this paper, we propose a human pose estimation framework built upon High-Resolution Multi-scale Transformers, termed MTPose. We combine the two advantages of high-resolution and Transformers together to improve the performance. Specifically, we design a sub-network, MTNet (Multi-scale Transformers-based high-resolution Networks), which consists of two parallel branches. One is high-resolution with convolutional local operations, named as local branch. The other is the global branch utilizing multi-scale Transformer encoders to learn long-range dependencies of the whole body keypoints. At the end of the networks, the two branches are integrated together to predict the final keypoint heatmaps. Experiments on two benchmark datasets, the MSCOCO keypoint detection dataset and MPII human pose dataset, demonstrate that our method can significantly improve the state-of-the-art human pose estimation methods. Code will be available at: https://github.com/fudiGeng/MTPose.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Object detection using YOLO: challenges, architectural successors, datasets and applications

Article 08 August 2022

SSD: Single Shot MultiBox Detector

YOLO-based Object Detection Models: A Review and its Applications

Article 14 March 2024

References

Si C, Chen W, Wang W, Wang L, Tan T (2019) An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 1227–1236. https://doi.org/10.1109/CVPR.2019.00132
Yang C, Xu Y, Shi J, Dai B, Zhou B (2020) Temporal pyramid network for action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 591–600. https://doi.org/10.1109/CVPR42600.2020.00067
Gao J, Zheng WS, Pan JH, Gao C, Wang Y, Zeng W, Lai J (2020) An asymmetric modeling for action assessment. In: European conference on computer vision (ECCV), pp. 222–238. https://doi.org/10.1007/978-3-030-58577-8_14
Pan JH, Gao J, Zheng WS (2019) Action assessment by joint relation graphs. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 63341–6340. https://doi.org/10.1109/ICCV.2019.00643
Snower M, Kadav A, Lai F, Graf HP (2020) 15 Keypoints is all you need. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 6738–6748. https://doi.org/10.1109/CVPR42600.2020.00677
Ning G, Pei J, Huang H (2020) LightTrack: a generic framework for online top-down human pose tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 1034–1035. https://doi.org/10.1109/CVPRW50498.2020.00525
Wang M, Tighe J, Modolo D (2020) Combining detection and tracking for human pose estimation in videos. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 11088–11096. https://doi.org/10.1109/CVPR42600.2020.01110
Rafi U, Doering A, Leibe B, Gall J (2020) Self-supervised keypoint correspondences for multi-person pose estimation and tracking in videos. In: European conference on computer vision (ECCV). https://doi.org/10.1007/978-3-030-58565-5_3
Kwon OH, Tanke J, Gall J (2020) Recursive Bayesian filtering for multiple human pose tracking from multiple cameras. In: Proceedings of the asian conference on computer vision (ACCV). https://doi.org/10.1007/978-3-030-69532-3_27
Kocabas M, Athanasiou N, Black MJ (2020) VIBE: video inference for human body pose and shape estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5253–5263. https://doi.org/10.1109/CVPR42600.2020.00530
Chen H, Guo P, Li P, Lee GH, Chirikjian G (2020) Multi-person 3D pose estimation in crowded scenes based on multi-view geometry. In: European conference on computer vision (ECCV), pp 541–557. https://doi.org/10.1007/978-3-030-58580-8_32
Kolotouros N, Pavlakos G, Black MJ, Daniilidis K (2019) Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp. 2252–2261. https://doi.org/10.1109/ICCV.2019.00234
Qiu H, Wang C, Wang J, Wang N, Zeng W (2019) Cross view fusion for 3D human pose estimation. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp. 4342–4351. https://doi.org/10.1109/ICCV.2019.00444
Zhou X, Huang Q, Sun X, Xue X, Wei Y (2017) Towards 3D human pose estimation in the wild: a weakly-supervised approach. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 398–407. https://doi.org/10.1109/ICCV.2017.51
Newell A, Yang K, Deng J (2016) Stacked hourglass networks for human pose estimation. In: European conference on computer vision (ECCV). https://doi.org/10.1007/978-3-319-46484-8_29
Xiao B, Wu H, Wei Y (2018) Simple baselines for human pose estimation and tracking. In: Proceedings of the European conference on computer vision (ECCV), pp 466–481. https://doi.org/10.1007/978-3-030-01231-1_29
Sun K, Xiao B, Liu D, Wang J (2019) Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5693–5703. https://doi.org/10.1109/CVPR.2019.00584
Wang J, Sun K, Cheng T, Jiang B, Deng C, Zhao Y, Liu D, Mu Y, Tan M, Wang X, Liu W, Xiao B (2020) Deep high-resolution representation learning for visual recognition. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2020.2983686
Article Google Scholar
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2021) An image is worth 16x16 words: transformers for image recognition at scale. In: International conference on learning representations (ICLR)
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision (ECCV)
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft COCO: common objects in context. In: European conference on computer vision (ECCV), pp 740–755. https://doi.org/10.1007/978-3-319-10602-1_48
Andriluka M, Pishchulin L, Gehler P, Schiele B (2014) 2D human pose estimation: new benchmark and state of the art analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 3686–3693. https://doi.org/10.1109/CVPR.2014.471
Toshev A, Szegedy C (2014) DeepPose: human pose estimation via deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 1653–1660. https://doi.org/10.1109/CVPR.2014.214
Wei SE, Ramakrishna V, Kanade T, Sheikh Y (2016) Convolutional pose machines. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 4724–4732. https://doi.org/10.1109/CVPR.2016.511
Cao Z, Simon T, Wei SE, Sheikh Y (2017) Realtime multi-person 2D pose estimation using part affinity fields. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). https://doi.org/10.1109/CVPR.2017.143
Newell A, Huang Z, Deng J (2017) Associative embedding: end-to-end learning for joint detection and grouping. Adv Neural Inf Process Syst
Kreiss S, Bertoni L, Alahi A (2019) PifPaf: composite fields for human pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 11977–11986. https://doi.org/10.1109/CVPR.2019.01225
Cheng B, Xiao B, Wang J, Shi H, Huang TS, Zhang L (2020) HigherHRNet: scale-aware representation learning for bottom-up human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5386–5395. https://doi.org/10.1109/CVPR42600.2020.00543
Luo Z, Wang Z, Huang Y, Wang L, Tan T, Zhou E (2021) Rethinking the heatmap regression for bottom-up human pose estimation. In: Proceedings IEEE conference on computer vision and pattern recognition (CVPR)
Geng Z, Sun K, Xiao B, Zhang Z, Wang J (2021) Bottom-up human pose estimation via disentangled keypoint regression. In: Proceedings IEEE conference on computer vision and pattern recognition (CVPR)
Weian M, Tian Z, Wang X, Shen C (2021) FCPose: fully convolutional multi-person pose estimation with dynamic instance-aware convolutions. In: Proc. IEEE conference on computer vision and pattern recognition (CVPR)
Shaoqing Ren, Kaiming He, Ross Girshick and ian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In: Advances in Neural Information Processing Systems, 2015, pp. 91–99. https://doi.org/10.1109/TPAMI.2016.2577031.
Yu C, Xiao B, Gao C, Yuan L, Zhang L, Sang N, Wang J (2021) Lite-HRNet: a lightweight high-resolution network. In: Proceedings IEEE conference on computer vision and pattern recognition (CVPR), pp 10440–10450
Zhang X, Zhou X, Lin M, Sun J (2018) ShuffleNet: an extremely efficient convolutional neural network for mobile devices. In: Proc. IEEE conference on computer vision and pattern recognition (CVPR), pp 6848–6856. https://doi.org/10.1109/CVPR.2018.00716
He K, Zhang X, Ren S, Sun J (2016) Deep Residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778. https://doi.org/10.1109/CVPR.2016.90
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2021) Deformable DETR: deformable transformers for end-to-end object detection. In: International conference on learning representations (ICLR)
Wang Y, Xu Z, Wang X, Shen C, Cheng B, Shen H, Xia H (2021) End-to-end video instance segmentation with transformers. In: Proceedings IEEE conference on computer vision and pattern recognition (CVPR), pp 8741–8750
Huang L, Tan J, Liu J, Yuan J (2020) Hand-transformer: non-autoregressive structured modeling for 3D hand pose estimation. In: Proceedings European conference on computer vision (ECCV). Springer, pp 17–33
Wang N, Zhou W, Wang J, Li H (2021) Transformer meets tracker: exploiting temporal context for robust visual tracking. In: Proceedings IEEE conference on computer vision and pattern recognition (CVPR), pp 1571–1580
Meinhardt T, Kirillov A, Leal-Taixe L, Feichtenhofer C (2021) TrackFormer: multi-object tracking with transformers. arXiv preprint arXiv:2101.02702
Chen H, Wang Y, Guo T, Xu C, Deng Y, Liu Z, Ma S, Xu C, Xu C, Gao W (2021) Pre-trained image processing transformer. In: Proceedings IEEE conference on computer vision and pattern recognition (CVPR), pp 12299–12310
Dai Z, Liu H, Le Q, Tan M (2021) CoAtNet: marrying convolution and attention for all data sizes. arXiv preprint arXiv:2106.04803
Bello I, Zoph B, Vaswani A, Shlens J, Le QV (2019) Attention augmented convolutional networks. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 3286–3295. https://doi.org/10.1109/ICCV.2019.00338
Srinivas A, Lin TY, Parmar N, Shlens J, Abbeel P, Vaswani A (2021) Bottleneck transformers for visual recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 16519–16529
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030
Wu H, Xiao B, Codella N, Liu M, Dai X, Yuan L, Zhang L (2021) CvT: introducing convolutions to vision transformers. arXiv preprint arXiv:2103.15808
Yuan L, Chen Y, Wang T, Yu W, Shi Y, Jiang ZH, Tay FE, Feng J, Yan S (2021) Tokens-to-Token ViT: training vision transformers from scratch on ImageNet. arXiv preprint arXiv:2101.11986
Wang W, Xie E, Li X, Fan DP, Song K, Liang D, Lu T, Luo P, Shao L (2021) Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. arXiv preprint arXiv:2102.12122
Mao W, Ge Y, Shen C, Tian Z, Wang X, Wang Z (2021) TFPose: direct human pose estimation with transformers. arXiv preprint arXiv:2103.15320
Chen Y, Wang Z, Peng Y, Zhang Z, Yu G, Sun J (2018) Cascaded pyramid network for multi-person pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 7103–7112. https://doi.org/10.1109/CVPR.2018.00742
Papandreou G, Zhu T, Chen LC, Gidaris S, Tompson J, Murphy K (2018) PersonLab: person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In: Proceedings of the European conference on computer vision (ECCV), pp 269–286
Kocabas M, Karagoz S, Akbas E (2018) MultiPoseNet: fast multi-person pose estimation using pose residual network. In: Proceedings of the European conference on computer vision (ECCV), pp 417–433
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask R-CNN. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 2961–2969. https://doi.org/10.1109/TPAMI.2018.2844175
Papandreou G, Zhu T, Kanazawa N, Toshev A, Tompson J, Bregler C, Murphy K (2017) Towards accurate multi-person pose estimation in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 4903–4911. https://doi.org/10.1109/CVPR.2017.395
Fang HS, Xie S, Tai YW, Lu C (2017) RMPE: regional multi-person pose estimation. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 2334–2343. https://doi.org/10.1109/ICCV.2017.256
Gamra MB, Akhloufi MA (2021) A review of deep learning techniques for 2D and 3D human pose estimation. Image Vis Comput 114:104282. https://doi.org/10.1016/j.imavis.2021.104282
Article Google Scholar
Nibali A, Millward J, He Z, Morgan S (2021) ASPset: an outdoor sports pose video dataset with 3D keypoint annotations. Image Vis Comput 111:104196. https://doi.org/10.1016/j.imavis.2021.104196
Article Google Scholar
Zhang W, Wang X, You W, Chen J, Dai P, Zhang P (2019) RESLS: region and edge synergetic level set framework for image segmentation. IEEE Trans Image Process 29:57–71. https://doi.org/10.1109/TIP.2019.2928134
Article MathSciNet MATH Google Scholar
Xiao Y (2014) Blurred trace infrared image segmentation based on template approach and immune factor. Infrared Phys Technol 67:116–120. https://doi.org/10.1016/j.infrared.2014.07.002
Article Google Scholar
Xiao Y, Zijie Z (2020) Infrared image extraction algorithm based on adaptive growth immune field. Neural Process Lett 51:2575–2587. https://doi.org/10.1007/s11063-020-10218-7
Article Google Scholar
Zhu H, Zhang Q, Wang Q, Li H (2017) 4D light field superpixel and segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 6384–6392. https://doi.org/10.1109/TIP.2019.2927330
Yu X, Zhou Z, Gao Q, Li D, Ríha K (2014) Infrared image segmentation using growing immune field and clone threshold. Infrared Phys Technol 88:184–193. https://doi.org/10.1016/j.infrared.2017.11.029
Article Google Scholar
Zhou Z, Zhang B, Yu X (2021) Infrared handprint classification using deep convolution neural network. Neural Process Lett. https://doi.org/10.1007/s11063-021-10429-6
Article Google Scholar

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grant 61771299.

Author information

Authors and Affiliations

School of Communication and Information Engineering, Shanghai University, Shanghai, 200444, China
Rui Wang, Fudi Geng & Xiangyang Wang

Authors

Rui Wang
View author publications
You can also search for this author in PubMed Google Scholar
Fudi Geng
View author publications
You can also search for this author in PubMed Google Scholar
Xiangyang Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiangyang Wang.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, R., Geng, F. & Wang, X. MTPose: Human Pose Estimation with High-Resolution Multi-scale Transformers. Neural Process Lett 54, 3941–3964 (2022). https://doi.org/10.1007/s11063-022-10794-w

Download citation

Accepted: 28 February 2022
Published: 29 March 2022
Issue Date: October 2022
DOI: https://doi.org/10.1007/s11063-022-10794-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MTPose: Human Pose Estimation with High-Resolution Multi-scale Transformers

Abstract

Access this article

Similar content being viewed by others

Object detection using YOLO: challenges, architectural successors, datasets and applications

SSD: Single Shot MultiBox Detector

YOLO-based Object Detection Models: A Review and its Applications

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

MTPose: Human Pose Estimation with High-Resolution Multi-scale Transformers

Abstract

Access this article

Similar content being viewed by others

Object detection using YOLO: challenges, architectural successors, datasets and applications

SSD: Single Shot MultiBox Detector

YOLO-based Object Detection Models: A Review and its Applications

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation