Abstract
HRNet (High-Resolution Networks) as reported by Sun et al. (in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2019) has been the state-of-the-art human pose estimation method, benefitting from its parallel high-resolution designed network structures. However, HRNet is still a typical CNN (Convolutional Neural Networks) architecture, with local convolution operations. Recently, Transformers have been successfully applied in many computer vision areas. The main mechanism in Transformers is self-attention, which can learn global or long-range dependencies among different parts. In this paper, we propose a human pose estimation framework built upon High-Resolution Multi-scale Transformers, termed MTPose. We combine the two advantages of high-resolution and Transformers together to improve the performance. Specifically, we design a sub-network, MTNet (Multi-scale Transformers-based high-resolution Networks), which consists of two parallel branches. One is high-resolution with convolutional local operations, named as local branch. The other is the global branch utilizing multi-scale Transformer encoders to learn long-range dependencies of the whole body keypoints. At the end of the networks, the two branches are integrated together to predict the final keypoint heatmaps. Experiments on two benchmark datasets, the MSCOCO keypoint detection dataset and MPII human pose dataset, demonstrate that our method can significantly improve the state-of-the-art human pose estimation methods. Code will be available at: https://github.com/fudiGeng/MTPose.
Similar content being viewed by others
References
Si C, Chen W, Wang W, Wang L, Tan T (2019) An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 1227–1236. https://doi.org/10.1109/CVPR.2019.00132
Yang C, Xu Y, Shi J, Dai B, Zhou B (2020) Temporal pyramid network for action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 591–600. https://doi.org/10.1109/CVPR42600.2020.00067
Gao J, Zheng WS, Pan JH, Gao C, Wang Y, Zeng W, Lai J (2020) An asymmetric modeling for action assessment. In: European conference on computer vision (ECCV), pp. 222–238. https://doi.org/10.1007/978-3-030-58577-8_14
Pan JH, Gao J, Zheng WS (2019) Action assessment by joint relation graphs. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 63341–6340. https://doi.org/10.1109/ICCV.2019.00643
Snower M, Kadav A, Lai F, Graf HP (2020) 15 Keypoints is all you need. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 6738–6748. https://doi.org/10.1109/CVPR42600.2020.00677
Ning G, Pei J, Huang H (2020) LightTrack: a generic framework for online top-down human pose tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 1034–1035. https://doi.org/10.1109/CVPRW50498.2020.00525
Wang M, Tighe J, Modolo D (2020) Combining detection and tracking for human pose estimation in videos. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 11088–11096. https://doi.org/10.1109/CVPR42600.2020.01110
Rafi U, Doering A, Leibe B, Gall J (2020) Self-supervised keypoint correspondences for multi-person pose estimation and tracking in videos. In: European conference on computer vision (ECCV). https://doi.org/10.1007/978-3-030-58565-5_3
Kwon OH, Tanke J, Gall J (2020) Recursive Bayesian filtering for multiple human pose tracking from multiple cameras. In: Proceedings of the asian conference on computer vision (ACCV). https://doi.org/10.1007/978-3-030-69532-3_27
Kocabas M, Athanasiou N, Black MJ (2020) VIBE: video inference for human body pose and shape estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5253–5263. https://doi.org/10.1109/CVPR42600.2020.00530
Chen H, Guo P, Li P, Lee GH, Chirikjian G (2020) Multi-person 3D pose estimation in crowded scenes based on multi-view geometry. In: European conference on computer vision (ECCV), pp 541–557. https://doi.org/10.1007/978-3-030-58580-8_32
Kolotouros N, Pavlakos G, Black MJ, Daniilidis K (2019) Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp. 2252–2261. https://doi.org/10.1109/ICCV.2019.00234
Qiu H, Wang C, Wang J, Wang N, Zeng W (2019) Cross view fusion for 3D human pose estimation. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp. 4342–4351. https://doi.org/10.1109/ICCV.2019.00444
Zhou X, Huang Q, Sun X, Xue X, Wei Y (2017) Towards 3D human pose estimation in the wild: a weakly-supervised approach. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 398–407. https://doi.org/10.1109/ICCV.2017.51
Newell A, Yang K, Deng J (2016) Stacked hourglass networks for human pose estimation. In: European conference on computer vision (ECCV). https://doi.org/10.1007/978-3-319-46484-8_29
Xiao B, Wu H, Wei Y (2018) Simple baselines for human pose estimation and tracking. In: Proceedings of the European conference on computer vision (ECCV), pp 466–481. https://doi.org/10.1007/978-3-030-01231-1_29
Sun K, Xiao B, Liu D, Wang J (2019) Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5693–5703. https://doi.org/10.1109/CVPR.2019.00584
Wang J, Sun K, Cheng T, Jiang B, Deng C, Zhao Y, Liu D, Mu Y, Tan M, Wang X, Liu W, Xiao B (2020) Deep high-resolution representation learning for visual recognition. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2020.2983686
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2021) An image is worth 16x16 words: transformers for image recognition at scale. In: International conference on learning representations (ICLR)
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision (ECCV)
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft COCO: common objects in context. In: European conference on computer vision (ECCV), pp 740–755. https://doi.org/10.1007/978-3-319-10602-1_48
Andriluka M, Pishchulin L, Gehler P, Schiele B (2014) 2D human pose estimation: new benchmark and state of the art analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 3686–3693. https://doi.org/10.1109/CVPR.2014.471
Toshev A, Szegedy C (2014) DeepPose: human pose estimation via deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 1653–1660. https://doi.org/10.1109/CVPR.2014.214
Wei SE, Ramakrishna V, Kanade T, Sheikh Y (2016) Convolutional pose machines. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 4724–4732. https://doi.org/10.1109/CVPR.2016.511
Cao Z, Simon T, Wei SE, Sheikh Y (2017) Realtime multi-person 2D pose estimation using part affinity fields. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). https://doi.org/10.1109/CVPR.2017.143
Newell A, Huang Z, Deng J (2017) Associative embedding: end-to-end learning for joint detection and grouping. Adv Neural Inf Process Syst
Kreiss S, Bertoni L, Alahi A (2019) PifPaf: composite fields for human pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 11977–11986. https://doi.org/10.1109/CVPR.2019.01225
Cheng B, Xiao B, Wang J, Shi H, Huang TS, Zhang L (2020) HigherHRNet: scale-aware representation learning for bottom-up human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5386–5395. https://doi.org/10.1109/CVPR42600.2020.00543
Luo Z, Wang Z, Huang Y, Wang L, Tan T, Zhou E (2021) Rethinking the heatmap regression for bottom-up human pose estimation. In: Proceedings IEEE conference on computer vision and pattern recognition (CVPR)
Geng Z, Sun K, Xiao B, Zhang Z, Wang J (2021) Bottom-up human pose estimation via disentangled keypoint regression. In: Proceedings IEEE conference on computer vision and pattern recognition (CVPR)
Weian M, Tian Z, Wang X, Shen C (2021) FCPose: fully convolutional multi-person pose estimation with dynamic instance-aware convolutions. In: Proc. IEEE conference on computer vision and pattern recognition (CVPR)
Shaoqing Ren, Kaiming He, Ross Girshick and ian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In: Advances in Neural Information Processing Systems, 2015, pp. 91–99. https://doi.org/10.1109/TPAMI.2016.2577031.
Yu C, Xiao B, Gao C, Yuan L, Zhang L, Sang N, Wang J (2021) Lite-HRNet: a lightweight high-resolution network. In: Proceedings IEEE conference on computer vision and pattern recognition (CVPR), pp 10440–10450
Zhang X, Zhou X, Lin M, Sun J (2018) ShuffleNet: an extremely efficient convolutional neural network for mobile devices. In: Proc. IEEE conference on computer vision and pattern recognition (CVPR), pp 6848–6856. https://doi.org/10.1109/CVPR.2018.00716
He K, Zhang X, Ren S, Sun J (2016) Deep Residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778. https://doi.org/10.1109/CVPR.2016.90
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2021) Deformable DETR: deformable transformers for end-to-end object detection. In: International conference on learning representations (ICLR)
Wang Y, Xu Z, Wang X, Shen C, Cheng B, Shen H, Xia H (2021) End-to-end video instance segmentation with transformers. In: Proceedings IEEE conference on computer vision and pattern recognition (CVPR), pp 8741–8750
Huang L, Tan J, Liu J, Yuan J (2020) Hand-transformer: non-autoregressive structured modeling for 3D hand pose estimation. In: Proceedings European conference on computer vision (ECCV). Springer, pp 17–33
Wang N, Zhou W, Wang J, Li H (2021) Transformer meets tracker: exploiting temporal context for robust visual tracking. In: Proceedings IEEE conference on computer vision and pattern recognition (CVPR), pp 1571–1580
Meinhardt T, Kirillov A, Leal-Taixe L, Feichtenhofer C (2021) TrackFormer: multi-object tracking with transformers. arXiv preprint arXiv:2101.02702
Chen H, Wang Y, Guo T, Xu C, Deng Y, Liu Z, Ma S, Xu C, Xu C, Gao W (2021) Pre-trained image processing transformer. In: Proceedings IEEE conference on computer vision and pattern recognition (CVPR), pp 12299–12310
Dai Z, Liu H, Le Q, Tan M (2021) CoAtNet: marrying convolution and attention for all data sizes. arXiv preprint arXiv:2106.04803
Bello I, Zoph B, Vaswani A, Shlens J, Le QV (2019) Attention augmented convolutional networks. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 3286–3295. https://doi.org/10.1109/ICCV.2019.00338
Srinivas A, Lin TY, Parmar N, Shlens J, Abbeel P, Vaswani A (2021) Bottleneck transformers for visual recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 16519–16529
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030
Wu H, Xiao B, Codella N, Liu M, Dai X, Yuan L, Zhang L (2021) CvT: introducing convolutions to vision transformers. arXiv preprint arXiv:2103.15808
Yuan L, Chen Y, Wang T, Yu W, Shi Y, Jiang ZH, Tay FE, Feng J, Yan S (2021) Tokens-to-Token ViT: training vision transformers from scratch on ImageNet. arXiv preprint arXiv:2101.11986
Wang W, Xie E, Li X, Fan DP, Song K, Liang D, Lu T, Luo P, Shao L (2021) Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. arXiv preprint arXiv:2102.12122
Mao W, Ge Y, Shen C, Tian Z, Wang X, Wang Z (2021) TFPose: direct human pose estimation with transformers. arXiv preprint arXiv:2103.15320
Chen Y, Wang Z, Peng Y, Zhang Z, Yu G, Sun J (2018) Cascaded pyramid network for multi-person pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 7103–7112. https://doi.org/10.1109/CVPR.2018.00742
Papandreou G, Zhu T, Chen LC, Gidaris S, Tompson J, Murphy K (2018) PersonLab: person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In: Proceedings of the European conference on computer vision (ECCV), pp 269–286
Kocabas M, Karagoz S, Akbas E (2018) MultiPoseNet: fast multi-person pose estimation using pose residual network. In: Proceedings of the European conference on computer vision (ECCV), pp 417–433
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask R-CNN. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 2961–2969. https://doi.org/10.1109/TPAMI.2018.2844175
Papandreou G, Zhu T, Kanazawa N, Toshev A, Tompson J, Bregler C, Murphy K (2017) Towards accurate multi-person pose estimation in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 4903–4911. https://doi.org/10.1109/CVPR.2017.395
Fang HS, Xie S, Tai YW, Lu C (2017) RMPE: regional multi-person pose estimation. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 2334–2343. https://doi.org/10.1109/ICCV.2017.256
Gamra MB, Akhloufi MA (2021) A review of deep learning techniques for 2D and 3D human pose estimation. Image Vis Comput 114:104282. https://doi.org/10.1016/j.imavis.2021.104282
Nibali A, Millward J, He Z, Morgan S (2021) ASPset: an outdoor sports pose video dataset with 3D keypoint annotations. Image Vis Comput 111:104196. https://doi.org/10.1016/j.imavis.2021.104196
Zhang W, Wang X, You W, Chen J, Dai P, Zhang P (2019) RESLS: region and edge synergetic level set framework for image segmentation. IEEE Trans Image Process 29:57–71. https://doi.org/10.1109/TIP.2019.2928134
Xiao Y (2014) Blurred trace infrared image segmentation based on template approach and immune factor. Infrared Phys Technol 67:116–120. https://doi.org/10.1016/j.infrared.2014.07.002
Xiao Y, Zijie Z (2020) Infrared image extraction algorithm based on adaptive growth immune field. Neural Process Lett 51:2575–2587. https://doi.org/10.1007/s11063-020-10218-7
Zhu H, Zhang Q, Wang Q, Li H (2017) 4D light field superpixel and segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 6384–6392. https://doi.org/10.1109/TIP.2019.2927330
Yu X, Zhou Z, Gao Q, Li D, Ríha K (2014) Infrared image segmentation using growing immune field and clone threshold. Infrared Phys Technol 88:184–193. https://doi.org/10.1016/j.infrared.2017.11.029
Zhou Z, Zhang B, Yu X (2021) Infrared handprint classification using deep convolution neural network. Neural Process Lett. https://doi.org/10.1007/s11063-021-10429-6
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grant 61771299.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wang, R., Geng, F. & Wang, X. MTPose: Human Pose Estimation with High-Resolution Multi-scale Transformers. Neural Process Lett 54, 3941–3964 (2022). https://doi.org/10.1007/s11063-022-10794-w
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-022-10794-w