Abstract
Multi-person pose estimation has been an increasingly popular topic with the advancements of all kinds of computer vision and human-machine interaction tasks. This study field could further enhance the understanding of human poses and activities. The current mainstream multi-person pose estimation methods are generally divided into two categories: top-down and bottom-up methods. Although top-down methods are capable of achieving better performance by simplifying the problem to single-person pose estimation, while this strategy somewhat greatly increases the time complexity as a trade-off for better accuracy. The bottom-up methods could directly locate all the keypoints in the image, which can be potentially more effective and can be made real-time. However, most of the current bottom-up methods have separated the detection and grouping of keypoints into two independent steps. This greatly hindered the overall performance and computation efficiency of the algorithms. To address this issue, our study proposes an end-to-end bottom-up framework for multi-person pose estimation. Using the HRNet as the backbone structure, we add a deconvolution module to acquire high-resolution feature maps in the keypoints proposal stage. The graph neural network is leveraged in the grouping stage, which is integrated to the backbone so that the whole framework can be trained in an end-to-end manner. Using the keypoint candidates as nodes, two discriminators are exploited to supervise the grouping process. Lastly, a graph-based pose optimization algorithm is explored to refine the results. Experiments on the COCO and CrowdPose datasets show that our method achieves better accuracy and greatly reduce the computation time as well.






Similar content being viewed by others
References
Cao Z, Simon T, Wei SE., Sheikh Y (2017) Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition. pp 7291–7299
Chen Y, Wang Z, Peng Y, Zhang Z, Yu G Sun J (2018) Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition. pp 7103–7112
Chen Y, Rohrbach M, Yan Z, Shuicheng Y, Feng J, Kalantidis Y (2019) Graph-based global reasoning networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 433–442
Cheng B, Xiao B, Wang J, Shi H, Huang TS, Zhang L (2020) Higherhrnet: Scale-aware representation learning for bottom-up human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5386–5395
Dhillon IS, Guan Y, Kulis B (2007) Weighted graph cuts without eigenvectors a multilevel approach. IEEE Trans Pattern Anal Mach Intell 29(11):1944–1957
Duvenaud DK, Maclaurin D, Iparraguirre J, Bombarell R, Hirzel T, Aspuru-Guzik A, Adams RP (2015) Convolutional networks on graphs for learning molecular fingerprints. Adv Neural Inf Proces Syst 28:2224–2232
Estrach JB, Zaremba W, Szlam A, LeCun Y (2014) Spectral networks and deep locally connected networks on graphs. In 2nd International Conference on Learning Representations. pp 1–14
Fang HS, Xie S, Tai Y W, Lu C (2017) Rmpe: Regional multi-person pose estimation. In Proceedings of the IEEE international conference on computer vision. 2334–2343
Gori M, Monfardini G, Scarselli F (2005) A new model for learning in graph domains. In Proceedings. 2005 IEEE International Joint Conference on Neural Networks. 2: 729–734
He K, Gkioxari G, Doll’ar P and Girshick R (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision. 2961–2969
Huang S, Gong M, Tao D (2017) A coarse-fine network for keypoint localization. In Proceedings of the IEEE international conference on computer vision. 3028–3037
Jin S, Liu W, Ouyang W, Qian C (2019) Multi-person articulated tracking with spatial and temporal embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5664–5673
Kipf TN, Welling M (2017) Semi-supervised classification with graph convolutional networks. In 5th International Conference on Learning Representations. 1–14
Kreiss S, Bertoni L & Alahi A (2019) Pifpaf: Composite fields for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11977–11986
Li J, Wang C, Zhu H, Mao Y, Fang HS, Lu C (2019) Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 10863–10872
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, ... Zitnick CL (2014) Microsoft coco: common objects in context. In European conference on computer vision. Springer, Cham. pp. 740–755
Newell A, Yang K, Deng J (2016) Stacked hourglass networks for human pose estimation. In European conference on computer vision. Springer, Cham. 483–499
Newell A, Huang Z, Deng J (2017) Associative embedding: end-to-end learning for joint detection and grouping. Adv Neural Inf Proces Syst 2017:2278–2288
Nie X, Feng J, Zhang J, Yan, S (2019) Single-stage multi-person pose machines. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6951–6960
Papandreou G, Zhu T, Kanazawa N, Toshev A, Tompson J, Bregler C, Murphy K (2017) Towards accurate multi-person pose estimation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4903–4911
Papandreou G, Zhu T, Chen LC, Gidaris S, Tompson J, Murphy K (2018) Personlab: Person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In Proceedings of the European Conference on Computer Vision (ECCV). 269–286
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Proces Syst 28:91–99
Sun K, Xiao B, Liu D, Wang J (2019) Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5693–5703
Sun X, Xiao B, Wei F, Liang S, Wei Y (2018) Integral human pose regression. In Proceedings of the European Conference on Computer Vision (ECCV). 529–545
Wang Y, Sun Y, Liu Z, Sarma SE, Bronstein MM, Solomon JM (2019) Dynamic graph cnn for learning on point clouds. Acm Trans Graphics (tog) 38(5):1–12
Wei SE, Ramakrishna V, Kanade T, Sheikh Y (2016) Convolutional pose machines. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. pp 4724–4732
Xiao B, Wu H, Wei Y (2018) Simple baselines for human pose estimation and tracking. In Proceedings of the European conference on computer vision (ECCV). 466–481
Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In Thirty-second AAAI conference on artificial intelligence. 7444–7452
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interests
No conflicts of interests in this work.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zeng, Q., Hu, Y., Li, D. et al. Multi-person pose estimation based on graph grouping optimization. Multimed Tools Appl 82, 7039–7053 (2023). https://doi.org/10.1007/s11042-022-13445-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-13445-3