Abstract
Human pose estimation, especially multi-person pose estimation, is vital for understanding human abnormal behavior. In this paper, we develop a fractal hourglass model to automatically regress human body joints, and propose a layered double-way inference algorithm to calculate the affinity between neighboring skeleton joints. Firstly, the original hourglass resident unit was replaced and the candidate skeleton joints location heatmap regression process was described. And then, we determine the specific body joints location and optimize the regression results. Next, the double-way conditional probabilities between adjacent joints is defined as joints pairwise affinity, and is applied to match adjacent human body part. What’s more, we adopt the spatial distance constraint to refine body joints matching result. Finally, we connect the best matching joints-pair, and iterate the process until all candidate joints are assigned into individual. Extensive experiments on the MPII multi-person subset and the COCO 2016 keypoints challenge show the effectiveness of our method, outperforming the second best method (Associative Embedding) by 0.45 and 1.20%.











Similar content being viewed by others
References
Andriluka M, Pishchulin L, Gehler P, Schiele B (2014) 2D human pose estimation: new benchmark and state of the art analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3686–3693
Belagiannis V, Zisserman A (2017) Recurrent human pose estimation. In: Proceedings of the IEEE International Conference on Automatic Face & Gesture Recognition, pp 468–475
Cao Z, Simon T, Wei SE, Sheikh Y (2016) Realtime multi-person 2D pose estimation using part affinity fields. arXiv:1611.08050
Carreira J, Agrawal P, Fragkiadaki K, Malik J (2016) Human pose estimation with iterative error feedback. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4733–4742
Chen X, Yuille A (2014) Articulated pose estimation by a graphical model with image dependent pairwise relations. In: Proceedings of Advances in Neural Information Processing Systems, pp 1736–1744
Chu X, Yang W, Ouyang WL, Ma C, Yuille AL, Wang XG (2017) Multi-context attention for human pose estimation. arXiv:1702.07432
COCO Dataset. http://cocodataset.org/#keypoints-eval
Collobert R, Kavukcuoglu K, Farabet C (2011) Torch7: a matlab-like environment for machine learning. In: Proceedings of Advances in Neural Information Processing Systems
Fan X, Zheng K, Lin Y, Wang S (2015) Combining local appearance and holistic view: dual-source deep neural networks for human pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1347–1355
Fang HS, Xie SQ, Tai YW, Lu CW (2016) RMPE: regional multi-person pose estimation. arXiv: 1612.00137
Geng Y, Liang RZ, Li W, Wang J, Liang G, Xu C, Wang J (2016) Learning convolutional neural network to maximize pos@top performance measure. In: European Symposium on Artificial Neural Networks (ESANN), pp 589–594
Geng Y, Zhang G, Li W, Gu Y, Liang RZ, Liang G, Wang J, Wu Y, Patil N, Wang JY (2017) A novel image tag completion method based on convolutional neural transformation. In: International Conference on Artificial Neural Networks, pp 539–546
Guo Y, Tao D, Yu J, Xiong H, Li Y, Tao D (2016) Deep neural networks with relativity learning for facial expression recognition. In: IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp 1–6
He KM, Zhang XY, Ren SQ, Sun J (2015) Deep residual learning for image recognition. arXiv:1512.03385
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask R-CNN. arXiv: 1703.06870
Insafutdinov E, Andriluka M, Pishchulin L, Tang S, Levinkov E, Andres B, Schiele B (2016) ArtTrack: articulated multi-person tracking in the wild. arXiv: 1612.01465
Insafutdinov E, Pishchulin L, Andres B, Andriluka M, Schiele B (2016) DeeperCut: a deeper, stronger, and faster multi-person pose estimation model. In: European Conference on Computer Vision, pp 34–50
Iqbal U, Gall J (2016) Multi-person pose estimation with local joint-to-person associations. In: European Conference on Computer Vision, pp 627–642
Jain A, Tompson J, Andriluka M, Taylor GW, Bregler C (2013) Learning human pose estimation features with convolutional networks. Comput Sci
Ke SR, Zhu LJ, Hwang JN, Pai HI, Lan KM, Liao CP (2010) Real-time 3D human pose estimation from monocular view with applications to event detection and video gaming. In: Proceedings of Seventh IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp 489–496
Ke SR, Hwang JN, Lan KM, Wang SZ (2011) View-invariant 3D human body pose reconstruction using a monocular video camera. In: Fifth ACM/IEEE International Conference on Distributed Smart Cameras (ICDSC), pp 1–6
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Proces Syst 25(2):1097–1105
Lin TY, Maire M, Belongie S, Bourdev L, Girshick R, Hays J, Perona P, Ramanan D, Zitnick CL, Dollár P (2014) Microsoft COCO: common objects in context. In: European Conference on Computer Vision, pp 740–755
Loffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167
Neubeck A, Gool LV (2006) Efficient non-maximum suppression. In: International Conference on Pattern Recognition, pp 850–855
Newell A, Yang KY, Deng J (2016) Stacked hourglass networks for human pose estimation. In: European Conference on Computer Vision, pp 483–499
Newell A, Huang Z, Deng J (2016) Associative embedding: end-to-end learning for joint detection and grouping. arXiv: 1611.05424
Pan Z, Liu S, Fu W (2017) A review of visual moving target tracking. Multimed Tools Appl 76(16):16989–17018
Papandreou G, Zhu T, Kanazawa N, Toshev A, Tompson J, Bregler C, Murphy K (2017) Towards accurate multi-person pose estimation in the wild. arXiv:1701.01779
Pishchulin L, Insafutdinov E, Tang S, Andres B, Andriluka M, Gehler P, Schiele B (2016) DeepCut: joint subset partition and labeling for multi person pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4929–4937
Popoola OP, Wang K (2012) Video-based abnormal human behavior recognition—a review. IEEE Trans on System Man & Cybern 42(6):865–878
Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp 91–99
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. Comput Sci
Tao D, Cheng J, Song M, Lin X (2016) Manifold ranking-based matrix factorization for saliency detection. IEEE Transactions on Neural Networks and Learning Systems (TNNLS) 27(6):1122–1134
Tao D, Guo Y, Yu B, Pang J, Yu Z (2017) Deep multi-view feature learning for person re-identification. IEEE Trans Circuits Syst Video Technol (TCSVT) PP(99):1–1
Tieleman T, Hinton G (2017) Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. In COURSERA: Neural Networks for Machine Learning, 4(2)
Tompson J, Jain A, Lecun Y, Bregler C (2014) Joint training of a convolutional network and a graphical model for human pose estimation. In: Proceedings of Advances in Neural Information Processing Systems, pp 1799–1807
Toshev A, Szegedy C (2013) DeepPose: human pose estimation via deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1653–1660
Wang C, Wang Y, Yuille AL (2013) An approach to pose-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 915–922
Wang H, Dan O, Verbeek J, Schmid C (2016) A robust and efficient video representation for action recognition. Int J Comput Vis 119(3):219–238
Xiao T, Li H, Ouyang W, Wang X (2016) Learning deep feature representations with domain guided dropout for person re-identification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1249–1258
Yang Y, Ramanan D (2013) Articulated human detection with flexible mixtures of parts. IEEE Trans Pattern Anal Mach Intell 35(12):2878–2890
Yuan Y, Fang J, Wang Q (2015) Online anomaly detection in crowd scenes via structure analysis. IEEE Trans on Cybernetics 45(3):548–561
Zhang G, Liang G, Li W, Fang J, Wang J, Geng Y, Wang JY (2017) Learning convolutional ranking-score function by query preference regularization. In: International Conference on Intelligent Data Engineering and Automated Learning, pp 1–8
Acknowledgements
We would like to gratitude the authors of the MPII human pose dataset and the team members of the COCO 2016 Keypoint Challenges. At the same time, we also thank our laboratory member’s assistance.
Funding
This work was supported by the grants from National Natural Science Foundation of China (Grant No. 61605048), the Talent project of Huaqiao University (Grant No. 14BS215), and Quanzhou scientific and technological planning projects of Fujian, China (Grant No. 2015Z120).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Luo, Y., Xu, Z., Liu, P. et al. Combining fractal hourglass network and skeleton joints pairwise affinity for multi-person pose estimation. Multimed Tools Appl 78, 7341–7363 (2019). https://doi.org/10.1007/s11042-018-6502-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-6502-7