Abstract
This paper presents a method to capture human pose from individual real-world RGB images using a deep learning technique. The current works on estimating human pose by deep learning are designed in a detection or a regression framework, and in a part-based manner. As a new perspective, we introduce a classification scheme for this problem, which reasons the pose holistically. To the best of our knowledge, this is the first work for holistic human pose classification task that owes its feasibility to the great power of convolutional neural networks in feature learning. After training a convolutional neural network to classify the input image to one of the KeyPoses, the final pose is computed as a linear combination of several KeyPoses. In this new holistic classification attitude, the vast and high degree of freedom human pose space is divided into a finite number of subspaces and the convolutional neural network shows promising results in learning the features of each subspace. Empirical results (PCP and PCK rates) demonstrate that the proposed scheme is successfully able to understand human pose (i.e., predict a valid, true and coarse pose) in real-world unconstrained images with challenges like severe occlusion, high articulation, low quality and cluttered background. Furthermore, using the proposed method, the need for defining a complex model (such as appearance model or joints pairwise relations) is relieved. We have also verified a potential application of our proposed method in semantic image retrieval based on human pose.
Similar content being viewed by others
References
Abdel-Hamid O, Mohamed AR, Jiang H, Deng L, Penn G, Yu D (2014) Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 22(10):1533–1545
Amin S, Andriluka M, Rohrbach M, Schiele B (2013) Multi-view pictorial structures for 3D human pose estimation. In: British machine vision conference (BMVC)
Andriluka M, Pishchulin L, Gehle P, Schiele B (2014) 2D human pose estimation: new benchmark and state of the art analysis. In: IEEE conference on computer vision and pattern recognition (CVPR). http://human-pose.mpi-inf.mpg.de/
Belagiannis V, Amann C, Navab N, Ilic S (2014) Holistic Human Pose Estimation with Regression Forests. In: Articulated Motion and Deformable Objects (AMDO)
Belagiannis V, Rupprecht C, Carneiro G (2015) Robust optimization for deep regression. In: International Conference on Computer Vision (ICCV) https://doi.org/10.1109/ICCV.2015.324
Berg A, Deng J, Fei-Fei L (2010) Large Scale Visual Recognition Challenge. http://www.image-net.org/challenges/LSVRC
Bourdev L, Malik J (2009) Poselets: body part detectors trained using 3D human pose annotations. In: International Conference on Computer Vision (ICCV)
Butepage J, Black MJ, Kragic D, Kjellström H (2017) Deep representation learning for human motion prediction and classification. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Cao Z, Simon T, Wei S-E, Sheikh Y (2017) Realtime multi-person 2D pose estimation using part affinity fields. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Carreira J, Agrawal P, Fragkiadaki K, Malik J (2016) Human pose estimation with iterative error feedback. In: IEEE conference on computer vision and pattern recognition (CVPR)
Chao Y-W, Yang J, Price B, Cohen S, Deng J (2017) Forecasting human dynamics from static images. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Chatfield K, Simonyan K, Vedaldi A, Zisserman A (2014) Return of the devil in the details: delving deep into convolutional networks. In: British Machine Vision Conference (BMVC)
Chen X, Yuille A (2014) Articulated pose estimation by a graphical model with image dependent pairwise relations. Advances in neural information processing systems (NIPS)
Chen Y, Yang X, Zhong B, Pan S, Chen D, Zhang H (2016) CNNTracker: online discriminative object tracking via deep convolutional neural network. Appl Soft Comput 38:1088–1098
Chu X, Yang W, Ouyang W, Ma C, Yuille AL, Wang X (2017) Multi-context attention for human pose estimation. In: IEEE conference on computer vision and pattern recognition (CVPR)
Cui J, Liu Y, Xu Y (2013) Tracking generic human motion via fusion of low- and high-dimensional approaches. IEEE Trans Syst Man Cybern Syst 43(4):996–1002. https://doi.org/10.1109/TSMCA.2012.2223670
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE conference on computer vision and pattern recognition (CVPR). Pp 886-893. https://doi.org/10.1109/CVPR.2005.177
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) ImageNet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Eichner M, Ferrari V (2009) Better appearance models for pictorial structures. In: Proceedings of the British Machine Vision Conference (BMVC). https://doi.org/10.5244/C.23.3
Eigen D, Krishnan D, Fergus R (2013) Restoring an image taken through a window covered with dirt or rain. In: international conference on computer vision (ICCV)
Felzenszwalb PF, Huttenlocher DP (2005) Pictorial Structures for Object Recognition. International Journal of Computer Vision (IJCV)
Felzenszwalb PF, McAllester D, Ramanan D (2008) A discriminatively trained, multiscale, deformable part model. In: IEEE conference on computer vision and pattern recognition (CVPR)
Ferrari V, Marin-Jimenez M, Zisserman A (2008) Progressive search space reduction for human pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) https://doi.org/10.1109/CVPR.2008.4587468
Ferrari V, Marin-Jiminez M, Zisserman A (2008) Progressive search space reduction for human pose estimation. In: Proceedings of the IEEE Conferencen Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1109/CVPR.2008.4587468
Ghodrati A, Diba A, Pedersoli M, Tuytelaars T, Van Gool L (2015) DeepProposal: hunting objects by cascading deep convolutional layers. In: international conference on computer vision (ICCV)
Girshick R (2015) Fast R-CNN. In: International Conference on Computer Vision (ICCV)
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: IEEE conference on computer vision and pattern recognition (CVPR)
Gkioxari G, Hariharan B, Girshick R, Malik J (2014) R-CNNs for pose estimation and action detection. arXiv:1406.5212
Hariharan B, Arbeláez P, Girshick R, Malik J (2015) Hypercolumns for object segmentation and fine-grained localization. In: IEEE conference on computer vision and pattern recognition (CVPR)
He T, Mao H, Yi Z (2016) Moving object recognition using multi-view three-dimensional convolutional neural networks. Neural Comput & Applic. https://doi.org/10.1007/s00521-016-2277-9
Insafutdinov E, Pishchulin L, Andres B, Andriluka M, Schiele B (2016) DeeperCut: a deeper, stronger, and faster multi-person pose estimation model. In: European conference on computer vision (ECCV) Cham. Springer international publishing, pp 34-50. https://doi.org/10.1007/978-3-319-46466-4_3
Ionescu C, Li F, Sminchisescu C (2011) Latent structured models for human pose estimation. In: International conference on computer vision (ICCV). pp 2220–2227
Iqbal U, Milan A, Gall J (2017) PoseTrack: joint multi-person pose estimation and tracking. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Jain A, Tompson J, Andriluka M, Taylor GW, Bregler C (2013) Learning Human Pose Estimation Features with Convolutional Networks. arXiv:1312.7302
Johnson S, Everingham M (2010) Clustered Pose and Nonlinear Appearance Models for Human Pose Estimation. In: British Machine Vision Conference (BMVC). http://www.comp.leeds.ac.uk/mat4saj/lsp.html. https://doi.org/10.5244/C.24.12
Johnson S, Everingham M (2011) Learning effective human pose estimation from inaccurate annotation. In: IEEE conference on computer vision and pattern recognition (CVPR). http://www.comp.leeds.ac.uk/mat4saj/lspet.html
Kadkhodamohammadi A, Gangi A, de Mathelin M, Padoy N (2015) Pictorial structures on RGB-D images for human pose estimation in the operating room. In: Medical Image Computing and Computer-Assisted Intervention. pp 363–370. https://doi.org/10.1007/978-3-319-24553-9_45
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: Neural information processing systems (NIPS). pp 1106–1114
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. In: the IEEE vol 11. pp 2278–2324
Lifshitz I, Fetaya E, Ullman S (2016) Human pose estimation using deep consensus voting. In: European conference on computer vision (ECCV)
Liu Y, Zhang X, Cui J (2010) Visual analysis of child-adult interactive behaviors in video sequences. In: International Conference on Virtual Systems and Multimedia (VSMM) https://doi.org/10.1109/VSMM.2010.5665969
Ye Liu, Jinshi Cui, Zhao H (2012) Fusion of low-and high-dimensional approaches by Trackers sampling for generic human motion tracking. In: 21st International Conference on Pattern Recognition (ICPR)
Liu Y, Nie L, Han L, Zhang L, Rosenblum DS (2015) Action2Activity: recognizing complex activities from sensor data. In: International conference on artificial intelligence (IJCAI), pp 1617–1623
Liu L, Cheng L, Liu Y, Jia Y, Rosenblum DS (2016) Recognizing complex activities by a probabilistic interval-based model. In: AAAI Conference on Artificial Intelligence
Liu Y, Nie L, Liu L, Rosenblum DS (2016) From action to activity: sensor-based activity recognition. Neurocomputing 181:108–115. https://doi.org/10.1016/j.neucom.2015.08.096
Lowe DG (1999) Object recognition from local scale-invariant features. In: International Conference on Computer Vision. pp 1150–1157. https://doi.org/10.1109/ICCV.1999.790410
Lu Y, Wei Y, Liu L, Zhong J, Sun L, Liu Y (2017) Towards unsupervised physical activity recognition using smartphone accelerometers. Multimedia Tools Appl 76(8):10701–10719
Martinez J, Black MJ, Romero J (2017) On human motion prediction using recurrent neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Newell A, Yang K, Deng J (2016) Stacked hourglass networks for human pose estimation. arXiv:1603.06937
Ojala T, Pietikäinen M, Harwood D (1994) Performance evaluation of texture measures with classification based on Kullback discrimination of distributions. In: IAPR. Int Conf Pattern Recog (ICPR):582–585
Ouyang W, Chu X, Wang X (2014) Multi-source deep learning for human pose estimation. In: IEEE conference on computer vision and pattern recognition (CVPR)
Oyedotun OK, Khashman A (2016) Deep learning in vision-based static hand gesture recognition. Neural Comput & Applic. https://doi.org/10.1007/s00521-016-2294-8
Papandreou G, Zhu T, Kanazawa N, Toshev A, Tompson J, Bregler C, Murphy K (2017) Towards accurate multi-person pose estimation in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Parkhi OM, Vedaldi A, Zisserman A (2015) Deep Face Recognition. In: British Machine Vision Conference (BMVC)
Pfister T, Simonyan K, Charles J, Zisserman A (2014) Deep convolutional neural networks for efficient pose estimation in gesture videos. In: Asian Conference on Computer Vision (ACCV)
Pfister T, Charles J, Zisserman A (2015) Flowing ConvNets for human pose estimation in videos. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Pinheiro P, Collobert R (2014) Recurrent convolutional neural networks for scene labeling in: international conference on machine learning (ICML). Pp 82-90
Pishchulin L, Insafutdinov E, Tang S, Andres B, Andriluka M, Gehler P, Schiele B (2015) DeepCut: joint subset partition and labeling for multi person pose estimation. In: IEEE conference on computer vision and pattern recognition (CVPR)
Ramanan D (2006) Learning to parse images of articulated objects. Neural information processing systems (NIPS)
Rogez G, Rihan J, Ramalingam S, Orrite C, Torr PH (2008) Randomized trees for human pose detection. In: IEEE conference on computer vision and pattern recognition (CVPR). pp 1–8
Shotton J, Fitzgibbon A, Sharp T, Cook M, Finocchio M, Moore R, Kohli P, Criminisi A, Kipman A (2013) Efficient human pose estimation from single depth images. IEEE Trans Pattern Anal Mach Intell (PAMI) 35(12):2821–2840. https://doi.org/10.1109/TPAMI.2012.241
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition arXiv:1409.1556
Tompson J, Jain A, LeCun Y, Bregler C (2014) Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation. Neural Information Processing Systems (NIPS)
Tompson J, Goroshin R, Jain A, LeCun Y, Bregler C (2015) Efficient object localization using convolutional networks. In: IEEE conference on computer vision and pattern recognition (CVPR)
Toshev A, Szegedy C (2014) DeepPose: human pose estimation via deep neural networks. In: IEEE conference on computer vision and pattern recognition (CVPR)
Tran D, Forsyth D (2010) Improved human parsing with a full relational model. In: European conference on computer vision (ECCV) http://vision.cs.uiuc.edu/humanparse/
Uijlings JRR, Ferrari V (2015) Situational object boundary detection. In: IEEE conference on computer vision and pattern recognition (CVPR)
Wang Y, Tran D, Liao Z (2011) Learning hierarchical Poselets for human parsing. In: IEEE conference on computer vision and pattern recognition (CVPR). http://ieeexplore.ieee.org/abstract/document/5995519/; https://doi.org/10.1109/CVPR.2011.5995519
Wei S-E, Ramakrishna V, Kanade T, Sheikh Y (2016) Convolutional pose machines. In: IEEE conference on computer vision and pattern recognition (CVPR)
Xu L, Ren JS, Liu C, Jia J (2014) Deep Convolutional Neural Network for Image Deconvolution. In: Neural Information Processing Systems (NIPS)
Yang Y, Ramanan D (2013) Articulated human detection with flexible mixtures of parts. IEEE transactions on pattern analysis and machine intelligence (PAMI)
Zagoruyko S, Komodakis N (2015) Learning to compare image patches via convolutional neural networks. In: IEEE conference on computer vision and pattern recognition (CVPR)
Zhang K, Liu Q, Wu Y, Yang MH (2016) Robust visual tracking via convolutional networks without training. IEEE Trans Image Process 25(4):1779–1792. https://doi.org/10.1109/TIP.2016.2531283
Zhou F, De la Torre F (2016) Spatio-temporal matching for human pose estimation in video. IEEE Trans Pattern Anal Mach Intell (PAMI) 38(8):1492–1504. https://doi.org/10.1109/TPAMI.2016.2526002
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Shamsafar, F., Ebrahimnezhad, H. Understanding holistic human pose using class-specific convolutional neural network. Multimed Tools Appl 77, 23193–23225 (2018). https://doi.org/10.1007/s11042-018-5617-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-5617-1