Skip to main content
Log in

Hand pose estimation with multi-scale network

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Hand pose estimation plays an important role in human-computer interaction. Because it is a problem of high-dimensional nonlinear regression, the accuracy achieved by the existing methods of hand pose estimation are still unsatisfactory. With the development of deep neural networks, more and more people have begun to adopt the method involving deep neural network.We proposed a multi-scale convolutional neural network for the single depth image of the hand. The network, which is end-to-end, directly calculates the three-dimensional coordinates of the joints of the hand,and the multi-scale structure enhances the convergence speed and the output accuracy of the network. In addition, an output function for the output layer, called Stair Rectified Linear Units, is used to limit the output value. As a result of experiments, the optimization method with momentum is found not suitable for hand pose estimation because it is a task of unstable regression. Finally our proposed method has state-of-the-art performance on the NYU Hand Pose Dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  1. Keskin C, Kirac F, Kara YE, Akarun L (2011) Real time hand pose estimation using depth sensors. In: 2011 IEEE international conference on computer vision workshops (ICCV Workshops). IEEE, pp 1228–1234

  2. Supancic JS, Rogez G, Yang Y, Shotton J, Ramanan D (2015) Depth-based hand pose estimation: data, methods, and challenges. In: Proceedings of the IEEE international conference on computer vision, pp 1868–1876

  3. Oberweger M, Wohlhart P, Lepetit V (2015) Hands deep in deep learning for hand pose estimation. In: Computer vision winter workshop

  4. Xu C, Cheng L (2013) Efficient hand pose estimation from a single depth image. In: Proceedings of the IEEE international conference on computer vision, pp 3456–3462

  5. Kirac F, Kara Y E, Akarun L (2014) Hierarchically constrained 3D hand pose estimation using regression forests from single frame depth data. Pattern Recogn Lett 50:91–100

    Article  Google Scholar 

  6. Li P, Ling H, Li X, Liao C (2015) 3d hand pose estimation using randomized decision forest with segmentation index points. In: Proceedings of the IEEE international conference on computer vision, pp 819–827

  7. Qian C, Sun X, Wei Y, Tang X, Sun J (2014) Realtime and robust hand tracking from depth. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1106–1113

  8. Sharp T, Keskin C, Robertson D, Taylor J, Shotton J, Kim D, Freedman D (2015) Accurate, robust, and flexible real-time hand tracking. In: Proceedings of the 33rd annual ACM conference on human factors in computing system. ACM, pp 3633–3642

  9. Sridhar S, Oulasvirta A, Theobalt C (2013) Interactive markerless articulated hand motion tracking using RGB and depth data. In: Proceedings of the IEEE international conference on computer vision, pp 2456–2463

  10. Tzionas D, Srikantha A, Aponte P, Gall J (2014) Capturing hand motion with an RGB-D sensor, fusing a generative model with salient points. In: German conference on pattern recognition. Springer, Cham, pp 277–289

  11. Coleca F, State A, Klement S, Barth E, Martinetz T (2015) Self-organizing maps for hand and full body tracking. Neurocomputing 147:174–184

    Article  Google Scholar 

  12. Tompson J, Stein M, Lecun Y, Perlin K (2014) Real-time continuous pose recovery of human hands using convolutional networks. ACM Trans Graph (ToG) 33(5):169

    Article  Google Scholar 

  13. Toshev A, Szegedy C (2014) Deeppose: human pose estimation via deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1653–1660

  14. Sinha A, Choi C, Ramani K (2016) Deephand: robust hand pose estimation by completing a matrix imputed with deep features. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4150–4158

  15. Neverova N, Wolf C, Nebout F, Taylor GW (2017) Hand pose estimation through semi-supervised and weakly-supervised learning. Computer Vision and Image Understanding. In press, Corrected Proof

  16. Rautaray S S, Agrawal A (2015) Vision based hand gesture recognition for human computer interaction: a survey. Artif Intell Rev 43(1):1–54

    Article  Google Scholar 

  17. Hasan H, Abdul-Kareem S (2014) Static hand gesture recognition using neural networks. Artif Intell Rev 1–35

  18. Molchanov P, Gupta S, Kim K, Kautz J (2015) Hand gesture recognition with 3D convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 1–7

  19. Ozturk O, Aksac A, Ozyer T, Alhajj R (2015) Boosting real-time recognition of hand posture and gesture for virtual mouse operations with segmentation. Appl Intell 43(4):786

    Article  Google Scholar 

  20. Tripathi B K (2017) On the complex domain deep machine learning for face recognition. Appl Intell 1–15

  21. Dinh D L, Lim M J, Thang N D, Lee S, Kim T S (2014) Real-time 3D human pose recovery from a single depth image using principal direction analysis. Appl Intell 41(2):473

    Article  Google Scholar 

  22. Keskin C, Kıraç F, Kara Y, Akarun L (2012) Hand pose estimation and hand shape classification using multi-layered randomized decision forests. In: Computer vision ICCV 2012, pp 852–863

  23. Krizhevsky A, Sutskever I, Hinton G E (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105

  24. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1725–1732

  25. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9

  26. Szegedy C, Ioffe S, Vanhoucke V, Alemi A A (2017) Inception-v4, inception-ResNet and the impact of residual connections on learning. In: AAAI, pp 4278–4284

  27. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826

  28. Nair V, Hinton G E (2010) Rectified linear units improve restricted Boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pp 807–814

  29. Melax S, Keselman L, Orsten S (2013) Dynamics based 3D skeletal hand tracking. In: Proceedings of graphics interface 2013. Canadian Information Processing Society, pp 63–70

  30. Oikonomidis I, Kyriazis N, Argyros A A (2011) Efficient model-based 3D tracking of hand articulations using Kinect. In: BmVC, vol 1(2), p 3

  31. Liang H, Wang J, Sun Q, Liu Y J, Yuan J, Luo J, He Y (2016) Barehanded music: real-time hand interaction for virtual piano. In: Proceedings of the 20th ACM SIGGRAPH symposium on interactive 3D graphics and games. ACM, pp 87–94

  32. Tang D, Jin Chang H, Tejani A, Kim T K (2014) Latent regression forest: structured estimation of 3d articulated hand posture. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3786–3793

  33. Sun X, Wei Y, Liang S, Tang X, Sun J (2015) Cascaded hand pose regression. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 824–832

  34. Tang D, Yu T H, Kim T K (2013) Real-time articulated hand pose estimation using semi-supervised transductive regression forests. In: Proceedings of the IEEE international conference on computer vision, pp 3224–3231

  35. Fourure D, Emonet R, Fromont E, Muselet D, Neverova N, Tremeau A, Wolf C (2017) Multi-task, multi-domain learning: application to semantic segmentation and pose regression. Neurocomputing 251:68–80

    Article  Google Scholar 

  36. Ge L, Liang H, Yuan J, Thalmann D (2016) Robust 3D hand pose estimation in single depth images: from single-view CNN to multi-view CNNs. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3593–3601

  37. Hafiz A R, Al-Nuaimi A Y, Amin M F, Murase K (2015) Classification of skeletal wireframe representation of hand gesture using complex-valued neural network. Neural Process Lett 42(3):649–664

    Article  Google Scholar 

  38. Taylor J, Shotton J, Sharp T, Fitzgibbon A (2012) The vitruvian manifold: inferring dense correspondences for one-shot human pose estimation. In: 2012 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 103–110

  39. LeCun Y, Cortes C, Burges CJ (2010) MNIST handwritten digit database. AT&T Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2

Download references

Acknowledgements

This work was supported by the National Key Technology R&D Program of China (No.2015BAF01B00) and the National Key R&D Program of China (No.2017YFD0400405).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bo Wu.

Appendix

Appendix

$$ \delta=\frac{\partial L}{\partial u},u^{l}=W^{l}x^{l-1}+b^{l} $$
(A.1)

For the bias b in the parameter, since u/ b = 1, by the chain derived rule:

$$ \frac{\partial L}{\partial b^{l}}=\frac{\partial L}{\partial u^{l}}\frac{\partial u^{l}}{\partial b^{l}}=\delta^{l} $$
(A.2)

The partial derivative of the cost function L for the weight W in the parameter:

$$ \frac{\partial L}{\partial W^{l}}=\frac{\partial L}{\partial u^{l}}\frac{\partial u^{l}}{\partial W^{l}}=\delta^{l}(x^{l-1})^{T} $$
(A.3)

The sensitivity of each layer is not the same, can be calculated:

$$\begin{array}{@{}rcl@{}} \delta^{l}&=&\frac{\partial L}{\partial u^{l}}=\frac{\partial L}{\partial u^{l + 1}}\frac{\partial u^{l + 1}}{\partial u^{l}}\\ &=&\delta^{l + 1}\frac{\partial(W^{l + 1}x^{l}+b)}{\partial u^{l}}\\ &=&\delta^{l + 1}\frac{\partial(W^{l + 1}f(u^{l})+b)}{\partial u^{l}}\\ &=&(W^{l + 1})^{T}\delta^{l + 1}\cdot f^{\prime}(u^{l}) \end{array} $$
(A.4)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hu, Z., Hu, Y., Wu, B. et al. Hand pose estimation with multi-scale network. Appl Intell 48, 2501–2515 (2018). https://doi.org/10.1007/s10489-017-1092-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-017-1092-z

Keywords

Navigation