Skip to main content

Advertisement

Log in

A robust multi-scale deep learning approach for unconstrained hand detection aided by skin segmentation

  • Original article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

Robust detection of hands in images at different scales, especially, small-sized hands, has remained a challenge in computer vision. In this work, we design a multi-scale deep learning algorithm to detect hands in unconstrained scenarios as well as frames from driving videos. Our carefully crafted deep learning models have achieved improvement in detection accuracies on several widely used benchmark datasets. We have shown that a set of shallow parallel Faster-RCNNs can lead to higher accuracies than one deep Faster-RCNN since deeper layers cause loss of fine features due to larger strides. We achieve 77.1%, 86.53%, 91.43%, and 74.43% average precision over Oxford hand, VIVA, CVRR and ICD datasets, respectively. Furthermore, the proposed approach can detect hands as small as \(15\times 15\) pixels, which was not possible for previous works. Our analysis shows that different context modules (human and skin) can benefit the detection result by reducing false positives. For this purpose, several approaches for segmentation using dilated convolution and adversarial learning are proposed which can isolate skin regions faster and more accurately. The skin detection accuracies obtained using the proposed algorithm over IBTD, Pratheepan, Uchile and HRG datasets are 94.52%, 96.49%, 90.74%, and 98.86%, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20

Similar content being viewed by others

References

  1. Yang, Y., Fermuller C.and Li, Y., Aloimonos, Y.: Grasp type revisited: a modern perspective on a classical feature for vision. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 400–408 (2015)

  2. Shu, X., Tang, J., Qi, G., Liu, W., Yang, J.: Hierarchical long short-term concurrent memory for human interaction recognition. IEEE Trans. Pattern Anal. Mach. Intell. (2019). https://doi.org/10.1109/TPAMI.2019.2942030

    Article  Google Scholar 

  3. Shu, X., Zhang, L., Sun, Y., Tang, J.: Host-parasite: Graph lstm-in-lstm for group activity recognition. IEEE Trans. Neural Netw. Learn. Syst. 1–12 (2020)

  4. Koller, O., Ney, H., Bowden, R.: Deep hand: How to train a cnn on 1 million hand images when your data is continuous and weakly labelled. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3793–3802 (2016)

  5. Zhang, W., Lin, Z., Cheng, J., Ma, C., Deng, X., Wang, H.: STA-GCN: two-stream graph convolutional network with spatial-temporal attention for hand gesture recognition. Vis. Comput. 36(10), 2433–2444 (2020)

    Article  Google Scholar 

  6. Gunes, H., Piccardi, M.: Bi-modal emotion recognition from expressive face and body gestures. J. Netw. Comput. Appl. 30(4), 1334–1345 (2007)

    Article  Google Scholar 

  7. Do, N.H., Yanai, K.: Hand detection and tracking in videos for fine-grained action recognition. In: Proceedings of Asian Conference on Computer Vision, pp. 19–34. Springer (2014)

  8. Ma, Z., Wu, E.: Real-time and robust hand tracking with a single depth camera. Vis. Comput. 30(10), 1133–1144 (2014)

    Article  Google Scholar 

  9. Shu, X., Zhang, L., Qi, G., Liu, W., Tang, J.: Spatiotemporal co-attention recurrent neural networks for human-skeleton motion prediction. arXiv preprint arXiv:1909.13245 (2019)

  10. Mohanty, A., Vaishnavi, P., Jana, P., Majumdar, A., Ahmed, A., Goswami, T., Sahay, R.R.: Nrityabodha: Towards understanding Indian classical dance using a deep learning approach. Signal Process. Image Commun. 47, 529–548 (2016)

    Article  Google Scholar 

  11. Mittal A.and Zisserman, A., Torr, P.H.: Hand detection using multiple proposals. In: Proceedings of British Machine Vision Conference, pp. 1–11. BMVA (2011)

  12. Pisharady, P.K., Vadakkepat, P., Loh, A.P.: Attention based detection and recognition of hand postures against complex backgrounds. Int. J. Comput. Vis. 101(3), 403–419 (2013)

    Article  Google Scholar 

  13. Narasimhaswamy, S., Wei, Z., Wang, Y., Zhang, J., Hoai, M.: Contextual attention for hand detection in the wild. arXiv preprint arXiv:1904.04882 (2019)

  14. Hoang Ngan Le, T., Zheng, Y., Zhu, C., Luu, K., Savvides, M.: Multiple scale Faster-RCNN approach to driver’s cell-phone usage and hands on steering wheel detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 46–53 (2016)

  15. Roy, K., Mohanty, A., Sahay, R.R.: Deep learning based hand detection in cluttered environment using skin segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 640–649. IEEE (2017)

  16. Deng, X., Zhang, Y., Yang, S., Tan, P., Chang, L., Yuan, Y., Wang, H.: Joint hand detection and rotation estimation using CNN. IEEE Trans. Image Process. 27, 1888–1900 (2017)

    Article  MathSciNet  Google Scholar 

  17. Le, T.H.N., Zhu, C., Zheng, Y., Luu, K., Savvides, M.: Robust hand detection in vehicles. In: Proceedings of International Conference on Pattern Recognit., pp. 573–578. IEEE (2016)

  18. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017)

    Article  Google Scholar 

  19. Cai, Z., Fan, Q., Feris, R.S., Vasconcelos, N.: A unified multi-scale deep convolutional neural network for fast object detection. In: Proceedings of European Conference on Computer Vision, pp. 354–370. Springer (2016)

  20. Viola, P., Jones, M.J.: Robust real-time face detection. Int. J. Comput. Vis. 57(2), 137–154 (2004)

    Article  Google Scholar 

  21. Yang, S., Luo P., Loy, C., Tang, X.: Wider face: a face detection benchmark. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 5525–5533 (2016)

  22. Zhang, K., Zhang, Z., Li, Z., Qiao, Y.: Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett. 23(10), 1499–1503 (2016)

    Article  Google Scholar 

  23. Li, H., Lin, Z., Shen, X., Brandt, J., Hua, G.: A convolutional neural network cascade for face detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 5325–5334 (2015)

  24. Qin, H., Yan, J., Li, X., Hu, X.: Joint training of cascaded CNN for face detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3456–3465 (2016)

  25. Ouyang, W., Wang, K., Zhu, X., Wang, X.: Learning chained deep features and classifiers for cascade in object detection. arXiv preprint arXiv:1702.07054 (2017)

  26. Dollár, P., Appel, R., Belongie, S., Perona, P.: Fast feature pyramids for object detection. IEEE Trans. Pattern Anal. Mach. Intell. 36(8), 1532–1545 (2014)

    Article  Google Scholar 

  27. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Proceedings of International Conference on Medical Image Computing and Computer Assisted Intervention, pp. 234–241. Springer (2015)

  28. Lei, Y., Yuan, W., Wang, H., Wenhu, Y., Bo, W.: A skin segmentation algorithm based on stacked autoencoders. IEEE Trans. Multimed. 19(4), 740–749 (2017)

    Article  Google Scholar 

  29. Chakraborty, B.K., Bhuyan, M.: Skin segmentation using possibilistic fuzzy C-means clustering in presence of skin-colored background. In: Proceedings of the IEEE Recent Advances in Intelligent Computational Systems, pp. 246–250 (2015)

  30. Hwang, I., Kim, Y., Cho, N.I.: Skin detection based on multi-seed propagation in a multi-layer graph for regional and color consistency. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 1273–1277. IEEE (2017)

  31. Kim, Y., Hwang, I., Cho, N.I.: Convolutional neural networks and training strategies for skin detection. In: Proceedings of IEEE International Conference on Image Processing, pp. 3919–3923. IEEE (2017)

  32. Shelhamer, E., Long, J., Darrell, T.: Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 640–651 (2017)

    Article  Google Scholar 

  33. Rudin, L.I., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Physica D Nonlinear Phenom. 60(1–4), 259–268 (1992)

    Article  MathSciNet  Google Scholar 

  34. Le, T.H.N., Quach, K.G., Zhu, C., Duong, C.N., Luu, K., Savvides, M., Center, C.B.: Robust hand detection and classification in vehicles and in the wild. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1203–1210. IEEE (2017)

  35. Oikonomidis, I., Kyriazis, N., Argyros, A.A.: Efficient model-based 3d tracking of hand articulations using kinect. In: Proceedings of British Machine Vision Conference, vol. 1, p. 3. BMVA (2011)

  36. Li, C., Kitani, K.M.: Pixel-level hand detection in ego-centric videos. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3570–3577 (2013)

  37. Chen, T., Wu, M., Hsieh, Y., Fu, L.: Deep learning for integrated hand detection and pose estimation. In: Proceedings of International Conference on Computer Vision and Pattern Recognition, pp. 615–620. IEEE (2016)

  38. Zhu, C., Zheng, Y., Luu, K., Savvides, M.: CMS-RCNN: contextual multi-scale region-based CNN for unconstrained face detection. In: Deep Learning for Biometrics, pp. 57–79. Springer (2017)

  39. Chen, C., Liu, M., Tuzel, O., Xiao, J.: R-CNN for small object detection. In: Proceedings of Asian Conference on Computer Vision, pp. 214–230. Springer (2016)

  40. Zagoruyko, S., Lerer, A., Lin, T., Pinheiro, P.O., Gross, S., Chintala, S., Dollár, P.: A multipath network for object detection. arXiv preprint arXiv:1604.02135 (2016)

  41. Gidaris, S., Komodakis, N.: Object detection via a multi-region and semantic segmentation-aware CNN model. In: Proceedings of IEEE International Conference on Computer Vision, pp. 1134–1142 (2015)

  42. Vezhnevets, V., Sazonov, V., Andreeva, A.: A survey on pixel-based skin color detection techniques. In: Proceedings of Graphicon, vol. 3, pp. 85–92. Moscow, Russia (2003)

  43. Yu, Z., et al.: Fast Gaussian mixture clustering for skin detection. In: Proceedings of IEEE International Conference on Image Processing, pp. 2997–3000 (2006)

  44. Kawulok, M.: Fast propagation-based skin regions segmentation in color images. In: Proceedings of IEEE International Conference on Automatic Face and Gesture Recognition, pp. 1–7. IEEE (2013)

  45. Jones, M.J., Rehg, J.M.: Statistical color models with application to skin detection. Int. J. Comput. Vis. 46(1), 81–96 (2002)

    Article  Google Scholar 

  46. Al-Mohair, H.K., Saleh, J.M., Suandi, S.A.: Hybrid human skin detection using neural network and k-means clustering technique. Appl. Soft Comput. 33, 337–347 (2015)

    Article  Google Scholar 

  47. Kakumanu, P., Makrogiannis, S., Bourbakis, N.: A survey of skin-color modeling and detection methods. Pattern Recognit. 40(3), 1106–1122 (2007)

    Article  Google Scholar 

  48. Zuo, H., Fan, H., Blasch, E., Ling, H.: Combining convolutional and recurrent neural networks for human skin detection. IEEE Signal Process. Lett. 24(3), 289–293 (2017)

    Article  Google Scholar 

  49. Kim, Y., Hwang, I., Cho, N.I.: A new convolutional network-in-network structure and its applications in skin detection, semantic segmentation, and artifact reduction. arXiv preprint arXiv:1701.06190 (2017)

  50. Tang, J., Li, Z., Lai, H., Zhang, L., Yan, S., et al.: Personalized age progression with bi-level aging dictionary learning. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 905–917 (2017)

    Google Scholar 

  51. Das, N., Ohn-Bar, E., Trivedi, M.M.: On performance evaluation of driver hand detection algorithms: Challenges, dataset, and metrics. In: Proceedings of International IEEE Conference on Intelligent Transportation Systems, pp. 2953–2958. IEEE (2015)

  52. Lin, T., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. arXiv preprint arXiv:1612.03144 (2016)

  53. Sun, X., Wu, P., Hoi, S.C.: Face detection using deep learning: an improved faster RCNN approach. arXiv preprint arXiv:1701.08289 (2017)

  54. Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Technical report, Citeseer (2009)

    Google Scholar 

  55. Girshick, R.: Fast R-CNN. In: Proceedings of IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)

  56. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Proceedings of Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)

  57. Ohn-Bar, E., Trivedi, M.M.: In-vehicle hand localization using integration of regions. In: Proceedings of IEEE Intelligent Vehicle Symposium, pp. 1034–1039 (2013)

  58. Mottaghi, R., Chen, X., Liu, X., Cho, N., Lee, S., Fidler, S., Urtasun, R., Yuille, A.: The role of context for object detection and semantic segmentation in the wild. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 891–898 (2014)

  59. Parikh, D., Zitnick, C.L., Chen, T.: Exploring tiny images: The roles of appearance and contextual information for machine and human object recognition. IEEE Trans. Pattern Anal. Mach. Intell. 34(10), 1978–1991 (2012)

    Article  Google Scholar 

  60. Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2010)

    Article  Google Scholar 

  61. Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In: Proceedings of International Conference on Learning Representations (2015)

  62. Wang, T., Sun, M., Hu, K.: Dilated deep residual network for image denoising. In: Proceedings of Int. Conf. Tools Artif. Intell., pp. 1272–1279. IEEE (2017)

  63. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of IEEE International Conference on Computer Vision, pp. 1026–1034 (2015)

  64. Luc, P., Couprie, C., Chintala, S., Verbeek, J.: Semantic segmentation using adversarial networks. arXiv preprint arXiv:1611.08408 (2016)

  65. Xue, Y., Xu, T., Zhang, H., Long, L.R., Huang, X.: Segan: adversarial network with multi-scale l1 loss for medical image segmentation. Neuroinformatics 16(3–4), 383–392 (2018)

    Article  Google Scholar 

  66. Isola, P., Zhu, J., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 5967–5976. IEEE (2017)

  67. Ohn-Bar, E., Trivedi, M.M.: The power is in your hands: 3D analysis of hand gestures in naturalistic video. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 912–917 (2013)

  68. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html (2007)

  69. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html (2012)

  70. Yao, B., Jiang, X., Khosla, A., Lin, A.L., Guibas, L., Fei-Fei, L.: Human action recognition by learning bases of action attributes and parts. In: Proceedings of International Conference on Computer Vision, pp. 1331–1338. IEEE (2011)

  71. Shan, D., Geng, J., Shu, M., Fouhey, D.F.: Understanding human hands in contact at internet scale. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 9869–9878 (2020)

  72. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)

  73. Lab, S.H.: SCUT HCII Lab. http://cvrr.ucsd.edu/vivachallenge/index.php/hands/hand-detection/#cite2/ (2008). Accessed 19 July 2008

  74. Zhu, Q., Wu, C., Cheng, K., Wu, Y.: An adaptive skin model and its application to objectionable image filtering. In: Proceedings of ACM Multimedia, pp. 56–63. ACM (2004)

  75. Ruiz-del Solar, J., Verschae, R.: Skin detection using neighborhood information. In: Proceedings of IEEE International Conference on Automatic Face and Gesture Recognition, pp. 463–468. IEEE (2004)

  76. Tan, W.R., Chan, C.S., Yogarajah, P., Condell, J.: A fusion approach for efficient human skin detection. IEEE Trans. Ind. Inform. 8(1), 138–147 (2012)

    Article  Google Scholar 

  77. Kawulok, M., Kawulok, J., Nalepa, J.: Spatial-based skin detection using discriminative skin-presence features. Pattern Recognit. Lett. 41, 3–13 (2014)

    Article  Google Scholar 

  78. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kankana Roy.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Roy, K., Sahay, R.R. A robust multi-scale deep learning approach for unconstrained hand detection aided by skin segmentation. Vis Comput 38, 2801–2825 (2022). https://doi.org/10.1007/s00371-021-02157-8

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-021-02157-8

Keywords

Navigation