A robust multi-scale deep learning approach for unconstrained hand detection aided by skin segmentation

Roy, Kankana; Sahay, Rajiv Ranjan

doi:10.1007/s00371-021-02157-8

A robust multi-scale deep learning approach for unconstrained hand detection aided by skin segmentation

Original article
Published: 18 May 2021

Volume 38, pages 2801–2825, (2022)
Cite this article

The Visual Computer Aims and scope Submit manuscript

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

Robust detection of hands in images at different scales, especially, small-sized hands, has remained a challenge in computer vision. In this work, we design a multi-scale deep learning algorithm to detect hands in unconstrained scenarios as well as frames from driving videos. Our carefully crafted deep learning models have achieved improvement in detection accuracies on several widely used benchmark datasets. We have shown that a set of shallow parallel Faster-RCNNs can lead to higher accuracies than one deep Faster-RCNN since deeper layers cause loss of fine features due to larger strides. We achieve 77.1%, 86.53%, 91.43%, and 74.43% average precision over Oxford hand, VIVA, CVRR and ICD datasets, respectively. Furthermore, the proposed approach can detect hands as small as $15\times 15$ pixels, which was not possible for previous works. Our analysis shows that different context modules (human and skin) can benefit the detection result by reducing false positives. For this purpose, several approaches for segmentation using dilated convolution and adversarial learning are proposed which can isolate skin regions faster and more accurately. The skin detection accuracies obtained using the proposed algorithm over IBTD, Pratheepan, Uchile and HRG datasets are 94.52%, 96.49%, 90.74%, and 98.86%, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 20

Hand Detection Based on Multi-scale Fully Convolutional Networks

2D Hand Detection Using Multi-Feature Skin Model Supervised Cascaded CNN

Article 13 September 2018

Accurate Hand Detection Method for Noisy Environments

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Yang, Y., Fermuller C.and Li, Y., Aloimonos, Y.: Grasp type revisited: a modern perspective on a classical feature for vision. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 400–408 (2015)
Shu, X., Tang, J., Qi, G., Liu, W., Yang, J.: Hierarchical long short-term concurrent memory for human interaction recognition. IEEE Trans. Pattern Anal. Mach. Intell. (2019). https://doi.org/10.1109/TPAMI.2019.2942030
Article Google Scholar
Shu, X., Zhang, L., Sun, Y., Tang, J.: Host-parasite: Graph lstm-in-lstm for group activity recognition. IEEE Trans. Neural Netw. Learn. Syst. 1–12 (2020)
Koller, O., Ney, H., Bowden, R.: Deep hand: How to train a cnn on 1 million hand images when your data is continuous and weakly labelled. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3793–3802 (2016)
Zhang, W., Lin, Z., Cheng, J., Ma, C., Deng, X., Wang, H.: STA-GCN: two-stream graph convolutional network with spatial-temporal attention for hand gesture recognition. Vis. Comput. 36(10), 2433–2444 (2020)
Article Google Scholar
Gunes, H., Piccardi, M.: Bi-modal emotion recognition from expressive face and body gestures. J. Netw. Comput. Appl. 30(4), 1334–1345 (2007)
Article Google Scholar
Do, N.H., Yanai, K.: Hand detection and tracking in videos for fine-grained action recognition. In: Proceedings of Asian Conference on Computer Vision, pp. 19–34. Springer (2014)
Ma, Z., Wu, E.: Real-time and robust hand tracking with a single depth camera. Vis. Comput. 30(10), 1133–1144 (2014)
Article Google Scholar
Shu, X., Zhang, L., Qi, G., Liu, W., Tang, J.: Spatiotemporal co-attention recurrent neural networks for human-skeleton motion prediction. arXiv preprint arXiv:1909.13245 (2019)
Mohanty, A., Vaishnavi, P., Jana, P., Majumdar, A., Ahmed, A., Goswami, T., Sahay, R.R.: Nrityabodha: Towards understanding Indian classical dance using a deep learning approach. Signal Process. Image Commun. 47, 529–548 (2016)
Article Google Scholar
Mittal A.and Zisserman, A., Torr, P.H.: Hand detection using multiple proposals. In: Proceedings of British Machine Vision Conference, pp. 1–11. BMVA (2011)
Pisharady, P.K., Vadakkepat, P., Loh, A.P.: Attention based detection and recognition of hand postures against complex backgrounds. Int. J. Comput. Vis. 101(3), 403–419 (2013)
Article Google Scholar
Narasimhaswamy, S., Wei, Z., Wang, Y., Zhang, J., Hoai, M.: Contextual attention for hand detection in the wild. arXiv preprint arXiv:1904.04882 (2019)
Hoang Ngan Le, T., Zheng, Y., Zhu, C., Luu, K., Savvides, M.: Multiple scale Faster-RCNN approach to driver’s cell-phone usage and hands on steering wheel detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 46–53 (2016)
Roy, K., Mohanty, A., Sahay, R.R.: Deep learning based hand detection in cluttered environment using skin segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 640–649. IEEE (2017)
Deng, X., Zhang, Y., Yang, S., Tan, P., Chang, L., Yuan, Y., Wang, H.: Joint hand detection and rotation estimation using CNN. IEEE Trans. Image Process. 27, 1888–1900 (2017)
Article MathSciNet Google Scholar
Le, T.H.N., Zhu, C., Zheng, Y., Luu, K., Savvides, M.: Robust hand detection in vehicles. In: Proceedings of International Conference on Pattern Recognit., pp. 573–578. IEEE (2016)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017)
Article Google Scholar
Cai, Z., Fan, Q., Feris, R.S., Vasconcelos, N.: A unified multi-scale deep convolutional neural network for fast object detection. In: Proceedings of European Conference on Computer Vision, pp. 354–370. Springer (2016)
Viola, P., Jones, M.J.: Robust real-time face detection. Int. J. Comput. Vis. 57(2), 137–154 (2004)
Article Google Scholar
Yang, S., Luo P., Loy, C., Tang, X.: Wider face: a face detection benchmark. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 5525–5533 (2016)
Zhang, K., Zhang, Z., Li, Z., Qiao, Y.: Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett. 23(10), 1499–1503 (2016)
Article Google Scholar
Li, H., Lin, Z., Shen, X., Brandt, J., Hua, G.: A convolutional neural network cascade for face detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 5325–5334 (2015)
Qin, H., Yan, J., Li, X., Hu, X.: Joint training of cascaded CNN for face detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3456–3465 (2016)
Ouyang, W., Wang, K., Zhu, X., Wang, X.: Learning chained deep features and classifiers for cascade in object detection. arXiv preprint arXiv:1702.07054 (2017)
Dollár, P., Appel, R., Belongie, S., Perona, P.: Fast feature pyramids for object detection. IEEE Trans. Pattern Anal. Mach. Intell. 36(8), 1532–1545 (2014)
Article Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Proceedings of International Conference on Medical Image Computing and Computer Assisted Intervention, pp. 234–241. Springer (2015)
Lei, Y., Yuan, W., Wang, H., Wenhu, Y., Bo, W.: A skin segmentation algorithm based on stacked autoencoders. IEEE Trans. Multimed. 19(4), 740–749 (2017)
Article Google Scholar
Chakraborty, B.K., Bhuyan, M.: Skin segmentation using possibilistic fuzzy C-means clustering in presence of skin-colored background. In: Proceedings of the IEEE Recent Advances in Intelligent Computational Systems, pp. 246–250 (2015)
Hwang, I., Kim, Y., Cho, N.I.: Skin detection based on multi-seed propagation in a multi-layer graph for regional and color consistency. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 1273–1277. IEEE (2017)
Kim, Y., Hwang, I., Cho, N.I.: Convolutional neural networks and training strategies for skin detection. In: Proceedings of IEEE International Conference on Image Processing, pp. 3919–3923. IEEE (2017)
Shelhamer, E., Long, J., Darrell, T.: Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 640–651 (2017)
Article Google Scholar
Rudin, L.I., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Physica D Nonlinear Phenom. 60(1–4), 259–268 (1992)
Article MathSciNet Google Scholar
Le, T.H.N., Quach, K.G., Zhu, C., Duong, C.N., Luu, K., Savvides, M., Center, C.B.: Robust hand detection and classification in vehicles and in the wild. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1203–1210. IEEE (2017)
Oikonomidis, I., Kyriazis, N., Argyros, A.A.: Efficient model-based 3d tracking of hand articulations using kinect. In: Proceedings of British Machine Vision Conference, vol. 1, p. 3. BMVA (2011)
Li, C., Kitani, K.M.: Pixel-level hand detection in ego-centric videos. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3570–3577 (2013)
Chen, T., Wu, M., Hsieh, Y., Fu, L.: Deep learning for integrated hand detection and pose estimation. In: Proceedings of International Conference on Computer Vision and Pattern Recognition, pp. 615–620. IEEE (2016)
Zhu, C., Zheng, Y., Luu, K., Savvides, M.: CMS-RCNN: contextual multi-scale region-based CNN for unconstrained face detection. In: Deep Learning for Biometrics, pp. 57–79. Springer (2017)
Chen, C., Liu, M., Tuzel, O., Xiao, J.: R-CNN for small object detection. In: Proceedings of Asian Conference on Computer Vision, pp. 214–230. Springer (2016)
Zagoruyko, S., Lerer, A., Lin, T., Pinheiro, P.O., Gross, S., Chintala, S., Dollár, P.: A multipath network for object detection. arXiv preprint arXiv:1604.02135 (2016)
Gidaris, S., Komodakis, N.: Object detection via a multi-region and semantic segmentation-aware CNN model. In: Proceedings of IEEE International Conference on Computer Vision, pp. 1134–1142 (2015)
Vezhnevets, V., Sazonov, V., Andreeva, A.: A survey on pixel-based skin color detection techniques. In: Proceedings of Graphicon, vol. 3, pp. 85–92. Moscow, Russia (2003)
Yu, Z., et al.: Fast Gaussian mixture clustering for skin detection. In: Proceedings of IEEE International Conference on Image Processing, pp. 2997–3000 (2006)
Kawulok, M.: Fast propagation-based skin regions segmentation in color images. In: Proceedings of IEEE International Conference on Automatic Face and Gesture Recognition, pp. 1–7. IEEE (2013)
Jones, M.J., Rehg, J.M.: Statistical color models with application to skin detection. Int. J. Comput. Vis. 46(1), 81–96 (2002)
Article Google Scholar
Al-Mohair, H.K., Saleh, J.M., Suandi, S.A.: Hybrid human skin detection using neural network and k-means clustering technique. Appl. Soft Comput. 33, 337–347 (2015)
Article Google Scholar
Kakumanu, P., Makrogiannis, S., Bourbakis, N.: A survey of skin-color modeling and detection methods. Pattern Recognit. 40(3), 1106–1122 (2007)
Article Google Scholar
Zuo, H., Fan, H., Blasch, E., Ling, H.: Combining convolutional and recurrent neural networks for human skin detection. IEEE Signal Process. Lett. 24(3), 289–293 (2017)
Article Google Scholar
Kim, Y., Hwang, I., Cho, N.I.: A new convolutional network-in-network structure and its applications in skin detection, semantic segmentation, and artifact reduction. arXiv preprint arXiv:1701.06190 (2017)
Tang, J., Li, Z., Lai, H., Zhang, L., Yan, S., et al.: Personalized age progression with bi-level aging dictionary learning. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 905–917 (2017)
Google Scholar
Das, N., Ohn-Bar, E., Trivedi, M.M.: On performance evaluation of driver hand detection algorithms: Challenges, dataset, and metrics. In: Proceedings of International IEEE Conference on Intelligent Transportation Systems, pp. 2953–2958. IEEE (2015)
Lin, T., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. arXiv preprint arXiv:1612.03144 (2016)
Sun, X., Wu, P., Hoi, S.C.: Face detection using deep learning: an improved faster RCNN approach. arXiv preprint arXiv:1701.08289 (2017)
Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Technical report, Citeseer (2009)
Google Scholar
Girshick, R.: Fast R-CNN. In: Proceedings of IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Proceedings of Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Ohn-Bar, E., Trivedi, M.M.: In-vehicle hand localization using integration of regions. In: Proceedings of IEEE Intelligent Vehicle Symposium, pp. 1034–1039 (2013)
Mottaghi, R., Chen, X., Liu, X., Cho, N., Lee, S., Fidler, S., Urtasun, R., Yuille, A.: The role of context for object detection and semantic segmentation in the wild. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 891–898 (2014)
Parikh, D., Zitnick, C.L., Chen, T.: Exploring tiny images: The roles of appearance and contextual information for machine and human object recognition. IEEE Trans. Pattern Anal. Mach. Intell. 34(10), 1978–1991 (2012)
Article Google Scholar
Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2010)
Article Google Scholar
Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In: Proceedings of International Conference on Learning Representations (2015)
Wang, T., Sun, M., Hu, K.: Dilated deep residual network for image denoising. In: Proceedings of Int. Conf. Tools Artif. Intell., pp. 1272–1279. IEEE (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of IEEE International Conference on Computer Vision, pp. 1026–1034 (2015)
Luc, P., Couprie, C., Chintala, S., Verbeek, J.: Semantic segmentation using adversarial networks. arXiv preprint arXiv:1611.08408 (2016)
Xue, Y., Xu, T., Zhang, H., Long, L.R., Huang, X.: Segan: adversarial network with multi-scale l1 loss for medical image segmentation. Neuroinformatics 16(3–4), 383–392 (2018)
Article Google Scholar
Isola, P., Zhu, J., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 5967–5976. IEEE (2017)
Ohn-Bar, E., Trivedi, M.M.: The power is in your hands: 3D analysis of hand gestures in naturalistic video. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 912–917 (2013)
Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html (2007)
Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html (2012)
Yao, B., Jiang, X., Khosla, A., Lin, A.L., Guibas, L., Fei-Fei, L.: Human action recognition by learning bases of action attributes and parts. In: Proceedings of International Conference on Computer Vision, pp. 1331–1338. IEEE (2011)
Shan, D., Geng, J., Shu, M., Fouhey, D.F.: Understanding human hands in contact at internet scale. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 9869–9878 (2020)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Lab, S.H.: SCUT HCII Lab. http://cvrr.ucsd.edu/vivachallenge/index.php/hands/hand-detection/#cite2/ (2008). Accessed 19 July 2008
Zhu, Q., Wu, C., Cheng, K., Wu, Y.: An adaptive skin model and its application to objectionable image filtering. In: Proceedings of ACM Multimedia, pp. 56–63. ACM (2004)
Ruiz-del Solar, J., Verschae, R.: Skin detection using neighborhood information. In: Proceedings of IEEE International Conference on Automatic Face and Gesture Recognition, pp. 463–468. IEEE (2004)
Tan, W.R., Chan, C.S., Yogarajah, P., Condell, J.: A fusion approach for efficient human skin detection. IEEE Trans. Ind. Inform. 8(1), 138–147 (2012)
Article Google Scholar
Kawulok, M., Kawulok, J., Nalepa, J.: Spatial-based skin detection using discriminative skin-presence features. Pattern Recognit. Lett. 41, 3–13 (2014)
Article Google Scholar
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, West Bengal, India
Kankana Roy
Department of Electrical Engineering, Indian Institute of Technology, Kharagpur, West Bengal, India
Rajiv Ranjan Sahay

Authors

Kankana Roy
View author publications
You can also search for this author inPubMed Google Scholar
Rajiv Ranjan Sahay
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Kankana Roy.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Roy, K., Sahay, R.R. A robust multi-scale deep learning approach for unconstrained hand detection aided by skin segmentation. Vis Comput 38, 2801–2825 (2022). https://doi.org/10.1007/s00371-021-02157-8

Download citation

Accepted: 27 April 2021
Published: 18 May 2021
Issue Date: August 2022
DOI: https://doi.org/10.1007/s00371-021-02157-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A robust multi-scale deep learning approach for unconstrained hand detection aided by skin segmentation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Hand Detection Based on Multi-scale Fully Convolutional Networks

2D Hand Detection Using Multi-Feature Skin Model Supervised Cascaded CNN

Accurate Hand Detection Method for Noisy Environments

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now