Skip to main content
Log in

Comprehensive features with randomized decision forests for hand segmentation from color images in uncontrolled indoor scenarios

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Hand segmentation is an integral part of many computer vision applications, especially gesture recognition. Training a classifier to classify pixels into hand or background using skin color as a feature is one of the most popular methods for this purpose. This approach has been highly restricted to simple hand segmentation scenarios since color feature alone provides very limited information for classification. Meanwhile there have been a rise of segmentation methods utilizing deep learning networks to exploit multi-layers of complex features learned from image data. Yet a deep neural network requires a large database for training and a powerful computational machine for operations due to its complexity in computations. In this work, the development of comprehensive features and optimized uses of these features with a randomized decision forest (RDF) classifier for the task of hand segmentation in uncontrolled indoor environments is investigated. Newly designed image features and new implementations are provided with evaluations of their hand segmentation performances. In total, seven image features which extract pixel or neighborhood related properties from color images are proposed and evaluated individually as well as in combination. The behaviours of feature and RDF parameters are also evaluated and optimum parameters for the scenario under consideration are identified. Additionally, a new dataset containing hand images in uncontrolled indoor scenarios was created during this work. It was observed from the research that a combination of features extracting color, texture, neighborhood histogram and neighborhood probability information outperforms existing methods for hand segmentation in restricted as well as unrestricted indoor environments using just a small training dataset. Computations required for the proposed features and the RDF classifier are light, hence the segmentation algorithm is suited for embedded devices equipped with limited power, memory, and computational capacities.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19

Similar content being viewed by others

Notes

  1. https://vimeo.com/298622291

  2. https://github.com/meetshah1995/pytorch-semseg

References

  1. Albiol A, Torres L, Delp EJ (2001) Optimum color spaces for skin detection. In: Proceedings of the IEEE international conference on image processing, vol 1. IEEE, pp 122–124

  2. Badrinarayanan V, Kendall A, Cipolla R (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell 39(12):2481–2495. https://doi.org/10.1109/TPAMI.2016.2644615

    Article  Google Scholar 

  3. Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  MATH  Google Scholar 

  4. Criminisi A, Shotton J (2013) Decision forests for computer vision and medical image analysis. Springer Science & Business Media, Berlin

    Book  Google Scholar 

  5. Davies ER (2004) Machine vision: theory, algorithms, practicalities. Elsevier, Amsterdam

    Google Scholar 

  6. Garg P, Aggarwal N, Sofat S (2009) Vision based hand gesture recognition. World Acad Sci Eng Technol 49(1):972–977

    Google Scholar 

  7. Goldin-Meadow S (1999) The role of gesture in communication and thinking. Trends Cogn Sci 3(11):419–429

    Article  Google Scholar 

  8. Grzejszczak T, Kawulok M, Galuszka A (2016) Hand landmarks detection and localization in color images. Multimed Tools Appl 75(23):16,363–16,387

    Article  Google Scholar 

  9. Guo Y, Liu Y, Georgiou T, Lew MS (2018) A review of semantic segmentation using deep neural networks. Int J Multimed Inf Retrieval 7(2):87–93

    Article  Google Scholar 

  10. Jain AK, Farrokhnia F (1991) Unsupervised texture segmentation using gabor filters. Pattern Recogn 24(12):1167–1186

    Article  Google Scholar 

  11. Kakumanu P, Makrogiannis S, Bourbakis N (2007) A survey of skin-color modeling and detection methods. Pattern Recogn 40(3):1106–1122

    Article  MATH  Google Scholar 

  12. Karam M (2009) A framework for gesture-based human computer interactions. VDM Verlag, Saarbrücken

    Google Scholar 

  13. Kawulok M, Kawulok J, Nalepa J, Smolka B (2014) Self-adaptive algorithm for segmenting skin regions. EURASIP J Adv Signal Process 2014(170):1–22

    Google Scholar 

  14. Khan R, Hanbury A, Stoettinger J (2010) Skin detection: a random forest approach. In: Proceedings of the IEEE international conference on image processing. IEEE, pp 4613–4616

  15. Khan R, Hanbury A, Stöttinger J, Bais A (2012) Color based skin classification. Pattern Recogn Lett 33(2):157–163

    Article  Google Scholar 

  16. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Proceedings of the 25th international conference on neural information processing systems, NIPS’12, vol 1. Curran Associates Inc., New York, pp 1097–1105. http://dl.acm.org/citation.cfm?id=2999134.2999257

  17. Laws KI (1980) Rapid texture identification. In: Proceedings of SPIE - the international society for optical engineering, vol 238, pp 376–381

  18. Li C, Kitani KM (2013) Pixel-level hand detection in ego-centric videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3570–3577

  19. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440

  20. Nalepa J, Kawulok M (2014) Fast and accurate hand shape classification. In: Proceedings of the international conference: beyond databases, architectures and structures. Springer, pp 364–373

  21. Oghaz MM, Maarof MA, Zainal A, Rohani MF, Yaghoubyan SH (2015) A hybrid color space for skin detection using genetic algorithm heuristic search and principal component analysis technique. PLOS One 10(8):1–21

    Google Scholar 

  22. Rautaray SS, Agrawal A (2015) Vision based hand gesture recognition for human computer interaction: a survey. Artif Intell Rev 43(1):1–54

    Article  Google Scholar 

  23. Sarkar AR, Sanyal G, Majumder S (2013) Hand gesture recognition systems: a survey. Int J Comput Appl 71(15):25–37

    Google Scholar 

  24. Saxena A, Chung SH, Ng AY (2006) Learning depth from single monocular images. In: Proceedings of the international conference on neural information processing system, pp 1161–1168

  25. Schroff F, Criminisi A, Zisserman A (2008) Object class segmentation using random forests. In: Proceedings of the British machine vision conference, pp 1–10

  26. Serra G, Camurri M, Baraldi L, Benedetti M, Cucchiara R (2013) Hand segmentation for gesture recognition in ego-vision. In: Proceedings of the 3rd ACM international workshop on interactive multimedia on mobile & portable devices. ACM, pp 31–36

  27. Shotton J, Fitzgibbon A, Cook M, Sharp T, Finocchio M, Moore R, Kipman A, Blake A (2011) Real-time human pose recognition in parts from single depth images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, vol 56. IEEE, pp 1297–1304

  28. Shotton J, Girshick R, Fitzgibbon A, Sharp T, Cook M, Finocchio M, Moore R, Kohli P, Criminisi A, Kipman A et al (2013) Efficient human pose estimation from single depth images. IEEE Trans Pattern Anal Mach Intel 35 (12):2821–2840

    Article  Google Scholar 

  29. Shotton J, Johnson M, Cipolla R (2008) Semantic texton forests for image categorization and segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, pp 1–8

  30. Shotton J, Winn J, Rother C, Criminisi A (2006) Textonboost: joint appearance, shape and context modeling for multi-class object recognition and segmentation. In: Proceedings of the 9th European conference on computer vision. Springer, pp 1–15

  31. Ungureanu AS, Bazrafkan S, Corcoran P (2018) Deep learning for hand segmentation in complex backgrounds. In: Proceedings of the the IEEE conference on consumer electronics. IEEE, pp 1–2

  32. Vezhnevets V, Sazonov V, Andreeva A (2003) A survey on pixel-based skin color detection techniques. In: Proceedings of the 13th international conference on computer graphics and vision, vol 3. MSU, pp 85–92

  33. Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features. In: Proceedings of the IEEE conference on computer vision and pattern recognition, vol 1. IEEE, pp 511–518

  34. Vodopivec T, Lepetit V, Peer P (2016) Fine hand segmentation using convolutional neural networks. arXiv

  35. Wang C, Yang H, Bartz C, Meinel C (2016) Image captioning with deep bidirectional lstms. In: Proceedings of the 2016 ACM on multimedia conference. ACM, pp 988–997

  36. Wang C, Yang H, Meinel C (2015) Deep semantic mapping for cross-modal retrieval. In: Proceedings of the international conference on tools with artificial intelligence. IEEE, pp 234–241

  37. Wang C, Yang H, Meinel C (2016) A deep semantic framework for multimodal representation learning. Multimed Tools Appl 75(15):9255–9276

    Article  Google Scholar 

  38. Wang C, Yang H, Meinel C (2016) Exploring multimodal video representation for action recognition. In: Proceedings of the international joint conference on neural networks. IEEE, pp 1924–1931

  39. Wang C, Yang H, Meinel C (2018) Image captioning with deep bidirectional lstms and multi-task learning. ACM Trans Multimed Comput Commun Appl 14(2s):40

    Google Scholar 

  40. Wang Q, Gao J, Yuan Y (2018) Embedding structured contour and location prior in siamesed fully convolutional networks for road detection. IEEE Trans Intell Transp Syst 19(1):230–241

    Article  Google Scholar 

  41. Wang Q, Gao J, Yuan Y (2018) A joint convolutional neural networks and context transfer for street scenes labeling. IEEE Trans Intell Transp Syst 19(5):1457–1470

    Article  Google Scholar 

  42. Winn J, Criminisi A (2006) Object class recognition at a glance. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE

  43. Winn J, Shotton J (2006) The layout consistent random field for recognizing and segmenting partially occluded objects. In: Proceedings of the IEEE conferenceon computer vision and pattern recognition, vol 1. IEEE, pp 37–44

  44. Zabulis X, Baltzakis H, Argyros AA (2009) Vision-based hand gesture recognition for human-computer interaction. 30–88. LEA

  45. Zhu X, Jia X, Wong KYK (2014) Pixel-level hand detection with shape-aware structured forests. In: Proceedings of the Asian conference on computer vision. Springer, pp 64–78

  46. Zhu X, Jia X, Wong KYK (2015) Structured forests for pixel-level hand detection and hand part labelling. Comput Vis Image Underst 141:95–107

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Manu Martin.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Features

1.1 A.1 Comparison of the new implementation to estimate texture using gabor filters

Figure 20 provides a comparison of the filter response using the new implementation and the original method for one of the selected scale and orientation. The error in this case was 0.02% in mean square sense.

Fig. 20
figure 20

Filtered Gabor filter magnitude response comparison

Appendix B: Optimization of feature parameters

1.1 B.1 Gabor texture feature

As explained in Section 3.2, there are two parameters associated with the Gabor texture feature: number of scales (nScale) and number of orientations (nOrient). Figure 21 shows how nScale affects the segmentation performance and feature extraction time. The maximum possible value for nScale is 5, limited by image dimensions and the scale down operation used for the faster implementation. It can be observed from the figure that the precision increased with number of scales for both scenarios. On the other hand, the behaviour of recall was different between the databases. The time required for feature estimation showed almost linear relationship with the number scales. nScale = 3 gave good performances in both scenarios.

Fig. 21
figure 21

Effects of nScale on segmentation (nOrient = 8)

The effect of nOrient on the segmentation output is shown in Fig. 22. Both scenarios displayed convergence of the evaluation measures when 8 orientations were used. The time for feature extraction varied linearly with the number of orientations because of the corresponding increase in filtering operations required at each scale.

Fig. 22
figure 22

Effects of nOrient on segmentation (nScale = 3)

B.2 Laws texture feature

Number of scales (nScale) and filter width (fWidth) are the two parameters of Laws texture approach. The influence of nScale on segmentation output when fWidth = 3 and fWidth = 5 is shown in Figs. 23 and 24 respectively. The effects of number of scales on segmentation performances were observed to be very similar to that of the Gabor texture method, with optimum results achieved for the nScale value of 3. The segmentation outputs when the filters used were of size 5x5 were slightly better than 3x3 filters.

Fig. 23
figure 23

Effects of nScale parameter of Laws approach (fWidth = 3)

Fig. 24
figure 24

Effects of nScale parameter of Laws approach (fWidth = 5)

B.3 Neighborhood difference feature

Effects of the three parameters neiCount, neiSpace and neiOrient on the segmentation performance when the HSV color space was used are shown in Figs. 2526 and 27 respectively. Increasing the number of neighbors improved all three evaluation measures for both scenarios at the expense of the time required for feature estimation which increased linearly. In the case of neiSpace, all the evaluation measures converged to a maximum for the value 5. The time was unaffected by this parameter. Lastly, all the evaluation measures showed convergence for a neiOrient value of 8. The time for feature estimation showed an almost linear relationship with number of neighbor orientations.

Fig. 25
figure 25

Effects of neiCount (neiSpace = 1 & neiOrient = 8)

Fig. 26
figure 26

Effects of neiSpace (neiCount = 25 & neiOrient = 8)

Fig. 27
figure 27

Effects of neiOrient (neiCount = 25 & neiSpace = 5)

B.4 Neighbourhood histogram feature

The effects of nBins and hWidth parameters of the neighborhood histogram feature on segmentation performance using the H channel are shown in Figs. 28 and 29 respectively. In the case of nBins, all evaluation measures converge to a maximum for both scenarios. The time requirement varied linearly with increasing number of bins. On the other hand, the hWidth parameter showed a decrease in recall and f-score after crossing a limit value. Due to the custom implementation used, the time was unaffected by this parameter.

Fig. 28
figure 28

Effects of nBins (hWidth = 51)

Fig. 29
figure 29

Effects of hWdith (nBins = 30)

B.5 Neighborhood probability feature

The effects of different parameters of the neighborhood probability feature on the segmentation performance are given in Figs. 3031 and 32. The behaviour is found to be similar to that of the neighborhood difference feature.

Fig. 30
figure 30

Effects of neiCount (neiSpace = 10 & neiOrient = 8)

Fig. 31
figure 31

Effects of neiSpace (neiCount = 4 & neiOrient = 8)

Fig. 32
figure 32

Effects of neiOrient (neiCount = 4 & neiSpace = 10)

Appendix C: Evaluation of other feature combinations

Figure 33 shows the evaluation results of feature combinations which are not listed in Section 4.4. The patterns visible in these results are similar to the ones identified earlier.

Fig. 33
figure 33

F-scores of other feature combinations

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Martin, M., Nguyen, T., Yousefi, S. et al. Comprehensive features with randomized decision forests for hand segmentation from color images in uncontrolled indoor scenarios. Multimed Tools Appl 78, 20987–21020 (2019). https://doi.org/10.1007/s11042-019-7445-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-019-7445-3

Keywords

Navigation