Skip to main content
Log in

Real-time one-shot learning gesture recognition based on lightweight 3D Inception-ResNet with separable convolutions

  • Theoretical advances
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

Gesture recognition is a popular research field in computer vision and the application of deep neural networks greatly improves its performance. However, the general deep learning method has a large number of parameters preventing the practical application on resource-limited devices. Meanwhile, collecting large number of training samples is usually time-consuming and difficult. To this end, we propose a lightweight 3D Inception-ResNet to extract discriminative features for real-time one-shot learning gesture recognition which aims to recognize gestures successfully given only one training sample for each new class. For efficient extraction of gesture features, we firstly extend the original 2D Inception-ResNet to the 3D version and then apply two kinds of separable convolutions as well as some other design strategies to reduce the number of parameters and computation complexity making it running in real-time even on CPU for feature extraction. Moreover, the consumption of storage space is also greatly reduced. In order to obtain robust performance for one-shot learning recognition, we employ an evolution mechanism by updating the root sample with innovation of new samples to enhance and improve the performance of the nearest neighbor classifier. Meanwhile, we propose an update strategy of the dynamic threshold to deal with the problem of threshold selection in real-world applications. In order to improve the robustness of recognition performance, we conduct artificial data synthesis to augment our collected dataset. A series of experiments conducted on public datasets and our collected dataset demonstrate the effectiveness of our approach to one-shot learning gesture recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23

Similar content being viewed by others

Availability of data and material

Not applicable.

References

  1. Mitra S, Acharya T (2007) Gesture recognition: a survey. IEEE Trans Syst Man Cybernet Part C (Appl Rev) 37(3):311–324

    Article  Google Scholar 

  2. Chen L, Wang F, Deng H, Ji K (2013) A survey on hand gesture recognition. In: 2013 international conference on computer sciences and applications, IEEE, pp 313–316

  3. Zhang L, Zhu G, Shen P, Song J, Afaq Shah S, Bennamoun M (2017) Learning spatiotemporal features using 3dcnn and convolutional lstm for gesture recognition. In: Proceedings of the IEEE international conference on computer vision, pp 3120–3128

  4. Miao Q, Li Y, Ouyang W, Ma Z, Xu X, Shi W, Cao X (2017) Multimodal gesture recognition based on the resc3d network. In: Proceedings of the IEEE international conference on computer vision, pp 3047–3055

  5. Duan J, Wan J, Zhou S, Guo X, Li SZ (2018) A unified framework for multi-modal isolated gesture recognition. ACM Trans Multimedia Comput Commun Appl (TOMM) 14(1):21

    Google Scholar 

  6. Molchanov P, Gupta S, Kim K, Kautz J (2015) Hand gesture recognition with 3D convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 1-7

  7. Elouariachi I, Benouini R, Zenkouar K, Zarghili A (2020) Robust hand gesture recognition system based on a new set of quaternion Tchebichef moment invariants. Pattern Anal Appl 23:1337–1353

    Article  MathSciNet  Google Scholar 

  8. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105

  9. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  10. Wozniak M, Polap D (2018) Object detection and recognition via clustered features. Neurocomputing 320:76–84

    Article  Google Scholar 

  11. Farrajota M, Rodrigues JM, du Buf JH (2019) Human action recognition in videos with articulated pose information by deep networks. Pattern Anal Appl 22(4):1307–1318

    Article  MathSciNet  Google Scholar 

  12. Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6299–6308

  13. Wang T, Chen Y, Zhang M, Chen J, Snoussi H (2017) Internal transfer learning for improving performance in human action recognition for small datasets. IEEE Access 5:17627–17633

    Article  Google Scholar 

  14. Wozniak M, Wieczorek M, Silka J, Polap D (2021) Body pose prediction based on motion sensor data and recurrent neural network. IEEE Trans Industr Inf 17(3):2101–2111

    Article  Google Scholar 

  15. Lin J, Ruan X, Yu N, Yang Y-H (2016) Adaptive local spatiotemporal features from RGB-D data for one-shot learning gesture recognition. Sensors 16(12):2171

    Article  Google Scholar 

  16. Lin J, Ruan X, Yu N, Wei R (2015) One-shot learning gesture recognition based on improved 3D SMoSIFT feature descriptor from RGB-D videos. In: The 27th chinese control and decision conference (2015 CCDC), IEEE, pp 4911–4916

  17. Wan J, Guo G, Li SZ (2016) Explore efficient local features from RGB-D data for one-shot learning gesture recognition. IEEE Trans Pattern Anal Mach Intell 38(8):1626–1639

    Article  Google Scholar 

  18. Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: Thirty-First AAAI conference on artificial intelligence 31(1)

  19. Konen J, Hagara M (2014) One-shot-learning gesture recognition using hog-hof features. J Mach Learn Res 15(1):2513–2532

    MathSciNet  Google Scholar 

  20. Malgireddy MR, Inwogu I, Govindaraju V (2012) A temporal Bayesian model for classifying, detecting and localizing activities in video sequences. In: 2012 IEEE computer society conference on computer vision and pattern recognition workshops, IEEE, pp 43–48

  21. Malgireddy MR, Nwogu I, Govindaraju V (2013) Language-motivated approaches to action recognition. J Mach Learn Res 14(1):2189–2212

    MathSciNet  Google Scholar 

  22. Ming Y, Ruan Q, Hauptmann AG (2012) Activity recognition from RGB-D camera with 3D local spatio-temporal features. In: 2012 IEEE international conference on multimedia and expo, IEEE, pp 344–349

  23. Wan J, Ruan Q, Li W, An G, Zhao R (2014) 3D SMoSIFT: three-dimensional sparse motion scale invariant feature transform for activity recognition from RGB-D videos. J Electron Imaging 23(2):023017

    Article  Google Scholar 

  24. Ji S, Xu W, Yang M, Yu K (2013) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231

    Article  Google Scholar 

  25. Baccouche M, Mamalet F, Wolf C, Garcia C, Baskurt A (2011) Sequential deep learning for human action recognition. In: International workshop on human behavior understanding, Springer, pp 29–39

  26. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497

  27. Li Y, Miao Q, Tian K, Fan Y, Xu X, Li R, Song J (2018) Large-scale gesture recognition with a fusion of rgb-d data based on saliency theory and c3d model. IEEE Trans Circuits Syst Video Technol 28(10):2956–2964

    Article  Google Scholar 

  28. Molchanov P, Yang X, Gupta S, Kim K, Tyree S, Kautz J (2016) Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4207–4215

  29. Wu D, Zhu F, Shao L (2012) One shot learning gesture recognition from rgbd images. In: 2012 IEEE computer society conference on computer vision and pattern recognition workshops, IEEE, pp 7–12

  30. Wan J, Ruan Q, Li W, Deng S (2013) One-shot learning gesture recognition from RGB-D data using bag of features. J Mach Learn Res 14(1):2549–2582

    Google Scholar 

  31. Escalante HJ, Guyon I, Athitsos V, Jangyodsuk P, Wan J (2017) Principal motion components for one-shot gesture recognition. Pattern Anal Appl 20(1):167–182

    Article  MathSciNet  Google Scholar 

  32. Cabrera ME, Wachs JP (2018) Biomechanical-based approach to data augmentation for one-shot gesture recognition. In: 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018), IEEE, pp 38–44

  33. Li L, Qin S, Lu Z, Xu K, Hu Z (2020) One-shot learning gesture recognition based on joint training of 3D ResNet and memory module. Multimedia Tools Appl 79:6727–6757

    Article  Google Scholar 

  34. Lu Z, Qin S, Li L, Zhang D, Xu K, Hu Z (2019) One-shot learning hand gesture recognition based on lightweight 3D convolutional neural networks for portable applications on mobile systems. IEEE Access 7:131732–131748

    Article  Google Scholar 

  35. Iandola FN, Han S, Moskewicz MW, Ashraf K, Dally WJ, Keutzer K (2016) SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size. arXiv preprint arXiv:160207360

  36. Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:170404861

  37. Zhang X, Zhou X, Lin M, Sun J (2018) Shufflenet: An extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6848–6856

  38. Sun L, Jia K, Yeung D-Y, Shi BE (2015) Human action recognition using factorized spatio-temporal convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4597–4605

  39. Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In: Proceedings of the IEEE international conference on computer vision, pp 5533–5541

  40. Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6450–6459

  41. Xie S, Sun C, Huang J, Tu Z, Murphy K (2018) Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In: Proceedings of the European conference on computer vision (ECCV), pp 305–321

  42. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826

  43. Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359

    Article  Google Scholar 

  44. Yosinski J, Clune J, Bengio Y, Lipson H (2014) How transferable are features in deep neural networks? In: Advances in neural information processing systems, pp 3320–3328

  45. O’Neill J, Buitelaar P (2018) Few shot transfer learning betweenword relatedness and similarity tasks using a gated recurrent siamese network. In: Thirty-second AAAI conference on artificial intelligence 32(1)

  46. Wan J, Zhao Y, Zhou S, Guyon I, Escalera S, Li SZ (2016) Chalearn looking at people rgb-d isolated and continuous datasets for gesture recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 56–64

  47. Li X, Qin S, Xu K, Hu Z (2018) One-shot Learning Gesture Recognition Based on Evolution of Discrimination with Successive Memory. In: 2018 IEEE international conference of intelligent robotic and control engineering (IRCE), IEEE, pp 263–269

  48. Li Y, Miao Q, Qi X, Ma Z, Ouyang W (2019) A spatiotemporal attention-based ResC3D model for large-scale gesture recognition. Mach Vis Appl 30(5):875–888

    Article  Google Scholar 

  49. Guyon I, Athitsos V, Jangyodsuk P, Escalante HJ (2014) The ChaLearn gesture dataset (CGD 2011). Mach Vis Appl 25(8):1929–1951

    Article  Google Scholar 

  50. Girija SS Tensorflow: Large-scale machine learning on heterogeneous distributed systems

  51. Loshchilov I, Hutter F (2016) Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:160803983

  52. Zach C, Pock T, Bischof H (2007) A duality based approach for realtime TV-L 1 optical flow. In: Joint pattern recognition symposium, Springer, pp 214–223

  53. Zhu G, Zhang L, Shen P, Song J (2017) Multimodal gesture recognition using 3-D convolution and convolutional LSTM. IEEE Access 5:4517–4524

    Article  Google Scholar 

  54. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1725–1732

  55. Zhu G, Zhang L, Shen P, Song J, Shah SAA, Bennamoun M (2018) Continuous gesture segmentation and recognition using 3dcnn and convolutional lstm. IEEE Trans Multimedia 21(4):1011–1021

    Article  Google Scholar 

  56. Wan J, Athitsos V, Jangyodsuk P, Escalante HJ, Ruan Q, Guyon I (2014) CSMMI: class-specific maximization of mutual information for action and gesture recognition. IEEE Trans Image Process 23(7):3152–3165

    Article  MathSciNet  Google Scholar 

  57. Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9(Nov):2579–2605

    MATH  Google Scholar 

Download references

Acknowledgements

This work was supported in part by National Natural Science Foundation of China (Grant No. 61731001) and SONY.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shiyin Qin.

Ethics declarations

Conflict of interests

The authors declare that they have no conflict of interest.

Code availability

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, L., Qin, S., Lu, Z. et al. Real-time one-shot learning gesture recognition based on lightweight 3D Inception-ResNet with separable convolutions. Pattern Anal Applic 24, 1173–1192 (2021). https://doi.org/10.1007/s10044-021-00965-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-021-00965-1

Keywords

Navigation