Skip to main content

Advertisement

Log in

One-shot learning gesture recognition based on joint training of 3D ResNet and memory module

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

As a research hotspot in the field of human-machine interaction, a great progress of hand gesture recognition has been achieved with the development of deep learning of neural networks. However, in the deep learning based recognition methods, it is necessary to rely heavily on large-scale labeled dataset which is very hard to build in practical applications. In order to achieve a well performance under some strict constraint of few sample data, one-shot learning gesture recognition is studied and a joint deep training method by combination of 3D ResNet with a memory module is presented in this paper. In our scheme a combinatorial optimization of feature extraction by 3D ResNet with memory capacity of rare event by memory module is carried out with an effective strategy of optimal decision and two relative performance indices. In order to implement one-shot learning gesture recognition, the memory module is employed to remember the features extracted by well-trained 3D ResNet and the classification decision is performed by the nearest neighbor algorithm with cosine similarity measure. In view of real-world applications about human-machine interaction technology, its ability to deal with negative samples plays a significant role thus a mechanism based on the threshold of cosine similarity is built to realize effective classification and rejection respectively. In order to validate and evaluate the performance of our proposed method, a special hand gesture dataset containing 3045 gesture videos is built and a series of experiment results on our collected dataset and public datasets demonstrate the feasibility and effectiveness of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  1. Bertinetto L, Henriques JF, Valmadre J, Torr P, Vedaldi A (2016) Learning feed-forward one-shot learners. In: Advances in neural information processing systems, pp 523–531

  2. Cai Q, Pan Y, Yao T, Yan C, Mei T (2018) Memory matching networks for one-shot image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4080– 4088

  3. Fe-Fei L (2003) A Bayesian approach to unsupervised one-shot learning of object categories. In: Ninth IEEE international conference on computer vision, 2003. Proceedings. IEEE, pp 1134–1141

  4. Finn C, Abbeel P, Levine S (2017) Model-agnostic meta-learning for fast adaptation of deep networks. arXiv:170303400

  5. Girija SS (2016) Tensorflow: large-scale machine learning on heterogeneous distributed systems

  6. Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp 249–256

  7. Guo J, Yuan C, Zhao Z, Feng P, Wang T, Liu F (2018) Bi-branch deconvolution-based convolutional neural network for image classification. Multimed Tools Appl 77(23):30233–30250

    Article  Google Scholar 

  8. Guyon I, Athitsos V, Jangyodsuk P, Escalante HJ (2014) The ChaLearn gesture dataset (CGD 2011). Mach Vis Appl 25(8):1929–1951

    Article  Google Scholar 

  9. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  10. He K, Zhang X, Ren S, Sun J (2016) Identity mappings in deep residual networks. In: European conference on computer vision. Springer, pp 630–645

  11. Holzinger A, Kieseberg P, Weippl E, Tjoa AM (2018) Current advances, trends and challenges of machine learning and knowledge extraction: from machine learning to explainable AI. In: Springer lecture notes in computer science LNCS 11015. Springer International, pp 1–8

  12. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv:150203167

  13. Ji S, Xu W, Yang M, Yu K (2013) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231

    Article  Google Scholar 

  14. Kaiser Ł, Nachum O, Roy A, Bengio S (2017) Learning to remember rare events. arXiv:170303129

  15. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1725–1732

  16. Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv:14126980

  17. Koch G, Zemel R, Salakhutdinov R (2015) Siamese neural networks for one-shot image recognition. In: ICML deep learning workshop

  18. Konečný J, Hagara M (2014) One-shot-learning gesture recognition using hog-hof features. J Mach Learning Res 15(1):2513–2532

    MathSciNet  Google Scholar 

  19. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105

  20. Li J, Tao J, Ding L, Gao H, Deng Z, Luo Y, Li Z (2018) A new iterative synthetic data generation method for CNN based stroke gesture recognition. Multimed Tools Appl 77(13):17181–17205

    Article  Google Scholar 

  21. Li Y, Miao Q, Tian K, Fan Y, Xu X, Li R, Song J (2016) Large-scale gesture recognition with a fusion of rgb-d data based on the c3d model. In: 2016 23rd international conference on pattern recognition (ICPR). IEEE, pp 25–30

  22. Lin J, Ruan X, Yu N, Wei R (2015) One-shot learning gesture recognition based on improved 3D SMoSIFT feature descriptor from RGB-D videos. In: 2015 27th Chinese control and decision conference (CCDC). IEEE, pp 4911–4916

  23. Lin J, Ruan X, Yu N, Yang Y-H (2016) Adaptive local spatiotemporal features from RGB-d data for one-shot learning gesture recognition. Sensors 16(12):2171

    Article  Google Scholar 

  24. Loshchilov I, Hutter F (2016) Sgdr: stochastic gradient descent with warm restarts. arXiv:160803983

  25. Malgireddy MR, Inwogu I, Govindaraju V (2012) A temporal bayesian model for classifying, detecting and localizing activities in video sequences. In: 2012 IEEE computer society conference on computer vision and pattern recognition workshops (CVPRW). IEEE, pp 43–48

  26. Malgireddy MR, Nwogu I, Govindaraju V (2013) Language-motivated approaches to action recognition. J Mach Learning Res 14(1):2189–2212

    MathSciNet  Google Scholar 

  27. Molchanov P, Gupta S, Kim K, Kautz J (2015) Hand gesture recognition with 3D convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 1–7

  28. Molchanov P, Gupta S, Kim K, Pulli K (2015) Multi-sensor system for driver’s hand-gesture recognition. In: 2015 11th IEEE international conference and workshops on automatic face and gesture recognition (FG). IEEE, pp 1–8

  29. Molchanov P, Yang X, Gupta S, Kim K, Tyree S, Kautz J (2016) Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4207–4215

  30. Munkhdalai T, Yu H (2017) Meta networks. arXiv:170300837

  31. Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In: 2017 IEEE international conference on computer vision (ICCV). IEEE, pp 5534–5542

  32. Ravi S, Larochelle H (2016) Optimization as a model for few-shot learning

  33. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211– 252

    Article  MathSciNet  Google Scholar 

  34. Santoro A, Bartunov S, Botvinick M, Wierstra D, Lillicrap T (2016) One-shot learning with memory-augmented neural networks. arXiv:160506065

  35. Shahroudy A, Liu J, Ng T-T, Wang G (2016) NTU RGB+ D: a large scale dataset for 3D human activity analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1010–1019

  36. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576

  37. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:14091556

  38. Snell J, Swersky K, Zemel R (2017) Prototypical networks for few-shot learning. In: Advances in neural information processing systems, pp 4077–4087

  39. Sokolova M, Lapalme G (2009) A systematic analysis of performance measures for classification tasks. Information Processing & Management 45(4):427–437

    Article  Google Scholar 

  40. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497

  41. Tran D, Ray J, Shou Z, Chang S-F, Paluri M (2017) Convnet architecture search for spatiotemporal feature learning. arXiv:170805038

  42. Maaten Lvd, Hinton G (2008) Visualizing data using t-SNE. J Mach Learning Res 9(Nov):2579–2605

    MATH  Google Scholar 

  43. Veit A, Wilber MJ, Belongie S (2016) Residual networks behave like ensembles of relatively shallow networks. In: Advances in neural information processing systems, pp 550–558

  44. Vinyals O, Blundell C, Lillicrap T, Wierstra D (2016) Matching networks for one shot learning. In: Advances in neural information processing systems, pp 3630–3638

  45. Wan J, Athitsos V, Jangyodsuk P, Escalante HJ, Ruan Q, Guyon I (2014) CSMMI: class-specific maximization of mutual information for action and gesture recognition. IEEE Transactions on Image Processing 23(7):3152–3165

    Article  MathSciNet  Google Scholar 

  46. Wan J, Guo G, Li SZ (2016) Explore efficient local features from RGB-d data for one-shot learning gesture recognition. IEEE Trans Pattern Anal Mach Intell 38(8):1626–1639

    Article  Google Scholar 

  47. Wan J, Ruan Q, Li W, An G, Zhao R (2014) 3D SMoSIFT: three-dimensional sparse motion scale invariant feature transform for activity recognition from RGB-D videos. J Electron Imaging 23(2):023017

    Article  Google Scholar 

  48. Wan J, Ruan Q, Li W, Deng S (2013) One-shot learning gesture recognition from RGB-d data using bag of features. J Mach Learning Res 14(1):2549–2582

    Google Scholar 

  49. Wan J, Zhao Y, Zhou S, Guyon I, Escalera S, Li SZ (2016) Chalearn looking at people rgb-d isolated and continuous datasets for gesture recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 56–64

  50. Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. In: European conference on computer vision. Springer, pp 20–36

  51. Wang T, Chen Y, Zhang M, Chen J, Snoussi H (2017) Internal transfer learning for improving performance in human action recognition for small datasets. IEEE Access 5:17627–17633

    Article  Google Scholar 

  52. Weston J, Chopra S, Bordes A (2014) Memory networks. arXiv:14103916

  53. Xuejiao L, Yongqing S (2017) Tracking skeletal fusion feature for one shot learning gesture recognition. In: 2017 2nd international conference on image, vision and computing (ICIVC). IEEE, pp 194–200

  54. Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4694–4702

  55. Zhang H, Xia C, Gao X (2019) Action recognition based on multi-stage jointly training convolutional network. Multimed Tools Appl 78(8):9919–9931

    Article  Google Scholar 

  56. Zhang L, Zhu G, Shen P, Song J, Shah SA, Bennamoun M (2017) Learning spatiotemporal features using 3dcnn and convolutional lstm for gesture recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3120–3128

  57. Zhang Y, Cao C, Cheng J, Lu H (2018) Egogesture: a new dataset and benchmark for egocentric hand gesture recognition. IEEE Trans Multimed 20 (5):1038–1050

    Article  Google Scholar 

  58. Zhu G, Zhang L, Mei L, Shao J, Song J, Shen P (2016) Large-scale isolated gesture recognition using pyramidal 3d convolutional networks. In: 2016 23rd international conference on pattern recognition (ICPR). IEEE, pp 19–24

  59. Zhu G, Zhang L, Shen P, Song J (2017) Multimodal gesture recognition using 3-D convolution and convolutional LSTM. IEEE Access 5:4517–4524

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported in part by National Natural Science Foundation of China (Grant No. 61731001) and SONY.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lianwei Li.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, L., Qin, S., Lu, Z. et al. One-shot learning gesture recognition based on joint training of 3D ResNet and memory module. Multimed Tools Appl 79, 6727–6757 (2020). https://doi.org/10.1007/s11042-019-08429-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-019-08429-9

Keywords

Navigation