One-shot learning gesture recognition based on joint training of 3D ResNet and memory module

Li, Lianwei; Qin, Shiyin; Lu, Zhi; Xu, Kuanhong; Hu, Zhongying

doi:10.1007/s11042-019-08429-9

One-shot learning gesture recognition based on joint training of 3D ResNet and memory module

Published: 17 December 2019

Volume 79, pages 6727–6757, (2020)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Lianwei Li¹,
Shiyin Qin^1,2,
Zhi Lu¹,
Kuanhong Xu³ &
…
Zhongying Hu³

921 Accesses
12 Citations
Explore all metrics

Abstract

As a research hotspot in the field of human-machine interaction, a great progress of hand gesture recognition has been achieved with the development of deep learning of neural networks. However, in the deep learning based recognition methods, it is necessary to rely heavily on large-scale labeled dataset which is very hard to build in practical applications. In order to achieve a well performance under some strict constraint of few sample data, one-shot learning gesture recognition is studied and a joint deep training method by combination of 3D ResNet with a memory module is presented in this paper. In our scheme a combinatorial optimization of feature extraction by 3D ResNet with memory capacity of rare event by memory module is carried out with an effective strategy of optimal decision and two relative performance indices. In order to implement one-shot learning gesture recognition, the memory module is employed to remember the features extracted by well-trained 3D ResNet and the classification decision is performed by the nearest neighbor algorithm with cosine similarity measure. In view of real-world applications about human-machine interaction technology, its ability to deal with negative samples plays a significant role thus a mechanism based on the threshold of cosine similarity is built to realize effective classification and rejection respectively. In order to validate and evaluate the performance of our proposed method, a special hand gesture dataset containing 3045 gesture videos is built and a series of experiment results on our collected dataset and public datasets demonstrate the feasibility and effectiveness of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 6

Real-time one-shot learning gesture recognition based on lightweight 3D Inception-ResNet with separable convolutions

Article 23 April 2021

One-shot learning hand gesture recognition based on modified 3d convolutional neural networks

Article 01 August 2019

Introducing and Benchmarking a One-Shot Learning Gesture Recognition Dataset

References

Bertinetto L, Henriques JF, Valmadre J, Torr P, Vedaldi A (2016) Learning feed-forward one-shot learners. In: Advances in neural information processing systems, pp 523–531
Cai Q, Pan Y, Yao T, Yan C, Mei T (2018) Memory matching networks for one-shot image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4080– 4088
Fe-Fei L (2003) A Bayesian approach to unsupervised one-shot learning of object categories. In: Ninth IEEE international conference on computer vision, 2003. Proceedings. IEEE, pp 1134–1141
Finn C, Abbeel P, Levine S (2017) Model-agnostic meta-learning for fast adaptation of deep networks. arXiv:170303400
Girija SS (2016) Tensorflow: large-scale machine learning on heterogeneous distributed systems
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp 249–256
Guo J, Yuan C, Zhao Z, Feng P, Wang T, Liu F (2018) Bi-branch deconvolution-based convolutional neural network for image classification. Multimed Tools Appl 77(23):30233–30250
Article Google Scholar
Guyon I, Athitsos V, Jangyodsuk P, Escalante HJ (2014) The ChaLearn gesture dataset (CGD 2011). Mach Vis Appl 25(8):1929–1951
Article Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
He K, Zhang X, Ren S, Sun J (2016) Identity mappings in deep residual networks. In: European conference on computer vision. Springer, pp 630–645
Holzinger A, Kieseberg P, Weippl E, Tjoa AM (2018) Current advances, trends and challenges of machine learning and knowledge extraction: from machine learning to explainable AI. In: Springer lecture notes in computer science LNCS 11015. Springer International, pp 1–8
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv:150203167
Ji S, Xu W, Yang M, Yu K (2013) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231
Article Google Scholar
Kaiser Ł, Nachum O, Roy A, Bengio S (2017) Learning to remember rare events. arXiv:170303129
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1725–1732
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv:14126980
Koch G, Zemel R, Salakhutdinov R (2015) Siamese neural networks for one-shot image recognition. In: ICML deep learning workshop
Konečný J, Hagara M (2014) One-shot-learning gesture recognition using hog-hof features. J Mach Learning Res 15(1):2513–2532
MathSciNet Google Scholar
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
Li J, Tao J, Ding L, Gao H, Deng Z, Luo Y, Li Z (2018) A new iterative synthetic data generation method for CNN based stroke gesture recognition. Multimed Tools Appl 77(13):17181–17205
Article Google Scholar
Li Y, Miao Q, Tian K, Fan Y, Xu X, Li R, Song J (2016) Large-scale gesture recognition with a fusion of rgb-d data based on the c3d model. In: 2016 23rd international conference on pattern recognition (ICPR). IEEE, pp 25–30
Lin J, Ruan X, Yu N, Wei R (2015) One-shot learning gesture recognition based on improved 3D SMoSIFT feature descriptor from RGB-D videos. In: 2015 27th Chinese control and decision conference (CCDC). IEEE, pp 4911–4916
Lin J, Ruan X, Yu N, Yang Y-H (2016) Adaptive local spatiotemporal features from RGB-d data for one-shot learning gesture recognition. Sensors 16(12):2171
Article Google Scholar
Loshchilov I, Hutter F (2016) Sgdr: stochastic gradient descent with warm restarts. arXiv:160803983
Malgireddy MR, Inwogu I, Govindaraju V (2012) A temporal bayesian model for classifying, detecting and localizing activities in video sequences. In: 2012 IEEE computer society conference on computer vision and pattern recognition workshops (CVPRW). IEEE, pp 43–48
Malgireddy MR, Nwogu I, Govindaraju V (2013) Language-motivated approaches to action recognition. J Mach Learning Res 14(1):2189–2212
MathSciNet Google Scholar
Molchanov P, Gupta S, Kim K, Kautz J (2015) Hand gesture recognition with 3D convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 1–7
Molchanov P, Gupta S, Kim K, Pulli K (2015) Multi-sensor system for driver’s hand-gesture recognition. In: 2015 11th IEEE international conference and workshops on automatic face and gesture recognition (FG). IEEE, pp 1–8
Molchanov P, Yang X, Gupta S, Kim K, Tyree S, Kautz J (2016) Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4207–4215
Munkhdalai T, Yu H (2017) Meta networks. arXiv:170300837
Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In: 2017 IEEE international conference on computer vision (ICCV). IEEE, pp 5534–5542
Ravi S, Larochelle H (2016) Optimization as a model for few-shot learning
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211– 252
Article MathSciNet Google Scholar
Santoro A, Bartunov S, Botvinick M, Wierstra D, Lillicrap T (2016) One-shot learning with memory-augmented neural networks. arXiv:160506065
Shahroudy A, Liu J, Ng T-T, Wang G (2016) NTU RGB+ D: a large scale dataset for 3D human activity analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1010–1019
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:14091556
Snell J, Swersky K, Zemel R (2017) Prototypical networks for few-shot learning. In: Advances in neural information processing systems, pp 4077–4087
Sokolova M, Lapalme G (2009) A systematic analysis of performance measures for classification tasks. Information Processing & Management 45(4):427–437
Article Google Scholar
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497
Tran D, Ray J, Shou Z, Chang S-F, Paluri M (2017) Convnet architecture search for spatiotemporal feature learning. arXiv:170805038
Maaten Lvd, Hinton G (2008) Visualizing data using t-SNE. J Mach Learning Res 9(Nov):2579–2605
MATH Google Scholar
Veit A, Wilber MJ, Belongie S (2016) Residual networks behave like ensembles of relatively shallow networks. In: Advances in neural information processing systems, pp 550–558
Vinyals O, Blundell C, Lillicrap T, Wierstra D (2016) Matching networks for one shot learning. In: Advances in neural information processing systems, pp 3630–3638
Wan J, Athitsos V, Jangyodsuk P, Escalante HJ, Ruan Q, Guyon I (2014) CSMMI: class-specific maximization of mutual information for action and gesture recognition. IEEE Transactions on Image Processing 23(7):3152–3165
Article MathSciNet Google Scholar
Wan J, Guo G, Li SZ (2016) Explore efficient local features from RGB-d data for one-shot learning gesture recognition. IEEE Trans Pattern Anal Mach Intell 38(8):1626–1639
Article Google Scholar
Wan J, Ruan Q, Li W, An G, Zhao R (2014) 3D SMoSIFT: three-dimensional sparse motion scale invariant feature transform for activity recognition from RGB-D videos. J Electron Imaging 23(2):023017
Article Google Scholar
Wan J, Ruan Q, Li W, Deng S (2013) One-shot learning gesture recognition from RGB-d data using bag of features. J Mach Learning Res 14(1):2549–2582
Google Scholar
Wan J, Zhao Y, Zhou S, Guyon I, Escalera S, Li SZ (2016) Chalearn looking at people rgb-d isolated and continuous datasets for gesture recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 56–64
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. In: European conference on computer vision. Springer, pp 20–36
Wang T, Chen Y, Zhang M, Chen J, Snoussi H (2017) Internal transfer learning for improving performance in human action recognition for small datasets. IEEE Access 5:17627–17633
Article Google Scholar
Weston J, Chopra S, Bordes A (2014) Memory networks. arXiv:14103916
Xuejiao L, Yongqing S (2017) Tracking skeletal fusion feature for one shot learning gesture recognition. In: 2017 2nd international conference on image, vision and computing (ICIVC). IEEE, pp 194–200
Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4694–4702
Zhang H, Xia C, Gao X (2019) Action recognition based on multi-stage jointly training convolutional network. Multimed Tools Appl 78(8):9919–9931
Article Google Scholar
Zhang L, Zhu G, Shen P, Song J, Shah SA, Bennamoun M (2017) Learning spatiotemporal features using 3dcnn and convolutional lstm for gesture recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3120–3128
Zhang Y, Cao C, Cheng J, Lu H (2018) Egogesture: a new dataset and benchmark for egocentric hand gesture recognition. IEEE Trans Multimed 20 (5):1038–1050
Article Google Scholar
Zhu G, Zhang L, Mei L, Shao J, Song J, Shen P (2016) Large-scale isolated gesture recognition using pyramidal 3d convolutional networks. In: 2016 23rd international conference on pattern recognition (ICPR). IEEE, pp 19–24
Zhu G, Zhang L, Shen P, Song J (2017) Multimodal gesture recognition using 3-D convolution and convolutional LSTM. IEEE Access 5:4517–4524
Article Google Scholar

Download references

Acknowledgements

This work was supported in part by National Natural Science Foundation of China (Grant No. 61731001) and SONY.

Author information

Authors and Affiliations

School of Automation Science and Electrical Engineering, Beihang University, Beijing, 100191, China
Lianwei Li, Shiyin Qin & Zhi Lu
School of Electrical Engineering and Intelligentization, Dongguan University of Technology, Dongguan, 523808, Guangdong Province, China
Shiyin Qin
Artificial Intelligence Research Department, Sony China Research Laboratory, Beijing, 100028, China
Kuanhong Xu & Zhongying Hu

Authors

Lianwei Li
View author publications
You can also search for this author in PubMed Google Scholar
Shiyin Qin
View author publications
You can also search for this author in PubMed Google Scholar
Zhi Lu
View author publications
You can also search for this author in PubMed Google Scholar
Kuanhong Xu
View author publications
You can also search for this author in PubMed Google Scholar
Zhongying Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lianwei Li.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, L., Qin, S., Lu, Z. et al. One-shot learning gesture recognition based on joint training of 3D ResNet and memory module. Multimed Tools Appl 79, 6727–6757 (2020). https://doi.org/10.1007/s11042-019-08429-9

Download citation

Received: 05 March 2019
Revised: 09 October 2019
Accepted: 01 November 2019
Published: 17 December 2019
Issue Date: March 2020
DOI: https://doi.org/10.1007/s11042-019-08429-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

One-shot learning gesture recognition based on joint training of 3D ResNet and memory module

Abstract

Access this article

Similar content being viewed by others

Real-time one-shot learning gesture recognition based on lightweight 3D Inception-ResNet with separable convolutions

One-shot learning hand gesture recognition based on modified 3d convolutional neural networks

Introducing and Benchmarking a One-Shot Learning Gesture Recognition Dataset

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

One-shot learning gesture recognition based on joint training of 3D ResNet and memory module

Abstract

Access this article

Similar content being viewed by others

Real-time one-shot learning gesture recognition based on lightweight 3D Inception-ResNet with separable convolutions

One-shot learning hand gesture recognition based on modified 3d convolutional neural networks

Introducing and Benchmarking a One-Shot Learning Gesture Recognition Dataset

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation