Real-time one-shot learning gesture recognition based on lightweight 3D Inception-ResNet with separable convolutions

Li, Lianwei; Qin, Shiyin; Lu, Zhi; Zhang, Dinghao; Xu, Kuanhong; Hu, Zhongying

doi:10.1007/s10044-021-00965-1

Real-time one-shot learning gesture recognition based on lightweight 3D Inception-ResNet with separable convolutions

Theoretical advances
Published: 23 April 2021

Volume 24, pages 1173–1192, (2021)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

Lianwei Li¹,
Shiyin Qin^1,2,
Zhi Lu¹,
Dinghao Zhang¹,
Kuanhong Xu³ &
…
Zhongying Hu³

578 Accesses
6 Citations
1 Altmetric
Explore all metrics

Abstract

Gesture recognition is a popular research field in computer vision and the application of deep neural networks greatly improves its performance. However, the general deep learning method has a large number of parameters preventing the practical application on resource-limited devices. Meanwhile, collecting large number of training samples is usually time-consuming and difficult. To this end, we propose a lightweight 3D Inception-ResNet to extract discriminative features for real-time one-shot learning gesture recognition which aims to recognize gestures successfully given only one training sample for each new class. For efficient extraction of gesture features, we firstly extend the original 2D Inception-ResNet to the 3D version and then apply two kinds of separable convolutions as well as some other design strategies to reduce the number of parameters and computation complexity making it running in real-time even on CPU for feature extraction. Moreover, the consumption of storage space is also greatly reduced. In order to obtain robust performance for one-shot learning recognition, we employ an evolution mechanism by updating the root sample with innovation of new samples to enhance and improve the performance of the nearest neighbor classifier. Meanwhile, we propose an update strategy of the dynamic threshold to deal with the problem of threshold selection in real-world applications. In order to improve the robustness of recognition performance, we conduct artificial data synthesis to augment our collected dataset. A series of experiments conducted on public datasets and our collected dataset demonstrate the effectiveness of our approach to one-shot learning gesture recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 7

Fig. 13

One-shot learning gesture recognition based on joint training of 3D ResNet and memory module

Article 17 December 2019

One-shot learning hand gesture recognition based on modified 3d convolutional neural networks

Article 01 August 2019

Real-time continuous detection and recognition of dynamic hand gestures in untrimmed sequences based on end-to-end architecture with 3D DenseNet and LSTM

Article 14 July 2023

Availability of data and material

Not applicable.

References

Mitra S, Acharya T (2007) Gesture recognition: a survey. IEEE Trans Syst Man Cybernet Part C (Appl Rev) 37(3):311–324
Article Google Scholar
Chen L, Wang F, Deng H, Ji K (2013) A survey on hand gesture recognition. In: 2013 international conference on computer sciences and applications, IEEE, pp 313–316
Zhang L, Zhu G, Shen P, Song J, Afaq Shah S, Bennamoun M (2017) Learning spatiotemporal features using 3dcnn and convolutional lstm for gesture recognition. In: Proceedings of the IEEE international conference on computer vision, pp 3120–3128
Miao Q, Li Y, Ouyang W, Ma Z, Xu X, Shi W, Cao X (2017) Multimodal gesture recognition based on the resc3d network. In: Proceedings of the IEEE international conference on computer vision, pp 3047–3055
Duan J, Wan J, Zhou S, Guo X, Li SZ (2018) A unified framework for multi-modal isolated gesture recognition. ACM Trans Multimedia Comput Commun Appl (TOMM) 14(1):21
Google Scholar
Molchanov P, Gupta S, Kim K, Kautz J (2015) Hand gesture recognition with 3D convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 1-7
Elouariachi I, Benouini R, Zenkouar K, Zarghili A (2020) Robust hand gesture recognition system based on a new set of quaternion Tchebichef moment invariants. Pattern Anal Appl 23:1337–1353
Article MathSciNet Google Scholar
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Wozniak M, Polap D (2018) Object detection and recognition via clustered features. Neurocomputing 320:76–84
Article Google Scholar
Farrajota M, Rodrigues JM, du Buf JH (2019) Human action recognition in videos with articulated pose information by deep networks. Pattern Anal Appl 22(4):1307–1318
Article MathSciNet Google Scholar
Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6299–6308
Wang T, Chen Y, Zhang M, Chen J, Snoussi H (2017) Internal transfer learning for improving performance in human action recognition for small datasets. IEEE Access 5:17627–17633
Article Google Scholar
Wozniak M, Wieczorek M, Silka J, Polap D (2021) Body pose prediction based on motion sensor data and recurrent neural network. IEEE Trans Industr Inf 17(3):2101–2111
Article Google Scholar
Lin J, Ruan X, Yu N, Yang Y-H (2016) Adaptive local spatiotemporal features from RGB-D data for one-shot learning gesture recognition. Sensors 16(12):2171
Article Google Scholar
Lin J, Ruan X, Yu N, Wei R (2015) One-shot learning gesture recognition based on improved 3D SMoSIFT feature descriptor from RGB-D videos. In: The 27th chinese control and decision conference (2015 CCDC), IEEE, pp 4911–4916
Wan J, Guo G, Li SZ (2016) Explore efficient local features from RGB-D data for one-shot learning gesture recognition. IEEE Trans Pattern Anal Mach Intell 38(8):1626–1639
Article Google Scholar
Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: Thirty-First AAAI conference on artificial intelligence 31(1)
Konen J, Hagara M (2014) One-shot-learning gesture recognition using hog-hof features. J Mach Learn Res 15(1):2513–2532
MathSciNet Google Scholar
Malgireddy MR, Inwogu I, Govindaraju V (2012) A temporal Bayesian model for classifying, detecting and localizing activities in video sequences. In: 2012 IEEE computer society conference on computer vision and pattern recognition workshops, IEEE, pp 43–48
Malgireddy MR, Nwogu I, Govindaraju V (2013) Language-motivated approaches to action recognition. J Mach Learn Res 14(1):2189–2212
MathSciNet Google Scholar
Ming Y, Ruan Q, Hauptmann AG (2012) Activity recognition from RGB-D camera with 3D local spatio-temporal features. In: 2012 IEEE international conference on multimedia and expo, IEEE, pp 344–349
Wan J, Ruan Q, Li W, An G, Zhao R (2014) 3D SMoSIFT: three-dimensional sparse motion scale invariant feature transform for activity recognition from RGB-D videos. J Electron Imaging 23(2):023017
Article Google Scholar
Ji S, Xu W, Yang M, Yu K (2013) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231
Article Google Scholar
Baccouche M, Mamalet F, Wolf C, Garcia C, Baskurt A (2011) Sequential deep learning for human action recognition. In: International workshop on human behavior understanding, Springer, pp 29–39
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497
Li Y, Miao Q, Tian K, Fan Y, Xu X, Li R, Song J (2018) Large-scale gesture recognition with a fusion of rgb-d data based on saliency theory and c3d model. IEEE Trans Circuits Syst Video Technol 28(10):2956–2964
Article Google Scholar
Molchanov P, Yang X, Gupta S, Kim K, Tyree S, Kautz J (2016) Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4207–4215
Wu D, Zhu F, Shao L (2012) One shot learning gesture recognition from rgbd images. In: 2012 IEEE computer society conference on computer vision and pattern recognition workshops, IEEE, pp 7–12
Wan J, Ruan Q, Li W, Deng S (2013) One-shot learning gesture recognition from RGB-D data using bag of features. J Mach Learn Res 14(1):2549–2582
Google Scholar
Escalante HJ, Guyon I, Athitsos V, Jangyodsuk P, Wan J (2017) Principal motion components for one-shot gesture recognition. Pattern Anal Appl 20(1):167–182
Article MathSciNet Google Scholar
Cabrera ME, Wachs JP (2018) Biomechanical-based approach to data augmentation for one-shot gesture recognition. In: 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018), IEEE, pp 38–44
Li L, Qin S, Lu Z, Xu K, Hu Z (2020) One-shot learning gesture recognition based on joint training of 3D ResNet and memory module. Multimedia Tools Appl 79:6727–6757
Article Google Scholar
Lu Z, Qin S, Li L, Zhang D, Xu K, Hu Z (2019) One-shot learning hand gesture recognition based on lightweight 3D convolutional neural networks for portable applications on mobile systems. IEEE Access 7:131732–131748
Article Google Scholar
Iandola FN, Han S, Moskewicz MW, Ashraf K, Dally WJ, Keutzer K (2016) SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size. arXiv preprint arXiv:160207360
Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:170404861
Zhang X, Zhou X, Lin M, Sun J (2018) Shufflenet: An extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6848–6856
Sun L, Jia K, Yeung D-Y, Shi BE (2015) Human action recognition using factorized spatio-temporal convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4597–4605
Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In: Proceedings of the IEEE international conference on computer vision, pp 5533–5541
Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6450–6459
Xie S, Sun C, Huang J, Tu Z, Murphy K (2018) Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In: Proceedings of the European conference on computer vision (ECCV), pp 305–321
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826
Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359
Article Google Scholar
Yosinski J, Clune J, Bengio Y, Lipson H (2014) How transferable are features in deep neural networks? In: Advances in neural information processing systems, pp 3320–3328
O’Neill J, Buitelaar P (2018) Few shot transfer learning betweenword relatedness and similarity tasks using a gated recurrent siamese network. In: Thirty-second AAAI conference on artificial intelligence 32(1)
Wan J, Zhao Y, Zhou S, Guyon I, Escalera S, Li SZ (2016) Chalearn looking at people rgb-d isolated and continuous datasets for gesture recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 56–64
Li X, Qin S, Xu K, Hu Z (2018) One-shot Learning Gesture Recognition Based on Evolution of Discrimination with Successive Memory. In: 2018 IEEE international conference of intelligent robotic and control engineering (IRCE), IEEE, pp 263–269
Li Y, Miao Q, Qi X, Ma Z, Ouyang W (2019) A spatiotemporal attention-based ResC3D model for large-scale gesture recognition. Mach Vis Appl 30(5):875–888
Article Google Scholar
Guyon I, Athitsos V, Jangyodsuk P, Escalante HJ (2014) The ChaLearn gesture dataset (CGD 2011). Mach Vis Appl 25(8):1929–1951
Article Google Scholar
Girija SS Tensorflow: Large-scale machine learning on heterogeneous distributed systems
Loshchilov I, Hutter F (2016) Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:160803983
Zach C, Pock T, Bischof H (2007) A duality based approach for realtime TV-L 1 optical flow. In: Joint pattern recognition symposium, Springer, pp 214–223
Zhu G, Zhang L, Shen P, Song J (2017) Multimodal gesture recognition using 3-D convolution and convolutional LSTM. IEEE Access 5:4517–4524
Article Google Scholar
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1725–1732
Zhu G, Zhang L, Shen P, Song J, Shah SAA, Bennamoun M (2018) Continuous gesture segmentation and recognition using 3dcnn and convolutional lstm. IEEE Trans Multimedia 21(4):1011–1021
Article Google Scholar
Wan J, Athitsos V, Jangyodsuk P, Escalante HJ, Ruan Q, Guyon I (2014) CSMMI: class-specific maximization of mutual information for action and gesture recognition. IEEE Trans Image Process 23(7):3152–3165
Article MathSciNet Google Scholar
Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9(Nov):2579–2605
MATH Google Scholar

Download references

Acknowledgements

This work was supported in part by National Natural Science Foundation of China (Grant No. 61731001) and SONY.

Author information

Authors and Affiliations

School of Automation Science and Electrical Engineering, Beihang University, Beijing, 100191, China
Lianwei Li, Shiyin Qin, Zhi Lu & Dinghao Zhang
School of Electrical Engineering and Intelligentization, Dongguan University of Technology, Dongguan, 523808, Guangdong Province, China
Shiyin Qin
Artificial Intelligence Research Department, Sony China Research Laboratory, Beijing, 100028, China
Kuanhong Xu & Zhongying Hu

Authors

Lianwei Li
View author publications
You can also search for this author in PubMed Google Scholar
Shiyin Qin
View author publications
You can also search for this author in PubMed Google Scholar
Zhi Lu
View author publications
You can also search for this author in PubMed Google Scholar
Dinghao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Kuanhong Xu
View author publications
You can also search for this author in PubMed Google Scholar
Zhongying Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shiyin Qin.

Ethics declarations

Conflict of interests

The authors declare that they have no conflict of interest.

Code availability

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, L., Qin, S., Lu, Z. et al. Real-time one-shot learning gesture recognition based on lightweight 3D Inception-ResNet with separable convolutions. Pattern Anal Applic 24, 1173–1192 (2021). https://doi.org/10.1007/s10044-021-00965-1

Download citation

Received: 29 September 2020
Accepted: 24 January 2021
Published: 23 April 2021
Issue Date: August 2021
DOI: https://doi.org/10.1007/s10044-021-00965-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Real-time one-shot learning gesture recognition based on lightweight 3D Inception-ResNet with separable convolutions

Abstract

Access this article

Similar content being viewed by others

One-shot learning gesture recognition based on joint training of 3D ResNet and memory module

One-shot learning hand gesture recognition based on modified 3d convolutional neural networks

Real-time continuous detection and recognition of dynamic hand gestures in untrimmed sequences based on end-to-end architecture with 3D DenseNet and LSTM

Availability of data and material

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Code availability

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Real-time one-shot learning gesture recognition based on lightweight 3D Inception-ResNet with separable convolutions

Abstract

Access this article

Similar content being viewed by others

One-shot learning gesture recognition based on joint training of 3D ResNet and memory module

One-shot learning hand gesture recognition based on modified 3d convolutional neural networks

Real-time continuous detection and recognition of dynamic hand gestures in untrimmed sequences based on end-to-end architecture with 3D DenseNet and LSTM

Availability of data and material

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Code availability

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation