One-shot learning hand gesture recognition based on modified 3d convolutional neural networks

Lu, Zhi; Qin, Shiyin; Li, Xiaojie; Li, Lianwei; Zhang, Dinghao

doi:10.1007/s00138-019-01043-7

One-shot learning hand gesture recognition based on modified 3d convolutional neural networks

Original Paper
Published: 01 August 2019

Volume 30, pages 1157–1180, (2019)
Cite this article

Machine Vision and Applications Aims and scope Submit manuscript

Zhi Lu ORCID: orcid.org/0000-0002-0440-9175¹,
Shiyin Qin¹,
Xiaojie Li¹,
Lianwei Li¹ &
…
Dinghao Zhang¹

1156 Accesses
14 Citations
Explore all metrics

Abstract

Though deep neural networks have played a very important role in the field of vision-based hand gesture recognition, however, it is challenging to acquire large numbers of annotated samples to support its deep learning or training. Furthermore, in practical applications it often encounters some case with only one single sample for a new gesture class so that conventional recognition method cannot be qualified with a satisfactory classification performance. In this paper, the methodology of transfer learning is employed to build an effective network architecture of one-shot learning so as to deal with such intractable problem. Then some useful knowledge from deep training with big dataset of relative objects can be transferred and utilized to strengthen one-shot learning hand gesture recognition (OSLHGR) rather than to train a network from scratch. According to this idea a well-designed convolutional network architecture with deeper layers, C3D (Tran et al. in: ICCV, pp 4489–4497, 2015), is modified as an effective tool to extract spatiotemporal feature by deep learning. Then continuous fine-tune training is performed on a sample of new classes to complete one-shot learning. Moreover, the test of classification is carried out by Softmax classifier and geometrical classification based on Euclidean distance. Finally, a series of experiments and tests on two benchmark datasets, VIVA (Vision for Intelligent Vehicles and Applications) and SKIG (Sheffield Kinect Gesture) are conducted to demonstrate its state-of-the-art recognition accuracy of our proposed method. Meanwhile, a special dataset of gestures, BSG, is built using SoftKinetic DS325 for the test of OSLHGR, and a series of test results verify and validate its well classification performance and real-time response speed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 16

CBAM: Convolutional Block Attention Module

HCRNN: A Novel Architecture for Fast Online Handwritten Stroke Classification

Deep Learning for Generic Object Detection: A Survey

Article Open access 31 October 2019

References

Mitra, S., Acharya, T.: Gesture recognition: a survey. IEEE Trans. Syst. Man Cybern. Part C 37(3), 311–324 (2007)
Article Google Scholar
Rautaray, S.S., Agrawal, A.: Vision based hand gesture recognition for human computer interaction: a survey. Artif. Intell. Rev. 43(1), 1–54 (2015)
Article Google Scholar
Qian, K., Niu, J., Yang, H.: Developing a gesture based remote human-robot interaction system using Kinect. Int. J. Smart Home 7(4), 203–208 (2013)
Google Scholar
Weaver, J., Starner, T., Pentland, A.: Real-time american sign language recognition using desk and wearable computer based video. IEEE Trans. Pattern Anal. Mach. Intell. 20(12), 1371–1375 (1998)
Article Google Scholar
Porikli, F., Brémond, F., Dockstader, S.L., Ferryman, J., Hoogs, A., Lovell, B.C., Pankanti, S., Rinner, B., Tu, P., Venetianer, P.L.: Video surveillance: past, present, and now the future. IEEE Signal Process. Mag. 30(3), 190–198 (2013)
Article Google Scholar
Reifinger, S., Wallhoff, F., Ablassmeier, M., Poitschke, T., Rigoll, G.: Static and dynamic hand-gesture recognition for augmented reality applications. In: Proceedings of the 12th International Conference on Human-computer Interaction: Intelligent Multimodal Interaction Environments, pp. 728–737 (2007)
Chapter Google Scholar
Molchanov, P., Gupta, S., Kim, K., Kautz, J.: Hand gesture recognition with 3d convolutional neural networks. In: CVPR, pp. 1–7 (2015)
Molchanov, P., Yang, X., Gupta, S., Kim, K., Tyree, S., Kautz, J.: Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural net-work. In: CVPR, pp. 4207–4215 (2016)
Li, F., Rob, F., Pietro, P.: One-shot learning of object categories. IEEE Trans. Pattern Anal. Mach. Intell. 28(4), 594–611 (2006)
Article Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS, pp. 1106–1114 (2012)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR, pp. 1–9 (2015)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. PAMI 35(1), 221–231 (2013)
Article Google Scholar
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR, pp. 1725–1732 (2014)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: ICCV, pp. 4489–4497 (2015)
Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neural networks? In: NIPS, pp. 3320–3328 (2014)
Guyon, I., Athitsos, V., Jangyodsuk, P., Escalante, H.J.: The chalearn gesture dataset (CGD 2011). Mach. Vis. Appl. 25(8), 1929–1951 (2014)
Article Google Scholar
Wu, D., Zhu, F., Shao, L.: One shot learning gesture recognition from RGBD images. In: CVPR, pp. 7–12 (2012)
Fanello, S.R., Gori, I., Metta, G., Odone, F.: One-shot learning for real-time action recognition. In: Iberian Conference on Pattern Recognition and Image Analysis, pp. 31–40 (2013)
Chapter Google Scholar
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR, pp. 886–893 (2005)
Wan, J., Ruan, Q., Li, W., Deng, S.: One-shot learning gesture recognition from RGB-D data using bag of features. J. Mach. Learn. Res. 14(1), 2549–2582 (2013)
Google Scholar
Wan, J., Ruan, Q.Q., Lei, W., An, G.Y., Zhao, R.Z.: 3D SMoSIFT: three-dimensional sparse motion scale invariant feature transform for activity recognition from RGB-D videos. J. Electron. Imaging 23(2), 1709–1717 (2014)
Article Google Scholar
Wan, J., Guo, G., Li, S.Z.: Explore efficient local features from RGB-D data for one-shot learning gesture recognition. IEEE Trans. Pattern Anal. Mach. Intell. 38(8), 1626–1639 (2016)
Article Google Scholar
Yang, W., Wang, Y., Mori, G.: Human action recognition from a single clip per action. In: ICCV, pp. 482–489 (2009)
Mahbub, U., Imtiaz, H., Roy, T., Rahman, M.S., Ahad, M.A.R.: A template matching approach of one-shot-learning gesture recognition. Pattern Recognit. Lett. 34(15), 1780–1788 (2012)
Article Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS, pp. 568–576 (2014)
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: towards good practices for deep action recognition. In: ECCV, pp. 20–36 (2016)
Chapter Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML, pp. 448–456 (2015)
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: CVPR, pp. 1933–1941 (2016)
Duan, J., Zhou, S., Wan, J., Guo, X., Li, S.Z.: Multi-modality fusion based on consensus-voting and 3D convolution for isolated gesture recognition. arXiv preprint. arXiv:1611.06689 (2016)
Zhu, G., Zhang, L., Mei, L., Shao, J., Song, J., Shen. P.: Large-scale isolated gesture recognition using pyramidal 3d convolutional networks. In: ICPR, pp. 19–24 (2016)
Tran, D., Ray, J., Shou, Z., Chang, S.-F., Paluri, M.: Convnet architecture search for spatiotemporal feature learning. arXiv preprint. arXiv:1708.05038 (2017)
Miao, Q., Li, Y., Ouyang, W., Ma, Z., Xu, X., Shi, W., Cao, X.: Multimodal gesture recognition based on the ResC3D network. In: CVPR, pp. 3047–3055 (2017)
Molchanov, P., Gupta, S., Kim, K., Pulli, K.: Multi-sensor system for driver’s hand-gesture recognition. In: IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, pp. 1–8 (2015)
Zhu, G., Zhang, L., Shen, P., Song, J.: Multimodal gesture recognition using 3d convolution and convolutional lstm. IEEE Access 5, 4517–4524 (2017)
Article Google Scholar
Zhang, L., Zhu, G., Shen, P., Song, J.: Learning spatiotemporal features using 3DCNN and convolutional LSTM for gesture recognition. In: ICCV, pp. 3120–3128 (2017)
Koch, G., Zemel, R., Salakhutdinov, R.: Siamese neural networks for one-shot image recognition. In: ICML (2015)
Xu, Z., Zhu, L., Yang, Y.: Few-shot object recognition from machine-labeled web images. In: CVPR, pp. 5358–5366 (2016)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR, pp. 580–587 (2014)
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 640–651 (2015)
Zagoruyko, S., Komodakis, N.: Wide residual networks. In: BMVC (2016)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2015)
Pan, S.J., Yang, Q.: A survey on transfer learning. Knowledge and Data Engineering. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2010)
Article Google Scholar
Zhuo, L., Jiang, L., Zhu, Z., Li, J., Zhang, J., Long, H.: Vehicle classification for large scale traffic surveillance videos using convolutional neural networks. Mach. Vis. Appl. 28(7), 793–802 (2017)
Article Google Scholar
Lin, M., Chen, Q., Yan, S.C.: Network in network. In: International Conference on Learning Representations, abs/1312.4400 (2014). arXiv:1312.4400
Ohn-Bar, E., Trivedi, M.M.: Hand gesture recognition in real-time for automotive interfaces: a multimodal vision-based approach and evaluations. IEEE Trans. Intell. Transport Syst. 15(6), 2368–2377 (2014)
Article Google Scholar
Oreifej, O., Liu, Z.: Hon4d: histogram of oriented 4d normals for activity recognition from depth sequences. In: CVPR, pp. 716–723 (2013)
Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. 103(1), 60–79 (2013)
Article MathSciNet Google Scholar
Klaser, A., Marszalek, M., Schmid, C.: A spatio-temporal descriptor based on 3d-gradients. In: BMVC 2008—19th British Machine Vision Conference, pp. 1–10 (2008)
Hadfield, S., Bowden, R.: Hollywood 3d: recognizing actions in 3d natural scenes. In: CVPR, pp. 3398–3405 (2013)
Castro, F.M., Marín-Jiménez, M.J., Guil, N.: Multimodal features fusion for gait, gender and shoes recognition. Mach. Vis. Appl. 27(8), 1213–1228 (2016)
Article Google Scholar
Zhang, C., Yan, J., Li, C., Hu, H., Bie, R.: End-to-end learning for image-based air quality level estimation. Mach. Vis. Appl. 29(4), 601–615 (2018)
Article Google Scholar
Liu, L., Shao, L.: Learning discriminative representations from RGB-D video data. In: International Joint Conference on Artificial Intelligence, pp. 1493–1500 (2013)
Choi, H., Park, H.: A hierarchical structure for gesture recognition using RGB-D sensor. In: Proc. 2nd Int. Conf. Human-Agent Interact. pp. 265–268 (2014)
Cirujeda, P., Binefa, X.: 4DCov: a nested covariance descriptor of spatio-temporal features for gesture recognition in depth sequences. In: Proc. 2nd Int. Conf. 3D Vis., Dec. pp. 657–664 (2014)
Liu, M., Liu, H.: Depth context: a new descriptor for human activity recognition by using sole depth sequences. Neurocomputing 175, 747–758 (2016)
Article Google Scholar
Tung, P.T., Ngoc, L.Q.: Elliptical density shape model for hand gesture recognition. In: Proc. 5th Symp. Inf. Commun. Technol. pp. 186–191 (2014)
Nishida, N., Nakayama, H.: Multimodal gesture recognition using multi-stream recurrent neural network. Image Video Technol. 9431, 682–694 (2015)
Article Google Scholar
Zheng, J., Feng, Z., Xu, C., Hu, J., Ge, W.: Fusing shape and spatio-temporal features for depth-based dynamic hand gesture recognition. Multimed. Tools Appl. 76(20), 20525–20544 (2017)
Article Google Scholar
Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime TV-L1 optical flow. Pattern Recognition, pp. 214–223 (2007)
Achanta, R., Hemami, S.S., Estrada, F., Susstrunk, S.: Frequency-tuned salient region detection. In: CVPR, pp. 1597–1604 (2009)
Sokolova, M., Lapalme, G.: A systematic analysis of performance measures for classification tasks. Inform. Process. Manag. 45(4), 427–437 (2009)
Article Google Scholar
Käding, C., Rodner, E., Freytag, A., Denzler, J.: Fine-tuning deep neural networks in continuous learning scenarios. In: Interpretation and Visualization of Deep Neural Nets, pp. 588–605 (2016)
Chapter Google Scholar
Maaten, Lvd, Hinton, G.: Visualizing data using t-sne. J Mach. Learn. Res. 9, 2579–2605 (2008)
MATH Google Scholar

Download references

Acknowledgements

The paper is partly supported by National Natural Science Foundation of China (Grant Nos. 61731001, U1435220).

Author information

Authors and Affiliations

School of Automation Science and Electrical Engineering, Beihang University, No. 37, Xueyuan Road, Beijing, 100191, HaiDian District, China
Zhi Lu, Shiyin Qin, Xiaojie Li, Lianwei Li & Dinghao Zhang

Authors

Zhi Lu
View author publications
You can also search for this author in PubMed Google Scholar
Shiyin Qin
View author publications
You can also search for this author in PubMed Google Scholar
Xiaojie Li
View author publications
You can also search for this author in PubMed Google Scholar
Lianwei Li
View author publications
You can also search for this author in PubMed Google Scholar
Dinghao Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhi Lu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lu, Z., Qin, S., Li, X. et al. One-shot learning hand gesture recognition based on modified 3d convolutional neural networks. Machine Vision and Applications 30, 1157–1180 (2019). https://doi.org/10.1007/s00138-019-01043-7

Download citation

Received: 28 June 2018
Revised: 29 June 2019
Accepted: 23 July 2019
Published: 01 August 2019
Issue Date: October 2019
DOI: https://doi.org/10.1007/s00138-019-01043-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

One-shot learning hand gesture recognition based on modified 3d convolutional neural networks

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

HCRNN: A Novel Architecture for Fast Online Handwritten Stroke Classification

Deep Learning for Generic Object Detection: A Survey

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

One-shot learning hand gesture recognition based on modified 3d convolutional neural networks

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

HCRNN: A Novel Architecture for Fast Online Handwritten Stroke Classification

Deep Learning for Generic Object Detection: A Survey

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation