Skip to main content
Log in

Real-time continuous detection and recognition of dynamic hand gestures in untrimmed sequences based on end-to-end architecture with 3D DenseNet and LSTM

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

With the continuous development of the deep learning theory, novel gesture recognition approaches have been constantly emerging, and the performance has also been continuously improved. However, most research methods focus on the recognition of isolated gestures, and the detection and recognition of continuous gestures are rarely studied. To this end, aiming at the real-time detection and classification of dynamic gestures in untrimmed sequences, a well-designed end-to-end architecture based on the variants of 3D DenseNet and unidirectional LSTM was hereby proposed as an effective tool to extract the discriminative spatio-temporal features of untrimmed hand gesture sequences. Then, connectionist temporal classification was combined to train the network on a publicly available dataset, and some effective capacities could be transferred to enhance the learning ability of the proposed network by training large gesture samples. In this way, the class-conditional probability of an incoming sequence belonging to a given gesture class was predicted and then compared with a predefined threshold to automatically determine the start and end of gestures. In addition, to enhance the classification accuracy of segmented gestures, a bidirectional LSTM network was utilized to model the temporal information, with both the past frames and the future ones taken into account. Finally, a continuous gesture dataset collected indoors for specific application was introduced to validate the proposed method. On this challenge dataset, the 3D DenseNet-LSTM model achieves real-time early detection and classification tasks on unsegmented gesture sequences, and the 3D DenseNet-BiLSTM not only achieves an accuracy of 92.06\(\%\) on segmented gestures, but also a classification accuracy of 89.8\(\%\) and 99.7\(\%\) on nvGesture and SKIG public datasets, respectively. The experimental results demonstrate the performance advantages of the detection and classification as well as the real-time response speed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25

Similar content being viewed by others

Data availability

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

References

  1. Amin MG, Zhang YD, Ahmad F, Ho KD (2016) Radar signal processing for elderly fall detection: the future for in-home monitoring. IEEE Signal Process Mag 33(2):71–80

    Article  Google Scholar 

  2. Badrinarayanan V, Kendall A, Cipolla R (2017) SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal 39:2481–2495

    Article  Google Scholar 

  3. Barron O, Raison M, Gaudet G, Achiche S (2020) Recurrent neural network for electromyographic gesture recognition in transhumeral amputees. Appl Soft Comput 96:1–9

    Article  Google Scholar 

  4. Bridle JS (1990) Probabilistic interpretation of feed forward classification network outputs, with relationships to statistical pattern recognition. Neurocomputing 68:227–236

    Article  MathSciNet  Google Scholar 

  5. Carrara F, Elias P, Sedmidubsky J, Zezula P (2019) LSTM-based real-time action detection and prediction in human motion streams. Multimed Tools Appl 78(2):27309–27331

    Article  Google Scholar 

  6. Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR. pp 6299–6308

  7. Chai X, Liu Z, Yin F, Liu Z, Chen X (2017) Two streams recurrent neural networks for large-scale continuous gesture recognition. In: ICPR. pp 31–36

  8. Chalasani, T., Smolic, A.: Simultaneous segmentation and recognition: Towards more accurate ego gesture recognition. In: ICCV. pp 4367–4375 (2019)

  9. Dhingra N, Kunz A (2019) Res3ATN-deep 3D residual attention network for hand gesture recognition in videos. In: 2019 International Conference on 3D Vision. pp 491–501

  10. Duric Z, Gray WD, Heishman R, Fayin L, Rosenfeld A, Schoelles MJ, Schunn C, Wechsler H (2002) Integrating perceptual and cognitive modeling for adaptive and intelligent human-computer interaction. P IEEE 90(7):1272–1289

    Article  Google Scholar 

  11. Farneback G (2003) Two-frame motion estimation based on polynomial expansion. Scandinavian Conference on Image Analysis 363–370

  12. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR. pp 580–587

  13. Graves A, Fernández S, Gomez F, Schmidhuber J (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: ICML. pp 369–376

  14. Hadfield S, Bowden R (2012) Supervised sequence labelling with recurrent neural networks. Stud Computat Intell 385:5–13

    Article  MathSciNet  Google Scholar 

  15. Haghighat M, Abdel-Mottaleb M, Alhalabi W (2016) Discriminant correlation analysis: real-time feature level fusion for multimodal biometric recognition. IEEE Trans Inf Foren Sec 11:1984–1996

    Article  Google Scholar 

  16. Huang G, Liu Z, Van der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: CVPR. pp 2261–2269

  17. Kingma DP, Ba JL (2015) Adam: a method for stochastic optimization. In: International Conference on Learning Representations. pp 1–15

  18. Köpüklü O, Gunduz A, Kose N, Rigoll G (2020) Online dynamic hand gesture recognition including efficiency analysis. IEEE Trans Biom Behav Identity Sci 2(2):85–97

    Article  Google Scholar 

  19. Köpüklü O, Gunduz A, Kose N, Rigoll G (2019) Real-time hand gesture detection and classification using convolutional neural networks. In: 14th IEEE International Conference on Automatic Face and Gesture Recognition. pp 1–8

  20. Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. Adv Neural Inf Proces Syst 1106–1114

  21. Liu Z, Chai X, Liu Z, Chen X (2017) Continuous gesture recognition with hand-oriented spatiotemporal feature. In: ICPR. pp 3056–3064

  22. Liu L, Shao L (2013) Learning discriminative representations from RGB-D video data. In: Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence. pp 1493–1500

  23. Lu Z, Qin S, Li X, Li L, Zhang D (2019) One-shot learning hand gesture recognition based on modified 3D convolutional neural networks. Mach Vision Appl 30(3):1157–1180

    Article  Google Scholar 

  24. Lu Z, Qin S, Li L, Zhang D, Xu K, Hu Z (2019) One-shot learning hand gesture recognition based on lightweight 3D convolutional neural networks for portable applications on mobile systems. IEEE Access 7:131732–131748

    Article  Google Scholar 

  25. Molchanov P, Gupta S, Kim K, Pulli K (2015) Multi-sensor system for driver’s hand gesture recognition. In: 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition. pp 1–8

  26. Molchanov P, Yang X, Gupta S, Kim K, Tyree S, Kautz J (2016) Online detection and classification of dynamic hand gestures with recurrent 3D convolutional neural networks. In: NIPS. pp 4207–4215

  27. Murakami K, Taguchi H (1991) Gesture recognition using recurrent neural networks. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. pp 237–242

  28. Narayana P, Beveridge JR, Draper BA (2018) Gesture recognition: focus on the hands. In: CVPR. pp 5235–5244

  29. Nishida N, Nakayama H (2015) Multimodal gesture recognition using multi-stream recurrent neural network. In: Pacific-Rim Symposium on Image and Video Technology. pp 682–694

  30. Núñez JC, Cabido R, Pantrigo JJ, Montemayor AS, Vélez JF (2018) Convolutional neural networks and long short-term memory for skeleton-based human activity and hand gesture recognition. Pattern Recogn 76:80–94

    Article  Google Scholar 

  31. Ohn-Bar E, Trivedi MM (2014) Hand gesture recognition in real time for automotive interfaces: A multimodal vision-based approach and evaluations. IEEE Trans Intell Trans 15:1–10

    Google Scholar 

  32. Park E, Han X, Berg TL, Berg AC (2016) Combining multiple sources of knowledge in deep CNNs for action recognition. IEEE Winter Conf Appl Comput Vis 1–8

  33. Ren S, He K, Girshick R, Sun J (2017) Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal 39:1137–1149

    Article  Google Scholar 

  34. Ronnebergerhick O, Fischer P, Brox T (2015) U-Net: convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp 234–241

  35. Ryoo MS (2011) Human activity prediction: early recognition of ongoing activities from streaming videos. In: ICCV. pp 1036–1043

  36. Shelhamer E, Long J, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: CVPR. pp 3431–3440

  37. Shou Z, Wang D, Chang SF (2016) Temporal action localization in untrimmed videos via multi-stage CNNs. In: CVPR. pp 1049–1058

  38. Simonyan K, Zisserman A (2017) Two-stream convolutional networks for action recognition in videos. In: NIPS. pp 568–576

  39. Song S, Lan C, Xing J, Zeng W, Liu J (2016) An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence. pp 4263–4270

  40. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2005) A new method of feature fusion and its application in image recognition. Pattern Recogn 38(12):2437–2448

    Article  Google Scholar 

  41. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: ICCV. pp 4489–4497

  42. Tung PT, Ngoc LQ (2014) Elliptical density shape model for hand gesture recognition. In: Proceedings of the Fifth Symposium on Information and Communication Technology. pp 186–191

  43. Twentybn Jester Dataset (2017) A hand gesture dataset. https://www.twentybn.com/datasets/jester

  44. Wang H, Oneata D, Verbeek J, Schmid C (2016) A robust and efficient video representation for action recognition. Int J Comput Vision 119:219–238

    Article  MathSciNet  Google Scholar 

  45. Wang Y, Yu T, Shi L, Li Z (2008) Using human body gestures as inputs for gaming via depth analysis. In: Proceedings of the IEEE International Conference on Multimedia and Expo. pp 993–996

  46. Wu D, Pigou L, Kindermans PJ, Le N, Shao L, Dambre J, Odobez JM (2016) Deep dynamic neural networks for multimodal gesture segmentation and recognition. IEEE Trans Pattern Anal 38(8):1583–1597

    Article  Google Scholar 

  47. Yang HD, Lee SW (2013) Robust sign language recognition by combining manual and non-manual features based on conditional random field and support vector machine. Pattern Recogn Lett 34(16):2051–2056

    Article  Google Scholar 

  48. Yang W, Wang Y, Mori G (2009) Large-scale multimodal gesture segmentation and recognition based on convolutional neural networks. In: ICCV. pp 3138–3146

  49. Zhang X, Li X (2016) Dynamic gesture recognition based on MEMP network. Future Internet 11:91–101

    Article  Google Scholar 

  50. Zhang E, Xue B, Cao F, Duan J, Lin G, Lei Y (2019) Fusion of 2D CNN and 3D DenseNet for dynamic gesture recognition. Electronics 8:1511–1525

    Article  Google Scholar 

  51. Zhang L, Zhu G, Shen P, Song J (2017) Learning spatiotemporal features using 3D CNN and convolutional LSTM for gesture recognition. In: Proceedings of the IEEE International Conference on Computer Vision. pp 3120–3128

  52. Zhao H, Shi J, Qi X, Wang X, Jia J (2017) Pyramid scene parsing network. In: CVPR. pp 2881–2890

  53. Zhao Y, Xiong Y, Wang L, Wu Z, Lin D, Tang X (2017) Temporal action detection with structured segment networks. In: ICCV. pp 2933–2942

  54. Zhu G, Zhang L, Shen P, Song J (2017) Multimodal gesture recognition using 3-D convolution and convolutional LSTM. IEEE Access 5:4517–4524

    Article  Google Scholar 

Download references

Funding

The paper is partly supported by National Natural Science Foundation of China (Grant No. 61731001) and Natural Science Foundation of Zhejiang Province (Grant No. LY21E050017).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhi Lu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lu, Z., Qin, S., Lv, P. et al. Real-time continuous detection and recognition of dynamic hand gestures in untrimmed sequences based on end-to-end architecture with 3D DenseNet and LSTM. Multimed Tools Appl 83, 16275–16312 (2024). https://doi.org/10.1007/s11042-023-16130-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-16130-1

Keywords

Navigation