Abstract
With the continuous development of the deep learning theory, novel gesture recognition approaches have been constantly emerging, and the performance has also been continuously improved. However, most research methods focus on the recognition of isolated gestures, and the detection and recognition of continuous gestures are rarely studied. To this end, aiming at the real-time detection and classification of dynamic gestures in untrimmed sequences, a well-designed end-to-end architecture based on the variants of 3D DenseNet and unidirectional LSTM was hereby proposed as an effective tool to extract the discriminative spatio-temporal features of untrimmed hand gesture sequences. Then, connectionist temporal classification was combined to train the network on a publicly available dataset, and some effective capacities could be transferred to enhance the learning ability of the proposed network by training large gesture samples. In this way, the class-conditional probability of an incoming sequence belonging to a given gesture class was predicted and then compared with a predefined threshold to automatically determine the start and end of gestures. In addition, to enhance the classification accuracy of segmented gestures, a bidirectional LSTM network was utilized to model the temporal information, with both the past frames and the future ones taken into account. Finally, a continuous gesture dataset collected indoors for specific application was introduced to validate the proposed method. On this challenge dataset, the 3D DenseNet-LSTM model achieves real-time early detection and classification tasks on unsegmented gesture sequences, and the 3D DenseNet-BiLSTM not only achieves an accuracy of 92.06\(\%\) on segmented gestures, but also a classification accuracy of 89.8\(\%\) and 99.7\(\%\) on nvGesture and SKIG public datasets, respectively. The experimental results demonstrate the performance advantages of the detection and classification as well as the real-time response speed.
Similar content being viewed by others
Data availability
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.
References
Amin MG, Zhang YD, Ahmad F, Ho KD (2016) Radar signal processing for elderly fall detection: the future for in-home monitoring. IEEE Signal Process Mag 33(2):71–80
Badrinarayanan V, Kendall A, Cipolla R (2017) SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal 39:2481–2495
Barron O, Raison M, Gaudet G, Achiche S (2020) Recurrent neural network for electromyographic gesture recognition in transhumeral amputees. Appl Soft Comput 96:1–9
Bridle JS (1990) Probabilistic interpretation of feed forward classification network outputs, with relationships to statistical pattern recognition. Neurocomputing 68:227–236
Carrara F, Elias P, Sedmidubsky J, Zezula P (2019) LSTM-based real-time action detection and prediction in human motion streams. Multimed Tools Appl 78(2):27309–27331
Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR. pp 6299–6308
Chai X, Liu Z, Yin F, Liu Z, Chen X (2017) Two streams recurrent neural networks for large-scale continuous gesture recognition. In: ICPR. pp 31–36
Chalasani, T., Smolic, A.: Simultaneous segmentation and recognition: Towards more accurate ego gesture recognition. In: ICCV. pp 4367–4375 (2019)
Dhingra N, Kunz A (2019) Res3ATN-deep 3D residual attention network for hand gesture recognition in videos. In: 2019 International Conference on 3D Vision. pp 491–501
Duric Z, Gray WD, Heishman R, Fayin L, Rosenfeld A, Schoelles MJ, Schunn C, Wechsler H (2002) Integrating perceptual and cognitive modeling for adaptive and intelligent human-computer interaction. P IEEE 90(7):1272–1289
Farneback G (2003) Two-frame motion estimation based on polynomial expansion. Scandinavian Conference on Image Analysis 363–370
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR. pp 580–587
Graves A, Fernández S, Gomez F, Schmidhuber J (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: ICML. pp 369–376
Hadfield S, Bowden R (2012) Supervised sequence labelling with recurrent neural networks. Stud Computat Intell 385:5–13
Haghighat M, Abdel-Mottaleb M, Alhalabi W (2016) Discriminant correlation analysis: real-time feature level fusion for multimodal biometric recognition. IEEE Trans Inf Foren Sec 11:1984–1996
Huang G, Liu Z, Van der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: CVPR. pp 2261–2269
Kingma DP, Ba JL (2015) Adam: a method for stochastic optimization. In: International Conference on Learning Representations. pp 1–15
Köpüklü O, Gunduz A, Kose N, Rigoll G (2020) Online dynamic hand gesture recognition including efficiency analysis. IEEE Trans Biom Behav Identity Sci 2(2):85–97
Köpüklü O, Gunduz A, Kose N, Rigoll G (2019) Real-time hand gesture detection and classification using convolutional neural networks. In: 14th IEEE International Conference on Automatic Face and Gesture Recognition. pp 1–8
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. Adv Neural Inf Proces Syst 1106–1114
Liu Z, Chai X, Liu Z, Chen X (2017) Continuous gesture recognition with hand-oriented spatiotemporal feature. In: ICPR. pp 3056–3064
Liu L, Shao L (2013) Learning discriminative representations from RGB-D video data. In: Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence. pp 1493–1500
Lu Z, Qin S, Li X, Li L, Zhang D (2019) One-shot learning hand gesture recognition based on modified 3D convolutional neural networks. Mach Vision Appl 30(3):1157–1180
Lu Z, Qin S, Li L, Zhang D, Xu K, Hu Z (2019) One-shot learning hand gesture recognition based on lightweight 3D convolutional neural networks for portable applications on mobile systems. IEEE Access 7:131732–131748
Molchanov P, Gupta S, Kim K, Pulli K (2015) Multi-sensor system for driver’s hand gesture recognition. In: 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition. pp 1–8
Molchanov P, Yang X, Gupta S, Kim K, Tyree S, Kautz J (2016) Online detection and classification of dynamic hand gestures with recurrent 3D convolutional neural networks. In: NIPS. pp 4207–4215
Murakami K, Taguchi H (1991) Gesture recognition using recurrent neural networks. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. pp 237–242
Narayana P, Beveridge JR, Draper BA (2018) Gesture recognition: focus on the hands. In: CVPR. pp 5235–5244
Nishida N, Nakayama H (2015) Multimodal gesture recognition using multi-stream recurrent neural network. In: Pacific-Rim Symposium on Image and Video Technology. pp 682–694
Núñez JC, Cabido R, Pantrigo JJ, Montemayor AS, Vélez JF (2018) Convolutional neural networks and long short-term memory for skeleton-based human activity and hand gesture recognition. Pattern Recogn 76:80–94
Ohn-Bar E, Trivedi MM (2014) Hand gesture recognition in real time for automotive interfaces: A multimodal vision-based approach and evaluations. IEEE Trans Intell Trans 15:1–10
Park E, Han X, Berg TL, Berg AC (2016) Combining multiple sources of knowledge in deep CNNs for action recognition. IEEE Winter Conf Appl Comput Vis 1–8
Ren S, He K, Girshick R, Sun J (2017) Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal 39:1137–1149
Ronnebergerhick O, Fischer P, Brox T (2015) U-Net: convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp 234–241
Ryoo MS (2011) Human activity prediction: early recognition of ongoing activities from streaming videos. In: ICCV. pp 1036–1043
Shelhamer E, Long J, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: CVPR. pp 3431–3440
Shou Z, Wang D, Chang SF (2016) Temporal action localization in untrimmed videos via multi-stage CNNs. In: CVPR. pp 1049–1058
Simonyan K, Zisserman A (2017) Two-stream convolutional networks for action recognition in videos. In: NIPS. pp 568–576
Song S, Lan C, Xing J, Zeng W, Liu J (2016) An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence. pp 4263–4270
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2005) A new method of feature fusion and its application in image recognition. Pattern Recogn 38(12):2437–2448
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: ICCV. pp 4489–4497
Tung PT, Ngoc LQ (2014) Elliptical density shape model for hand gesture recognition. In: Proceedings of the Fifth Symposium on Information and Communication Technology. pp 186–191
Twentybn Jester Dataset (2017) A hand gesture dataset. https://www.twentybn.com/datasets/jester
Wang H, Oneata D, Verbeek J, Schmid C (2016) A robust and efficient video representation for action recognition. Int J Comput Vision 119:219–238
Wang Y, Yu T, Shi L, Li Z (2008) Using human body gestures as inputs for gaming via depth analysis. In: Proceedings of the IEEE International Conference on Multimedia and Expo. pp 993–996
Wu D, Pigou L, Kindermans PJ, Le N, Shao L, Dambre J, Odobez JM (2016) Deep dynamic neural networks for multimodal gesture segmentation and recognition. IEEE Trans Pattern Anal 38(8):1583–1597
Yang HD, Lee SW (2013) Robust sign language recognition by combining manual and non-manual features based on conditional random field and support vector machine. Pattern Recogn Lett 34(16):2051–2056
Yang W, Wang Y, Mori G (2009) Large-scale multimodal gesture segmentation and recognition based on convolutional neural networks. In: ICCV. pp 3138–3146
Zhang X, Li X (2016) Dynamic gesture recognition based on MEMP network. Future Internet 11:91–101
Zhang E, Xue B, Cao F, Duan J, Lin G, Lei Y (2019) Fusion of 2D CNN and 3D DenseNet for dynamic gesture recognition. Electronics 8:1511–1525
Zhang L, Zhu G, Shen P, Song J (2017) Learning spatiotemporal features using 3D CNN and convolutional LSTM for gesture recognition. In: Proceedings of the IEEE International Conference on Computer Vision. pp 3120–3128
Zhao H, Shi J, Qi X, Wang X, Jia J (2017) Pyramid scene parsing network. In: CVPR. pp 2881–2890
Zhao Y, Xiong Y, Wang L, Wu Z, Lin D, Tang X (2017) Temporal action detection with structured segment networks. In: ICCV. pp 2933–2942
Zhu G, Zhang L, Shen P, Song J (2017) Multimodal gesture recognition using 3-D convolution and convolutional LSTM. IEEE Access 5:4517–4524
Funding
The paper is partly supported by National Natural Science Foundation of China (Grant No. 61731001) and Natural Science Foundation of Zhejiang Province (Grant No. LY21E050017).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Lu, Z., Qin, S., Lv, P. et al. Real-time continuous detection and recognition of dynamic hand gestures in untrimmed sequences based on end-to-end architecture with 3D DenseNet and LSTM. Multimed Tools Appl 83, 16275–16312 (2024). https://doi.org/10.1007/s11042-023-16130-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-16130-1