Real-time continuous detection and recognition of dynamic hand gestures in untrimmed sequences based on end-to-end architecture with 3D DenseNet and LSTM

Lu, Zhi; Qin, Shiyin; Lv, Pin; Sun, Liguo; Tang, Bo

doi:10.1007/s11042-023-16130-1

Real-time continuous detection and recognition of dynamic hand gestures in untrimmed sequences based on end-to-end architecture with 3D DenseNet and LSTM

Published: 14 July 2023

Volume 83, pages 16275–16312, (2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Zhi Lu ORCID: orcid.org/0000-0002-0440-9175¹,
Shiyin Qin^2,3,
Pin Lv¹,
Liguo Sun¹ &
…
Bo Tang⁴

265 Accesses
1 Citation
Explore all metrics

Abstract

With the continuous development of the deep learning theory, novel gesture recognition approaches have been constantly emerging, and the performance has also been continuously improved. However, most research methods focus on the recognition of isolated gestures, and the detection and recognition of continuous gestures are rarely studied. To this end, aiming at the real-time detection and classification of dynamic gestures in untrimmed sequences, a well-designed end-to-end architecture based on the variants of 3D DenseNet and unidirectional LSTM was hereby proposed as an effective tool to extract the discriminative spatio-temporal features of untrimmed hand gesture sequences. Then, connectionist temporal classification was combined to train the network on a publicly available dataset, and some effective capacities could be transferred to enhance the learning ability of the proposed network by training large gesture samples. In this way, the class-conditional probability of an incoming sequence belonging to a given gesture class was predicted and then compared with a predefined threshold to automatically determine the start and end of gestures. In addition, to enhance the classification accuracy of segmented gestures, a bidirectional LSTM network was utilized to model the temporal information, with both the past frames and the future ones taken into account. Finally, a continuous gesture dataset collected indoors for specific application was introduced to validate the proposed method. On this challenge dataset, the 3D DenseNet-LSTM model achieves real-time early detection and classification tasks on unsegmented gesture sequences, and the 3D DenseNet-BiLSTM not only achieves an accuracy of 92.06\(\%\) on segmented gestures, but also a classification accuracy of 89.8\(\%\) and 99.7\(\%\) on nvGesture and SKIG public datasets, respectively. The experimental results demonstrate the performance advantages of the detection and classification as well as the real-time response speed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 9

HCRNN: A Novel Architecture for Fast Online Handwritten Stroke Classification

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

Article 10 June 2021

Human Activity Recognition (HAR) Using Deep Learning: Review, Methodologies, Progress and Future Research Directions

Article 12 August 2023

Data availability

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

References

Amin MG, Zhang YD, Ahmad F, Ho KD (2016) Radar signal processing for elderly fall detection: the future for in-home monitoring. IEEE Signal Process Mag 33(2):71–80
Article Google Scholar
Badrinarayanan V, Kendall A, Cipolla R (2017) SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal 39:2481–2495
Article Google Scholar
Barron O, Raison M, Gaudet G, Achiche S (2020) Recurrent neural network for electromyographic gesture recognition in transhumeral amputees. Appl Soft Comput 96:1–9
Article Google Scholar
Bridle JS (1990) Probabilistic interpretation of feed forward classification network outputs, with relationships to statistical pattern recognition. Neurocomputing 68:227–236
Article MathSciNet Google Scholar
Carrara F, Elias P, Sedmidubsky J, Zezula P (2019) LSTM-based real-time action detection and prediction in human motion streams. Multimed Tools Appl 78(2):27309–27331
Article Google Scholar
Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR. pp 6299–6308
Chai X, Liu Z, Yin F, Liu Z, Chen X (2017) Two streams recurrent neural networks for large-scale continuous gesture recognition. In: ICPR. pp 31–36
Chalasani, T., Smolic, A.: Simultaneous segmentation and recognition: Towards more accurate ego gesture recognition. In: ICCV. pp 4367–4375 (2019)
Dhingra N, Kunz A (2019) Res3ATN-deep 3D residual attention network for hand gesture recognition in videos. In: 2019 International Conference on 3D Vision. pp 491–501
Duric Z, Gray WD, Heishman R, Fayin L, Rosenfeld A, Schoelles MJ, Schunn C, Wechsler H (2002) Integrating perceptual and cognitive modeling for adaptive and intelligent human-computer interaction. P IEEE 90(7):1272–1289
Article Google Scholar
Farneback G (2003) Two-frame motion estimation based on polynomial expansion. Scandinavian Conference on Image Analysis 363–370
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR. pp 580–587
Graves A, Fernández S, Gomez F, Schmidhuber J (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: ICML. pp 369–376
Hadfield S, Bowden R (2012) Supervised sequence labelling with recurrent neural networks. Stud Computat Intell 385:5–13
Article MathSciNet Google Scholar
Haghighat M, Abdel-Mottaleb M, Alhalabi W (2016) Discriminant correlation analysis: real-time feature level fusion for multimodal biometric recognition. IEEE Trans Inf Foren Sec 11:1984–1996
Article Google Scholar
Huang G, Liu Z, Van der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: CVPR. pp 2261–2269
Kingma DP, Ba JL (2015) Adam: a method for stochastic optimization. In: International Conference on Learning Representations. pp 1–15
Köpüklü O, Gunduz A, Kose N, Rigoll G (2020) Online dynamic hand gesture recognition including efficiency analysis. IEEE Trans Biom Behav Identity Sci 2(2):85–97
Article Google Scholar
Köpüklü O, Gunduz A, Kose N, Rigoll G (2019) Real-time hand gesture detection and classification using convolutional neural networks. In: 14th IEEE International Conference on Automatic Face and Gesture Recognition. pp 1–8
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. Adv Neural Inf Proces Syst 1106–1114
Liu Z, Chai X, Liu Z, Chen X (2017) Continuous gesture recognition with hand-oriented spatiotemporal feature. In: ICPR. pp 3056–3064
Liu L, Shao L (2013) Learning discriminative representations from RGB-D video data. In: Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence. pp 1493–1500
Lu Z, Qin S, Li X, Li L, Zhang D (2019) One-shot learning hand gesture recognition based on modified 3D convolutional neural networks. Mach Vision Appl 30(3):1157–1180
Article Google Scholar
Lu Z, Qin S, Li L, Zhang D, Xu K, Hu Z (2019) One-shot learning hand gesture recognition based on lightweight 3D convolutional neural networks for portable applications on mobile systems. IEEE Access 7:131732–131748
Article Google Scholar
Molchanov P, Gupta S, Kim K, Pulli K (2015) Multi-sensor system for driver’s hand gesture recognition. In: 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition. pp 1–8
Molchanov P, Yang X, Gupta S, Kim K, Tyree S, Kautz J (2016) Online detection and classification of dynamic hand gestures with recurrent 3D convolutional neural networks. In: NIPS. pp 4207–4215
Murakami K, Taguchi H (1991) Gesture recognition using recurrent neural networks. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. pp 237–242
Narayana P, Beveridge JR, Draper BA (2018) Gesture recognition: focus on the hands. In: CVPR. pp 5235–5244
Nishida N, Nakayama H (2015) Multimodal gesture recognition using multi-stream recurrent neural network. In: Pacific-Rim Symposium on Image and Video Technology. pp 682–694
Núñez JC, Cabido R, Pantrigo JJ, Montemayor AS, Vélez JF (2018) Convolutional neural networks and long short-term memory for skeleton-based human activity and hand gesture recognition. Pattern Recogn 76:80–94
Article Google Scholar
Ohn-Bar E, Trivedi MM (2014) Hand gesture recognition in real time for automotive interfaces: A multimodal vision-based approach and evaluations. IEEE Trans Intell Trans 15:1–10
Google Scholar
Park E, Han X, Berg TL, Berg AC (2016) Combining multiple sources of knowledge in deep CNNs for action recognition. IEEE Winter Conf Appl Comput Vis 1–8
Ren S, He K, Girshick R, Sun J (2017) Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal 39:1137–1149
Article Google Scholar
Ronnebergerhick O, Fischer P, Brox T (2015) U-Net: convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp 234–241
Ryoo MS (2011) Human activity prediction: early recognition of ongoing activities from streaming videos. In: ICCV. pp 1036–1043
Shelhamer E, Long J, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: CVPR. pp 3431–3440
Shou Z, Wang D, Chang SF (2016) Temporal action localization in untrimmed videos via multi-stage CNNs. In: CVPR. pp 1049–1058
Simonyan K, Zisserman A (2017) Two-stream convolutional networks for action recognition in videos. In: NIPS. pp 568–576
Song S, Lan C, Xing J, Zeng W, Liu J (2016) An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence. pp 4263–4270
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2005) A new method of feature fusion and its application in image recognition. Pattern Recogn 38(12):2437–2448
Article Google Scholar
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: ICCV. pp 4489–4497
Tung PT, Ngoc LQ (2014) Elliptical density shape model for hand gesture recognition. In: Proceedings of the Fifth Symposium on Information and Communication Technology. pp 186–191
Twentybn Jester Dataset (2017) A hand gesture dataset. https://www.twentybn.com/datasets/jester
Wang H, Oneata D, Verbeek J, Schmid C (2016) A robust and efficient video representation for action recognition. Int J Comput Vision 119:219–238
Article MathSciNet Google Scholar
Wang Y, Yu T, Shi L, Li Z (2008) Using human body gestures as inputs for gaming via depth analysis. In: Proceedings of the IEEE International Conference on Multimedia and Expo. pp 993–996
Wu D, Pigou L, Kindermans PJ, Le N, Shao L, Dambre J, Odobez JM (2016) Deep dynamic neural networks for multimodal gesture segmentation and recognition. IEEE Trans Pattern Anal 38(8):1583–1597
Article Google Scholar
Yang HD, Lee SW (2013) Robust sign language recognition by combining manual and non-manual features based on conditional random field and support vector machine. Pattern Recogn Lett 34(16):2051–2056
Article Google Scholar
Yang W, Wang Y, Mori G (2009) Large-scale multimodal gesture segmentation and recognition based on convolutional neural networks. In: ICCV. pp 3138–3146
Zhang X, Li X (2016) Dynamic gesture recognition based on MEMP network. Future Internet 11:91–101
Article Google Scholar
Zhang E, Xue B, Cao F, Duan J, Lin G, Lei Y (2019) Fusion of 2D CNN and 3D DenseNet for dynamic gesture recognition. Electronics 8:1511–1525
Article Google Scholar
Zhang L, Zhu G, Shen P, Song J (2017) Learning spatiotemporal features using 3D CNN and convolutional LSTM for gesture recognition. In: Proceedings of the IEEE International Conference on Computer Vision. pp 3120–3128
Zhao H, Shi J, Qi X, Wang X, Jia J (2017) Pyramid scene parsing network. In: CVPR. pp 2881–2890
Zhao Y, Xiong Y, Wang L, Wu Z, Lin D, Tang X (2017) Temporal action detection with structured segment networks. In: ICCV. pp 2933–2942
Zhu G, Zhang L, Shen P, Song J (2017) Multimodal gesture recognition using 3-D convolution and convolutional LSTM. IEEE Access 5:4517–4524
Article Google Scholar

Download references

Funding

The paper is partly supported by National Natural Science Foundation of China (Grant No. 61731001) and Natural Science Foundation of Zhejiang Province (Grant No. LY21E050017).

Author information

Authors and Affiliations

Institute of Automation, Chinese Academy of Sciences, No.95, Zhongguancun East Road, Beijing, 100190, HaiDian District, China
Zhi Lu, Pin Lv & Liguo Sun
School of Automation Science and Electrical Engineering, Beihang University, No. 37, Xueyuan Road, Beijing, 100191, HaiDian District, China
Shiyin Qin
School of Electrical Engineering & Intelligentization, Dongguan University of Technology, Dongguan, 523808, China
Shiyin Qin
College of Metrology and Measurement Engineering, China Jiliang University, Hangzhou, Zhejiang, 310018, China
Bo Tang

Authors

Zhi Lu
View author publications
You can also search for this author in PubMed Google Scholar
Shiyin Qin
View author publications
You can also search for this author in PubMed Google Scholar
Pin Lv
View author publications
You can also search for this author in PubMed Google Scholar
Liguo Sun
View author publications
You can also search for this author in PubMed Google Scholar
Bo Tang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhi Lu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Lu, Z., Qin, S., Lv, P. et al. Real-time continuous detection and recognition of dynamic hand gestures in untrimmed sequences based on end-to-end architecture with 3D DenseNet and LSTM. Multimed Tools Appl 83, 16275–16312 (2024). https://doi.org/10.1007/s11042-023-16130-1

Download citation

Received: 24 August 2022
Revised: 31 May 2023
Accepted: 26 June 2023
Published: 14 July 2023
Issue Date: February 2024
DOI: https://doi.org/10.1007/s11042-023-16130-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Real-time continuous detection and recognition of dynamic hand gestures in untrimmed sequences based on end-to-end architecture with 3D DenseNet and LSTM

Abstract

Access this article

Similar content being viewed by others

HCRNN: A Novel Architecture for Fast Online Handwritten Stroke Classification

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

Human Activity Recognition (HAR) Using Deep Learning: Review, Methodologies, Progress and Future Research Directions

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Real-time continuous detection and recognition of dynamic hand gestures in untrimmed sequences based on end-to-end architecture with 3D DenseNet and LSTM

Abstract

Access this article

Similar content being viewed by others

HCRNN: A Novel Architecture for Fast Online Handwritten Stroke Classification

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

Human Activity Recognition (HAR) Using Deep Learning: Review, Methodologies, Progress and Future Research Directions

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation