Skip to main content
Log in

Towards an end-to-end isolated and continuous deep gesture recognition process

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Behavior recognition in video sequences is an interesting task in the computer vision field. Human action recognition, allows machines to analyze and comprehend several human activities from input data sources, such as sensors, and multimedia contents. It has been widely applied in public security, video surveillance, robotics and video games control, and industrial automation fields. However, in these applications, gestures are formed by complex temporal sequences of motions. This representation imposes many problems such as variations in action/gesture performers style, body shape and size and even the way that the same signer realize the motion, and various problems related to the scene environment like illumination changes and background complexity. Moreover, processing on large-scale continuous gesture data make these problems more challenging due to the unknown boundaries limit of each gesture embedded in the continuous sequence. In this work, we present a novel deep architecture for isolated actions and large scale continuous hand gesture recognition using RGB, depth, and gray-scale video sequences. The proposed approach, called End-to-End Deep Gesture Recognition Process (E2E-DGRP), considers a new representation of the data by extracting gesture characteristics. This is followed by an analysis of each sequence to detect and recognize the corresponding actions. Firstly, segmentation module spot each start and end gesture boundaries and segment continuous gesture sequences into several isolated gestures based on motion indicator method. Then, a set of discriminating features is generated characterizing motion location, motion velocity and motion orientation using a proposed method named “Deep Signature”. Generated features are fed to an additional neural network layer formed by a Softmax classifier which is an output layer for gesture classification purpose. The high performance of our approach is obtained by the experiments carried out on three well-known datasets KTH and Weizman for isolated actions in RGB and grayscale modality and Chalearn LAP ConGD dataset for continuous gesture in depth. A first experimental study shows that, the recognition results obtained by our method outperform those obtained by previously published approaches and it achieves 97.5 and 98.7% in terms of accuracy with KTH and Weizmann datasets, respectively. A second experimental study, conducted using the Chalearn LAP ConGD data set, shows that our strategy outperforms other methods in terms of precision reaching 0.6661 as mean Jaccard Index.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20

Similar content being viewed by others

Notes

  1. http://ufldl.stanford.edu/tutorial/supervised/SoftmaxRegression/.

  2. http://www.nada.kth.se/cvap/actions/.

  3. http://www.wisdom.weizmann.ac.il/vision/SpaceTimeActions.html.

  4. http://www.cbsr.ia.ac.cn/users/jwan/database/congd.html.

References

  1. Foerster F, Smeja M (1999) Joint amplitude and frequency analysis of tremor activity. Electromyogr Clin Neurophysiol 39(1):11–19

    Google Scholar 

  2. Vishwakarma S, Agrawal A (2012) A survey on activity recognition and behavior understanding in video surveillance. Visual Computer 29:983–1009

    Article  Google Scholar 

  3. Petkovic M, Jonker W (2001) Content-based video retrieval by integrating spatio-temporal and stochastic recognition of events. In: Proceedings IEEE Workshop on Detection and Recognition of Events in Video., pp. 75–82

  4. Prati A, Shan C, Wang K (2019) Sensors, vision and networks: from video surveillance to activity recognition and health monitoring. J Ambient Intell Smart Environ 11:5–22

    Google Scholar 

  5. Wilson P, Lewandowska-Tomaszczyk B (2014) Affective robotics: modelling and testing cultural prototypes. Cognit Comput 6:814–840. https://doi.org/10.1007/s12559-014-9299-3

    Article  Google Scholar 

  6. Yang L, Huang J, Feng T, Hong’an W, Guozhong D (2019) Gesture interaction in virtual reality. Virtual Reality Intell Hardw 1:84–112. https://doi.org/10.3724/SP.J.2096-5796.2018.0006

    Article  Google Scholar 

  7. Wadhawan A, Kumar P (2020) Deep learning-based sign language recognition system for static signs. Neural Comput Appl 32:1–12. https://doi.org/10.1007/s00521-019-04691-y

    Article  Google Scholar 

  8. Mousavi Hondori H (2014) A review on technical and clinical impact of microsoft kinect on physical therapy and rehabilitation. J Med Eng 2014:1–16. https://doi.org/10.1155/2014/846514

    Article  Google Scholar 

  9. Moss RH, Stoecker WV, Lin S-J, Muruganandhan S, Chu K-F, Poneleit KM, Mitchell CD (1989) Skin cancer recognition by computer vision. Computerized Med Imag Gr 13(1):31–36. https://doi.org/10.1016/0895-6111(89)90076-1. (Diversity in Biomedical Imaging)

    Article  Google Scholar 

  10. Kuniyoshi Y, Inoue H, Inaba M (1990) Design and implementation of a system that generates assembly programs from visual recognition of human action sequences. In: EEE International workshop on intelligent robots and systems, towards a new frontier of applications, pp. 567–574

  11. Yamato J, Ohya J, Ishii K (1992) Recognizing human action in time-sequential images using hidden markov model. In: IEEE computer society conference on computer vision and pattern recognition, pp. 379–385

  12. Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4305–4314

  13. Feichtenhofer C, Pinz A, Wildes RP (2016) Spatiotemporal residual networks for video action recognition. In: Proceedings of the 30th international conference on neural information processing systems. NIPS’16, pp. 3476–3484. Curran Associates Inc., Red Hook, NY, USA

  14. Asadi-Aghbolaghi M, Clapés A, Bellantonio M, Escalante HJ, Ponce-López V, Baró X, Guyon I, Kasaei S, Escalera S (2017) Deep learning for action and gesture recognition in image sequences: A survey. In: Gesture Recognition Springer, pp. 539–578

  15. Li Q, Huang C, Yao Z, Chen Y, Ma L (2018) Continuous dynamic gesture spotting algorithm based on dempster-shafer theory in the augmented reality human computer interaction. Int J Med Robot Computer Assist Surg 14:41. https://doi.org/10.1002/rcs.1931

    Article  Google Scholar 

  16. Mahmoud R, Belgacem S, Omri MN (2021) Towards wide-scale continuous gesture recognition model for in-depth and grayscale input videos. Int J Mach Learn Cybernet 12(4):1173–1189. https://doi.org/10.1007/s13042-020-01227-y

    Article  Google Scholar 

  17. Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: A local SVM approach. In: proceedings of the pattern recognition, 17th international conference on (ICPR-04). ICPR-04, pp. 32–36. IEEE Computer Society, USA

  18. Gorelick L, Blank M, Shechtman E, Irani M, Basri R (2007) Actions as space-time shapes. Trans Pattern Anal Mach Intell 29(12):2247–2253

    Article  Google Scholar 

  19. Wan J, Li S, Zhao Y, Zhou S, Guyon I, Escalera S (2016) Chalearn looking at people RGB-D isolated and continuous datasets for gesture recognition, pp. 761–769 . https://doi.org/10.1109/CVPRW.2016.100

  20. LeCun Y, Bottou L, Bengio Y, Haffner P (2016) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324

    Article  Google Scholar 

  21. Rumelhart DE, Hinton GE, Williams RJ (1986) Learning internal representations by error propagation. In: Rumelhart DE, Mcclelland JL (eds) Parallel distributed processing: explorations in the microstructure of cognition, volume 1: foundations. MIT Press, Cambridge, MA, pp 318–362

    Chapter  Google Scholar 

  22. Jordan MI (1986) Serial order: A parallel, distributed processing approach. Technical Report 8604, Institute for Cognitive Science, University of California, San Diego

  23. Cunningham P, Delany S (2007) k-nearest neighbour classifiers. J Mult Classif Syst 54(6):1–25. https://doi.org/10.1145/3459665

    Article  Google Scholar 

  24. Breiman L (2001) Random forests. Mach Learn J 45(1):5–32. https://doi.org/10.1023/A:1010933404324

    Article  MATH  Google Scholar 

  25. Quinlan JR (1986) Induction of decision trees. Mach Learn J 1:81–106

    Google Scholar 

  26. Friedman N, Geiger D, Goldszmidt M (1997) Bayesian network classifiers. Mach Learn 29(2–3):131–163. https://doi.org/10.1023/A:1007465528199

    Article  MATH  Google Scholar 

  27. Rabiner LR (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 77(2):257–286. https://doi.org/10.1109/5.18626

    Article  Google Scholar 

  28. Lafferty J, Mccallum A, Pereira F (2001) Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: proceedings of the eighteenth international conference on machine learning, pp. 282–289

  29. Sun X, Chen M, Hauptmann A (2009) Action recognition via local descriptors and holistic features. In: IEEE computer society conference on computer vision and pattern recognition workshops, pp. 58–65 . https://doi.org/10.1109/CVPRW.2009.5204255

  30. B J, Patil C (2018) Video based human activity detection, recognition and classification of actions using SVM. Trans Mach Learn Artif Intell 6:22

    Google Scholar 

  31. Ji X-F, Wu Q-Q, Ju Z, Wang Y-Y (2015) Study of human action recognition based on improved spatio-temporal features. Int J Autom Comput 11:500–509

    Article  Google Scholar 

  32. Aslan MF, Durdu A, Sabanci K (2019) Human action recognition with bag of visual words using different machine learning methods and hyperparameter optimization. Neural Comput Appl 32:8585–8597

    Article  Google Scholar 

  33. Mabrouk O, Hlaoua L, Omri MN (2021) Exploiting ontology information in fuzzy SVM social media profile classification. Appl Intell 51(6):3757–3774. https://doi.org/10.1007/s10489-020-01939-2

    Article  Google Scholar 

  34. Mabrouk O, Hlaoua L, Omri MN (2018) Fuzzy twin svm based-profile categorization approach. 14th International conference on natural computation, fuzzy systems and knowledge discovery (ICNC-FSKD), 547–553

  35. Liu C, Ying J, Yang H, Liu J (2020) Improved human action recognition approach based on two-stream convolutional neural network model. Visual Computer 37:1327–1341. https://doi.org/10.1007/s00371-020-01868-8

    Article  Google Scholar 

  36. Jaouedi N, Boujnah N, Bouhlel MS (2020) A new hybrid deep learning model for human action recognition. J King Saud Univ- Computer Inform Sci 32(4):447–453. https://doi.org/10.1016/j.jksuci.2019.09.004. (Emerging Software Systems)

    Article  Google Scholar 

  37. Baccouche M, Mamalet F, Wolf C, Garcia C, Baskurt A (2011) Sequential deep learning for human action recognition. In: proceedings of the second international conference on human behavior unterstanding. HBU’11, pp. 29–39. Springer, Berlin, Heidelberg . https://doi.org/10.1007/978-3-642-25446-8-4

  38. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Proceedings of the 27th international conference on neural information processing systems. NIPS’14, pp. 568–576. MIT Press, MA, USA

  39. Xu k, Jiang X, Sun T (2017) Two-stream dictionary learning architecture for action recognition. IEEE Trans Circuit Syst Video Technol 27(3):567–576. https://doi.org/10.1109/TCSVT.2017.2665359

    Article  Google Scholar 

  40. Sclaroff S, Lee S, Yang H (2009) Sign language spotting with a threshold model based on conditional random fields. IEEE Trans Pattern Anal Mach Intell 31(07):1264–1277. https://doi.org/10.1109/TPAMI.2008.172

    Article  Google Scholar 

  41. Lee H-K, Kim JH (1999) An HMM-based threshold model approach for gesture recognition. IEEE Trans Pattern Anal Mach Intell 21(10):961–973. https://doi.org/10.1109/34.799904

    Article  Google Scholar 

  42. Celebi S, Aydin AS, Temiz TT, Arici T (2013) Gesture recognition using skeleton data with weighted dynamic time warping. VISAPP 2013 - proceedings of the international conference on computer vision theory and applications 1, 620–625

  43. Wan J, Ruan Q, Deng S (2013) One-shot learning gesture recognition from RGB-D data using bag of features. J Mach Learn Res 14:2549–2582

    Google Scholar 

  44. Wan J, Athitsos V, Jangyodsuk P, Escalante HJ, Ruan Q, Guyon I (2014) CSMMI: class-specific maximization of mutual information for action and gesture recognition. IEEE Trans Image Process: Publ IEEE Signal Process Soc 23(7):3152–3165

    Article  MathSciNet  Google Scholar 

  45. Forney GD (1973) The Viterbi algorithm. Proc IEEE 61(3):268–278. https://doi.org/10.1109/PROC.1973.9030

    Article  MathSciNet  Google Scholar 

  46. Zhu G, Zhang L, Shen P, Song J, Shah S, Bennamoun M (2019) Continuous gesture segmentation and recognition using 3DCNN and convolutional LSTM. IEEE Trans Multim 21(4):1011–1021. https://doi.org/10.1109/TMM.2018.2869278

    Article  Google Scholar 

  47. Hoàng NN, Lee G, Kim S, Yang HJ (2019) Continuous hand gesture spotting and classification using 3D finger joints information. In: 2019 IEEE International conference on image processing (ICIP), pp. 539–543

  48. Chai X, Liu Z, Yin F, Liu Z, Chen X (2016) Two streams recurrent neural networks for large-scale continuous gesture recognition. In: 2016 23rd international conference on pattern recognition (ICPR), pp. 31–36 . https://doi.org/10.1109/ICPR.2016.7899603

  49. Liu Z, Chen Z (2017) Continuous gesture recognition with hand-oriented spatiotemporal feature. In: 2017 IEEE international conference on computer vision workshops (ICCVW), pp. 3056–3064 . https://doi.org/10.1109/ICCVW.2017.361

  50. Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: Towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28:91–99

    Google Scholar 

  51. Mahmoud R, Belgacem S, Omri MN (2020) Deep signature-based isolated and large scale continuous gesture recognition approach. J King Saud Univ- Computer Inf Sci 5:1319–1578. https://doi.org/10.1007/s10462-012-9356-9

    Article  Google Scholar 

  52. Boukhari K, Omri MN (2020) Approximate matching-based unsupervised document indexing approach: application to biomedical domain. Scientometrics 123:1–22. https://doi.org/10.1007/s11192-020-03474-w

    Article  Google Scholar 

  53. Chebil W, Soualmia L, Omri MN, Darmoni S (2016) Indexing biomedical documents with a possibilistic network. J Am Soc Inf Sci Technol 67:928–941. https://doi.org/10.1002/asi.23435

    Article  Google Scholar 

  54. Boukhari K, Omri MN (2020) DL-VSM based document indexing approach for information retrieval. J Amb Intell Human Comput 11:1–12. https://doi.org/10.1007/s12652-020-01684-x

    Article  Google Scholar 

  55. Fkih F, Omri MN (2020) Hidden data states-based complex terminology extraction from textual web data model. Appl Intell 50(6):1813–1831. https://doi.org/10.1007/s10489-019-01568-4

    Article  Google Scholar 

  56. Belgacem S, Chatelain C, Paquet T (2017) Gesture sequence recognition with one shot learned CRF/HMM hybrid model. J Image Vision Comput 61:12–21

    Article  Google Scholar 

  57. Ranjan A, Black, M (2017) Optical flow estimation using a spatial pyramid network. In: Proceedings IEEE conference on computer vision and pattern recognition (CVPR) 2017, pp. 4161–4170

  58. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the 2015 IEEE International conference on computer vision (ICCV). ICCV’15, pp. 4489–4497

  59. Wan J, Escalera S, Escalante HJ, Baró X, Guyon I, Allik J, Lin C, Xie Y, Anbarjafari G, Gorbova J (2017) Results and analysis of chalearn lap multi-modal isolated and continuous gesture recognition, and real versus fake expressed emotions challenges. In: Proceedings of the IEEE international conference on computer vision workshops, pp. 3189–3197 . https://doi.org/10.1109/ICCVW.2017.377

  60. Kumar V, Nandi G, Kala R (2014) Static hand gesture recognition using stacked denoising sparse autoencoders. 2014 seventh international conference on contemporary computing (IC3), 99–104

  61. Rublee E, Rabaud V, Konolige K, Bradski G (2011) ORB: an efficient alternative to sift or surf. In: Proceedings of the IEEE international conference on computer vision, pp. 2564–2571

  62. Pichao W, Wanqing L, Song L, Yuyao Z, Zhimin G, Philip O (2016) Large-scale continuous gesture recognition using convolutional neural networks. In: 23rd international conference on pattern recognition (ICPR), pp. 13–18 . https://doi.org/10.1109/ICPR.2016.7899600

  63. Sun Y, Kamel MS, Wong AKC, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recognit 40(12):3358–3378. https://doi.org/10.1016/j.patcog.2007.04.009

    Article  MATH  Google Scholar 

  64. Wan J, Guo G, Li S (2015) Explore efficient local features from RGB-D data for one-shot learning gesture recognition. IEEE Trans Pattern Anal Mach Intell 38(8):1626–1639

    Article  Google Scholar 

  65. Jiang F, Zhang S, Wu S, Gao Y, Zhao D (2015) Multi-layered gesture recognition with kinect. J Mach Learn Res 16(1):227–254

    MathSciNet  MATH  Google Scholar 

  66. Qian H, Zhou J, Mao Y, Yuan Y (2017) Recognizing human actions from silhouettes described with weighted distance metric and kinematics. Multim Tools Appl 76(21):21889–21910. https://doi.org/10.1007/s11042-017-4610-4

    Article  Google Scholar 

  67. Chou K-P, Prasad M, Wu D, Sharma N, Li D, Lin Y, Blumenstein M, Lin W-C, Lin C-T (2018) Robust feature-based automated multi-view human action recognition system. IEEE Access 6:15283–15296

    Article  Google Scholar 

  68. Vishwakarma D, Dhiman C (2019) A unified model for human activity recognition using spatial distribution of gradients and difference of gaussian kernel. Visual Computer 35:1–19. https://doi.org/10.1007/s00371-018-1560-4

    Article  Google Scholar 

  69. Antonik P, Marsal N, Brunner D, Rontani D (2019) Human action recognition with a large-scale brain-inspired photonic computer. Nat Mach Intell 1:530–537. https://doi.org/10.1038/s42256-019-0110-8

    Article  Google Scholar 

  70. Avola D, Cascio M, Cinque L, Foresti GL, Massaroni C, Rodolà E (2020) 2-D Skeleton-based action recognition via two-branch stacked LSTM-RNNs. IEEE Trans Multim 22(10):2481–2496. https://doi.org/10.1109/TMM.2019.2960588

    Article  Google Scholar 

  71. Jain SB, Sreeraj M (2015) Multi-posture human detection based on hybrid hog-bo feature. 2015 Fifth international conference on advances in computing and communications (ICACC), 37–40

  72. Vishwakarma DK (2020) A two-fold transformation model for human action recognition using decisive pose. Cognit Syst Res 61:1–13. https://doi.org/10.1016/j.cogsys.2019.12.004

    Article  Google Scholar 

  73. Molchanov P, Yang X, Gupta S, Kim K, Tyree S, Kautz J (2016) Online detection and classification of dynamic hand gestures with recurrent 3D convolutional neural networks. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp. 4207–4215 . https://doi.org/10.1109/CVPR.2016.456

  74. Cihan N Camgoz, Hadfield S, Bowden R (2017) Particle filter based probabilistic forced alignment for continuous gesture recognition. In: The IEEE international conference on computer vision (ICCV) workshops, pp. 3079–3085

  75. Dang LM, Min K, Wang H, Piran M, Lee H, Moon H (2020) Sensor-based and vision-based human activity recognition: A comprehensive survey. Pattern Recognit. https://doi.org/10.1016/j.patcog.2020.107561

    Article  Google Scholar 

  76. Heidari AA, Faris H, Mirjalili S, Aljarah I (2019) An efficient hybrid multilayer perceptron neural network with grasshopper optimization. Soft Comput. https://doi.org/10.1007/s00500-018-3424-2

    Article  Google Scholar 

  77. Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th international conference on neural information processing systems - volume 1. NIPS’12, pp. 1097–1105. Curran Associates Inc., Red Hook, NY, USA

  78. Inoue M, Inoue S, Nishida T (2018) Deep recurrent neural network for mobile human activity recognition with high throughput. Artif Life Robot 23:173–185. https://doi.org/10.1007/s10015-017-0422-x

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rihem Mahmoud.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mahmoud, R., Belgacem, S. & Omri, M. Towards an end-to-end isolated and continuous deep gesture recognition process. Neural Comput & Applic 34, 13713–13732 (2022). https://doi.org/10.1007/s00521-022-07165-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-022-07165-w

Keywords

Navigation