Skip to main content
Log in

ZS-GR: zero-shot gesture recognition from RGB-D videos

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Gesture Recognition (GR) is a challenging research area in computer vision. To tackle the annotation bottleneck in GR, we formulate the problem of Zero-Shot Gesture Recognition (ZS-GR) and propose a two-stream model from two input modalities: RGB and Depth videos. To benefit from the vision Transformer capabilities, we use two vision Transformer models, for human detection and visual features representation. We configure a transformer encoder-decoder architecture, as a fast and accurate human detection model, to overcome the challenges of the current human detection models. Considering the human keypoints, the detected human body is segmented into nine parts. A spatio-temporal representation from human body is obtained using a vision Transformer and a LSTM network. A semantic space maps the visual features to the lingual embedding of the class labels via a Bidirectional Encoder Representations from Transformers (BERT) model. We evaluated the proposed model on five datasets, Montalbano II, MSR Daily Activity 3D, CAD-60, NTU-60, and isoGD obtaining state-of-the-art results compared to state-of-the-art ZS-GR models as well as the Zero-Shot Action Recognition (ZS-AR).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. ICLR

  2. Bilge YC, Ikizler-Cinbis N, Gokberk Cinbis R (2019) Zero-Shot Sign language recognition: can textual data uncover sign languages? BMVC

  3. Bishay M, Zoumpourlis G, Patras I (2019) TARN : temporal attentive relation network for few-shot and zero-shot action recognition. arXiv:1907.09021v1

  4. Cao Zh, Hidalgo G, Simon T, Wei ShE, Sheikh Y (2019) Openpose: realtime multi-person 2D Pose estimation using part affinity fields. IEEE Trans Pattern Anal Mach Intell 4:172–186

    Google Scholar 

  5. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko Sergey (2020) End-to-end object detection with transformers. ECCV:213–229

  6. Chan W, Saharia C, Hinton G, Norouzi M, Jaitly N (2020) Imputer: sequence modelling via imputation and dynamic programming. arXiv:2002.08926

  7. Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805

  8. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner Th, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2021) An image is worth 16x16 words: transformers for image recognition at scale. ICLR

  9. Escalera S, Gonzalez J, Baro X, Reyes M, Lopes O, Guyon I, Athitsos V, Escalante H (2013) Multi-modal gesture recognition challenge 2013: dataset and results. In: Proceedings of the 15th ACM on international conference on multi-modal interaction, pp 445–452

  10. Gu J, Bradbury J, Xiong C, Li VO, Socher R (2018) Non-autoregressive neural machine translation. ICLR

  11. Gupta P, Sharma D, Kiran R (2021) Sarvadevabhatla, Syntactically guided generative embeddings for zero-shot skeleton action recognition. IEEE Int Conf Image Process (ICIP), Anchorage, Alaska USA

  12. Gupta P, Sharma D, Kiran Sarvadevabhatla R (2021) Syntactically guided generative embeddings for zero-shot skeleton action recognition arXiv:2101.11530v1

  13. Hahn M, Silva A, Rehg JM (2019) Action2vec: a crossmodal embedding approach to action learning. arXiv:1901.00484v1

  14. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780

    Article  Google Scholar 

  15. Huang G, Bors AG (2021) Video classification with FineCoarse networks. arXiv:2103.15584v1

  16. Kalfaoglu ME, Kalkan S, Alatan A (2020) Late temporal modeling in 3D CNN architectures with BERT for action recognition. arXiv:2008.01232v3

  17. Khan S, Naseer M, Hayat M, Zamir SW, Khan FSh, Shah M (2022) Transformers in vision: a survey. arXiv:2101.01169v5

  18. Kiani K, Hematpour R, Rastgoo R (2021) Automatic grayscale image colorization using a deep hybrid model. J AI Data Mining. https://doi.org/10.22044/JADM.2021.9957.2131

  19. Li D, Xu Ch, Yu X, Zhang K, Swift B, Suominen H, Li H (2020) TSPNet hierarchical feature learning via temporal semantic pyramid for sign language translation, NIPS

  20. Li C. h., Zhang X, Liao L, Jin L, Yang W (2019) Skeleton-based gesture recognition using several fully connected layers with path signature features and temporal transformer module. In: Proceedings of the AAAI conference on artificial intelligence, pp 8585–8593

  21. Liu W, Anguelov D, Erhan D, Szegedy Ch, Reed S, Fu ChY, Berg AC (2016) SSD: single shot MultiBox detector. ECCV:21–37

  22. Madapana N, Wachs JP (2020) Feature selection for zero-shot gesture recognition. In: 15th IEEE international conference on automatic face and gesture recognition (FG 2020)

  23. Madapana N, Wachs JP (2020) Zero-shot learning for gesture recognition. In: 15th IEEE international conference on automatic face and gesture recognition (FG 2020)

  24. Madapana N, Wachs JP (2020) A semantical and analytical approach for zero shot gesture learning. In: 15th IEEE international conference on automatic face and gesture recognition (FG 2020)

  25. Majidi N, Kiani K, Rastgoo R (2020) A deep model for super-resolution enhancement from a single image. J AI Data Mining 8:451–460

    Google Scholar 

  26. Mishra A, Kumar Verma V, Shiva M, Reddy K, Arulkumar S, Rai P, Mittal A (2018) A generative approach to zero-shot and few-shot action recognition. In: 2018 IEEE Winter conference on applications of computer vision, pp 372–380

  27. Nguyen M, Qi-Yan W, Ho H (2021) Sign language recognition from digital videos using deep learning methods. Geomet Vision:108–118

  28. Rastgoo R, Kiani K, Escalera S (2020) A deep co-attentive hand-based video question answering framework using multi-view skeleton. Multimed Tools Appl 82:1401–1429

    Article  Google Scholar 

  29. Rastgoo R, Kiani K, Escalera S (2018) Multi-modal deep hand sign language recognition in still images using restricted Boltzmann machine. Entropy, vol 20(809)

  30. Rastgoo R, Kiani K, Escalera S (2020) Video-based isolated hand sign language recognition using a deep cascaded model. Multimed Tools Appl 79:22965–22987

    Article  Google Scholar 

  31. Rastgoo R, Kiani K, Escalera S (2020) Hand sign language recognition using multi-view hand skeleton. Expert Syst Appl 150:113336

    Article  Google Scholar 

  32. Rastgoo R, Kiani K, Escalera S (2021) Sign language recognition: a deep survey. Expert Syst Appl 113794:164

    Google Scholar 

  33. Rastgoo R, Kiani K, Escalera S (2021) Real-time isolated hand sign language recognition using deep networks and SVD, J Ambient Intell Humanized Comput

  34. Rastgoo R, Kiani K, Escalera S (2021) Hand pose aware multi-modal isolated sign language recognition. Multimed Tools Appl 80:127–163

    Article  Google Scholar 

  35. Rastgoo R, Kiani K, Escalera S, Sabokrou M (2021) Sign Language Production: a review. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3451–3461

  36. Redmon J, Farhadi A (2018) YOLOv3: an incremental improvement. arXiv:1804.02767

  37. Ren S, He K, Girshick RB, Sun J, Faster R -C N N (2015) Towards real-time object detection with region proposal networks. PAMI:1137–1149

  38. Schonfeld E, Ebrahimi S, Sinha S, Darrell T, Akata Z (2022) Generalized zero-shot learning via aligned variational auto-encoders, CVPR:8247–8255

  39. Shahroudy A, Liu J, Ng TT, Wang G, NTU R G B +D (2016) A large scale dataset for 3D human activity analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1010–1019

  40. Sung J, Ponce C, Selman B, Saxena A (2012) Unstructured human activity detection from RGBD images. IEEE Int Conf Robot Automation

  41. Tsai YH, Huang LK, Salakhutdinov R (2017) Learning robust visual-semantic embeddings. In: Proceedings of the IEEE International conference on computer vision, pp 3571–3580

  42. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. NIPS

  43. Wan J, Zhao Y, Zhou S. h., Guyon I, Escalera S, Li SZ (2016) Chalearn Looking at people RGB -d isolated and continuous datasets for gesture recognition. CVPR Workshop

  44. Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. In: IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp 1290–1297

  45. Wray M, Larlus D, Csurka G, Damen D (2019) Fine-grained action retrieval through multiple partsof-speech embeddings. In: IEEE/CVF international conference on computer vision (ICCV)

  46. Wu J, Li K, Zhao X, Tan M (2018) Unfamiliar dynamic hand gestures recognition based on zero-shot learning. ICONIP:244–254

Download references

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Razieh Rastgoo.

Ethics declarations

Consent for Publication

All authors confirm their consent for publication.

Conflict of Interests

The authors certify that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rastgoo, R., Kiani, K. & Escalera, S. ZS-GR: zero-shot gesture recognition from RGB-D videos. Multimed Tools Appl 82, 43781–43796 (2023). https://doi.org/10.1007/s11042-023-15112-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-15112-7

Keywords

Navigation