ZS-GR: zero-shot gesture recognition from RGB-D videos

Rastgoo, Razieh; Kiani, Kourosh; Escalera, Sergio

doi:10.1007/s11042-023-15112-7

ZS-GR: zero-shot gesture recognition from RGB-D videos

Published: 25 April 2023

Volume 82, pages 43781–43796, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

245 Accesses
3 Citations
Explore all metrics

Abstract

Gesture Recognition (GR) is a challenging research area in computer vision. To tackle the annotation bottleneck in GR, we formulate the problem of Zero-Shot Gesture Recognition (ZS-GR) and propose a two-stream model from two input modalities: RGB and Depth videos. To benefit from the vision Transformer capabilities, we use two vision Transformer models, for human detection and visual features representation. We configure a transformer encoder-decoder architecture, as a fast and accurate human detection model, to overcome the challenges of the current human detection models. Considering the human keypoints, the detected human body is segmented into nine parts. A spatio-temporal representation from human body is obtained using a vision Transformer and a LSTM network. A semantic space maps the visual features to the lingual embedding of the class labels via a Bidirectional Encoder Representations from Transformers (BERT) model. We evaluated the proposed model on five datasets, Montalbano II, MSR Daily Activity 3D, CAD-60, NTU-60, and isoGD obtaining state-of-the-art results compared to state-of-the-art ZS-GR models as well as the Zero-Shot Action Recognition (ZS-AR).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Towards an end-to-end isolated and continuous deep gesture recognition process

Article 06 April 2022

A real-time recognition method of static gesture based on DSSD

Article 18 February 2020

A New RGB-D Gesture Video Dataset and Its Benchmark Evaluations on Light-Weighted Networks

References

Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. ICLR
Bilge YC, Ikizler-Cinbis N, Gokberk Cinbis R (2019) Zero-Shot Sign language recognition: can textual data uncover sign languages? BMVC
Bishay M, Zoumpourlis G, Patras I (2019) TARN : temporal attentive relation network for few-shot and zero-shot action recognition. arXiv:1907.09021v1
Cao Zh, Hidalgo G, Simon T, Wei ShE, Sheikh Y (2019) Openpose: realtime multi-person 2D Pose estimation using part affinity fields. IEEE Trans Pattern Anal Mach Intell 4:172–186
Google Scholar
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko Sergey (2020) End-to-end object detection with transformers. ECCV:213–229
Chan W, Saharia C, Hinton G, Norouzi M, Jaitly N (2020) Imputer: sequence modelling via imputation and dynamic programming. arXiv:2002.08926
Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner Th, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2021) An image is worth 16x16 words: transformers for image recognition at scale. ICLR
Escalera S, Gonzalez J, Baro X, Reyes M, Lopes O, Guyon I, Athitsos V, Escalante H (2013) Multi-modal gesture recognition challenge 2013: dataset and results. In: Proceedings of the 15th ACM on international conference on multi-modal interaction, pp 445–452
Gu J, Bradbury J, Xiong C, Li VO, Socher R (2018) Non-autoregressive neural machine translation. ICLR
Gupta P, Sharma D, Kiran R (2021) Sarvadevabhatla, Syntactically guided generative embeddings for zero-shot skeleton action recognition. IEEE Int Conf Image Process (ICIP), Anchorage, Alaska USA
Gupta P, Sharma D, Kiran Sarvadevabhatla R (2021) Syntactically guided generative embeddings for zero-shot skeleton action recognition arXiv:2101.11530v1
Hahn M, Silva A, Rehg JM (2019) Action2vec: a crossmodal embedding approach to action learning. arXiv:1901.00484v1
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780
Article Google Scholar
Huang G, Bors AG (2021) Video classification with FineCoarse networks. arXiv:2103.15584v1
Kalfaoglu ME, Kalkan S, Alatan A (2020) Late temporal modeling in 3D CNN architectures with BERT for action recognition. arXiv:2008.01232v3
Khan S, Naseer M, Hayat M, Zamir SW, Khan FSh, Shah M (2022) Transformers in vision: a survey. arXiv:2101.01169v5
Kiani K, Hematpour R, Rastgoo R (2021) Automatic grayscale image colorization using a deep hybrid model. J AI Data Mining. https://doi.org/10.22044/JADM.2021.9957.2131
Li D, Xu Ch, Yu X, Zhang K, Swift B, Suominen H, Li H (2020) TSPNet hierarchical feature learning via temporal semantic pyramid for sign language translation, NIPS
Li C. h., Zhang X, Liao L, Jin L, Yang W (2019) Skeleton-based gesture recognition using several fully connected layers with path signature features and temporal transformer module. In: Proceedings of the AAAI conference on artificial intelligence, pp 8585–8593
Liu W, Anguelov D, Erhan D, Szegedy Ch, Reed S, Fu ChY, Berg AC (2016) SSD: single shot MultiBox detector. ECCV:21–37
Madapana N, Wachs JP (2020) Feature selection for zero-shot gesture recognition. In: 15th IEEE international conference on automatic face and gesture recognition (FG 2020)
Madapana N, Wachs JP (2020) Zero-shot learning for gesture recognition. In: 15th IEEE international conference on automatic face and gesture recognition (FG 2020)
Madapana N, Wachs JP (2020) A semantical and analytical approach for zero shot gesture learning. In: 15th IEEE international conference on automatic face and gesture recognition (FG 2020)
Majidi N, Kiani K, Rastgoo R (2020) A deep model for super-resolution enhancement from a single image. J AI Data Mining 8:451–460
Google Scholar
Mishra A, Kumar Verma V, Shiva M, Reddy K, Arulkumar S, Rai P, Mittal A (2018) A generative approach to zero-shot and few-shot action recognition. In: 2018 IEEE Winter conference on applications of computer vision, pp 372–380
Nguyen M, Qi-Yan W, Ho H (2021) Sign language recognition from digital videos using deep learning methods. Geomet Vision:108–118
Rastgoo R, Kiani K, Escalera S (2020) A deep co-attentive hand-based video question answering framework using multi-view skeleton. Multimed Tools Appl 82:1401–1429
Article Google Scholar
Rastgoo R, Kiani K, Escalera S (2018) Multi-modal deep hand sign language recognition in still images using restricted Boltzmann machine. Entropy, vol 20(809)
Rastgoo R, Kiani K, Escalera S (2020) Video-based isolated hand sign language recognition using a deep cascaded model. Multimed Tools Appl 79:22965–22987
Article Google Scholar
Rastgoo R, Kiani K, Escalera S (2020) Hand sign language recognition using multi-view hand skeleton. Expert Syst Appl 150:113336
Article Google Scholar
Rastgoo R, Kiani K, Escalera S (2021) Sign language recognition: a deep survey. Expert Syst Appl 113794:164
Google Scholar
Rastgoo R, Kiani K, Escalera S (2021) Real-time isolated hand sign language recognition using deep networks and SVD, J Ambient Intell Humanized Comput
Rastgoo R, Kiani K, Escalera S (2021) Hand pose aware multi-modal isolated sign language recognition. Multimed Tools Appl 80:127–163
Article Google Scholar
Rastgoo R, Kiani K, Escalera S, Sabokrou M (2021) Sign Language Production: a review. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3451–3461
Redmon J, Farhadi A (2018) YOLOv3: an incremental improvement. arXiv:1804.02767
Ren S, He K, Girshick RB, Sun J, Faster R -C N N (2015) Towards real-time object detection with region proposal networks. PAMI:1137–1149
Schonfeld E, Ebrahimi S, Sinha S, Darrell T, Akata Z (2022) Generalized zero-shot learning via aligned variational auto-encoders, CVPR:8247–8255
Shahroudy A, Liu J, Ng TT, Wang G, NTU R G B +D (2016) A large scale dataset for 3D human activity analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1010–1019
Sung J, Ponce C, Selman B, Saxena A (2012) Unstructured human activity detection from RGBD images. IEEE Int Conf Robot Automation
Tsai YH, Huang LK, Salakhutdinov R (2017) Learning robust visual-semantic embeddings. In: Proceedings of the IEEE International conference on computer vision, pp 3571–3580
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. NIPS
Wan J, Zhao Y, Zhou S. h., Guyon I, Escalera S, Li SZ (2016) Chalearn Looking at people RGB -d isolated and continuous datasets for gesture recognition. CVPR Workshop
Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. In: IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp 1290–1297
Wray M, Larlus D, Csurka G, Damen D (2019) Fine-grained action retrieval through multiple partsof-speech embeddings. In: IEEE/CVF international conference on computer vision (ICCV)
Wu J, Li K, Zhao X, Tan M (2018) Unfamiliar dynamic hand gestures recognition based on zero-shot learning. ICONIP:244–254

Download references

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Semnan University, Semnan, Iran
Razieh Rastgoo & Kourosh Kiani
Universitat de Barcelona and Computer Vision Center, Barcelona, Spain
Sergio Escalera

Authors

Razieh Rastgoo
View author publications
You can also search for this author in PubMed Google Scholar
Kourosh Kiani
View author publications
You can also search for this author in PubMed Google Scholar
Sergio Escalera
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Razieh Rastgoo.

Ethics declarations

Consent for Publication

All authors confirm their consent for publication.

Conflict of Interests

The authors certify that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Rastgoo, R., Kiani, K. & Escalera, S. ZS-GR: zero-shot gesture recognition from RGB-D videos. Multimed Tools Appl 82, 43781–43796 (2023). https://doi.org/10.1007/s11042-023-15112-7

Download citation

Received: 13 June 2022
Revised: 16 September 2022
Accepted: 12 March 2023
Published: 25 April 2023
Issue Date: November 2023
DOI: https://doi.org/10.1007/s11042-023-15112-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ZS-GR: zero-shot gesture recognition from RGB-D videos

Abstract

Access this article

Similar content being viewed by others

Towards an end-to-end isolated and continuous deep gesture recognition process

A real-time recognition method of static gesture based on DSSD

A New RGB-D Gesture Video Dataset and Its Benchmark Evaluations on Light-Weighted Networks

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Consent for Publication

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

ZS-GR: zero-shot gesture recognition from RGB-D videos

Abstract

Access this article

Similar content being viewed by others

Towards an end-to-end isolated and continuous deep gesture recognition process

A real-time recognition method of static gesture based on DSSD

A New RGB-D Gesture Video Dataset and Its Benchmark Evaluations on Light-Weighted Networks

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Consent for Publication

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation