Abstract
Gesture Recognition (GR) is a challenging research area in computer vision. To tackle the annotation bottleneck in GR, we formulate the problem of Zero-Shot Gesture Recognition (ZS-GR) and propose a two-stream model from two input modalities: RGB and Depth videos. To benefit from the vision Transformer capabilities, we use two vision Transformer models, for human detection and visual features representation. We configure a transformer encoder-decoder architecture, as a fast and accurate human detection model, to overcome the challenges of the current human detection models. Considering the human keypoints, the detected human body is segmented into nine parts. A spatio-temporal representation from human body is obtained using a vision Transformer and a LSTM network. A semantic space maps the visual features to the lingual embedding of the class labels via a Bidirectional Encoder Representations from Transformers (BERT) model. We evaluated the proposed model on five datasets, Montalbano II, MSR Daily Activity 3D, CAD-60, NTU-60, and isoGD obtaining state-of-the-art results compared to state-of-the-art ZS-GR models as well as the Zero-Shot Action Recognition (ZS-AR).
Similar content being viewed by others
References
Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. ICLR
Bilge YC, Ikizler-Cinbis N, Gokberk Cinbis R (2019) Zero-Shot Sign language recognition: can textual data uncover sign languages? BMVC
Bishay M, Zoumpourlis G, Patras I (2019) TARN : temporal attentive relation network for few-shot and zero-shot action recognition. arXiv:1907.09021v1
Cao Zh, Hidalgo G, Simon T, Wei ShE, Sheikh Y (2019) Openpose: realtime multi-person 2D Pose estimation using part affinity fields. IEEE Trans Pattern Anal Mach Intell 4:172–186
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko Sergey (2020) End-to-end object detection with transformers. ECCV:213–229
Chan W, Saharia C, Hinton G, Norouzi M, Jaitly N (2020) Imputer: sequence modelling via imputation and dynamic programming. arXiv:2002.08926
Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner Th, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2021) An image is worth 16x16 words: transformers for image recognition at scale. ICLR
Escalera S, Gonzalez J, Baro X, Reyes M, Lopes O, Guyon I, Athitsos V, Escalante H (2013) Multi-modal gesture recognition challenge 2013: dataset and results. In: Proceedings of the 15th ACM on international conference on multi-modal interaction, pp 445–452
Gu J, Bradbury J, Xiong C, Li VO, Socher R (2018) Non-autoregressive neural machine translation. ICLR
Gupta P, Sharma D, Kiran R (2021) Sarvadevabhatla, Syntactically guided generative embeddings for zero-shot skeleton action recognition. IEEE Int Conf Image Process (ICIP), Anchorage, Alaska USA
Gupta P, Sharma D, Kiran Sarvadevabhatla R (2021) Syntactically guided generative embeddings for zero-shot skeleton action recognition arXiv:2101.11530v1
Hahn M, Silva A, Rehg JM (2019) Action2vec: a crossmodal embedding approach to action learning. arXiv:1901.00484v1
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780
Huang G, Bors AG (2021) Video classification with FineCoarse networks. arXiv:2103.15584v1
Kalfaoglu ME, Kalkan S, Alatan A (2020) Late temporal modeling in 3D CNN architectures with BERT for action recognition. arXiv:2008.01232v3
Khan S, Naseer M, Hayat M, Zamir SW, Khan FSh, Shah M (2022) Transformers in vision: a survey. arXiv:2101.01169v5
Kiani K, Hematpour R, Rastgoo R (2021) Automatic grayscale image colorization using a deep hybrid model. J AI Data Mining. https://doi.org/10.22044/JADM.2021.9957.2131
Li D, Xu Ch, Yu X, Zhang K, Swift B, Suominen H, Li H (2020) TSPNet hierarchical feature learning via temporal semantic pyramid for sign language translation, NIPS
Li C. h., Zhang X, Liao L, Jin L, Yang W (2019) Skeleton-based gesture recognition using several fully connected layers with path signature features and temporal transformer module. In: Proceedings of the AAAI conference on artificial intelligence, pp 8585–8593
Liu W, Anguelov D, Erhan D, Szegedy Ch, Reed S, Fu ChY, Berg AC (2016) SSD: single shot MultiBox detector. ECCV:21–37
Madapana N, Wachs JP (2020) Feature selection for zero-shot gesture recognition. In: 15th IEEE international conference on automatic face and gesture recognition (FG 2020)
Madapana N, Wachs JP (2020) Zero-shot learning for gesture recognition. In: 15th IEEE international conference on automatic face and gesture recognition (FG 2020)
Madapana N, Wachs JP (2020) A semantical and analytical approach for zero shot gesture learning. In: 15th IEEE international conference on automatic face and gesture recognition (FG 2020)
Majidi N, Kiani K, Rastgoo R (2020) A deep model for super-resolution enhancement from a single image. J AI Data Mining 8:451–460
Mishra A, Kumar Verma V, Shiva M, Reddy K, Arulkumar S, Rai P, Mittal A (2018) A generative approach to zero-shot and few-shot action recognition. In: 2018 IEEE Winter conference on applications of computer vision, pp 372–380
Nguyen M, Qi-Yan W, Ho H (2021) Sign language recognition from digital videos using deep learning methods. Geomet Vision:108–118
Rastgoo R, Kiani K, Escalera S (2020) A deep co-attentive hand-based video question answering framework using multi-view skeleton. Multimed Tools Appl 82:1401–1429
Rastgoo R, Kiani K, Escalera S (2018) Multi-modal deep hand sign language recognition in still images using restricted Boltzmann machine. Entropy, vol 20(809)
Rastgoo R, Kiani K, Escalera S (2020) Video-based isolated hand sign language recognition using a deep cascaded model. Multimed Tools Appl 79:22965–22987
Rastgoo R, Kiani K, Escalera S (2020) Hand sign language recognition using multi-view hand skeleton. Expert Syst Appl 150:113336
Rastgoo R, Kiani K, Escalera S (2021) Sign language recognition: a deep survey. Expert Syst Appl 113794:164
Rastgoo R, Kiani K, Escalera S (2021) Real-time isolated hand sign language recognition using deep networks and SVD, J Ambient Intell Humanized Comput
Rastgoo R, Kiani K, Escalera S (2021) Hand pose aware multi-modal isolated sign language recognition. Multimed Tools Appl 80:127–163
Rastgoo R, Kiani K, Escalera S, Sabokrou M (2021) Sign Language Production: a review. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3451–3461
Redmon J, Farhadi A (2018) YOLOv3: an incremental improvement. arXiv:1804.02767
Ren S, He K, Girshick RB, Sun J, Faster R -C N N (2015) Towards real-time object detection with region proposal networks. PAMI:1137–1149
Schonfeld E, Ebrahimi S, Sinha S, Darrell T, Akata Z (2022) Generalized zero-shot learning via aligned variational auto-encoders, CVPR:8247–8255
Shahroudy A, Liu J, Ng TT, Wang G, NTU R G B +D (2016) A large scale dataset for 3D human activity analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1010–1019
Sung J, Ponce C, Selman B, Saxena A (2012) Unstructured human activity detection from RGBD images. IEEE Int Conf Robot Automation
Tsai YH, Huang LK, Salakhutdinov R (2017) Learning robust visual-semantic embeddings. In: Proceedings of the IEEE International conference on computer vision, pp 3571–3580
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. NIPS
Wan J, Zhao Y, Zhou S. h., Guyon I, Escalera S, Li SZ (2016) Chalearn Looking at people RGB -d isolated and continuous datasets for gesture recognition. CVPR Workshop
Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. In: IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp 1290–1297
Wray M, Larlus D, Csurka G, Damen D (2019) Fine-grained action retrieval through multiple partsof-speech embeddings. In: IEEE/CVF international conference on computer vision (ICCV)
Wu J, Li K, Zhao X, Tan M (2018) Unfamiliar dynamic hand gestures recognition based on zero-shot learning. ICONIP:244–254
Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Consent for Publication
All authors confirm their consent for publication.
Conflict of Interests
The authors certify that they have no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Rastgoo, R., Kiani, K. & Escalera, S. ZS-GR: zero-shot gesture recognition from RGB-D videos. Multimed Tools Appl 82, 43781–43796 (2023). https://doi.org/10.1007/s11042-023-15112-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-15112-7