Hand pose aware multimodal isolated sign language recognition

Rastgoo, Razieh; Kiani, Kourosh; Escalera, Sergio

doi:10.1007/s11042-020-09700-0

Hand pose aware multimodal isolated sign language recognition

Published: 01 September 2020

Volume 80, pages 127–163, (2021)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

1213 Accesses
Explore all metrics

Abstract

Isolated hand sign language recognition from video is a challenging research area in computer vision. Some of the most important challenges in this area include dealing with hand occlusion, fast hand movement, illumination changes, or background complexity. While most of the state-of-the-art results in the field have been achieved using deep learning-based models, the previous challenges are not completely solved. In this paper, we propose a hand pose aware model for isolated hand sign language recognition using deep learning approaches from two input modalities, RGB and depth videos. Four spatial feature types: pixel-level, flow, deep hand, and hand pose features, fused from both visual modalities, are input to LSTM for temporal sign recognition. While we use Optical Flow (OF) for flow information in RGB video inputs, Scene Flow (SF) is used for depth video inputs. By including hand pose features, we show a consistent performance improvement of the sign language recognition model. To the best of our knowledge, this is the first time that this discriminant spatiotemporal features, benefiting from the hand pose estimation features and multi-modal inputs, are fused for isolated hand sign language recognition. We perform a step-by-step analysis of the impact in terms of recognition performance of the hand pose features, different combinations of the spatial features, and different recurrent models, especially LSTM and GRU. Results on four public datasets confirm that the proposed model outperforms the current state-of-the-art models on Montalbano II, MSR Daily Activity 3D, and CAD-60 datasets with a relative accuracy improvement of 1.64%, 6.5%, and 7.6%. Furthermore, our model obtains a competitive results on isoGD dataset with only 0.22% margin lower than the current state-of-the-art model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 4

Video-based isolated hand sign language recognition using a deep cascaded model

Article 02 June 2020

Real-time isolated hand sign language recognition using deep networks and SVD

Article 16 February 2021

Exploiting 3D Hand Pose Estimation in Deep Learning-Based Sign Language Recognition from RGB Videos

References

Asadi-Aghbolaghi M, Bertiche H, Roig V, Kasaei Sh, Escalera S (2017) Action recognition from RGB-D data: comparison and fusion of Spatio-temporal handcrafted features and deep strategies, IEEE International Conference on Computer Vision Workshops (ICCVW), Venice, Italy
Avola D, Bernardi M, Cinque L, Foresti GL, Massaroni C (2019) Exploiting Recurrent Neural Networks and Leap Motion Controller for the Recognition of Sign Language and Semaphoric Hand Gestures. IEEE Transact Multimed 21:234–245
Article Google Scholar
Bin Y, Chen ZM, Wei X-S, Chen X, Gao C, Sang N (2020) Structure-aware human pose estimation with graph convolutional networks. Pattern Recogn 106:107410
Article Google Scholar
Camgoz NC, Koller O, Hadfield S, Bowden R (2020) Sign language transformers: joint end-to-end sign language recognition and translation. CVPR, Washington, US, pp 10023–10033
Google Scholar
Chen W, Yu C, Tu C, Lyu Z, Tang J, Ou S, Fu Y, Xue Z (2020) A survey on hand pose estimation with wearable sensors and computer-vision-based methods. Sensors 20:1074
Article Google Scholar
Cippitelli E, Gasparrini S, Gambi E, Spinsante S (2016) A Human Activity Recognition System Using Skeleton Data from RGBD Sensors, Computational Intelligence and Neuroscience, Article ID 4351435, 14 pages, https://doi.org/10.1155/2016/4351435
Dabre K, Dholay S (2014) Machine learning model for sign language interpretation using webcam images, International Conference on Circuits, Systems, Communication and Information Technology Applications (CSCITA), Mumbai, India
Ershadi-Nasab S, Noury E, Kasaei S, Sanaei E (2018) Multiple human 3D pose estimation from multiview images. Multimed Tools Appl 77:15573–15601. https://doi.org/10.1007/s11042-017-5133-8
Article Google Scholar
Escalera S, Gonzalez J, Baro X, Reyes M, Lopes O, Guyon I, Athitsos V, Escalante H (2013) Multi-modal gesture recognition challenge 2013: dataset and results, In Proceedings of the 15th ACM on International conference on multimodal interaction, 445–452
Garcia B, Alarcon Viesca S (2016) Real-time American sign language recognition with convolutional neural networks. Report of Standford University
Gomez-Donoso F, Orts-Escolano S, Cazorla M (2019) Accurate and efficient 3D hand pose regression for robot hand teleoperation using a monocular RGB camera. Expert Syst Appl 136:327–337. https://doi.org/10.1016/j.eswa.2019.06.055%0A
Article Google Scholar
Guo H, Wang G, Chen X, Zhang C (2017) Towards Good Practices for Deep 3D Hand Pose Estimation, arXiv:1707.07248v1
Hosain AA, Santhalingam PS, Pathak P, Rangwala H, Kosecka J (2020) FineHand: Learning Hand Shapes for American Sign Language Recognition, arXiv:2003.08753
Jaimez M, Souiai M, Gonzalez Jimenez J, Cremers D (2015) A primal-dual framework for real-time dense RGBD scene flow, In Robotics and Automation (ICRA), 2015 IEEE International Conference on, 98–104
Kim Y, Kim D (2020) A CNN-based 3D human pose estimation based on projection of depth and ridge data. Pattern Recogn 106:107462. https://doi.org/10.1016/j.patcog.2020.107462
Article Google Scholar
Köpüklü O, Kose N, Rigoll G (2018) Motion Fused Frames: Data Level Fusion Strategy for Hand Gesture Recognitiontle. IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, Utah, United States, 2103–2111. https://doi.org/10.1109/CVPRW.2018.00284
Krizhevsky A, Sutskever I, Hinton G (2012) ImageNet Classification with Deep Convolutional Neural Networks, Advances in Neural Information Processing Systems 25 (NIPS 2012), Nevada, USA
Li L, Qin S, Lu Z, Xu K, Hu Z (2020) One-shot learning gesture recognition based on joint training of 3D ResNet and memory module. Multimed Tools Appl 79:6727–6757. https://doi.org/10.1007/s11042-019-08429-9
Article Google Scholar
Lim KM, Tan AWC, Lee CP, Tan SC (2019) Isolated sign language recognition using convolutional neural network hand modelling and hand energy image. Multimed Tools Appl 78:19917–19944. https://doi.org/10.1007/s11042-019-7263-7
Article Google Scholar
Lucas BD, Kanade T (1981) An iterative image registration technique with an application in stereo vision, In Seventh International Joint Conference on Artificial Intelligence, Vancouver, 674–679
Newell A, Yang K, Deng J (2016) Stacked Hourglass Networks for Human Pose Estimation, European Conference on Computer Vision (ECCV), 483499
Oberweger M, Wohlhart P, Lepetit V (2015) Hands Deep in Deep Learning for Hand Pose Estimation, arXiv:1502.06807v2
Oberweger M, Wohlhart P, Lepetit V (2016) Efficiently creating 3D training data for fine hand pose estimation, Proceedings of the IEEE conference on computer vision and pattern recognition, USA, 4957–4965
Paragios N, Chen Y, Faugeras O (2005) Mathematical models in computer vision: the handbook, Springer, 39–258
Rahim MA, Shin J, Islam MR (2020) Hand gesture recognition-based non-touch character writing system on a virtual keyboard. Multimed Tools Appl 79:11813–11836. https://doi.org/10.1007/s11042-019-08448-6
Article Google Scholar
Rastgoo R, Kiani K, Escalera S (2018) Multi-modal deep hand sign language recognition in still images using Restricted Boltzmann Machine. Entropy 20:11, 809. https://www.mdpi.com/1099-4300/20/11/809.
Rastgoo R, Kiani K, Escalera S (2020) Hand sign language recognition using multi-view hand skeleton. Expert Syst Appl 150:113336. https://doi.org/10.1016/j.eswa.2020.113336
Article Google Scholar
Rastgoo R, Kiani K, Escalera S (2020) Video-based isolated hand sign language recognition using a deep cascaded model. Multimed Tools Appl 79:22965–22987. https://doi.org/10.1007/s11042-020-09048-5
Article Google Scholar
Ren S, He K, Girshick R, Sun J (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39:1137–1149
Article Google Scholar
Sarafianos N, Boteanu B, Ionescu B, Kakadiaris IA (2016) 3D human pose estimation: a review of the literature and analysis of covariates. Comput Vis Image Underst 152:1–20
Article Google Scholar
Shahroudy A, Ng T, Gong Y, Wang G (2016) Deep multimodal feature analysis for action recognition in RGB+D videos, IEEE Transactions on Software Engineering 99 https://doi.org/10.1109/TPAMI.2017.2691321
Simon T, Joo H, Matthews I, Sheikh Y (2017) Hand Keypoint detection in single images using multiview bootstrapping, CVPR
Simonyan K, Zisserman A (2014) Two-Stream Convolutional Networks for Action Recognition in Videos, NIPS’14 Proceedings of the 27th International Conference on Neural Information Processing Systems, pp. 568–576, Monteral, Canada
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition, arXiv technical report
Sung J, Ponce C, Selman B, Saxena A (2012) Unstructured human activity detection from RGBD images, IEEE International Conference on Robotics and Automation, Saint Paul, MN, USA
Supancic JS, Rogez G, Yang Y, Shotton J, Ramanan D (2015) Depth-based hand pose estimation: methods, data, and challenges, IEEE International Conference on Computer Vision (ICCV)
Szczuko P (2019) Deep neural networks for human pose estimation from a very low resolution depth image. Multimed Tools Appl 78:29357–29377. https://doi.org/10.1007/s11042-019-7433-7
Article Google Scholar
Tran D-S, Ho N-H, Yang H-J, Baek E-T, Kim S-H, Lee G (2020) Real-time hand gesture spotting and recognition using RGB-D camera and 3D convolutional neural network. Appl Sci 10:722
Article Google Scholar
Vedula S, Baker S, Rander P, Collins R, Kanade T (2015) Three-dimensional scene flow, IEEE Trans Pattern Anal Mach Intell, 475–480
Wan J et al. (2016) ChaLearn looking at people RGB-D isolated and continuous datasets for gesture recognition, IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Las Vegas, NV, USA
Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras, In Computer Vision and Pattern Recognition(CVPR), 2012 IEEE Conference on, pp. 1290–1297
Wang M, Chen W-Y, Li XD (2016) Hand gesture recognition using valley circle feature and Hu’s moments technique for robot movement control. Measurement 94:734–744
Article Google Scholar
Zhou X, Wan Q, Zhang W, Xue X, Wei Y (2016) Model-based Deep Hand Pose Estimation, International Joint Conference on Artificial Intelligence (IJCAI), New York, USA
Zimmermann Ch, Brox Th (2017) Learning to Estimate 3D Hand Pose from Single RGB Images, IEEE International Conference on Computer Vision (ICCV)
Rastgoo R, Kiani K, Escalera S (2020) ign Language Recognition: A Deep Survey. Expert Syst Appl 164:113794. https://doi.org/10.1016/j.eswa.2020.113794

Download references

Acknowledgements

This work has been partially supported by the Spanish project PID2019-105093GB-I00 (MINECO/FEDER, UE) and CERCA Programme/Generalitat de Catalunya, and ICREA under the ICREA Academia programme and High Intelligent Solution (HIS) company in Iran. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan XP GPU used for this research. Also, we would like to thank the Deaf centers in Semnan and Tehran in Iran and Computer Vision Center (CVC) in Spain for their collaborations.

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Electrical and Computer Engineering Department, Semnan University, Semnan, 3513119111, Iran
Razieh Rastgoo & Kourosh Kiani
Department of Mathematics and Informatics, Universitat de Barcelona, and Computer Vision Center, 585, 08007, Barcelona, Spain
Sergio Escalera

Authors

Razieh Rastgoo
View author publications
You can also search for this author inPubMed Google Scholar
Kourosh Kiani
View author publications
You can also search for this author inPubMed Google Scholar
Sergio Escalera
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Kourosh Kiani.

Ethics declarations

Conflict of interest

The authors certify that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rastgoo, R., Kiani, K. & Escalera, S. Hand pose aware multimodal isolated sign language recognition. Multimed Tools Appl 80, 127–163 (2021). https://doi.org/10.1007/s11042-020-09700-0

Download citation

Received: 21 March 2020
Revised: 09 July 2020
Accepted: 21 August 2020
Published: 01 September 2020
Issue Date: January 2021
DOI: https://doi.org/10.1007/s11042-020-09700-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hand pose aware multimodal isolated sign language recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Video-based isolated hand sign language recognition using a deep cascaded model

Real-time isolated hand sign language recognition using deep networks and SVD

Exploiting 3D Hand Pose Estimation in Deep Learning-Based Sign Language Recognition from RGB Videos

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now