Skip to main content
Log in

TIM-SLR: a lightweight network for video isolated sign language recognition

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

The research on video isolated sign language recognition (SLR) algorithms has made leaping progress, but there are problems that need to be solved urgently in the field of SLR. On the one hand, traditional sign language acquisition equipment has the disadvantages of being expensive and not easy to carry. Sign language collected based on Kinect contains rich information, but it is complicated to use. The data acquired by RGB cameras are beneficial to practical applications, but the existing sign language datasets collected by RGB cameras have disadvantages such as few demonstrators and small vocabulary. On the other hand, most of the existing SLR methods use complex network structures to achieve high accuracy, but complex networks mean longer inference time, which cannot meet practical application scenarios at all. In this paper, we propose a Chinese large-scale isolated sign language dataset named CSLD, which is collected using RGB camera, and each vocabulary is illustrated 10 times by 30 demonstrators, including 400 words. In addition, we proposed a lightweight TIM-SLR network. In order to verify lightweight and validity of the network, we not only conducted experiments on sign language datasets CSLD and LSA64, and obtained 91.6% and 99.8% accuracy, respectively, but also performed experiments on action recognition datasets Sth-Sth (V1 and V2) and both achieve state-of-the-art performance. Not only can it obtain higher accuracy, but also inference speed and parameter of the network can meet practical application scenarios, because TIM-SLR network is only composed of 2D convolution and temporal interaction module (TIM).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Data availability statement

The datasets generated during and the current study are not publicly available due to fund requirements but are available from the corresponding author on reasonable request.

References

  1. Aich D, Al Zubair A, Hasan K. Z, Nath A. D, Hasan Z (2020) “A deep learning approach for recognizing bengali character sign langauage,” In: 11th international conference on computing, communication and networking technologies (ICCCNT). IEEE, 2020, pp. 1–5

  2. Hasan M. M, Srizon A. Y, Sayeed A, Hasan M. A. M (2020) “Classification of sign language characters by applying a deep convolutional neural network.” In: 2nd international conference on advanced information and communication technology (ICAICT). IEEE, 2020, pp. 434–438

  3. Töngi R (2021) “Application of transfer learning to sign language recognition using an inflated 3d deep convolutional neural network.” arXiv preprint arXiv:2103.05111

  4. De Coster M, Van Herreweghe M, Dambre J (2021) “Isolated sign recognition from rgb video using pose flow and self-attention.” In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3441–3450

  5. Huang J, Zhou W, Li H, Li W, “Sign language recognition using 3d convolutional neural networks.” In: (2015) IEEE international conference on multimedia and expo (ICME). IEEE 2015: 1–6

  6. Jing L, Vahdani E, Huenerfauth M, Tian Y (2019) “Recognizing american sign language manual signs from rgb-d videos.” arXiv preprint arXiv:1906.02851

  7. Roy PP, Kumar P, Kim B-G (2021) An efficient sign language recognition (slr) system using Camshift tracker and hidden Markov model (hmm). SN Comput Sci 2(2):1–15

    Article  Google Scholar 

  8. Huang J, Zhou W, Zhang Q, Li H, Li W (2018) “Video-based sign language recognition without temporal segmentation.” In: proceedings of the AAAI conference on artificial intelligence, vol 32, no 1

  9. Li H, Gao L, Han R, Wan L, Feng W (2020) “Key action and joint ctc-attention based sign language recognition.” In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp. 2348–2352

  10. Hao A, Min Y, Chen X (2021) “Self-mutual distillation learning for continuous sign language recognition.” In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 11 303–11 312

  11. Min Y, Hao A, Chai X, Chen X (2021) “Visual alignment constraint for continuous sign language recognition.” In: proceedings of the IEEE/CVF international conference on computer vision, pp. 11 542–11 551

  12. Hossen M, Govindaiah A, Sultana S, Bhuiyan A, “Bengali sign language recognition using deep convolutional neural network.”In: (2018) joint 7th international conference on informatics, electronics & vision (iciev) and 2018 2nd international conference on imaging, vision & pattern recognition (icIVPR). IEEE 2018:369–373

  13. Rahman M. M, Islam M. S, Rahman M. H, Sassi R, Rivolta M. W, Aktaruzzaman M (2019) “A new benchmark on american sign language recognition using convolutional neural network.” In: 2019 international conference on sustainable technologies for industry 4.0 (STI). IEEE, pp. 1–6

  14. Ji Y, Kim S, Lee K.-B (2017) “Sign language learning system with image sampling and convolutional neural network.” In: 2017 first IEEE international conference on robotic computing (IRC). IEEE, pp. 371–375

  15. Kopuklu O, Kose N, Rigoll G (2018) “Motion fused frames: Data level fusion strategy for hand gesture recognition.” In: proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 2103–2111

  16. Huang J, Zhou W, Li H, Li W (2018) Attention-based 3d-cnns for large-vocabulary sign language recognition. IEEE Trans Circuits Syst Video Technol 29(9):2822–2832

    Article  Google Scholar 

  17. Liang Z-J, Liao S-B, Hu B-Z (2018) 3d convolutional neural networks for dynamic sign language recognition. Comput J 61(11):1724–1736

    Article  Google Scholar 

  18. Ye Y, Tian Y, Huenerfauth M, Liu J (2018) “Recognizing american sign language gestures from within continuous videos.” In: proceedings of the ieee conference on computer vision and pattern recognition workshops, pp. 2064–2073

  19. Miao Q, Li Y, Ouyang W, Ma Z, Xu X, Shi W, Cao X (2017) “Multimodal gesture recognition based on the resc3d network.” In: proceedings of the IEEE international conference on computer vision workshops, pp. 3047–3055

  20. Sripairojthikoon N, Harnsomburana J (2019) “Thai sign language recognition using 3d convolutional neural networks.” In: proceedings of the 2019 7th international conference on computer and communications management, pp. 186–189

  21. Wang F, Du Y, Wang G, Zeng Z, Zhao L (2022) (2+1)d-slr: an efficient network for video sign language recognition. Neural Comput Appl 34(3):2413–2423

    Article  Google Scholar 

  22. Zhou M, Ng M, Cai Z, Cheung KC (2020) “Self-attention-based fully-inception networks for continuous sign language recognition.” In: ECAI. IOS Press 2020: 2832–2839

  23. Molchanov P, Gupta S, Kim K, Kautz J (2015) “Hand gesture recognition with 3d convolutional neural networks.” In: proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 1–7

  24. Devineau G, Moutarde F, Xi W, Yang J (2018) “Deep learning for hand gesture recognition on skeletal data.” In: 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018). IEEE, pp. 106–113

  25. Konstantinidis D, Dimitropoulos K, Daras P (2018) “Sign language recognition based on hand and body skeletal data.” In: 2018-3DTV-conference: the true vision-capture, transmission and display of 3D video (3DTV-CON). IEEE, pp. 1–4

  26. Kim J-S, Jang W, Bien Z (1996) “A dynamic gesture recognition system for the korean sign language (ksl).” IEEE Trans Syst Man, Cybernetics. Part B (Cybernetics) 26(2): 354–359

  27. Holden E.-J, Owens R (2021) “Visual sign language recognition.” In: Multi-image analysis. Springer, pp. 270–287

  28. Efthimiou E, Fotinea S.-E (2007) “Gslc: creation and annotation of a greek sign language corpus for hci.” In: International conference on universal access in human-computer interaction. Springer, pp. 657–666

  29. Pugeault N, Bowden R, “Spelling it out: Real-time asl fingerspelling recognition.” In: (2011) IEEE international conference on computer vision workshops (ICCV workshops). IEEE 2011: 1114–1119

  30. Ong E.-J, Cooper H, Pugeault N, Bowden R (2012) “Sign language recognition using sequential pattern trees.” In: 2012 IEEE conference on computer vision and pattern recognition. IEEE,pp. 2200–2207

  31. Neidle C, Thangali A, Sclaroff S (2012) “Challenges in development of the american sign language lexicon video dataset (asllvd) corpus.” In: 5th workshop on the representation and processing of sign languages: interactions between corpus and Lexicon. LREC, Citeseer

  32. Oszust M, Wysocki M (2013) “Polish sign language words recognition with kinect.” In: 2013 6th international conference on human system interactions (HSI). IEEE, pp. 219–226

  33. Chai X, Wang H, Chen X (2014) “The devisign large vocabulary of chinese sign language database and baseline evaluations.” In: Technical report VIPL-TR-14-SLR-001. Key lab of intelligent information processing of chinese academy of sciences (CAS). Institute of computing technology

  34. Ronchetti F, Quiroga F, Estrebou C. A, Lanzarini L. C, Rosete A (2016) “Lsa64: an argentinian sign language dataset.” In: XXII Congreso Argentino de Ciencias de la Computación (CACIC 2016)

  35. Hu H, Zhou W, Pu J, Li H (2021) “Global-local enhancement network for nmf-aware sign language recognition.” In: ACM transactions on multimedia computing, communications, and applications (TOMM), vol 17, no 3, pp. 1–19

  36. Bo L, Lai K, Ren X, Fox D (2011) “Object recognition with hierarchical kernel descriptors.” In CVPR. IEEE 2011: 1729–1736

  37. Tharwat A, Gaber T, Hassanien A. E, Shahin M. K, Refaat B (2015) “Sift-based arabic sign language recognition system.” In: Afro-European conference for industrial advancement. Springer, pp. 359–370

  38. Dardas NH, Georganas ND (2011) Real-time hand gesture detection and recognition using bag-of-features and support vector machine techniques. IEEE Trans Instrum Meas 60(11):3592–3607

    Article  Google Scholar 

  39. Wadhawan A, Kumar P (2020) Deep learning-based sign language recognition system for static signs. Neural Comput Appl 32(12):7957–7968

    Article  Google Scholar 

  40. Samir Elons A, Abull-ela M, Tolba MF (2013) Neutralizing lighting non-homogeneity and background size in pcnn image signature for arabic sign language recognition. Neural Comput Appl 22(1):47–53

    Article  Google Scholar 

  41. Ozcan T, Basturk A (2019) Transfer learning-based convolutional neural networks with heuristic optimization for hand gesture recognition. Neural Comput Appl 31(12):8955–8970

    Article  Google Scholar 

  42. Li Y, Miao Q, Tian K, Fan Y, Xu X, Li R, Song J, “Large-scale gesture recognition with a fusion of rgb-d data based on the c3d model.” In: (2016) 23rd international conference on pattern recognition (ICPR). IEEE 2016: 25–30

  43. Ariesta M. C, Wiryana F, Kusuma G. P et al. (2018) “A survey of hand gesture recognition methods in sign language recognition.” Pertanika J Sci Technol 26(4):1659–1675

    Google Scholar 

  44. Cheok MJ, Omar Z, Jaward MH (2019) A review of hand gesture and sign language recognition techniques. Int J Mach Learn Cybern 10(1):131–153

    Article  Google Scholar 

  45. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) “Learning spatiotemporal features with 3d convolutional networks.” In: proceedings of the IEEE international conference on computer vision, pp. 4489–4497

  46. Feichtenhofer C, Fan H, Malik J, He K (2019) “Slowfast networks for video recognition.” In: proceedings of the IEEE/CVF international conference on computer vision, pp. 6202–6211

  47. Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) “A closer look at spatiotemporal convolutions for action recognition.” In: proceedings of the IEEE conference on computer vision and pattern recognition pp. 6450–6459

  48. Yang C, Xu Y, Shi J, Dai B, Zhou B (2020) “Temporal pyramid network for action recognition.” In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 591–600

  49. Feichtenhofer C (2020) “X3d: Expanding architectures for efficient video recognition.” In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition pp. 203–213

  50. Zhou Y, Sun X, Luo C, Zha Z.-J, Zeng W (2020) “Spatiotemporal fusion in 3d cnns: A probabilistic view.” In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition pp. 9829–9838

  51. Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) “Temporal segment networks: Towards good practices for deep action recognition.” In: European conference on computer vision Springer, pp. 20–36

  52. Lin J, Gan C, Han S (2018) “Temporal shift module for efficient video understanding.” CoRR, vol. abs/1811.08383. [Online]. Available: arXiv:1811.08383

  53. Goyal R, Ebrahimi Kahou S, Michalski V, Materzynska J, Westphal S, Kim H, Haenel V, Fruend I, Yianilos P, Mueller-Freitag M (2017)et al., “The” something something” video database for learning and evaluating visual common sense.” In: proceedings of the IEEE international conference on computer vision pp. 5842–5850

  54. Wang X,Girshick R, Gupta A, He K (2018) “Non-local neural networks.” In: proceedings of the IEEE conference on computer vision and pattern recognition pp. 7794–7803

  55. Carreira J, Zisserman A (2017) “Quo vadis, action recognition? a new model and the kinetics dataset.” In: proceedings of the IEEE conference on computer vision and pattern recognition pp. 6299–6308

  56. Ioffe S, Szegedy C (2015) “Batch normalization: Accelerating deep network training by reducing internal covariate shift.” In: international conference on machine learning. PMLR, pp. 448–456

  57. Wang X, Gupta A (2018) “Videos as space-time region graphs.” In: proceedings of the European conference on computer vision (ECCV), pp. 399–417

  58. Zolfaghari M, Singh K, Brox T (2018) “Eco: Efficient convolutional network for online video understanding.” In: proceedings of the European conference on computer vision (ECCV), pp. 695–712

  59. Wang Y, Chen Z, Jiang H, Song S, Han Y, Huang G (2021) “Adaptive focus for efficient video recognition.” In proceedings of the IEEE/CVF international conference on computer vision, pp. 16 249–16 258

  60. Qian S, Sun K, Wu W, Qian C, Jia J (2019) “Aggregation via separation: Boosting facial landmark detector with semi-supervised style translation.” In: proceedings of the IEEE/CVF international conference on computer vision, pp. 10 153–10 163

  61. Wang Y, Yue Y, Lin Y, Jiang H, Lai Z, Kulikov V, Orlov N, Shi H, Huang G (2022) “Adafocus v2: End-to-end training of spatial dynamic networks for video recognition.” In: 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR). IEEE, pp. 20 030–20 040

  62. Tang A, Lu K, Wang Y, Huang J, Li H (2015) A real-time hand posture recognition system using deep neural networks. ACM Trans Intell Syst Technol (TIST) 6(2):1–23

    Article  Google Scholar 

  63. Selvaraj P, Nc G, Kumar P, Khapra M (2021) “Openhands: Making sign language recognition accessible with pose-based pretrained models across languages.” arXiv preprint arXiv:2110.05877

  64. Boháček M, Hrúz M (2022) “Sign pose-based transformer for word-level sign language recognition.” In: proceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 182–191

  65. Zhou B, Andonian A, Oliva A, Torralba A (2018) “Temporal relational reasoning in videos.” In: proceedings of the European conference on computer vision (ECCV), pp. 803–818

Download references

Acknowledgements

This work was supported in part by the Foundation of National Natural Science Foundation of China under Grant 61973065, 52075531, the Fundamental Research Funds for the Central Universities of China under Grant N2104008, the Central Government Guides the Local Science And Technology Development Special Fund: 2021JH6/10500129 and the Innovative Talents Support Program of Liaoning Provincial Universities under LR2020047.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Fei Wang or Shuai Han.

Ethics declarations

Conflicts of interest

We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, F., Zhang, L., Yan, H. et al. TIM-SLR: a lightweight network for video isolated sign language recognition. Neural Comput & Applic 35, 22265–22280 (2023). https://doi.org/10.1007/s00521-023-08873-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-023-08873-7

Keywords

Navigation