Skip to main content
Log in

Residual spatial graph convolution and temporal sequence attention network for sign language translation

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Vision-based sign language translation technology (SLT) has brought the communication distance between deaf and ordinary people closer to a certain extent. The obstacle of SLT is mainly in two aspects: firstly, when capturing sign language action features, it is impossible to effectively overcome the shortcomings such as redundant information of sign language gesture features and motion ambiguity; secondly, it is difficult to define the alignment between action sequences and lexical sequences when processing sentence-level sign language videos. To overcome these problems, this paper proposes a sign language translation method based on residual spatial graph convolution network (Res-SGCN) and temporal attention model. Where, the Res-SGCN module is used to capture the spatial interaction feature information between the sign language skeleton nodes, and subsequently the temporal attention network is used to capture the temporal dimensional information fusion of the sign language spatial feature sequence and align it with the predicted vocabulary for translation. Experiments on public datasets show that the word error rate(WER) output by the proposed model reaches 4.17%, which is superior to other advanced sign language translation methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Data Availability statements

All data included in this study are available upon request by contact with the corresponding author. The datasets generated during and/or analysed during the current study are available in the [CCSL] repository, [http://home.ustc.edu.cn/~pjh/openresources/cslr-dataset-2015/index.html].

Notes

  1. http://home.ustc.edu.cn/~pjh/openresources/cslr-dataset-2015/index.html

References

  1. Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. CoRR arXiv:1409.0473

  2. Bazarevsky V, Grishchenko I, Raveendran K, Zhu TL, Zhang F, Grundmann M Blazepose: on-device real-time body pose tracking. arXiv:2006.10204

  3. Camgoz NC, Hadfield S, Koller O, Bowden R (2017) Subunets: end-to-end hand shape and continuous sign language recognition. In: 2017 IEEE international conference on computer vision (ICCV), pp 3075–3084. https://doi.org/10.1109/ICCV.2017.332

  4. Camgoz NC, Hadfield S, Koller O, Ney H, Bowden R (2018) Neural sign language translation. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 7784–7793. https://doi.org/10.1109/CVPR.2018.00812

  5. Cho K, van Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation

  6. Cihan Camgöz N, Koller O, Hadfield S, Bowden R (2020) Sign language transformers: joint end-to-end sign language recognition and translation. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10020–10030. https://doi.org/10.1109/CVPR42600.2020.01004

  7. de Amorim CC, Macêdo D, Zanchettin C (2019) Spatial-temporal graph convolutional networks for sign language recognition. In: Tetko IV, Kůrková V, Karpov P, Theis F (eds) Artificial neural networks and machine learning – ICANN 2019: workshop and special sessions, pp 646–657. Springer

  8. Gao L, Li H, Liu Z, Liu Z, Wan L, Feng W (2021) Rnn-transducer based chinese sign language recognition. Neurocomputing 434:45–54. https://doi.org/10.1016/j.neucom.2020.12.006

    Article  Google Scholar 

  9. Graves A, Fernández S, Gomez F, Schmidhuber J (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd international conference on machine learning. ICML ’06, pp 369–376. Association for Computing Machinery. https://doi.org/10.1145/1143844.1143891

  10. He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition

  11. Higuchi Y, Watanabe S, Chen N, Ogawa T, Kobayashi T (2020) Mask ctc: non-autoregressive end-to-end asr with ctc and mask predict

  12. Huang J, Zhou W, Li H, Li W (2015) Sign language recognition using 3d convolutional neural networks. In: 2015 IEEE international conference on multimedia and expo (ICME), pp 1–6. https://doi.org/10.1109/ICME.2015.7177428

  13. Huang J, Zhou W, Zhang Q, Li H, Li W (2018) Video-based sign language recognition without temporal segmentation

  14. Ko S-K, Kim CJ, Jung H, Cho C (2019) Neural sign language translation based on human keypoint estimation. Appl Sci 9(13):2683

    Article  Google Scholar 

  15. Li D, Opazo CR, Yu X, Li H (2020) Word-level deep sign language recognition from video: a new large-scale dataset and methods comparison. In: 2020 IEEE winter conference on applications of computer vision (WACV), pp 1448–1458. https://doi.org/10.1109/WACV45572.2020.9093512

  16. Ma C, Zhang S, Wang A, Qi Y, Chen G (2020) Skeleton-based dynamic hand gesture recognition using an enhanced network with one-shot learning. Appl Sci 10(11):3680. https://doi.org/10.3390/app10113680https://doi.org/10.3390/app10113680

    Article  Google Scholar 

  17. Pu J, Zhou W, Li H (2019) Iterative alignment network for continuous sign language recognition. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 4160–4169. https://doi.org/10.1109/CVPR.2019.00429

  18. Szegedy C, Ioffe S, Vanhoucke V, Alemi A (2016) Inception-v4 inception-ResNet and the impact of residual connections on learning

  19. Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to sequence – video to text. In: 2015 IEEE international conference on computer vision (ICCV), pp 4534–4542. https://doi.org/10.1109/ICCV.2015.515

  20. Xiao Q, Qin M, Yin Y (2020) Skeleton-based chinese sign language recognition and generation for bidirectional communication between deaf and hearing people. Neural Net 125:41–55. https://doi.org/10.1016/j.neunet.2020.01.030

    Article  Google Scholar 

  21. Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. Proceedings of the AAAI conference on artificial intelligence, vol 32(1)

  22. Yang S, Zhu Q (2017) Continuous chinese sign language recognition with cnn-lstm. In: Ninth international conference on digital image processing (ICDIP 2017), vol 10420, pp 83–89. SPIE

  23. Zhang J, Zhou W, Xie C, Pu J, Li H (2016) Chinese sign language recognition with adaptive hmm. In: 2016 IEEE international conference on multimedia and expo (ICME), pp 1–6. https://doi.org/10.1109/ICME.2016.7552950

  24. Zhou H, Zhou W, Li H (2019) Dynamic pseudo label decoding for continuous sign language recognition. In: 2019 IEEE International conference on multimedia and expo (ICME)

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grant 62172280. CAS Key Laboratory of Technology in Geospatial Information Processing and Application System (GIPAS), University of Science and Technology of China (USTC) for releasing the CCSL database.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jie Ying.

Ethics declarations

Conflict of Interests

No conflict of interest exits in the submission of this manuscript, and manuscript is approved by all authors for publication. I would like to declare on behalf of my co-authors that the work described was original research that has not been published previously, and not under consideration for publication elsewhere.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xu, W., Ying, J., Yang, H. et al. Residual spatial graph convolution and temporal sequence attention network for sign language translation. Multimed Tools Appl 82, 23483–23507 (2023). https://doi.org/10.1007/s11042-022-14172-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-14172-5

Keywords

Navigation