Abstract
Vision-based sign language translation technology (SLT) has brought the communication distance between deaf and ordinary people closer to a certain extent. The obstacle of SLT is mainly in two aspects: firstly, when capturing sign language action features, it is impossible to effectively overcome the shortcomings such as redundant information of sign language gesture features and motion ambiguity; secondly, it is difficult to define the alignment between action sequences and lexical sequences when processing sentence-level sign language videos. To overcome these problems, this paper proposes a sign language translation method based on residual spatial graph convolution network (Res-SGCN) and temporal attention model. Where, the Res-SGCN module is used to capture the spatial interaction feature information between the sign language skeleton nodes, and subsequently the temporal attention network is used to capture the temporal dimensional information fusion of the sign language spatial feature sequence and align it with the predicted vocabulary for translation. Experiments on public datasets show that the word error rate(WER) output by the proposed model reaches 4.17%, which is superior to other advanced sign language translation methods.
Similar content being viewed by others
Data Availability statements
All data included in this study are available upon request by contact with the corresponding author. The datasets generated during and/or analysed during the current study are available in the [CCSL] repository, [http://home.ustc.edu.cn/~pjh/openresources/cslr-dataset-2015/index.html].
References
Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. CoRR arXiv:1409.0473
Bazarevsky V, Grishchenko I, Raveendran K, Zhu TL, Zhang F, Grundmann M Blazepose: on-device real-time body pose tracking. arXiv:2006.10204
Camgoz NC, Hadfield S, Koller O, Bowden R (2017) Subunets: end-to-end hand shape and continuous sign language recognition. In: 2017 IEEE international conference on computer vision (ICCV), pp 3075–3084. https://doi.org/10.1109/ICCV.2017.332
Camgoz NC, Hadfield S, Koller O, Ney H, Bowden R (2018) Neural sign language translation. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 7784–7793. https://doi.org/10.1109/CVPR.2018.00812
Cho K, van Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation
Cihan Camgöz N, Koller O, Hadfield S, Bowden R (2020) Sign language transformers: joint end-to-end sign language recognition and translation. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10020–10030. https://doi.org/10.1109/CVPR42600.2020.01004
de Amorim CC, Macêdo D, Zanchettin C (2019) Spatial-temporal graph convolutional networks for sign language recognition. In: Tetko IV, Kůrková V, Karpov P, Theis F (eds) Artificial neural networks and machine learning – ICANN 2019: workshop and special sessions, pp 646–657. Springer
Gao L, Li H, Liu Z, Liu Z, Wan L, Feng W (2021) Rnn-transducer based chinese sign language recognition. Neurocomputing 434:45–54. https://doi.org/10.1016/j.neucom.2020.12.006
Graves A, Fernández S, Gomez F, Schmidhuber J (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd international conference on machine learning. ICML ’06, pp 369–376. Association for Computing Machinery. https://doi.org/10.1145/1143844.1143891
He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition
Higuchi Y, Watanabe S, Chen N, Ogawa T, Kobayashi T (2020) Mask ctc: non-autoregressive end-to-end asr with ctc and mask predict
Huang J, Zhou W, Li H, Li W (2015) Sign language recognition using 3d convolutional neural networks. In: 2015 IEEE international conference on multimedia and expo (ICME), pp 1–6. https://doi.org/10.1109/ICME.2015.7177428
Huang J, Zhou W, Zhang Q, Li H, Li W (2018) Video-based sign language recognition without temporal segmentation
Ko S-K, Kim CJ, Jung H, Cho C (2019) Neural sign language translation based on human keypoint estimation. Appl Sci 9(13):2683
Li D, Opazo CR, Yu X, Li H (2020) Word-level deep sign language recognition from video: a new large-scale dataset and methods comparison. In: 2020 IEEE winter conference on applications of computer vision (WACV), pp 1448–1458. https://doi.org/10.1109/WACV45572.2020.9093512
Ma C, Zhang S, Wang A, Qi Y, Chen G (2020) Skeleton-based dynamic hand gesture recognition using an enhanced network with one-shot learning. Appl Sci 10(11):3680. https://doi.org/10.3390/app10113680https://doi.org/10.3390/app10113680
Pu J, Zhou W, Li H (2019) Iterative alignment network for continuous sign language recognition. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 4160–4169. https://doi.org/10.1109/CVPR.2019.00429
Szegedy C, Ioffe S, Vanhoucke V, Alemi A (2016) Inception-v4 inception-ResNet and the impact of residual connections on learning
Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to sequence – video to text. In: 2015 IEEE international conference on computer vision (ICCV), pp 4534–4542. https://doi.org/10.1109/ICCV.2015.515
Xiao Q, Qin M, Yin Y (2020) Skeleton-based chinese sign language recognition and generation for bidirectional communication between deaf and hearing people. Neural Net 125:41–55. https://doi.org/10.1016/j.neunet.2020.01.030
Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. Proceedings of the AAAI conference on artificial intelligence, vol 32(1)
Yang S, Zhu Q (2017) Continuous chinese sign language recognition with cnn-lstm. In: Ninth international conference on digital image processing (ICDIP 2017), vol 10420, pp 83–89. SPIE
Zhang J, Zhou W, Xie C, Pu J, Li H (2016) Chinese sign language recognition with adaptive hmm. In: 2016 IEEE international conference on multimedia and expo (ICME), pp 1–6. https://doi.org/10.1109/ICME.2016.7552950
Zhou H, Zhou W, Li H (2019) Dynamic pseudo label decoding for continuous sign language recognition. In: 2019 IEEE International conference on multimedia and expo (ICME)
Acknowledgements
This work was supported by the National Natural Science Foundation of China under Grant 62172280. CAS Key Laboratory of Technology in Geospatial Information Processing and Application System (GIPAS), University of Science and Technology of China (USTC) for releasing the CCSL database.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
No conflict of interest exits in the submission of this manuscript, and manuscript is approved by all authors for publication. I would like to declare on behalf of my co-authors that the work described was original research that has not been published previously, and not under consideration for publication elsewhere.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xu, W., Ying, J., Yang, H. et al. Residual spatial graph convolution and temporal sequence attention network for sign language translation. Multimed Tools Appl 82, 23483–23507 (2023). https://doi.org/10.1007/s11042-022-14172-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-14172-5