Abstract
Sign Language Translation (SLT) is an important sequence-to-sequence problem that has been challenging to solve, because of the various factors which influence the meaning of a sign. In this paper, we implement a Multi Context Transformer architecture that attempts to solve this problem by operating on batched video segment representations called context vectors, intending to capture various temporal dependencies present between the frames to accurately translate the input signs. This architecture, being end-to-end also eliminates the need for sign language intermediaries known as glosses. Our model produces results that are on par with the state-of-the-art (98.19% score retention in the ROUGE-L score and 86.65% in the BLEU-4 score) while simultaneously achieving a 30.88% reduction in model parameters, which makes the model suitable for real-world applications. Our implementation is available on GitHub.(\(^{1}\)https://github.com/MBadriNarayanan/MultiContextTransformer)
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Camgoz, N.C., Koller, O., Hadfield, S., Bowden, R.: Sign language transformers: joint end-to-end sign language recognition and translation. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Yin, K., Read, J.: Better sign language translation with STMC-transformer. In: Proceedings of the 28th International Conference on Computational Linguistics, Barcelona (2020)
Quan, Y.: Chinese sign language recognition based on video sequence appearance modeling. In: 2010 5th IEEE Conference on Industrial Electronics and Applications, pp. 1537–1542 (2010)
Grobel, K., Assan, M.: Isolated sign language recognition using hidden markov models. In: 1997 IEEE International Conference on Systems, Man, and Cybernetics, Computational Cybernetics and Simulation, vol. 1, pp. 162–167 (1997)
Rastgoo, R., Kiani, K., Escalera, S.: Sign language recognition: a deep survey. Expert Syst. Appl. 164, 113794 (2021)
Koller, O., Ney, H., Bowden, R.: Deep hand: how to train a CNN on 1 million hand images when your data is continuous and weakly labelled. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Koller, O., Zargaran, S., Ney, H., Bowden, R.: Deep sign: hybrid CNN-HMM for continuous sign language recognition. In: Proceedings of the British Machine Vision Conference (2016)
Camgoz, N.C., Hadfield, S., Koller, O., Bowden, R.: Subunets: end-to-end hand shape and continuous sign language recognition. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 3075–3084 (2017)
Cui, R., Liu, H., Zhang, C.: Recurrent convolutional neural networks for continuous sign language recognition by staged optimization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Koller, O., Zargaran, S., Ney, H.: Re-sign: re-aligned end-to-end sequence modelling with deep recurrent CNN-HMMs. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3416–3424 (2017)
Camgöz, N.C., Hadfield, S., Koller, O., Ney, H., Bowden, R.: Neural sign language translation. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City (2018)
Li, D., et al.: Tspnet: hierarchical feature learning via temporal semantic pyramid for sign language translation. In: Advances in Neural Information Processing Systems, vol. 33 (2020)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural comput. 9, 1735–80 (1997)
Vaswani, A., et al.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 2017, pp. 6000–6010. Curran Associates Inc., Red Hook (2017)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset (2018)
Luong, M.T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation (2015)
Buehler, P., Zisserman, A., Everingham, M.: Learning sign language by watching tv (using weakly aligned subtitles). In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2961–2968 (2009)
Paszke, A., Gross, S., et al.: An imperative style, high-performance deep learning library. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32, pp. 8024–8035. Curran Associates, Inc. (2019)
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. J. Mach. Learn. Res. Proc. Track 9, 249–256 (01 2010)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation (2002)
Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81. Association for Computational Linguistics, Barcelona (2004)
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Narayanan, M.B., Bharadwaj, K.M., Nithin, G.R., Padamnoor, D., Vijayaraghavan, V. (2021). Sign Language Translation Using Multi Context Transformer. In: Batyrshin, I., Gelbukh, A., Sidorov, G. (eds) Advances in Soft Computing. MICAI 2021. Lecture Notes in Computer Science(), vol 13068. Springer, Cham. https://doi.org/10.1007/978-3-030-89820-5_25
Download citation
DOI: https://doi.org/10.1007/978-3-030-89820-5_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-89819-9
Online ISBN: 978-3-030-89820-5
eBook Packages: Computer ScienceComputer Science (R0)