Skip to main content

Spatial-Temporal Graph Transformer for Skeleton-Based Sign Language Recognition

  • Conference paper
  • First Online:
Neural Information Processing (ICONIP 2022)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1793))

Included in the following conference series:

  • 843 Accesses

Abstract

For continuous sign language recognition (CSLR), the skeleton sequence is insusceptible to environmental variances and achieves much attention. Previous studies mainly employ hand-craft features or the spatial-temporal graph convolution networks for skeleton modality and neglect the importance of capturing the information between distant nodes and the long-term context in CSLR. To learn more robust spatial-temporal features for CSLR, we propose a Spatial-Temporal Graph Transformer (STGT) model for skeleton-based CSLR. With the self-attention mechanism, the human skeleton graph is treated as a fully connected graph, and the relationship between distant nodes can be established directly in the spatial dimension. In the temporal dimension, the long-term context can be learned easily due to the characteristic of the transformer. Moreover, we propose graph positional embedding and graph multi-head self-attention to help the STGT distinguish the meanings of different nodes. We conduct the ablation study on the action recognition dataset to validate the effectiveness and analyze the advantages of our method. The experimental results on two CSLR datasets demonstrate the superiority of the STGT on skeleton-based CSLR.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. de Amorim, C.C., Macêdo, D., Zanchettin, C.: Spatial-temporal graph convolutional networks for sign language recognition. In: Tetko, I.V., Kůrková, V., Karpov, P., Theis, F. (eds.) ICANN 2019. LNCS, vol. 11731, pp. 646–657. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30493-5_59

    Chapter  Google Scholar 

  2. Baevski, A., Auli, M.: Adaptive input representations for neural language modeling. In: International Conference on Learning Representations (2018)

    Google Scholar 

  3. Cao, Z., Hidalgo, G., Simon, T., Wei, S.E., Sheikh, Y.: Openpose: realtime multi-person 2D pose estimation using part affinity fields. IEEE Trans. Pattern Anal. Mach. Intell. 43(1), 172–186 (2019)

    Article  Google Scholar 

  4. Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)

    Google Scholar 

  5. Cheng, K., Zhang, Y., Cao, C., Shi, L., Cheng, J., Lu, H.: Decoupling GCN with DropGraph module for skeleton-based action recognition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12369, pp. 536–553. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58586-0_32

    Chapter  Google Scholar 

  6. Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376 (2006)

    Google Scholar 

  7. Guo, D., Zhou, W., Li, H., Wang, M.: Hierarchical LSTM for sign language translation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)

    Google Scholar 

  8. Huang, J., Zhou, W., Zhang, Q., Li, H., Li, W.: Video-based sign language recognition without temporal segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)

    Google Scholar 

  9. Huang, S., Ye, Z.: Boundary-adaptive encoder with attention method for Chinese sign language recognition. IEEE Access 9, 70948–70960 (2021)

    Article  Google Scholar 

  10. Koller, O., Forster, J., Ney, H.: Continuous sign language recognition: towards large vocabulary statistical recognition systems handling multiple signers. Comput. Vis. Image Underst. 141, 108–125 (2015)

    Article  Google Scholar 

  11. Li, Q., Han, Z., Wu, X.M.: Deeper insights into graph convolutional networks for semi-supervised learning. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)

    Google Scholar 

  12. Li, S., Li, W., Cook, C., Zhu, C., Gao, Y.: Independently recurrent neural network (indrnn): Building a longer and deeper RNN. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5457–5466 (2018)

    Google Scholar 

  13. Liu, Z., Zhang, H., Chen, Z., Wang, Z., Ouyang, W.: Disentangling and unifying graph convolutions for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 143–152 (2020)

    Google Scholar 

  14. Pu, J., Zhou, W., Li, H.: Iterative alignment network for continuous sign language recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4165–4174 (2019)

    Google Scholar 

  15. Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+D: a large scale dataset for 3D human activity analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1010–1019 (2016)

    Google Scholar 

  16. Shi, L., Zhang, Y., Cheng, J., Lu, H.: Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12026–12035 (2019)

    Google Scholar 

  17. Si, C., Chen, W., Wang, W., Wang, L., Tan, T.: An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1227–1236 (2019)

    Google Scholar 

  18. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

    Google Scholar 

  19. Xiong, R., et al.: On layer normalization in the transformer architecture. arXiv preprint arXiv:2002.04745 (2020)

  20. Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI (2018)

    Google Scholar 

  21. Zhang, J., Zhou, W., Xie, C., Pu, J., Li, H.: Chinese sign language recognition with adaptive hmm. In: 2016 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2016)

    Google Scholar 

  22. Zhang, P., Lan, C., Xing, J., Zeng, W., Xue, J., Zheng, N.: View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2117–2126 (2017)

    Google Scholar 

  23. Zheng, W., Li, L., Zhang, Z., Huang, Y., Wang, L.: Relational network for skeleton-based action recognition. In: 2019 IEEE International Conference on Multimedia and Expo (ICME), pp. 826–831. IEEE (2019)

    Google Scholar 

  24. Zhou, H., Zhou, W., Zhou, Y., Li, H.: Spatial-temporal multi-cue network for continuous sign language recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13009–13016 (2020)

    Google Scholar 

Download references

Acknowledgment

The work is supported by the National Natural Science Foundation of China under Grant No.: 61976132, 61991411 and U1811461, and the Natural Science Foundation of Shanghai under Grant No.: 19ZR1419200.

We appreciate the High Performance Computing Center of Shanghai University and Shanghai Engineering Research Center of Intelligent Computing System No.: 19DZ2252600 for providing computing resources.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Yuchun Fang or Lan Ni .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xiao, Z., Lin, S., Wan, X., Fang, Y., Ni, L. (2023). Spatial-Temporal Graph Transformer for Skeleton-Based Sign Language Recognition. In: Tanveer, M., Agarwal, S., Ozawa, S., Ekbal, A., Jatowt, A. (eds) Neural Information Processing. ICONIP 2022. Communications in Computer and Information Science, vol 1793. Springer, Singapore. https://doi.org/10.1007/978-981-99-1645-0_12

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-1645-0_12

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-1644-3

  • Online ISBN: 978-981-99-1645-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics