Spatial-Temporal Graph Transformer for Skeleton-Based Sign Language Recognition

Xiao, Zhengye; Lin, Shiquan; Wan, Xiuan; Fang, Yuchun; Ni, Lan

doi:10.1007/978-981-99-1645-0_12

Zhengye Xiao¹⁰,
Shiquan Lin¹⁰,
Xiuan Wan¹⁰,
Yuchun Fang¹⁰ &
…
Lan Ni¹¹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1793))

Included in the following conference series:

International Conference on Neural Information Processing

843 Accesses

Abstract

For continuous sign language recognition (CSLR), the skeleton sequence is insusceptible to environmental variances and achieves much attention. Previous studies mainly employ hand-craft features or the spatial-temporal graph convolution networks for skeleton modality and neglect the importance of capturing the information between distant nodes and the long-term context in CSLR. To learn more robust spatial-temporal features for CSLR, we propose a Spatial-Temporal Graph Transformer (STGT) model for skeleton-based CSLR. With the self-attention mechanism, the human skeleton graph is treated as a fully connected graph, and the relationship between distant nodes can be established directly in the spatial dimension. In the temporal dimension, the long-term context can be learned easily due to the characteristic of the transformer. Moreover, we propose graph positional embedding and graph multi-head self-attention to help the STGT distinguish the meanings of different nodes. We conduct the ablation study on the action recognition dataset to validate the effectiveness and analyze the advantages of our method. The experimental results on two CSLR datasets demonstrate the superiority of the STGT on skeleton-based CSLR.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

de Amorim, C.C., Macêdo, D., Zanchettin, C.: Spatial-temporal graph convolutional networks for sign language recognition. In: Tetko, I.V., Kůrková, V., Karpov, P., Theis, F. (eds.) ICANN 2019. LNCS, vol. 11731, pp. 646–657. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30493-5_59
Chapter Google Scholar
Baevski, A., Auli, M.: Adaptive input representations for neural language modeling. In: International Conference on Learning Representations (2018)
Google Scholar
Cao, Z., Hidalgo, G., Simon, T., Wei, S.E., Sheikh, Y.: Openpose: realtime multi-person 2D pose estimation using part affinity fields. IEEE Trans. Pattern Anal. Mach. Intell. 43(1), 172–186 (2019)
Article Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Google Scholar
Cheng, K., Zhang, Y., Cao, C., Shi, L., Cheng, J., Lu, H.: Decoupling GCN with DropGraph module for skeleton-based action recognition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12369, pp. 536–553. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58586-0_32
Chapter Google Scholar
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376 (2006)
Google Scholar
Guo, D., Zhou, W., Li, H., Wang, M.: Hierarchical LSTM for sign language translation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Google Scholar
Huang, J., Zhou, W., Zhang, Q., Li, H., Li, W.: Video-based sign language recognition without temporal segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Google Scholar
Huang, S., Ye, Z.: Boundary-adaptive encoder with attention method for Chinese sign language recognition. IEEE Access 9, 70948–70960 (2021)
Article Google Scholar
Koller, O., Forster, J., Ney, H.: Continuous sign language recognition: towards large vocabulary statistical recognition systems handling multiple signers. Comput. Vis. Image Underst. 141, 108–125 (2015)
Article Google Scholar
Li, Q., Han, Z., Wu, X.M.: Deeper insights into graph convolutional networks for semi-supervised learning. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Google Scholar
Li, S., Li, W., Cook, C., Zhu, C., Gao, Y.: Independently recurrent neural network (indrnn): Building a longer and deeper RNN. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5457–5466 (2018)
Google Scholar
Liu, Z., Zhang, H., Chen, Z., Wang, Z., Ouyang, W.: Disentangling and unifying graph convolutions for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 143–152 (2020)
Google Scholar
Pu, J., Zhou, W., Li, H.: Iterative alignment network for continuous sign language recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4165–4174 (2019)
Google Scholar
Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+D: a large scale dataset for 3D human activity analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1010–1019 (2016)
Google Scholar
Shi, L., Zhang, Y., Cheng, J., Lu, H.: Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12026–12035 (2019)
Google Scholar
Si, C., Chen, W., Wang, W., Wang, L., Tan, T.: An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1227–1236 (2019)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Xiong, R., et al.: On layer normalization in the transformer architecture. arXiv preprint arXiv:2002.04745 (2020)
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI (2018)
Google Scholar
Zhang, J., Zhou, W., Xie, C., Pu, J., Li, H.: Chinese sign language recognition with adaptive hmm. In: 2016 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2016)
Google Scholar
Zhang, P., Lan, C., Xing, J., Zeng, W., Xue, J., Zheng, N.: View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2117–2126 (2017)
Google Scholar
Zheng, W., Li, L., Zhang, Z., Huang, Y., Wang, L.: Relational network for skeleton-based action recognition. In: 2019 IEEE International Conference on Multimedia and Expo (ICME), pp. 826–831. IEEE (2019)
Google Scholar
Zhou, H., Zhou, W., Zhou, Y., Li, H.: Spatial-temporal multi-cue network for continuous sign language recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13009–13016 (2020)
Google Scholar

Download references

Acknowledgment

The work is supported by the National Natural Science Foundation of China under Grant No.: 61976132, 61991411 and U1811461, and the Natural Science Foundation of Shanghai under Grant No.: 19ZR1419200.

We appreciate the High Performance Computing Center of Shanghai University and Shanghai Engineering Research Center of Intelligent Computing System No.: 19DZ2252600 for providing computing resources.

Author information

Authors and Affiliations

School of Computer Engineering and Science, Shanghai University, Shanghai, China
Zhengye Xiao, Shiquan Lin, Xiuan Wan & Yuchun Fang
College of Liberal Arts, Shanghai University, Shanghai, China
Lan Ni

Authors

Zhengye Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Shiquan Lin
View author publications
You can also search for this author in PubMed Google Scholar
Xiuan Wan
View author publications
You can also search for this author in PubMed Google Scholar
Yuchun Fang
View author publications
You can also search for this author in PubMed Google Scholar
Lan Ni
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Yuchun Fang or Lan Ni .

Editor information

Editors and Affiliations

Indian Institute of Technology Indore, Indore, India
Mohammad Tanveer
Indian Institute of Information Technology - Allahabad, Prayagraj, India
Sonali Agarwal
Kobe University, Kobe, Japan
Seiichi Ozawa
Indian Institute of Technology Patna, Patna, India
Asif Ekbal
University of Innsbruck, Innsbruck, Austria
Adam Jatowt

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xiao, Z., Lin, S., Wan, X., Fang, Y., Ni, L. (2023). Spatial-Temporal Graph Transformer for Skeleton-Based Sign Language Recognition. In: Tanveer, M., Agarwal, S., Ozawa, S., Ekbal, A., Jatowt, A. (eds) Neural Information Processing. ICONIP 2022. Communications in Computer and Information Science, vol 1793. Springer, Singapore. https://doi.org/10.1007/978-981-99-1645-0_12

Download citation

DOI: https://doi.org/10.1007/978-981-99-1645-0_12
Published: 14 April 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-1644-3
Online ISBN: 978-981-99-1645-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Spatial-Temporal Graph Transformer for Skeleton-Based Sign Language Recognition