Skip to main content
Log in

Lightweight sign transformer framework

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

Sign language video understanding requires capturing both spatial and temporal information in sign language video clips. We propose Lightweight Sign Transformer Framework, which is a two-stream lightweight network incorporating transformer architecture that consists of RGB flow and RGB difference. It leverages the latest advances in computer vision and natural language processing and applies them to video understanding. Then we implement video transformer network on sign language datasets and got excellent performance. Furthermore, we compare the performance of our network with I3D network (Carreira and Zisserman in Quo Vadis, Action recognition? A new model and the kinetics dataset. IEEE) and show better performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Asadi-Aghbolaghi, M., Clapés, A., Bellantonio, M., Escalante, H.J., Ponce-López, V., Baró, X., Guyon, I., Kasaei, S., Escalera, S.: Deep learning for action and gesture recognition in image sequences: a survey. In: Escalera, S., Guyon, I., Athitsos, V. (eds.) Gesture Recognition. Springer, Cham (2017)

    Google Scholar 

  2. Carreira, J., Zisserman, A.: Quo Vadis, Action recognition? A new model and the kinetics dataset. IEEE (2017)

  3. Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)

  4. Feichtenhofer, C., Pinz, A., Wildes, R.P.: Spatiotemporal multiplier networks for video action recognition. In: Proceedings of the IEEE Conference On Computer Vision and Pattern Recognition, pp. 4768–4777 (2017)

  5. Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two stream network fusion for video action recognition. In: Proc. IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 1933–1941 (2016)

  6. Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941 (2016)

  7. Ge, L., Liang, H., Yuan, J., Thalmann, D.: 3D convolutional neural networks for efficient and robust hand pose estimation from single depth images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1991–2000 (2017)

  8. He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 770–778 (2016)

  9. He, K., Zhang, X., Ren, S., et al.: Identity mappings in deep residual networks. European Conference on Computer Vision. Springer, Cham, pp 630–645 (2016)

  10. Huang, G., Liu, Z., Weinberger, K. Q., Maaten, L.: Densely connected convolutional networks. arXiv:1608.06993v3 (2016)

  11. Huang, G., Sun, Y., Liu, Z., Sedra, D., Weinberger, K. Q.: Deep networks with stochastic depth. arXiv:1603.09382v3 2016

  12. Huang, J. , Zhou, W ., Zhang, Q., et al.: Video-based sign language recognition without temporal segmentation (2018)

  13. Huang, W., Fan, L., Harandi, M., Ma, L., Liu, H., Liu, W., Gan, C.: Toward efficient action recognition: principal backpropagation for training two-stream networks. IEEE Trans. Image Process. 28(4), 1773–1782 (2018)

    Article  MathSciNet  Google Scholar 

  14. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large scale video classifification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1725–1732 (2014)

  15. Kozlov, A., Andronov, V., Gritsenko, Y.: Lightweight Network Architecture for Real-Time Action Recognition (2019)

  16. Sevilla-Lara, L. , Liao, Y., Guney, F , et al.: On the integration of optical flow and action recognition. (2017)

  17. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems. pp. 568–576 (2014)

  18. Veit, A., Wilber, M., Belongie, S.: Residual networks behave like ensembles of relatively shallow networks. arXiv:1605.06431v2 2016

  19. Veit, A., Wilber, M.J., Belongie, S.: Residual networks behave like ensembles of relatively shallow networks. In: Advances in Neural Information Processing Systems. pp. 550–558 (2016)

  20. Wan, C., Probst, T., Van Gool, L., Yao, A.: Dense 3d regression for hand pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5147–5156 (2018)

  21. Wang, L., Xiong, Y., Wang, Z., et al.: Temporal segment networks: towards good practices for deep action recognition (2016)

  22. Xie, S., Girshick, R., Dollar, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. arXiv preprint arXiv:1611.05431v1 2016

  23. Yuan, X.H., Kong, L.B., Feng, D.C., Wei, Z.C.: Automatic feature point detection and tracking of human actions in time-of-flight videos. IEEE/CAA J. Autom. Sinica 4(4), 677–685 (2017)

    Article  Google Scholar 

  24. Zhu, Y., Lan, Z. Z., Newsam, S., Hauptmann, A.: Hidden two-stream convolutional networks for action recognition. In: Proc. 14th Asian Conf. Computer Vision, Perth, Australia (2018)

  25. Zhu, C., Yang, J., Shao, Z.P., Liu, C.P.: Vision based hand gesture recognition using 3D shape context. IEEE/CAA J. Autom. Sinica (2019). https://doi.org/10.1109/JAS.2019.1911534

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (Grant No. 61973334), Research Center of Security Video and Image Processing Engineering Technology of Guizhou (China) under Grant SRC-Open Project ([2020]001]), and Beijing Advanced Innovation Center for Intelligent Robots and Systems (China) under Grant 2018IRS20.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lingyan Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, L., Chen, Y., Mei, X. et al. Lightweight sign transformer framework. SIViP 17, 381–387 (2023). https://doi.org/10.1007/s11760-022-02243-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11760-022-02243-x

Navigation