Lightweight sign transformer framework

Wang, Lingyan; Chen, Yuming; Mei, Xue; Qin, Wuyang; Qin, Xuan

doi:10.1007/s11760-022-02243-x

Lightweight sign transformer framework

Original Paper
Published: 18 May 2022

Volume 17, pages 381–387, (2023)
Cite this article

Signal, Image and Video Processing Aims and scope Submit manuscript

Lingyan Wang ORCID: orcid.org/0000-0002-1180-9281¹,
Yuming Chen¹,
Xue Mei¹,
Wuyang Qin¹ &
…
Xuan Qin²

262 Accesses
1 Altmetric
Explore all metrics

Abstract

Sign language video understanding requires capturing both spatial and temporal information in sign language video clips. We propose Lightweight Sign Transformer Framework, which is a two-stream lightweight network incorporating transformer architecture that consists of RGB flow and RGB difference. It leverages the latest advances in computer vision and natural language processing and applies them to video understanding. Then we implement video transformer network on sign language datasets and got excellent performance. Furthermore, we compare the performance of our network with I3D network (Carreira and Zisserman in Quo Vadis, Action recognition? A new model and the kinetics dataset. IEEE) and show better performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Two-stream lightweight sign language transformer

Article 24 August 2022

TIM-SLR: a lightweight network for video isolated sign language recognition

Article 07 August 2023

Sign Language Recognition (SLR): A Brisk Paired Deep Metric Attention Learning (BPDMAL) Model for Video Data Applications

Article 10 April 2024

References

Asadi-Aghbolaghi, M., Clapés, A., Bellantonio, M., Escalante, H.J., Ponce-López, V., Baró, X., Guyon, I., Kasaei, S., Escalera, S.: Deep learning for action and gesture recognition in image sequences: a survey. In: Escalera, S., Guyon, I., Athitsos, V. (eds.) Gesture Recognition. Springer, Cham (2017)
Google Scholar
Carreira, J., Zisserman, A.: Quo Vadis, Action recognition? A new model and the kinetics dataset. IEEE (2017)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Feichtenhofer, C., Pinz, A., Wildes, R.P.: Spatiotemporal multiplier networks for video action recognition. In: Proceedings of the IEEE Conference On Computer Vision and Pattern Recognition, pp. 4768–4777 (2017)
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two stream network fusion for video action recognition. In: Proc. IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 1933–1941 (2016)
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941 (2016)
Ge, L., Liang, H., Yuan, J., Thalmann, D.: 3D convolutional neural networks for efficient and robust hand pose estimation from single depth images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1991–2000 (2017)
He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 770–778 (2016)
He, K., Zhang, X., Ren, S., et al.: Identity mappings in deep residual networks. European Conference on Computer Vision. Springer, Cham, pp 630–645 (2016)
Huang, G., Liu, Z., Weinberger, K. Q., Maaten, L.: Densely connected convolutional networks. arXiv:1608.06993v3 (2016)
Huang, G., Sun, Y., Liu, Z., Sedra, D., Weinberger, K. Q.: Deep networks with stochastic depth. arXiv:1603.09382v3 2016
Huang, J. , Zhou, W ., Zhang, Q., et al.: Video-based sign language recognition without temporal segmentation (2018)
Huang, W., Fan, L., Harandi, M., Ma, L., Liu, H., Liu, W., Gan, C.: Toward efficient action recognition: principal backpropagation for training two-stream networks. IEEE Trans. Image Process. 28(4), 1773–1782 (2018)
Article MathSciNet Google Scholar
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large scale video classifification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1725–1732 (2014)
Kozlov, A., Andronov, V., Gritsenko, Y.: Lightweight Network Architecture for Real-Time Action Recognition (2019)
Sevilla-Lara, L. , Liao, Y., Guney, F , et al.: On the integration of optical flow and action recognition. (2017)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems. pp. 568–576 (2014)
Veit, A., Wilber, M., Belongie, S.: Residual networks behave like ensembles of relatively shallow networks. arXiv:1605.06431v2 2016
Veit, A., Wilber, M.J., Belongie, S.: Residual networks behave like ensembles of relatively shallow networks. In: Advances in Neural Information Processing Systems. pp. 550–558 (2016)
Wan, C., Probst, T., Van Gool, L., Yao, A.: Dense 3d regression for hand pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5147–5156 (2018)
Wang, L., Xiong, Y., Wang, Z., et al.: Temporal segment networks: towards good practices for deep action recognition (2016)
Xie, S., Girshick, R., Dollar, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. arXiv preprint arXiv:1611.05431v1 2016
Yuan, X.H., Kong, L.B., Feng, D.C., Wei, Z.C.: Automatic feature point detection and tracking of human actions in time-of-flight videos. IEEE/CAA J. Autom. Sinica 4(4), 677–685 (2017)
Article Google Scholar
Zhu, Y., Lan, Z. Z., Newsam, S., Hauptmann, A.: Hidden two-stream convolutional networks for action recognition. In: Proc. 14th Asian Conf. Computer Vision, Perth, Australia (2018)
Zhu, C., Yang, J., Shao, Z.P., Liu, C.P.: Vision based hand gesture recognition using 3D shape context. IEEE/CAA J. Autom. Sinica (2019). https://doi.org/10.1109/JAS.2019.1911534
Article Google Scholar

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (Grant No. 61973334), Research Center of Security Video and Image Processing Engineering Technology of Guizhou (China) under Grant SRC-Open Project ([2020]001]), and Beijing Advanced Innovation Center for Intelligent Robots and Systems (China) under Grant 2018IRS20.

Author information

Authors and Affiliations

College of Electrical Engineering and Control Science, Nanjing Tech University, Nanjing, China
Lingyan Wang, Yuming Chen, Xue Mei & Wuyang Qin
China Electronics Technology Group Corporation 28th Research Institute, Nanjing, China
Xuan Qin

Authors

Lingyan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yuming Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xue Mei
View author publications
You can also search for this author in PubMed Google Scholar
Wuyang Qin
View author publications
You can also search for this author in PubMed Google Scholar
Xuan Qin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lingyan Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, L., Chen, Y., Mei, X. et al. Lightweight sign transformer framework. SIViP 17, 381–387 (2023). https://doi.org/10.1007/s11760-022-02243-x

Download citation

Received: 17 May 2021
Revised: 28 March 2022
Accepted: 10 April 2022
Published: 18 May 2022
Issue Date: March 2023
DOI: https://doi.org/10.1007/s11760-022-02243-x

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Lightweight sign transformer framework

Abstract

Access this article

Similar content being viewed by others

Two-stream lightweight sign language transformer

TIM-SLR: a lightweight network for video isolated sign language recognition

Sign Language Recognition (SLR): A Brisk Paired Deep Metric Attention Learning (BPDMAL) Model for Video Data Applications

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Navigation

Lightweight sign transformer framework

Abstract

Access this article

Similar content being viewed by others

Two-stream lightweight sign language transformer

TIM-SLR: a lightweight network for video isolated sign language recognition

Sign Language Recognition (SLR): A Brisk Paired Deep Metric Attention Learning (BPDMAL) Model for Video Data Applications

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation