Abstract
Sign language video understanding requires capturing both spatial and temporal information in sign language video clips. We propose Lightweight Sign Transformer Framework, which is a two-stream lightweight network incorporating transformer architecture that consists of RGB flow and RGB difference. It leverages the latest advances in computer vision and natural language processing and applies them to video understanding. Then we implement video transformer network on sign language datasets and got excellent performance. Furthermore, we compare the performance of our network with I3D network (Carreira and Zisserman in Quo Vadis, Action recognition? A new model and the kinetics dataset. IEEE) and show better performance.
Similar content being viewed by others
References
Asadi-Aghbolaghi, M., Clapés, A., Bellantonio, M., Escalante, H.J., Ponce-López, V., Baró, X., Guyon, I., Kasaei, S., Escalera, S.: Deep learning for action and gesture recognition in image sequences: a survey. In: Escalera, S., Guyon, I., Athitsos, V. (eds.) Gesture Recognition. Springer, Cham (2017)
Carreira, J., Zisserman, A.: Quo Vadis, Action recognition? A new model and the kinetics dataset. IEEE (2017)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Feichtenhofer, C., Pinz, A., Wildes, R.P.: Spatiotemporal multiplier networks for video action recognition. In: Proceedings of the IEEE Conference On Computer Vision and Pattern Recognition, pp. 4768–4777 (2017)
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two stream network fusion for video action recognition. In: Proc. IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 1933–1941 (2016)
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941 (2016)
Ge, L., Liang, H., Yuan, J., Thalmann, D.: 3D convolutional neural networks for efficient and robust hand pose estimation from single depth images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1991–2000 (2017)
He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 770–778 (2016)
He, K., Zhang, X., Ren, S., et al.: Identity mappings in deep residual networks. European Conference on Computer Vision. Springer, Cham, pp 630–645 (2016)
Huang, G., Liu, Z., Weinberger, K. Q., Maaten, L.: Densely connected convolutional networks. arXiv:1608.06993v3 (2016)
Huang, G., Sun, Y., Liu, Z., Sedra, D., Weinberger, K. Q.: Deep networks with stochastic depth. arXiv:1603.09382v3 2016
Huang, J. , Zhou, W ., Zhang, Q., et al.: Video-based sign language recognition without temporal segmentation (2018)
Huang, W., Fan, L., Harandi, M., Ma, L., Liu, H., Liu, W., Gan, C.: Toward efficient action recognition: principal backpropagation for training two-stream networks. IEEE Trans. Image Process. 28(4), 1773–1782 (2018)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large scale video classifification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1725–1732 (2014)
Kozlov, A., Andronov, V., Gritsenko, Y.: Lightweight Network Architecture for Real-Time Action Recognition (2019)
Sevilla-Lara, L. , Liao, Y., Guney, F , et al.: On the integration of optical flow and action recognition. (2017)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems. pp. 568–576 (2014)
Veit, A., Wilber, M., Belongie, S.: Residual networks behave like ensembles of relatively shallow networks. arXiv:1605.06431v2 2016
Veit, A., Wilber, M.J., Belongie, S.: Residual networks behave like ensembles of relatively shallow networks. In: Advances in Neural Information Processing Systems. pp. 550–558 (2016)
Wan, C., Probst, T., Van Gool, L., Yao, A.: Dense 3d regression for hand pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5147–5156 (2018)
Wang, L., Xiong, Y., Wang, Z., et al.: Temporal segment networks: towards good practices for deep action recognition (2016)
Xie, S., Girshick, R., Dollar, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. arXiv preprint arXiv:1611.05431v1 2016
Yuan, X.H., Kong, L.B., Feng, D.C., Wei, Z.C.: Automatic feature point detection and tracking of human actions in time-of-flight videos. IEEE/CAA J. Autom. Sinica 4(4), 677–685 (2017)
Zhu, Y., Lan, Z. Z., Newsam, S., Hauptmann, A.: Hidden two-stream convolutional networks for action recognition. In: Proc. 14th Asian Conf. Computer Vision, Perth, Australia (2018)
Zhu, C., Yang, J., Shao, Z.P., Liu, C.P.: Vision based hand gesture recognition using 3D shape context. IEEE/CAA J. Autom. Sinica (2019). https://doi.org/10.1109/JAS.2019.1911534
Acknowledgements
This work was supported by National Natural Science Foundation of China (Grant No. 61973334), Research Center of Security Video and Image Processing Engineering Technology of Guizhou (China) under Grant SRC-Open Project ([2020]001]), and Beijing Advanced Innovation Center for Intelligent Robots and Systems (China) under Grant 2018IRS20.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wang, L., Chen, Y., Mei, X. et al. Lightweight sign transformer framework. SIViP 17, 381–387 (2023). https://doi.org/10.1007/s11760-022-02243-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11760-022-02243-x