ABSTRACT
To provide instant communication for hearing-impaired people, it is essential to achieve real-time sign language processing anytime anywhere. Therefore, in this paper, we propose a Region-aware Temporal Graph based neural Network (RTG-Net), aiming to achieve real-time Sign Language Recognition (SLR) and Translation (SLT) on edge devices. To reduce the computation overhead, we first construct a shallow graph convolution network to reduce model size by decreasing model depth. Besides, we apply structural re-parameterization to fuse the convolutional layer, batch normalization layer and all branches to simplify model complexity by reducing model width. To achieve the high performance in sign language processing as well, we extract key regions based on keypoints in skeleton from each frame, and design a region-aware temporal graph to combine key regions and full frame for feature representation. In RTG-Net, we design a multi-stage training strategy to optimize keypoint selection, SLR and SLT step by step. Experimental results demonstrate that RTG-Net achieves comparable performance with existing methods in SLR or SLT, while greatly reducing the computation overhead and achieving real-time sign language processing on edge devices. Our code is available at https://github.com/SignLanguageCode/realtimeSLRT.
- Kshitij Bantupalli and Ying Xie. 2018. American sign language recognition using deep learning and computer vision. In 2018 Big Data. IEEE, 4896--4899.Google Scholar
- Jan Bungeroth and Hermann Ney. 2004. Statistical sign language translation. In Workshop on representation and processing of sign languages, LREC, Vol. 4. Citeseer, 105--108.Google Scholar
- Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, and Richard Bowden. 2017. Subunets: End-to-end hand shape and continuous sign language recognition. In ICCV. IEEE, 3075--3084.Google Scholar
- N. C. Camgoz, S. Hadfield, O. Koller, H. Ney, and R. Bowden. 2018. Neural Sign Language Translation. In CVPR. 7784--7793. https://doi.org/10.1109/CVPR.2018.00812Google ScholarCross Ref
- Necati Cihan Camgoz, Oscar Koller, Simon Hadfield, and Richard Bowden. 2020a. Multi-channel Transformers for Multi-articulatory Sign Language Translation. arXiv preprint arXiv:2009.00299 (2020).Google Scholar
- Necati Cihan Camgoz, Oscar Koller, Simon Hadfield, and Richard Bowden. 2020b. Sign Language Transformers: Joint End-to-end Sign Language Recognition and Translation. In CVPR. 10023--10033.Google Scholar
- Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR. 7291--7299.Google Scholar
- Yutong Chen, Fangyun Wei, Xiao Sun, Zhirong Wu, and Stephen Lin. 2022a. A simple multi-modality transfer learning baseline for sign language translation. In CVPR. 5120--5130.Google Scholar
- Yutong Chen, Ronglai Zuo, Fangyun Wei, Yu Wu, Shujie Liu, and Brian Mak. 2022b. Two-stream network for sign language recognition and translation. Advances in Neural Information Processing Systems, Vol. 35 (2022), 17043--17056.Google Scholar
- Ka Leong Cheng, Zhaoyang Yang, Qifeng Chen, and Yu-Wing Tai. 2020. Fully convolutional networks for continuous sign language recognition. In European Conference on Computer Vision. Springer, 697--714.Google ScholarDigital Library
- Runpeng Cui, Hu Liu, and Changshui Zhang. 2017. Recurrent convolutional neural networks for continuous sign language recognition by staged optimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7361--7369.Google ScholarCross Ref
- Runpeng Cui, Hu Liu, and Changshui Zhang. 2019. A deep neural framework for continuous sign language recognition by iterative training. MM, Vol. 21, 7 (2019), 1880--1891.Google Scholar
- Xiaohan Ding, Xiangyu Zhang, Ningning Ma, Jungong Han, Guiguang Ding, and Jian Sun. 2021. Repvgg: Making vgg-style convnets great again. In CVPR. 13733--13742.Google Scholar
- Shiwei Gan, Yafeng Yin, Zhiwei Jiang, Lei Xie, and Sanglu Lu. 2021. Skeleton-Aware Neural Sign Language Translation. In ACM MM. 4353--4361.Google Scholar
- Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In ICML. 369--376.Google Scholar
- K. Grobel and M. Assan. 1997. Isolated sign language recognition using hidden Markov models. In SMC, Vol. 1. 162--167 vol.1. https://doi.org/10.1109/ICSMC.1997.625742Google ScholarCross Ref
- Dan Guo, Wengang Zhou, Anyang Li, Houqiang Li, and Meng Wang. 2019. Hierarchical recurrent deep fusion using adaptive clip summarization for sign language translation. TIP, Vol. 29 (2019), 1575--1590.Google Scholar
- Dan Guo, Wengang Zhou, Meng Wang, and Houqiang Li. 2016. Sign language recognition based on adaptive hmms with data augmentation. In ICIP. IEEE, 2876--2880.Google Scholar
- Aiming Hao, Yuecong Min, and Xilin Chen. 2021. Self-Mutual Distillation Learning for Continuous Sign Language Recognition. In ICCV. 11303--11312.Google Scholar
- Hezhen Hu, Wengang Zhou, and Houqiang Li. 2021. Hand-Model-Aware Sign Language Recognition. In AAAI, Vol. 35. 1558--1566.Google ScholarCross Ref
- Jie Huang, Wengang Zhou, Houqiang Li, and Weiping Li. 2018a. Attention-based 3D-CNNs for large-vocabulary sign language recognition. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 29, 9 (2018), 2822--2832.Google ScholarDigital Library
- Jie Huang, Wengang Zhou, Qilin Zhang, Houqiang Li, and Weiping Li. 2018b. Video-based sign language recognition without temporal segmentation. In AAAI.Google Scholar
- Jichao Kan, Kun Hu, Markus Hagenbuchner, Ah Chung Tsoi, Mohammed Bennamoun, and Zhiyong Wang. 2022. Sign Language Translation with Hierarchical Spatio-Temporal Graph Neural Network. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 3367--3376.Google ScholarCross Ref
- Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).Google Scholar
- Oscar Koller, Cihan Camgoz, Hermann Ney, and Richard Bowden. 2019. Weakly supervised learning with multi-stream CNN-LSTM-HMMs to discover sequential parallelism in sign language videos. TPAMI (2019).Google Scholar
- Oscar Koller, Jens Forster, and Hermann Ney. 2015. Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers. CVIU, Vol. 141 (Dec. 2015), 108--125.Google ScholarDigital Library
- Oscar Koller, Hermann Ney, and Richard Bowden. 2016a. Deep hand: How to train a cnn on 1 million hand images when your data is continuous and weakly labelled. In CVPR. 3793--3802.Google Scholar
- Oscar Koller, O Zargaran, Hermann Ney, and Richard Bowden. 2016b. Deep sign: Hybrid CNN-HMM for continuous sign language recognition. In BMVC.Google Scholar
- Oscar Koller, Sepehr Zargaran, and Hermann Ney. 2017. Re-sign: Re-aligned end-to-end sequence modelling with deep recurrent CNN-HMMs. In CVPR. 4297--4305.Google Scholar
- Dongxu Li, Chenchen Xu, Xin Yu, Kaihao Zhang, Benjamin Swift, Hanna Suominen, and Hongdong Li. 2020b. TSPNet: Hierarchical Feature Learning via Temporal Semantic Pyramid for Sign Language Translation. In NIPS, Vol. 33.Google Scholar
- Haibo Li, Liqing Gao, Ruize Han, Liang Wan, and Wei Feng. 2020a. Key action and joint ctc-attention based sign language recognition. In ICASSP 2020--2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2348--2352.Google ScholarCross Ref
- Qimai Li, Zhichao Han, and Xiao-Ming Wu. 2018. Deeper insights into graph convolutional networks for semi-supervised learning. In Thirty-Second AAAI conference on artificial intelligence.Google ScholarCross Ref
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74--81.Google Scholar
- Tao Liu, Wengang Zhou, and Houqiang Li. 2016. Sign language recognition with long short-term memory. In ICIP. IEEE, 2871--2875.Google Scholar
- Yuecong Min, Aiming Hao, Xiujuan Chai, and Xilin Chen. 2021. Visual alignment constraint for continuous sign language recognition. In ICCV. 11542--11551.Google Scholar
- Zhe Niu and Brian Mak. 2020. Stochastic fine-grained labeling of multi-state sign glosses for continuous sign language recognition. In European Conference on Computer Vision. Springer, 172--186.Google ScholarDigital Library
- Kenta Oono and Taiji Suzuki. 2019. Graph neural networks exponentially lose expressive power for node classification. arXiv preprint arXiv:1905.10947 (2019).Google Scholar
- Alptekin Orbay and Lale Akarun. 2020. Neural sign language translation by learning tokenization. In FG 2020. IEEE, 222--228.Google ScholarDigital Library
- Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In ACL. 311--318.Google Scholar
- Junfu Pu, Wengang Zhou, Hezhen Hu, and Houqiang Li. 2020. Boosting Continuous Sign Language Recognition via Cross Modality Augmentation. In MM. 1497--1505.Google Scholar
- Junfu Pu, Wengang Zhou, and Houqiang Li. 2018. Dilated Convolutional Network with Iterative Optimization for Continuous Sign Language Recognition.. In IJCAI, Vol. 3. 7.Google Scholar
- Junfu Pu, Wengang Zhou, and Houqiang Li. 2019. Iterative alignment network for continuous sign language recognition. In CVPR. 4165--4174.Google Scholar
- Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. 2019. Deep high-resolution representation learning for human pose estimation. In CVPR. 5693--5703.Google Scholar
- Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research, Vol. 9, 11 (2008).Google Scholar
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS. 5998--6008.Google Scholar
- Andreas Veit, Michael J Wilber, and Serge Belongie. 2016. Residual networks behave like ensembles of relatively shallow networks. NIPS, Vol. 29 (2016), 550--558.Google Scholar
- Hanjie Wang, Xiujuan Chai, Xiaopeng Hong, Guoying Zhao, and Xilin Chen. 2016. Isolated sign language recognition with grassmann covariance matrices. TACCESS, Vol. 8, 4 (2016), 1--21.Google ScholarDigital Library
- Chengcheng Wei, Jian Zhao, Wengang Zhou, and Houqiang Li. 2020. Semantic Boundary Detection with Reinforcement Learning for Continuous Sign Language Recognition. TCSVT, Vol. 31, 3 (2020), 1138--1149.Google Scholar
- Sijie Yan, Yuanjun Xiong, and Dahua Lin. 2018. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. In AAAI.Google Scholar
- Aoxiong Yin, Zhou Zhao, Jinglin Liu, Weike Jin, Meng Zhang, Xingshan Zeng, and Xiaofei He. 2021b. SimulSLT: End-to-End Simultaneous Sign Language Translation. In MM. 4118--4127.Google Scholar
- Kayo Yin, Amit Moryossef, Julie Hochgesang, Yoav Goldberg, and Malihe Alikhani. 2021a. Including signed languages in natural language processing. arXiv preprint arXiv:2105.05222 (2021).Google Scholar
- Kayo Yin and Jesse Read. 2020. Better sign language translation with STMC-transformer. In COLING. 5975--5989.Google Scholar
- Jihai Zhang, Wengang Zhou, and Houqiang Li. 2014. A threshold-based hmm-dtw approach for continuous sign language recognition. In ICIMCS. 237--240.Google Scholar
- Hao Zhou, Wengang Zhou, and Houqiang Li. 2019. Dynamic pseudo label decoding for continuous sign language recognition. In 2019 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1282--1287.Google ScholarCross Ref
- Hao Zhou, Wengang Zhou, Weizhen Qi, Junfu Pu, and Houqiang Li. 2021a. Improving Sign Language Translation with Monolingual Data by Sign Back-Translation. In CVPR. 1316--1325.Google Scholar
- Hao Zhou, Wengang Zhou, Yun Zhou, and Houqiang Li. 2020. Spatial-Temporal Multi-Cue Network for Continuous Sign Language Recognition.. In AAAI. 13009--13016.Google Scholar
- Hao Zhou, Wengang Zhou, Yun Zhou, and Houqiang Li. 2021b. Spatial-temporal multi-cue network for sign language recognition and translation. TMC (2021).Google Scholar
- Ronglai Zuo and Brian Mak. 2022. C2SLR: Consistency-Enhanced Continuous Sign Language Recognition. In CVPR. 5131--5140.Google Scholar
Index Terms
- Towards Real-Time Sign Language Recognition and Translation on Edge Devices
Recommendations
Deep Learning Methods for Sign Language Translation
Many sign languages are bona fide natural languages with grammatical rules and lexicons hence can benefit from machine translation methods. Similarly, since sign language is a visual-spatial language, it can also benefit from computer vision methods for ...
A machine translation system from Arabic sign language to Arabic
AbstractArabic sign language (ArSL) is one of the sign languages that is used in Arab countries. This language has structure and grammar that differ from spoken Arabic. Available ArSL recognition systems perform direct mapping between the recognized sign ...
A Machine Translation System from Indian Sign Language to English Text
Sign language recognition and translation is a crucial step towards improving communication between the deaf and the rest of the society. According to the Indian Sign Language Research and Training Centre (ISLRTC), India has around 300 certified human ...
Comments