skip to main content
10.1145/3581783.3611820acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Towards Real-Time Sign Language Recognition and Translation on Edge Devices

Published:27 October 2023Publication History

ABSTRACT

To provide instant communication for hearing-impaired people, it is essential to achieve real-time sign language processing anytime anywhere. Therefore, in this paper, we propose a Region-aware Temporal Graph based neural Network (RTG-Net), aiming to achieve real-time Sign Language Recognition (SLR) and Translation (SLT) on edge devices. To reduce the computation overhead, we first construct a shallow graph convolution network to reduce model size by decreasing model depth. Besides, we apply structural re-parameterization to fuse the convolutional layer, batch normalization layer and all branches to simplify model complexity by reducing model width. To achieve the high performance in sign language processing as well, we extract key regions based on keypoints in skeleton from each frame, and design a region-aware temporal graph to combine key regions and full frame for feature representation. In RTG-Net, we design a multi-stage training strategy to optimize keypoint selection, SLR and SLT step by step. Experimental results demonstrate that RTG-Net achieves comparable performance with existing methods in SLR or SLT, while greatly reducing the computation overhead and achieving real-time sign language processing on edge devices. Our code is available at https://github.com/SignLanguageCode/realtimeSLRT.

References

  1. Kshitij Bantupalli and Ying Xie. 2018. American sign language recognition using deep learning and computer vision. In 2018 Big Data. IEEE, 4896--4899.Google ScholarGoogle Scholar
  2. Jan Bungeroth and Hermann Ney. 2004. Statistical sign language translation. In Workshop on representation and processing of sign languages, LREC, Vol. 4. Citeseer, 105--108.Google ScholarGoogle Scholar
  3. Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, and Richard Bowden. 2017. Subunets: End-to-end hand shape and continuous sign language recognition. In ICCV. IEEE, 3075--3084.Google ScholarGoogle Scholar
  4. N. C. Camgoz, S. Hadfield, O. Koller, H. Ney, and R. Bowden. 2018. Neural Sign Language Translation. In CVPR. 7784--7793. https://doi.org/10.1109/CVPR.2018.00812Google ScholarGoogle ScholarCross RefCross Ref
  5. Necati Cihan Camgoz, Oscar Koller, Simon Hadfield, and Richard Bowden. 2020a. Multi-channel Transformers for Multi-articulatory Sign Language Translation. arXiv preprint arXiv:2009.00299 (2020).Google ScholarGoogle Scholar
  6. Necati Cihan Camgoz, Oscar Koller, Simon Hadfield, and Richard Bowden. 2020b. Sign Language Transformers: Joint End-to-end Sign Language Recognition and Translation. In CVPR. 10023--10033.Google ScholarGoogle Scholar
  7. Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR. 7291--7299.Google ScholarGoogle Scholar
  8. Yutong Chen, Fangyun Wei, Xiao Sun, Zhirong Wu, and Stephen Lin. 2022a. A simple multi-modality transfer learning baseline for sign language translation. In CVPR. 5120--5130.Google ScholarGoogle Scholar
  9. Yutong Chen, Ronglai Zuo, Fangyun Wei, Yu Wu, Shujie Liu, and Brian Mak. 2022b. Two-stream network for sign language recognition and translation. Advances in Neural Information Processing Systems, Vol. 35 (2022), 17043--17056.Google ScholarGoogle Scholar
  10. Ka Leong Cheng, Zhaoyang Yang, Qifeng Chen, and Yu-Wing Tai. 2020. Fully convolutional networks for continuous sign language recognition. In European Conference on Computer Vision. Springer, 697--714.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Runpeng Cui, Hu Liu, and Changshui Zhang. 2017. Recurrent convolutional neural networks for continuous sign language recognition by staged optimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7361--7369.Google ScholarGoogle ScholarCross RefCross Ref
  12. Runpeng Cui, Hu Liu, and Changshui Zhang. 2019. A deep neural framework for continuous sign language recognition by iterative training. MM, Vol. 21, 7 (2019), 1880--1891.Google ScholarGoogle Scholar
  13. Xiaohan Ding, Xiangyu Zhang, Ningning Ma, Jungong Han, Guiguang Ding, and Jian Sun. 2021. Repvgg: Making vgg-style convnets great again. In CVPR. 13733--13742.Google ScholarGoogle Scholar
  14. Shiwei Gan, Yafeng Yin, Zhiwei Jiang, Lei Xie, and Sanglu Lu. 2021. Skeleton-Aware Neural Sign Language Translation. In ACM MM. 4353--4361.Google ScholarGoogle Scholar
  15. Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In ICML. 369--376.Google ScholarGoogle Scholar
  16. K. Grobel and M. Assan. 1997. Isolated sign language recognition using hidden Markov models. In SMC, Vol. 1. 162--167 vol.1. https://doi.org/10.1109/ICSMC.1997.625742Google ScholarGoogle ScholarCross RefCross Ref
  17. Dan Guo, Wengang Zhou, Anyang Li, Houqiang Li, and Meng Wang. 2019. Hierarchical recurrent deep fusion using adaptive clip summarization for sign language translation. TIP, Vol. 29 (2019), 1575--1590.Google ScholarGoogle Scholar
  18. Dan Guo, Wengang Zhou, Meng Wang, and Houqiang Li. 2016. Sign language recognition based on adaptive hmms with data augmentation. In ICIP. IEEE, 2876--2880.Google ScholarGoogle Scholar
  19. Aiming Hao, Yuecong Min, and Xilin Chen. 2021. Self-Mutual Distillation Learning for Continuous Sign Language Recognition. In ICCV. 11303--11312.Google ScholarGoogle Scholar
  20. Hezhen Hu, Wengang Zhou, and Houqiang Li. 2021. Hand-Model-Aware Sign Language Recognition. In AAAI, Vol. 35. 1558--1566.Google ScholarGoogle ScholarCross RefCross Ref
  21. Jie Huang, Wengang Zhou, Houqiang Li, and Weiping Li. 2018a. Attention-based 3D-CNNs for large-vocabulary sign language recognition. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 29, 9 (2018), 2822--2832.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Jie Huang, Wengang Zhou, Qilin Zhang, Houqiang Li, and Weiping Li. 2018b. Video-based sign language recognition without temporal segmentation. In AAAI.Google ScholarGoogle Scholar
  23. Jichao Kan, Kun Hu, Markus Hagenbuchner, Ah Chung Tsoi, Mohammed Bennamoun, and Zhiyong Wang. 2022. Sign Language Translation with Hierarchical Spatio-Temporal Graph Neural Network. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 3367--3376.Google ScholarGoogle ScholarCross RefCross Ref
  24. Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).Google ScholarGoogle Scholar
  25. Oscar Koller, Cihan Camgoz, Hermann Ney, and Richard Bowden. 2019. Weakly supervised learning with multi-stream CNN-LSTM-HMMs to discover sequential parallelism in sign language videos. TPAMI (2019).Google ScholarGoogle Scholar
  26. Oscar Koller, Jens Forster, and Hermann Ney. 2015. Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers. CVIU, Vol. 141 (Dec. 2015), 108--125.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Oscar Koller, Hermann Ney, and Richard Bowden. 2016a. Deep hand: How to train a cnn on 1 million hand images when your data is continuous and weakly labelled. In CVPR. 3793--3802.Google ScholarGoogle Scholar
  28. Oscar Koller, O Zargaran, Hermann Ney, and Richard Bowden. 2016b. Deep sign: Hybrid CNN-HMM for continuous sign language recognition. In BMVC.Google ScholarGoogle Scholar
  29. Oscar Koller, Sepehr Zargaran, and Hermann Ney. 2017. Re-sign: Re-aligned end-to-end sequence modelling with deep recurrent CNN-HMMs. In CVPR. 4297--4305.Google ScholarGoogle Scholar
  30. Dongxu Li, Chenchen Xu, Xin Yu, Kaihao Zhang, Benjamin Swift, Hanna Suominen, and Hongdong Li. 2020b. TSPNet: Hierarchical Feature Learning via Temporal Semantic Pyramid for Sign Language Translation. In NIPS, Vol. 33.Google ScholarGoogle Scholar
  31. Haibo Li, Liqing Gao, Ruize Han, Liang Wan, and Wei Feng. 2020a. Key action and joint ctc-attention based sign language recognition. In ICASSP 2020--2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2348--2352.Google ScholarGoogle ScholarCross RefCross Ref
  32. Qimai Li, Zhichao Han, and Xiao-Ming Wu. 2018. Deeper insights into graph convolutional networks for semi-supervised learning. In Thirty-Second AAAI conference on artificial intelligence.Google ScholarGoogle ScholarCross RefCross Ref
  33. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74--81.Google ScholarGoogle Scholar
  34. Tao Liu, Wengang Zhou, and Houqiang Li. 2016. Sign language recognition with long short-term memory. In ICIP. IEEE, 2871--2875.Google ScholarGoogle Scholar
  35. Yuecong Min, Aiming Hao, Xiujuan Chai, and Xilin Chen. 2021. Visual alignment constraint for continuous sign language recognition. In ICCV. 11542--11551.Google ScholarGoogle Scholar
  36. Zhe Niu and Brian Mak. 2020. Stochastic fine-grained labeling of multi-state sign glosses for continuous sign language recognition. In European Conference on Computer Vision. Springer, 172--186.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Kenta Oono and Taiji Suzuki. 2019. Graph neural networks exponentially lose expressive power for node classification. arXiv preprint arXiv:1905.10947 (2019).Google ScholarGoogle Scholar
  38. Alptekin Orbay and Lale Akarun. 2020. Neural sign language translation by learning tokenization. In FG 2020. IEEE, 222--228.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In ACL. 311--318.Google ScholarGoogle Scholar
  40. Junfu Pu, Wengang Zhou, Hezhen Hu, and Houqiang Li. 2020. Boosting Continuous Sign Language Recognition via Cross Modality Augmentation. In MM. 1497--1505.Google ScholarGoogle Scholar
  41. Junfu Pu, Wengang Zhou, and Houqiang Li. 2018. Dilated Convolutional Network with Iterative Optimization for Continuous Sign Language Recognition.. In IJCAI, Vol. 3. 7.Google ScholarGoogle Scholar
  42. Junfu Pu, Wengang Zhou, and Houqiang Li. 2019. Iterative alignment network for continuous sign language recognition. In CVPR. 4165--4174.Google ScholarGoogle Scholar
  43. Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. 2019. Deep high-resolution representation learning for human pose estimation. In CVPR. 5693--5703.Google ScholarGoogle Scholar
  44. Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research, Vol. 9, 11 (2008).Google ScholarGoogle Scholar
  45. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS. 5998--6008.Google ScholarGoogle Scholar
  46. Andreas Veit, Michael J Wilber, and Serge Belongie. 2016. Residual networks behave like ensembles of relatively shallow networks. NIPS, Vol. 29 (2016), 550--558.Google ScholarGoogle Scholar
  47. Hanjie Wang, Xiujuan Chai, Xiaopeng Hong, Guoying Zhao, and Xilin Chen. 2016. Isolated sign language recognition with grassmann covariance matrices. TACCESS, Vol. 8, 4 (2016), 1--21.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Chengcheng Wei, Jian Zhao, Wengang Zhou, and Houqiang Li. 2020. Semantic Boundary Detection with Reinforcement Learning for Continuous Sign Language Recognition. TCSVT, Vol. 31, 3 (2020), 1138--1149.Google ScholarGoogle Scholar
  49. Sijie Yan, Yuanjun Xiong, and Dahua Lin. 2018. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. In AAAI.Google ScholarGoogle Scholar
  50. Aoxiong Yin, Zhou Zhao, Jinglin Liu, Weike Jin, Meng Zhang, Xingshan Zeng, and Xiaofei He. 2021b. SimulSLT: End-to-End Simultaneous Sign Language Translation. In MM. 4118--4127.Google ScholarGoogle Scholar
  51. Kayo Yin, Amit Moryossef, Julie Hochgesang, Yoav Goldberg, and Malihe Alikhani. 2021a. Including signed languages in natural language processing. arXiv preprint arXiv:2105.05222 (2021).Google ScholarGoogle Scholar
  52. Kayo Yin and Jesse Read. 2020. Better sign language translation with STMC-transformer. In COLING. 5975--5989.Google ScholarGoogle Scholar
  53. Jihai Zhang, Wengang Zhou, and Houqiang Li. 2014. A threshold-based hmm-dtw approach for continuous sign language recognition. In ICIMCS. 237--240.Google ScholarGoogle Scholar
  54. Hao Zhou, Wengang Zhou, and Houqiang Li. 2019. Dynamic pseudo label decoding for continuous sign language recognition. In 2019 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1282--1287.Google ScholarGoogle ScholarCross RefCross Ref
  55. Hao Zhou, Wengang Zhou, Weizhen Qi, Junfu Pu, and Houqiang Li. 2021a. Improving Sign Language Translation with Monolingual Data by Sign Back-Translation. In CVPR. 1316--1325.Google ScholarGoogle Scholar
  56. Hao Zhou, Wengang Zhou, Yun Zhou, and Houqiang Li. 2020. Spatial-Temporal Multi-Cue Network for Continuous Sign Language Recognition.. In AAAI. 13009--13016.Google ScholarGoogle Scholar
  57. Hao Zhou, Wengang Zhou, Yun Zhou, and Houqiang Li. 2021b. Spatial-temporal multi-cue network for sign language recognition and translation. TMC (2021).Google ScholarGoogle Scholar
  58. Ronglai Zuo and Brian Mak. 2022. C2SLR: Consistency-Enhanced Continuous Sign Language Recognition. In CVPR. 5131--5140.Google ScholarGoogle Scholar

Index Terms

  1. Towards Real-Time Sign Language Recognition and Translation on Edge Devices
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          MM '23: Proceedings of the 31st ACM International Conference on Multimedia
          October 2023
          9913 pages
          ISBN:9798400701085
          DOI:10.1145/3581783

          Copyright © 2023 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 27 October 2023

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate995of4,171submissions,24%

          Upcoming Conference

          MM '24
          MM '24: The 32nd ACM International Conference on Multimedia
          October 28 - November 1, 2024
          Melbourne , VIC , Australia
        • Article Metrics

          • Downloads (Last 12 months)184
          • Downloads (Last 6 weeks)34

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader