skip to main content
10.1145/3503161.3547957acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

DualSign: Semi-Supervised Sign Language Production with Balanced Multi-Modal Multi-Task Dual Transformation

Published: 10 October 2022 Publication History

Abstract

Sign Language Production (SLP) aims to translate a spoken language description to its corresponding continuous sign language sequence. A prevailing solution for this problem is in a two-staged manner: it formulates SLP as two sub-tasks, i.e., Text to Gloss (T2G) translation and Gloss to Pose (G2P) animation, with gloss annotations as pivots. Although two-staged approaches achieve better performance than their direct translation counterparts, the requirement of gloss intermediaries causes a parallel data bottleneck. In this paper, to reduce reliance on gloss annotations in two-staged approaches, we propose DualSign, a semi-supervised two-staged SLP framework, which can effectively utilize partially gloss-annotated text-pose pairs and monolingual gloss data. The key component of DualSign is a novel Balanced Multi-Modal Multi-Task Dual Transformation (BM3T-DT) method, where two well-designed models, i.e., a Multi-Modal T2G model (MM-T2G) and a Multi-Task G2P model (MT-G2P), are jointly trained by leveraging their task duality and unlabeled data. After applying BM3T-DT, we derive the expected uni-modal T2G model from the well-trained MM-T2G with knowledge distillation. Considering that the MM-T2G may suffer from modality imbalance when decoding with multiple input modalities, we devise a cross-modal balancing loss, further boosting the system's overall performance. Extensive experiments conducted on the PHOENIX14T dataset show the effectiveness of our approach in the semi-supervised setting. By training with additionally collected unlabeled data, DualSign substantially improves previous state-of-the-art SLP methods.

Supplementary Material

MP4 File (MM22-fp0857.mp4)
We present our paper "DualSign: Semi-Supervised Sign Language Production with Balanced Multi-Modal Multi-Task Dual Transformation" in this video. In this paper, we propose to break the parallel data bottleneck caused by the requirement of gloss annotations in two-staged SLP systems, which is rarely investigated in SLP. By designing a DualSign framework with a novel balanced multi-modal multi-task dual transformation training method, partially gloss-annotated text-pose pairs and monolingual gloss data can be fully exploited to improve the performance of both sub-tasks in two-staged SLP. The proposed cross-modal balancing loss further boosts the system's overall performance by alleviating the modality imbalance problem. Extensive experiments demonstrate the significance of our DualSign framework.

References

[1]
Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. 2018. Unsupervised Neural Machine Translation. In International Conference on Learning Representations.
[2]
Dzmitry Bahdanau, Kyung Hyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015.
[3]
Hannah Bull, Triantafyllos Afouras, Gül Varol, Samuel Albanie, Liliane Momeni, and Andrew Zisserman. 2021. Aligning subtitles in sign language videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11552--11561.
[4]
Iacer Calixto and Qun Liu. 2017. Incorporating Global Visual Features into Attention-based Neural Machine Translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 992--1003.
[5]
Iacer Calixto, Qun Liu, and Nick Campbell. 2017. Doubly-Attentive Decoder for Multi-modal Neural Machine Translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 1913--1924.
[6]
Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, Hermann Ney, and Richard Bowden. 2018. Neural sign language translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7784--7793.
[7]
Necati Cihan Camgoz, Oscar Koller, Simon Hadfield, and Richard Bowden. 2020. Sign language transformers: Joint end-to-end sign language recognition and translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10023--10033.
[8]
Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multiperson 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7291--7299.
[9]
Weicong Chen, Xu Tan, Yingce Xia, Tao Qin, Yu Wang, and Tie-Yan Liu. 2020. DualLip: A system for joint lip reading and generation. In Proceedings of the 28th ACM International Conference on Multimedia. 1985--1993.
[10]
Stephen Cox, Michael Lincoln, Judy Tryggvason, Melanie Nakisa, Mark Wells, Marcus Tutt, and Sanja Abbott. 2002. Tessa, a system to aid communication with deaf people. In Proceedings of the fifth international ACM conference on Assistive technologies. 205--212.
[11]
Amanda Cardoso Duarte. 2019. Cross-modal neural sign language translation. In Proceedings of the 27th ACM International Conference on Multimedia. 1650--1654.
[12]
Sarah Ebling and Matt Huenerfauth. 2015. Bridging the gap between sign language machine translation and sign language animation using sequence classification. In Proceedings of SLPAT 2015: 6th workshop on speech and language processing for assistive technologies. 2--9.
[13]
Desmond Elliott, Stella Frank, Khalil Sima'an, and Lucia Specia. 2016. Multi30K: Multilingual English-German Image Descriptions. In 5th Workshop on Vision and Language. Association for Computational Linguistics (ACL), 70--74.
[14]
Ralph Elliott, John RW Glauert, JR Kennaway, Ian Marshall, and Eva Safar. 2008. Linguistic modelling and language-processing technologies for Avatar-based sign language presentation. Universal Access in the Information Society 6, 4 (2008), 375--391.
[15]
Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics. JMLRWorkshop and Conference Proceedings, 249--256.
[16]
Alex Graves, Santiago Fernández, Faustino J. Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Machine Learning, Proceedings of the Twenty-Third International Conference (ICML 2006), Pittsburgh, Pennsylvania, USA, June 25--29, 2006, Vol. 148. ACM, 369--376.
[17]
Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tie-Yan Liu, and Wei-Ying Ma. 2016. Dual learning for machine translation. Advances in neural information processing systems 29 (2016).
[18]
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).
[19]
Wencan Huang, Wenwen Pan, Zhou Zhao, and Qi Tian. 2021. Towards Fast and High-Quality Sign Language Production. In MM '21: ACM Multimedia Conference, Virtual Event, China, October 20 - 24, 2021. ACM, 3172--3181.
[20]
Dilek Kayahan and Tunga Güngör. 2019. A hybrid translation system from Turkish spoken language to Turkish sign language. In 2019 IEEE International Symposium on INnovations in Intelligent SysTems and Applications (INISTA). IEEE, 1--6.
[21]
Suyoun Kim, Takaaki Hori, and Shinji Watanabe. 2017. Joint CTC-attention based end-to-end speech recognition using multi-task learning. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 4835--4839.
[22]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7--9, 2015, Conference Track Proceedings.
[23]
Dimitris Kouremenos, Klimis S Ntalianis, Giorgos Siolas, and Andreas Stafylopatis. 2018. Statistical Machine Translation for Greek to Greek Sign Language Using Parallel Corpora Produced via Rule-Based Machine Translation. In CIMA@ ICTAI. 28--42.
[24]
Guang Li, Linchao Zhu, Ping Liu, and Yi Yang. 2019. Entangled transformer for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8928--8937.
[25]
Chin-Yew Lin and Franz Josef Och. 2004. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04). 605--612.
[26]
Huan Lin, Fandong Meng, Jinsong Su, Yongjing Yin, Zhengyuan Yang, Yubin Ge, Jie Zhou, and Jiebo Luo. 2020. Dynamic context-guided capsule network for multimodal machine translation. In Proceedings of the 28th ACM International Conference on Multimedia. 1320--1329.
[27]
Yue Liu, XinWang, Yitian Yuan, andWenwu Zhu. 2019. Cross-modal dual learning for sentence-to-video generation. In Proceedings of the 27th ACM International Conference on Multimedia. 1239--1247.
[28]
Tomohiro Nakatani. 2019. Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration. In Proc. Interspeech 2019.
[29]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311--318.
[30]
Junfu Pu, Wengang Zhou, Hezhen Hu, and Houqiang Li. 2020. Boosting continuous sign language recognition via cross modality augmentation. In Proceedings of the 28th ACM International Conference on Multimedia. 1497--1505.
[31]
Yi Ren, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2019. Almost unsupervised text to speech and automatic speech recognition. In International Conference on Machine Learning. PMLR, 5410--5419.
[32]
Ben Saunders, Richard Bowden, and Necati Cihan Camgöz. 2020. Adversarial Training for Multi-Channel Sign Language Production. In 31st British Machine Vision Conference 2020, BMVC 2020, Virtual Event, UK, September 7--10, 2020. BMVA Press.
[33]
Ben Saunders, Necati Cihan Camgoz, and Richard Bowden. 2020. Everybody sign now: Translating spoken language to photo realistic sign language video. arXiv preprint arXiv:2011.09846 (2020).
[34]
Ben Saunders, Necati Cihan Camgoz, and Richard Bowden. 2020. Progressive transformers for end-to-end sign language production. In European Conference on Computer Vision. Springer, 687--705.
[35]
Ben Saunders, Necati Cihan Camgöz, and Richard Bowden. 2021. Continuous 3D Multi-Channel Sign Language Production via Progressive Transformers and Mixture Density Networks. Int. J. Comput. Vis. 129, 7 (2021), 2113--2135.
[36]
Ben Saunders, Necati Cihan Camgoz, and Richard Bowden. 2021. Mixed SIGNals: Sign Language Production via a Mixture of Motion Primitives. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1919--1929.
[37]
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Improving Neural Machine Translation Models with Monolingual Data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 86--96.
[38]
Jonathan Shen, Ye Jia, Mike Chrzanowski, Yu Zhang, Isaac Elias, Heiga Zen, and Yonghui Wu. 2020. Non-attentive tacotron: Robust and controllable neural TTS synthesis including unsupervised duration modeling. arXiv preprint arXiv:2010.04301 (2020).
[39]
Stephanie Stoll, Necati Cihan Camgöz, Simon Hadfield, and Richard Bowden. 2018. Sign language production using neural machine translation and generative adversarial networks. In Proceedings of the 29th British Machine Vision Conference (BMVC 2018). University of Surrey.
[40]
Stephanie Stoll, Necati Cihan Camgoz, Simon Hadfield, and Richard Bowden. 2020. Text2Sign: Towards sign language production using neural machine translation and generative adversarial networks. International Journal of Computer Vision 128, 4 (2020), 891--908.
[41]
Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 5100--5111.
[42]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
[43]
Yingce Xia, Tao Qin, Wei Chen, Jiang Bian, Nenghai Yu, and Tie-Yan Liu. 2017. Dual supervised learning. In International conference on machine learning. PMLR, 3789--3798.
[44]
Qinkun Xiao, Minying Qin, and Yuting Yin. 2020. Skeleton-based Chinese sign language recognition and generation for bidirectional communication between deaf and hearing people. Neural networks 125 (2020), 41--55.
[45]
Jin Xu, Xu Tan, Yi Ren, Tao Qin, Jian Li, Sheng Zhao, and Tie-Yan Liu. 2020. Lrspeech: Extremely low-resource speech synthesis and recognition. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2802--2812.
[46]
Shaowei Yao and Xiaojun Wan. 2020. Multimodal transformer for multimodal machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 4346--4350.
[47]
Kong Yawei and Kai Fan. 2021. Probing Multi-modal Machine Translation with Pre-trained Language Model. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 3689--3699.
[48]
Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong. 2017. Dualgan: Unsupervised dual learning for image-to-image translation. In Proceedings of the IEEE international conference on computer vision. 2849--2857.
[49]
Kayo Yin, Amit Moryossef, Julie Hochgesang, Yoav Goldberg, and Malihe Alikhani. 2021. Including Signed Languages in Natural Language Processing. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1--6, 2021. Association for Computational Linguistics, 7347--7360.
[50]
Yongjing Yin, Fandong Meng, Jinsong Su, Chulun Zhou, Zhengyuan Yang, Jie Zhou, and Jiebo Luo. 2020. A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine Translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 3025--3035.
[51]
Jan Zelinka and Jakub Kanis. 2020. Neural sign language synthesis: Words are our glosses. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 3395--3403.
[52]
Pengfei Zhang, Cuiling Lan, Wenjun Zeng, Junliang Xing, Jianru Xue, and Nanning Zheng. 2020. Semantics-guided neural networks for efficient skeleton-based human action recognition. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1112--1121.
[53]
Zhirui Zhang, Shujie Liu, Mu Li, Ming Zhou, and Enhong Chen. 2018. Joint training for neural machine translation models with monolingual data. In Thirty- Second AAAI Conference on Artificial Intelligence.
[54]
Wei Zhao, Wei Xu, Min Yang, Jianbo Ye, Zhou Zhao, Yabing Feng, and Yu Qiao. 2017. Dual learning for cross-domain image captioning. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 29--38.
[55]
Hao Zhou,Wengang Zhou,Weizhen Qi, Junfu Pu, and Houqiang Li. 2021. Improving Sign Language translation with monolingual data by sign back-translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1316--1325.
[56]
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision. 2223--2232.

Cited By

View all
  • (2024)Semantic-driven diffusion for sign language production with gloss-pose latent spaces alignmentComputer Vision and Image Understanding10.1016/j.cviu.2024.104050246(104050)Online publication date: Sep-2024
  • (2024)From rule-based models to deep learning transformers architectures for natural language processing and sign language translation systems: survey, taxonomy and performance evaluationArtificial Intelligence Review10.1007/s10462-024-10895-z57:10Online publication date: 29-Aug-2024
  • (2024)A Simple Baseline for Spoken Language to Sign Language Translation with 3D AvatarsComputer Vision – ECCV 202410.1007/978-3-031-72967-6_3(36-54)Online publication date: 3-Nov-2024

Index Terms

  1. DualSign: Semi-Supervised Sign Language Production with Balanced Multi-Modal Multi-Task Dual Transformation

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      MM '22: Proceedings of the 30th ACM International Conference on Multimedia
      October 2022
      7537 pages
      ISBN:9781450392037
      DOI:10.1145/3503161
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 10 October 2022

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. multi-modal learning
      2. sign language production
      3. task duality

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      MM '22
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)58
      • Downloads (Last 6 weeks)2
      Reflects downloads up to 01 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Semantic-driven diffusion for sign language production with gloss-pose latent spaces alignmentComputer Vision and Image Understanding10.1016/j.cviu.2024.104050246(104050)Online publication date: Sep-2024
      • (2024)From rule-based models to deep learning transformers architectures for natural language processing and sign language translation systems: survey, taxonomy and performance evaluationArtificial Intelligence Review10.1007/s10462-024-10895-z57:10Online publication date: 29-Aug-2024
      • (2024)A Simple Baseline for Spoken Language to Sign Language Translation with 3D AvatarsComputer Vision – ECCV 202410.1007/978-3-031-72967-6_3(36-54)Online publication date: 3-Nov-2024

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media