Enhanced encoder for non-autoregressive machine translation

Wang, Shuheng; Shi, Shumin; Huang, Heyan

doi:10.1007/s10590-021-09285-x

Enhanced encoder for non-autoregressive machine translation

Published: 16 November 2021

Volume 35, pages 595–609, (2021)
Cite this article

Machine Translation

Shuheng Wang¹,
Shumin Shi² &
Heyan Huang²

223 Accesses
Explore all metrics

Abstract

Non-autoregressive machine translation aims to speed up the decoding procedure by discarding the autoregressive model and generating the target words independently. Because non-autoregressive machine translation fails to exploit target-side information, the ability to accurately model source representations is critical. In this paper, we propose an approach to enhance the encoder’s modeling ability by using a pre-trained BERT model as an extra encoder. With a different tokenization method, the BERT encoder and the Raw encoder can model the source input from different aspects. Furthermore, having a gate mechanism, the decoder can dynamically determine which representations contribute to the decoding process. Experimental results on three translation tasks show that our method can significantly improve the performance of non-autoregressive MT, and surpass the baseline non-autoregressive models. On the WMT14 EN\(\rightarrow\)DE translation task, our method achieves 27.87 BLEU with a single decoding step. This is a comparable result with the baseline autoregressive Transformer model which obtains a score of 27.8 BLEU.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Non-autoregressive Neural Machine Translation with Distortion Model

Improving Non-autoregressive Machine Translation with Soft-Masking

Word-Level Error Correction in Non-autoregressive Neural Machine Translation

Notes

https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz
https://int-deepset-models-bert.s3.eu-central-1.amazonaws.com/pytorch/bert-base-german-cased.tar.gz
https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased.tar.gz
Levenshtein Transformer consists of three decoders, and the parameters of those decoders are shared. During inference, the first decoder decides which word should be deleted in the input target sentence, and the second decoder predicts the number of tokens to be inserted at every consecutive position pair and inserts the placeholders at the corresponding positions. Finally, the third decoder fills the tokens replacing the placeholders.

References

Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473
Bastings J, Titov I, Aziz W, Marcheggiani D, Sima’an K (2017) Graph convolutional encoders for syntax-aware neural machine translation. arXiv preprint arXiv:1704.04675
Chan W, Kitaev N, Guu K, Stern M, Uszkoreit J (2019) Kermit: generative insertion-based modeling for sequences. arXiv preprint arXiv:1906.01604
Clinchant S, Jung KW, Nikoulina V (2019) On the use of bert for neural machine translation. arXiv preprint arXiv:1909.12744
Dai AM, Le QV (2015) Semi-supervised sequence learning. Advances in neural information processing systems. Montréal, Canada, pp 3079–3087
Devlin, J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Gehring J, Auli M, Grangier D, Yarats D, Dauphin YN (2017) Convolutional sequence to sequence learning. In: Proceedings of the 34th international conference on machine learning, vol. 70, pp. 1243–1252, Sydney, Australia
Ghazvininejad M, Levy O, Liu Y, Zettlemoyer L (2019) Mask-predict: parallel decoding of conditional masked language models. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp. 6114–6123, Hong Kong, China
Ghazvininejad M, Karpukhin V, Zettlemoyer L, Levy O (2020a) Aligned cross entropy for non-autoregressive machine translation. arXiv preprint arXiv:2004.01655
Ghazvininejad M, Karpukhin V, Zettlemoyer L, Levy O (2020b) Semi-autoregressive training improves mask-predict decoding. arXiv preprint arXiv:2001.08785
Gu J, Bradbury J, Xiong C, Li VO, Socher R (2017) Non-autoregressive neural machine translation. arXiv preprint arXiv:1711.02281
Gu J, Wang C, Zhao J (2019) Levenshtein transformer. Advances in neural information processing systems. Vancouver, BC, Canada, pp 11179–11189
Guo J, Tan X, He D, Qin T, Xu L, Liu T-Y (2019) Non-autoregressive neural machine translation with enhanced decoder input. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 3723–3730, Honolulu, Hawaii, USA
Imamura K, Sumita E (2019) Recycling a pre-trained bert encoder for neural machine translation. In: Proceedings of the 3rd workshop on neural generation and translation, pp 23–31, Hong Kong, China
Roy Kaiser A, Vaswani A, Parmar N, Bengio S, Uszkoreit J, Shazeer N (2018) Fast decoding in sequence models using discrete latent variables. arXiv preprint arXiv:1803.03382
Kim Y, Rush AM (2016) Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R et al (2007) Open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, pp 177–180, Prague, Czech Republic
Lee J, Mansimov E, Cho K (2018) Deterministic non-autoregressive neural sequence modeling by iterative refinement. arXiv preprint arXiv:1802.06901
Li Z, Lin Z, He D, Tian F, Qin T, Wang L, Liu T-Y (2019) Hint-based training for non-autoregressive machine translation. arXiv preprint arXiv:1909.06708
Libovickỳ J, Helcl J (2018) End-to-end non-autoregressive neural machine translation with connectionist temporal classification. arXiv preprint arXiv:1811.04719
Ma, X Zhou, C Li X, Neubig G, Hovy E (2019) Flowseq: non-autoregressive conditional sequence generation with generative flow. arXiv preprint arXiv:1909.02480
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems. Harrahs and Harveys, Lake Tahoe, pp 3111–3119
Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318, Philadelphia, Pennsylvania, USA
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I (2019) Language models are unsupervised multitask learners. OpenAI Blog 1(8):9
Google Scholar
Saharia C, Chan W, Saxena S, Norouzi M (2020) Non-autoregressive machine translation with latent alignments. arXiv preprint arXiv:2004.07437
Sennrich R, Haddow B, Birch A (2015) Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909
Shao C, Feng Y, Zhang J, Meng F, Chen X, Zhou J (2019) Retrieving sequential information for non-autoregressive neural machine translation. arXiv preprint arXiv:1906.09444
Shao C, Zhang J, Feng Y, Meng F, Zhou J (2020) Minimizing the bag-of-ngrams difference for non-autoregressive neural machine translation. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 198–205, New York, USA
Shu R, Lee J, Nakayama H, Cho K (2019) Latent-variable non-autoregressive neural machine translation with deterministic inference using a delta posterior. arXiv preprint arXiv:1908.07181
Sun Z, Li Z, Wang H, He D, Lin Z, Deng Z (2019) Fast structured decoding for sequence models. Advances in neural information processing systems. Vancouver, BC, Canada, pp 3011–3020
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Advances in neural information processing systems. Long Beach Convention Center, Long Beach, pp 5998–6008
Wang Y, Tian F, He D, Qin T, Zhai C, Liu T-Y (2019) Non-autoregressive machine translation with auxiliary regularization. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 5377–5384, Honolulu, Hawaii, USA
Wei X, Hu Y, Xing L (2019) Gated self-attentive encoder for neural machine translation. In: International conference on knowledge science, engineering and management, pp. 655–666, Athens, Greece, 2019. Springer
Xiao F, Li J, Zhao H, Wang R, Chen K (2019) Lattice-based transformer encoder for neural machine translation. arXiv preprint arXiv:1906.01282
Yang J, Wang M, Zhou H, Zhao C, Yu Y, Zhang W, Li L (2019a) Towards making the most of bert in neural machine translation. arXiv preprint arXiv:1908.05672
Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov RR, Le QV (2019b) Xlnet: generalized autoregressive pretraining for language understanding. Advances in neural information processing systems. Vancouver, BC, Canada, pp 5754–5764
Zhou C, Neubig G, Gu J (2019) Understanding knowledge distillation in non-autoregressive machine translation. arXiv preprint arXiv:1911.02727
Zhou J, Keung P (2020) Improving non-autoregressive neural machine translation with monolingual data. arXiv preprint arXiv:2005.00932
Zhu J, Xia Y, Wu L, He D, Qin T, Zhou W, Li H, Liu T-Y (2020) Incorporating bert into neural machine translation. arXiv preprint arXiv:2002.06823

Download references

Acknowledgements

We thank the reviewers for their careful reviewing and constructive opinions. We thank Prof. Andy Way for his linguistic assistance and careful proofreading during the revision of this paper. This work is supported by the National Natural Science Foundation of China (Nos. 61732005, 61671064).

Author information

Authors and Affiliations

School of Computer Science and Engineer, Nanjing University of Science and Technology, Nanjing, China
Shuheng Wang
School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
Shumin Shi & Heyan Huang

Authors

Shuheng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Shumin Shi
View author publications
You can also search for this author in PubMed Google Scholar
Heyan Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shumin Shi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, S., Shi, S. & Huang, H. Enhanced encoder for non-autoregressive machine translation. Machine Translation 35, 595–609 (2021). https://doi.org/10.1007/s10590-021-09285-x

Download citation

Received: 26 October 2020
Accepted: 21 October 2021
Published: 16 November 2021
Issue Date: December 2021
DOI: https://doi.org/10.1007/s10590-021-09285-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Enhanced encoder for non-autoregressive machine translation

Abstract

Access this article

Similar content being viewed by others

Non-autoregressive Neural Machine Translation with Distortion Model

Improving Non-autoregressive Machine Translation with Soft-Masking

Word-Level Error Correction in Non-autoregressive Neural Machine Translation

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Enhanced encoder for non-autoregressive machine translation

Abstract

Access this article

Similar content being viewed by others

Non-autoregressive Neural Machine Translation with Distortion Model

Improving Non-autoregressive Machine Translation with Soft-Masking

Word-Level Error Correction in Non-autoregressive Neural Machine Translation

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation