Skip to main content

More Data Is Better Only to Some Level, After Which It Is Harmful: Profiling Neural Machine Translation Self-learning with Back-Translation

  • Conference paper
  • First Online:
Progress in Artificial Intelligence (EPIA 2021)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12981))

Included in the following conference series:

  • 1865 Accesses

Abstract

Neural machine translation needs a very large volume of data to unfold its potential. Self-learning with back-translation became widely adopted to address this data scarceness bottleneck: a seed system is used to translate source monolingual sentences which are aligned with the output sentences to form a synthetic data set that, when used to retrain the system, improves its translation performance. In this paper we report on the profiling of the self-learning with back-translation aiming at clarifying whether adding more synthetic data always leads to an increase of performance. With the experiments undertaken, we gathered evidence indicating that more synthetic data is better only to some level, after which it is harmful as the translation quality decays.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    This model is trained on a total of 17 million sentences: 5 million of the seed corpus and 12 million of the back-translated corpus. We will use this notation throughout this work, with the first point in the plot representing the seed system, and the subsequent corpora resulting from the addition of the seed corpora with the synthetic corpora.

  2. 2.

    https://github.com/alvations/sacremoses.

  3. 3.

    https://github.com/fxsjy/jieba.

  4. 4.

    https://github.com/rsennrich/subword-nmt.

  5. 5.

    https://www.statmt.org/moses/.

References

  1. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: Proceedings of the International Conference on Learning Representations (ICLR) (2015)

    Google Scholar 

  2. Barrault, L., et al.: Findings of the 2019 conference on machine translation (WMT19). In: Proceedings of the 4th Conference on MT, pp. 1–61 (2019)

    Google Scholar 

  3. Chao, L.S., Wong, D.F., Ao, C.H., Leal, A.L.: UM-PCorpus: a large Portuguese-Chinese parallel corpus. In: Proceedings of the LREC 2018 Workshop “Belt and Road: Language Resources and Evaluation”, pp. 38–43 (2018)

    Google Scholar 

  4. Edunov, S., Ott, M., Auli, M., Grangier, D.: Understanding back-translation at scale. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 489–500 (2018)

    Google Scholar 

  5. Hoang, V.C.D., Koehn, P., Haffari, G., Cohn, T.: Iterative back-translation for neural machine translation. In: Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pp. 18–24 (2018)

    Google Scholar 

  6. Imamura, K., Fujita, A., Sumita, E.: Enhancement of encoder and attention using target monolingual corpora in neural machine translation. In: Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pp. 55–63 (2018)

    Google Scholar 

  7. Junczys-Dowmunt, M., et al.: Fast neural machine translation in C++. In: Proceedings of ACL 2018, System Demonstrations, pp. 116–121 (2018)

    Google Scholar 

  8. Lample, G., Conneau, A., Denoyer, L., Ranzato, M.: Unsupervised machine translation using monolingual corpora only. In: International Conference on Learning Representations (ICLR) (2018)

    Google Scholar 

  9. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)

    Google Scholar 

  10. Poncelas, A., Shterionov, D., Way, A., de Buy Wenniger, G.M., Passban, P.: Investigating backtranslation in neural machine translation. In: 21st Annual Conference of the European Association for Machine Translation, pp. 249–258 (2018)

    Google Scholar 

  11. Santos, R., Silva, J., Branco, A., Xiong, D.: The direct path may not be the best: Portuguese-Chinese neural machine translation. In: Progress in Artificial Intelligence (EPIA 2019), pp. 757–768 (2019)

    Google Scholar 

  12. Sennrich, R., Haddow, B., Birch, A.: Improving neural machine translation models with monolingual data. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 86–96 (2016)

    Google Scholar 

  13. Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1715–1725 (2016)

    Google Scholar 

  14. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Neural Information Processing Systems, pp. 3104–3112 (2014)

    Google Scholar 

  15. Tiedemann, J.: Parallel data, tools and interfaces in OPUS. In: Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), pp. 2214–2218 (2012)

    Google Scholar 

  16. Vaswani, A., et al.: Attention is all you need. In: Neural Information Processing Systems, pp. 5998–6008 (2017)

    Google Scholar 

  17. Wu, H., Wang, H., Zong, C.: Domain adaptation for statistical machine translation with domain dictionary and monolingual corpora. In: Proceedings of the 22nd International Conference on Computational Linguistics, pp. 993–1000 (2008)

    Google Scholar 

  18. Zhang, Z., Liu, S., Li, M., Zhou, M., Chen, E.: Joint training for neural machine translation models with monolingual data. In: 32nd AAAI Conference (2018)

    Google Scholar 

Download references

Acknowledgement

This research was partially supported by PORTULAN CLARIN–Research Infrastructure for the Science and Technology of Language, funded by Lisboa 2020, Alentejo 2020 and FCT–Fundação para a Ciência e Tecnologia under the grant PINFRA/22117/2016.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rodrigo Santos .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Santos, R., Silva, J., Branco, A. (2021). More Data Is Better Only to Some Level, After Which It Is Harmful: Profiling Neural Machine Translation Self-learning with Back-Translation. In: Marreiros, G., Melo, F.S., Lau, N., Lopes Cardoso, H., Reis, L.P. (eds) Progress in Artificial Intelligence. EPIA 2021. Lecture Notes in Computer Science(), vol 12981. Springer, Cham. https://doi.org/10.1007/978-3-030-86230-5_57

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86230-5_57

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86229-9

  • Online ISBN: 978-3-030-86230-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics