Skip to main content

Monolingual Denoising with Large Language Models for Low-Resource Machine Translation

  • Conference paper
  • First Online:
Natural Language Processing and Chinese Computing (NLPCC 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14302))

  • 1138 Accesses

Abstract

Low-resource machine translation struggles over the issue of bilingual data sparsity. Self-training based bilingual data augmentation is potentially useful for overcoming the issue. However, the resultant pseudo-parallel data comprises a variety of noises in the target language, including grammatical errors, abnormal word sequences, misspellings, mistranslations, etc. The noises unavoidably cause distraction during training. In this paper, we propose to refine the pseudo-parallel data using monolingual denoising. Specifically, we finetune mBART model to low-resource parallel data and identify noisy samples by self-inspection during the self-training process. On this basis, we leverage large language models., e.g., ChatGPT, to fix the possible errors that occurred in the target language of noisy samples using manually-edited prompts. This allows the refined pseudo-parallel data to be produced. We employ the aforementioned data to augment and retrain the mBART model. We conduct experiments on benchmark low-resource English-oriented translation corpora in OPUS-100 which possess different source languages, including Georgian (Ka), Urdu (Ur), and Slovenian (Sl). Experimental results show that our method achieves substantial improvements, allowing the translation performance to reach the chrF++ scores of 36.8%, 43.5%, and 47.5%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/XDeepAzure/nmt-corrector-src.

  2. 2.

    https://opus.nlpl.eu/opus-100.php.

  3. 3.

    https://github.com/mjpost/sacrebleu.

  4. 4.

    https://github.com/lucadiliello/bleurt-pytorch. We use the recommended checkpoint lucadiliello/BLEURT-20 as described in the paper.

  5. 5.

    https://chat.openai.com.

  6. 6.

    https://huggingface.co/.

  7. 7.

    https://huggingface.co/pszemraj/flan-t5-large-grammar-synthesis.

  8. 8.

    https://huggingface.co/vennify/t5-base-grammar-correction.

References

  1. Aharoni, R., Johnson, M., Firat, O.: Massively multilingual neural machine translation. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 3874–3884. Minneapolis, Minnesota (2019)

    Google Scholar 

  2. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings (2015)

    Google Scholar 

  3. Guerreiro, N.M., Voita, E., Martins, A.F.: Looking for a needle in a haystack: a comprehensive study of hallucinations in neural machine translation. arXiv preprint arXiv:2208.05309 (2022)

  4. Haddow, B., Bawden, R., Barone, A.V.M., Helcl, J., Birch, A.: Survey of low-resource machine translation. Comput. Linguist. 48(3), 673–732 (2022)

    Article  Google Scholar 

  5. He, Z., et al.: Exploring human-like translation strategy with large language models. arXiv preprint arXiv:2305.04118 (2023)

  6. He, Z., Wang, X., Tu, Z., Shi, S., Wang, R.: Tencent ai lab-shanghai jiao tong university low-resource translation system for the wmt22 translation task. In: Proceedings of the Seventh Conference on Machine Translation (WMT), pp. 260–267 (2022)

    Google Scholar 

  7. He, Z., Wang, X., Wang, R., Shi, S., Tu, Z.: Bridging the data gap between training and inference for unsupervised neural machine translation. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 6611–6623 (2022)

    Google Scholar 

  8. Jiao, W., Wang, W., Huang, J.t., Wang, X., Tu, Z.: Is chatgpt a good translator? a preliminary study. arXiv preprint arXiv:2301.08745 (2023)

  9. Jiao, W., Wang, X., He, S., King, I., Lyu, M., Tu, Z.: Data rejuvenation: exploiting inactive training examples for neural machine translation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2255–2266 (2020)

    Google Scholar 

  10. Jiao, W., Wang, X., Tu, Z., Shi, S., Lyu, M., King, I.: Self-training sampling with monolingual data uncertainty for neural machine translation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 2840–2850 (2021)

    Google Scholar 

  11. Khayrallah, H., Koehn, P.: On the impact of various types of noise on neural machine translation. In: Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pp. 74–83 (2018)

    Google Scholar 

  12. Koehn, P., Khayrallah, H., Heafield, K., Forcada, M.L.: Findings of the wmt 2018 shared task on parallel corpus filtering. In: Proceedings of the third Conference on Machine Translation: Shared Task Papers, pp. 726–739 (2018)

    Google Scholar 

  13. Liang, T., et al.: Encouraging divergent thinking in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118 (2023)

  14. Lu, J., Ge, X., Shi, Y., Zhang, Y.: Alibaba submission to the wmt20 parallel corpus filtering task. In: Proceedings of the Fifth Conference on Machine Translation, pp. 979–984 (2020)

    Google Scholar 

  15. Nguyen, T.Q., Murray, K., Chiang, D.: Data augmentation by concatenation for low-resource translation: a mystery and a solution. In: IWSLT 2021, p. 287 (2021)

    Google Scholar 

  16. Nguyen, X.P., Joty, S., Wu, K., Aw, A.T.: Data diversification: a simple strategy for neural machine translation. Adv. Neural. Inf. Process. Syst. 33, 10018–10029 (2020)

    Google Scholar 

  17. NLLB Team, Marta R. Costa-jussà, J.C.: No language left behind: Scaling human-centered machine translation (2022)

    Google Scholar 

  18. Papineni, K., Roukos, S., Ward, T., Zhu, W.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6–12, 2002, Philadelphia, PA, USA, pp. 311–318 (2002)

    Google Scholar 

  19. Popović, M.: chrF: character n-gram F-score for automatic MT evaluation. In: Proceedings of the Tenth Workshop on Statistical Machine Translation, pp. 392–395. Lisbon, Portugal (2015)

    Google Scholar 

  20. Popović, M.: chrF++: words helping character n-grams. In: Proceedings of the Second Conference on Machine Translation, pp. 612–618. Copenhagen, Denmark (2017)

    Google Scholar 

  21. Post, M.: A call for clarity in reporting BLEU scores. In: Proceedings of the Third Conference on Machine Translation: Research Papers, pp. 186–191. Belgium, Brussels (2018)

    Google Scholar 

  22. Ranathunga, S., Lee, E.S.A., Prifti Skenduli, M., Shekhar, R., Alam, M., Kaur, R.: Neural machine translation for low-resource languages: a survey. ACM Comput. Surv. 55(11), 1–37 (2023)

    Google Scholar 

  23. Sellam, T., Das, D., Parikh, A.: BLEURT: learning robust metrics for text generation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7881–7892. Online (2020)

    Google Scholar 

  24. Sennrich, R., Haddow, B., Birch, A.: Improving neural machine translation models with monolingual data. In: 54th Annual Meeting of the Association for Computational Linguistics, pp. 86–96 (2016)

    Google Scholar 

  25. Snover, M., Dorr, B., Schwartz, R., Micciulla, L., Makhoul, J.: A study of translation edit rate with targeted human annotation. In: Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers, pp. 223–231. Cambridge, Massachusetts, USA (2006)

    Google Scholar 

  26. Tang, Y., Tran, C., Li, X., Chen, P.J., Goyal, N., Chaudhary, V., Gu, J., Fan, A.: Multilingual translation with extensible multilingual pretraining and finetuning (2020)

    Google Scholar 

  27. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4–9, 2017, Long Beach, CA, USA, pp. 5998–6008 (2017)

    Google Scholar 

  28. Vyas, Y., Niu, X., Carpuat, M.: Identifying semantic divergences in parallel text without annotations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1503–1515 (2018)

    Google Scholar 

  29. Wang, L., Lyu, C., Ji, T., Zhang, Z., Yu, D., Shi, S., Tu, Z.: Document-level machine translation with large language models. arXiv preprint arXiv:2304.02210 (2023)

  30. Wang, R., Tan, X., Luo, R., Qin, T., Liu, T.Y.: A survey on low-resource neural machine translation. arXiv preprint arXiv:2107.04239 (2021)

  31. Wang, W., Watanabe, T., Hughes, M., Nakagawa, T., Chelba, C.: Denoising neural machine translation training with trusted data and online data selection. In: Proceedings of the Third Conference on Machine Translation: Research Papers, pp. 133–143 (2018)

    Google Scholar 

  32. Wang, W., et al.: Understanding and improving sequence-to-sequence pretraining for neural machine translation. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2591–2600 (2022)

    Google Scholar 

  33. Wang, X., Lu, Z., Tu, Z., Li, H., Xiong, D., Zhang, M.: Neural machine translation advised by statistical machine translation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31 (2017)

    Google Scholar 

  34. Wang, X., Pham, H., Dai, Z., Neubig, G.: Switchout: an efficient data augmentation algorithm for neural machine translation. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 856–861 (2018)

    Google Scholar 

  35. Wu, H., Wang, W., Wan, Y., Jiao, W., Lyu, M.: Chatgpt or grammarly? evaluating chatgpt on grammatical error correction benchmark. arXiv preprint arXiv:2303.13648 (2023)

  36. Xia, M., Kong, X., Anastasopoulos, A., Neubig, G.: Generalized data augmentation for low-resource translation. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5786–5796. Florence, Italy (2019)

    Google Scholar 

  37. Xing, X., Hong, Y., Xu, M., Yao, J., Zhou, G.: Taking actions separately: a bidirectionally-adaptive transfer learning method for low-resource neural machine translation. In: Proceedings of the 29th International Conference on Computational Linguistics, pp. 4481–4491 (2022)

    Google Scholar 

  38. Zhang, B., Williams, P., Titov, I., Sennrich, R.: Improving massively multilingual neural machine translation and zero-shot translation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1628–1639. Online (2020)

    Google Scholar 

  39. Zhang, J., Zong, C.: Exploiting source-side monolingual data in neural machine translation. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1535–1545 (2016)

    Google Scholar 

  40. Zoph, B., Yuret, D., May, J., Knight, K.: Transfer learning for low-resource neural machine translation. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1568–1575. Austin, Texas (2016)

    Google Scholar 

Download references

Acknowledgements

The research is supported by National Key R&D Program of China (2020YFB1313601), National Science Foundation of China (62076174, 61836007).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yu Hong .

Editor information

Editors and Affiliations

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xu, H., Wang, X., Xing, X., Hong, Y. (2023). Monolingual Denoising with Large Language Models for Low-Resource Machine Translation. In: Liu, F., Duan, N., Xu, Q., Hong, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2023. Lecture Notes in Computer Science(), vol 14302. Springer, Cham. https://doi.org/10.1007/978-3-031-44693-1_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-44693-1_33

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-44692-4

  • Online ISBN: 978-3-031-44693-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics