Abstract
Low-resource machine translation struggles over the issue of bilingual data sparsity. Self-training based bilingual data augmentation is potentially useful for overcoming the issue. However, the resultant pseudo-parallel data comprises a variety of noises in the target language, including grammatical errors, abnormal word sequences, misspellings, mistranslations, etc. The noises unavoidably cause distraction during training. In this paper, we propose to refine the pseudo-parallel data using monolingual denoising. Specifically, we finetune mBART model to low-resource parallel data and identify noisy samples by self-inspection during the self-training process. On this basis, we leverage large language models., e.g., ChatGPT, to fix the possible errors that occurred in the target language of noisy samples using manually-edited prompts. This allows the refined pseudo-parallel data to be produced. We employ the aforementioned data to augment and retrain the mBART model. We conduct experiments on benchmark low-resource English-oriented translation corpora in OPUS-100 which possess different source languages, including Georgian (Ka), Urdu (Ur), and Slovenian (Sl). Experimental results show that our method achieves substantial improvements, allowing the translation performance to reach the chrF++ scores of 36.8%, 43.5%, and 47.5%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
https://github.com/lucadiliello/bleurt-pytorch. We use the recommended checkpoint lucadiliello/BLEURT-20 as described in the paper.
- 5.
- 6.
- 7.
- 8.
References
Aharoni, R., Johnson, M., Firat, O.: Massively multilingual neural machine translation. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 3874–3884. Minneapolis, Minnesota (2019)
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings (2015)
Guerreiro, N.M., Voita, E., Martins, A.F.: Looking for a needle in a haystack: a comprehensive study of hallucinations in neural machine translation. arXiv preprint arXiv:2208.05309 (2022)
Haddow, B., Bawden, R., Barone, A.V.M., Helcl, J., Birch, A.: Survey of low-resource machine translation. Comput. Linguist. 48(3), 673–732 (2022)
He, Z., et al.: Exploring human-like translation strategy with large language models. arXiv preprint arXiv:2305.04118 (2023)
He, Z., Wang, X., Tu, Z., Shi, S., Wang, R.: Tencent ai lab-shanghai jiao tong university low-resource translation system for the wmt22 translation task. In: Proceedings of the Seventh Conference on Machine Translation (WMT), pp. 260–267 (2022)
He, Z., Wang, X., Wang, R., Shi, S., Tu, Z.: Bridging the data gap between training and inference for unsupervised neural machine translation. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 6611–6623 (2022)
Jiao, W., Wang, W., Huang, J.t., Wang, X., Tu, Z.: Is chatgpt a good translator? a preliminary study. arXiv preprint arXiv:2301.08745 (2023)
Jiao, W., Wang, X., He, S., King, I., Lyu, M., Tu, Z.: Data rejuvenation: exploiting inactive training examples for neural machine translation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2255–2266 (2020)
Jiao, W., Wang, X., Tu, Z., Shi, S., Lyu, M., King, I.: Self-training sampling with monolingual data uncertainty for neural machine translation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 2840–2850 (2021)
Khayrallah, H., Koehn, P.: On the impact of various types of noise on neural machine translation. In: Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pp. 74–83 (2018)
Koehn, P., Khayrallah, H., Heafield, K., Forcada, M.L.: Findings of the wmt 2018 shared task on parallel corpus filtering. In: Proceedings of the third Conference on Machine Translation: Shared Task Papers, pp. 726–739 (2018)
Liang, T., et al.: Encouraging divergent thinking in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118 (2023)
Lu, J., Ge, X., Shi, Y., Zhang, Y.: Alibaba submission to the wmt20 parallel corpus filtering task. In: Proceedings of the Fifth Conference on Machine Translation, pp. 979–984 (2020)
Nguyen, T.Q., Murray, K., Chiang, D.: Data augmentation by concatenation for low-resource translation: a mystery and a solution. In: IWSLT 2021, p. 287 (2021)
Nguyen, X.P., Joty, S., Wu, K., Aw, A.T.: Data diversification: a simple strategy for neural machine translation. Adv. Neural. Inf. Process. Syst. 33, 10018–10029 (2020)
NLLB Team, Marta R. Costa-jussà, J.C.: No language left behind: Scaling human-centered machine translation (2022)
Papineni, K., Roukos, S., Ward, T., Zhu, W.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6–12, 2002, Philadelphia, PA, USA, pp. 311–318 (2002)
Popović, M.: chrF: character n-gram F-score for automatic MT evaluation. In: Proceedings of the Tenth Workshop on Statistical Machine Translation, pp. 392–395. Lisbon, Portugal (2015)
Popović, M.: chrF++: words helping character n-grams. In: Proceedings of the Second Conference on Machine Translation, pp. 612–618. Copenhagen, Denmark (2017)
Post, M.: A call for clarity in reporting BLEU scores. In: Proceedings of the Third Conference on Machine Translation: Research Papers, pp. 186–191. Belgium, Brussels (2018)
Ranathunga, S., Lee, E.S.A., Prifti Skenduli, M., Shekhar, R., Alam, M., Kaur, R.: Neural machine translation for low-resource languages: a survey. ACM Comput. Surv. 55(11), 1–37 (2023)
Sellam, T., Das, D., Parikh, A.: BLEURT: learning robust metrics for text generation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7881–7892. Online (2020)
Sennrich, R., Haddow, B., Birch, A.: Improving neural machine translation models with monolingual data. In: 54th Annual Meeting of the Association for Computational Linguistics, pp. 86–96 (2016)
Snover, M., Dorr, B., Schwartz, R., Micciulla, L., Makhoul, J.: A study of translation edit rate with targeted human annotation. In: Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers, pp. 223–231. Cambridge, Massachusetts, USA (2006)
Tang, Y., Tran, C., Li, X., Chen, P.J., Goyal, N., Chaudhary, V., Gu, J., Fan, A.: Multilingual translation with extensible multilingual pretraining and finetuning (2020)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4–9, 2017, Long Beach, CA, USA, pp. 5998–6008 (2017)
Vyas, Y., Niu, X., Carpuat, M.: Identifying semantic divergences in parallel text without annotations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1503–1515 (2018)
Wang, L., Lyu, C., Ji, T., Zhang, Z., Yu, D., Shi, S., Tu, Z.: Document-level machine translation with large language models. arXiv preprint arXiv:2304.02210 (2023)
Wang, R., Tan, X., Luo, R., Qin, T., Liu, T.Y.: A survey on low-resource neural machine translation. arXiv preprint arXiv:2107.04239 (2021)
Wang, W., Watanabe, T., Hughes, M., Nakagawa, T., Chelba, C.: Denoising neural machine translation training with trusted data and online data selection. In: Proceedings of the Third Conference on Machine Translation: Research Papers, pp. 133–143 (2018)
Wang, W., et al.: Understanding and improving sequence-to-sequence pretraining for neural machine translation. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2591–2600 (2022)
Wang, X., Lu, Z., Tu, Z., Li, H., Xiong, D., Zhang, M.: Neural machine translation advised by statistical machine translation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31 (2017)
Wang, X., Pham, H., Dai, Z., Neubig, G.: Switchout: an efficient data augmentation algorithm for neural machine translation. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 856–861 (2018)
Wu, H., Wang, W., Wan, Y., Jiao, W., Lyu, M.: Chatgpt or grammarly? evaluating chatgpt on grammatical error correction benchmark. arXiv preprint arXiv:2303.13648 (2023)
Xia, M., Kong, X., Anastasopoulos, A., Neubig, G.: Generalized data augmentation for low-resource translation. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5786–5796. Florence, Italy (2019)
Xing, X., Hong, Y., Xu, M., Yao, J., Zhou, G.: Taking actions separately: a bidirectionally-adaptive transfer learning method for low-resource neural machine translation. In: Proceedings of the 29th International Conference on Computational Linguistics, pp. 4481–4491 (2022)
Zhang, B., Williams, P., Titov, I., Sennrich, R.: Improving massively multilingual neural machine translation and zero-shot translation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1628–1639. Online (2020)
Zhang, J., Zong, C.: Exploiting source-side monolingual data in neural machine translation. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1535–1545 (2016)
Zoph, B., Yuret, D., May, J., Knight, K.: Transfer learning for low-resource neural machine translation. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1568–1575. Austin, Texas (2016)
Acknowledgements
The research is supported by National Key R&D Program of China (2020YFB1313601), National Science Foundation of China (62076174, 61836007).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Xu, H., Wang, X., Xing, X., Hong, Y. (2023). Monolingual Denoising with Large Language Models for Low-Resource Machine Translation. In: Liu, F., Duan, N., Xu, Q., Hong, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2023. Lecture Notes in Computer Science(), vol 14302. Springer, Cham. https://doi.org/10.1007/978-3-031-44693-1_33
Download citation
DOI: https://doi.org/10.1007/978-3-031-44693-1_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44692-4
Online ISBN: 978-3-031-44693-1
eBook Packages: Computer ScienceComputer Science (R0)