Monolingual Denoising with Large Language Models for Low-Resource Machine Translation

Xu, Haoyu; Wang, Xing; Xing, Xiaolin; Hong, Yu

doi:10.1007/978-3-031-44693-1_33

Haoyu Xu¹¹,
Xing Wang¹²,
Xiaolin Xing¹¹ &
…
Yu Hong¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14302))

Included in the following conference series:

CCF International Conference on Natural Language Processing and Chinese Computing

1138 Accesses

Abstract

Low-resource machine translation struggles over the issue of bilingual data sparsity. Self-training based bilingual data augmentation is potentially useful for overcoming the issue. However, the resultant pseudo-parallel data comprises a variety of noises in the target language, including grammatical errors, abnormal word sequences, misspellings, mistranslations, etc. The noises unavoidably cause distraction during training. In this paper, we propose to refine the pseudo-parallel data using monolingual denoising. Specifically, we finetune mBART model to low-resource parallel data and identify noisy samples by self-inspection during the self-training process. On this basis, we leverage large language models., e.g., ChatGPT, to fix the possible errors that occurred in the target language of noisy samples using manually-edited prompts. This allows the refined pseudo-parallel data to be produced. We employ the aforementioned data to augment and retrain the mBART model. We conduct experiments on benchmark low-resource English-oriented translation corpora in OPUS-100 which possess different source languages, including Georgian (Ka), Urdu (Ur), and Slovenian (Sl). Experimental results show that our method achieves substantial improvements, allowing the translation performance to reach the chrF++ scores of 36.8%, 43.5%, and 47.5%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/XDeepAzure/nmt-corrector-src.
2.
https://opus.nlpl.eu/opus-100.php.
3.
https://github.com/mjpost/sacrebleu.
4.
https://github.com/lucadiliello/bleurt-pytorch. We use the recommended checkpoint lucadiliello/BLEURT-20 as described in the paper.
5.
https://chat.openai.com.
6.
https://huggingface.co/.
7.
https://huggingface.co/pszemraj/flan-t5-large-grammar-synthesis.
8.
https://huggingface.co/vennify/t5-base-grammar-correction.

References

Aharoni, R., Johnson, M., Firat, O.: Massively multilingual neural machine translation. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 3874–3884. Minneapolis, Minnesota (2019)
Google Scholar
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings (2015)
Google Scholar
Guerreiro, N.M., Voita, E., Martins, A.F.: Looking for a needle in a haystack: a comprehensive study of hallucinations in neural machine translation. arXiv preprint arXiv:2208.05309 (2022)
Haddow, B., Bawden, R., Barone, A.V.M., Helcl, J., Birch, A.: Survey of low-resource machine translation. Comput. Linguist. 48(3), 673–732 (2022)
Article Google Scholar
He, Z., et al.: Exploring human-like translation strategy with large language models. arXiv preprint arXiv:2305.04118 (2023)
He, Z., Wang, X., Tu, Z., Shi, S., Wang, R.: Tencent ai lab-shanghai jiao tong university low-resource translation system for the wmt22 translation task. In: Proceedings of the Seventh Conference on Machine Translation (WMT), pp. 260–267 (2022)
Google Scholar
He, Z., Wang, X., Wang, R., Shi, S., Tu, Z.: Bridging the data gap between training and inference for unsupervised neural machine translation. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 6611–6623 (2022)
Google Scholar
Jiao, W., Wang, W., Huang, J.t., Wang, X., Tu, Z.: Is chatgpt a good translator? a preliminary study. arXiv preprint arXiv:2301.08745 (2023)
Jiao, W., Wang, X., He, S., King, I., Lyu, M., Tu, Z.: Data rejuvenation: exploiting inactive training examples for neural machine translation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2255–2266 (2020)
Google Scholar
Jiao, W., Wang, X., Tu, Z., Shi, S., Lyu, M., King, I.: Self-training sampling with monolingual data uncertainty for neural machine translation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 2840–2850 (2021)
Google Scholar
Khayrallah, H., Koehn, P.: On the impact of various types of noise on neural machine translation. In: Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pp. 74–83 (2018)
Google Scholar
Koehn, P., Khayrallah, H., Heafield, K., Forcada, M.L.: Findings of the wmt 2018 shared task on parallel corpus filtering. In: Proceedings of the third Conference on Machine Translation: Shared Task Papers, pp. 726–739 (2018)
Google Scholar
Liang, T., et al.: Encouraging divergent thinking in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118 (2023)
Lu, J., Ge, X., Shi, Y., Zhang, Y.: Alibaba submission to the wmt20 parallel corpus filtering task. In: Proceedings of the Fifth Conference on Machine Translation, pp. 979–984 (2020)
Google Scholar
Nguyen, T.Q., Murray, K., Chiang, D.: Data augmentation by concatenation for low-resource translation: a mystery and a solution. In: IWSLT 2021, p. 287 (2021)
Google Scholar
Nguyen, X.P., Joty, S., Wu, K., Aw, A.T.: Data diversification: a simple strategy for neural machine translation. Adv. Neural. Inf. Process. Syst. 33, 10018–10029 (2020)
Google Scholar
NLLB Team, Marta R. Costa-jussà, J.C.: No language left behind: Scaling human-centered machine translation (2022)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6–12, 2002, Philadelphia, PA, USA, pp. 311–318 (2002)
Google Scholar
Popović, M.: chrF: character n-gram F-score for automatic MT evaluation. In: Proceedings of the Tenth Workshop on Statistical Machine Translation, pp. 392–395. Lisbon, Portugal (2015)
Google Scholar
Popović, M.: chrF++: words helping character n-grams. In: Proceedings of the Second Conference on Machine Translation, pp. 612–618. Copenhagen, Denmark (2017)
Google Scholar
Post, M.: A call for clarity in reporting BLEU scores. In: Proceedings of the Third Conference on Machine Translation: Research Papers, pp. 186–191. Belgium, Brussels (2018)
Google Scholar
Ranathunga, S., Lee, E.S.A., Prifti Skenduli, M., Shekhar, R., Alam, M., Kaur, R.: Neural machine translation for low-resource languages: a survey. ACM Comput. Surv. 55(11), 1–37 (2023)
Google Scholar
Sellam, T., Das, D., Parikh, A.: BLEURT: learning robust metrics for text generation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7881–7892. Online (2020)
Google Scholar
Sennrich, R., Haddow, B., Birch, A.: Improving neural machine translation models with monolingual data. In: 54th Annual Meeting of the Association for Computational Linguistics, pp. 86–96 (2016)
Google Scholar
Snover, M., Dorr, B., Schwartz, R., Micciulla, L., Makhoul, J.: A study of translation edit rate with targeted human annotation. In: Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers, pp. 223–231. Cambridge, Massachusetts, USA (2006)
Google Scholar
Tang, Y., Tran, C., Li, X., Chen, P.J., Goyal, N., Chaudhary, V., Gu, J., Fan, A.: Multilingual translation with extensible multilingual pretraining and finetuning (2020)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4–9, 2017, Long Beach, CA, USA, pp. 5998–6008 (2017)
Google Scholar
Vyas, Y., Niu, X., Carpuat, M.: Identifying semantic divergences in parallel text without annotations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1503–1515 (2018)
Google Scholar
Wang, L., Lyu, C., Ji, T., Zhang, Z., Yu, D., Shi, S., Tu, Z.: Document-level machine translation with large language models. arXiv preprint arXiv:2304.02210 (2023)
Wang, R., Tan, X., Luo, R., Qin, T., Liu, T.Y.: A survey on low-resource neural machine translation. arXiv preprint arXiv:2107.04239 (2021)
Wang, W., Watanabe, T., Hughes, M., Nakagawa, T., Chelba, C.: Denoising neural machine translation training with trusted data and online data selection. In: Proceedings of the Third Conference on Machine Translation: Research Papers, pp. 133–143 (2018)
Google Scholar
Wang, W., et al.: Understanding and improving sequence-to-sequence pretraining for neural machine translation. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2591–2600 (2022)
Google Scholar
Wang, X., Lu, Z., Tu, Z., Li, H., Xiong, D., Zhang, M.: Neural machine translation advised by statistical machine translation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31 (2017)
Google Scholar
Wang, X., Pham, H., Dai, Z., Neubig, G.: Switchout: an efficient data augmentation algorithm for neural machine translation. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 856–861 (2018)
Google Scholar
Wu, H., Wang, W., Wan, Y., Jiao, W., Lyu, M.: Chatgpt or grammarly? evaluating chatgpt on grammatical error correction benchmark. arXiv preprint arXiv:2303.13648 (2023)
Xia, M., Kong, X., Anastasopoulos, A., Neubig, G.: Generalized data augmentation for low-resource translation. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5786–5796. Florence, Italy (2019)
Google Scholar
Xing, X., Hong, Y., Xu, M., Yao, J., Zhou, G.: Taking actions separately: a bidirectionally-adaptive transfer learning method for low-resource neural machine translation. In: Proceedings of the 29th International Conference on Computational Linguistics, pp. 4481–4491 (2022)
Google Scholar
Zhang, B., Williams, P., Titov, I., Sennrich, R.: Improving massively multilingual neural machine translation and zero-shot translation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1628–1639. Online (2020)
Google Scholar
Zhang, J., Zong, C.: Exploiting source-side monolingual data in neural machine translation. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1535–1545 (2016)
Google Scholar
Zoph, B., Yuret, D., May, J., Knight, K.: Transfer learning for low-resource neural machine translation. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1568–1575. Austin, Texas (2016)
Google Scholar

Download references

Acknowledgements

The research is supported by National Key R&D Program of China (2020YFB1313601), National Science Foundation of China (62076174, 61836007).

Author information

Authors and Affiliations

Soochow University, Computer Science and Technology, Suzhou, China
Haoyu Xu, Xiaolin Xing & Yu Hong
Tencent AI Lab, Shenzhen, China
Xing Wang

Authors

Haoyu Xu
View author publications
You can also search for this author in PubMed Google Scholar
Xing Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaolin Xing
View author publications
You can also search for this author in PubMed Google Scholar
Yu Hong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yu Hong .

Editor information

Editors and Affiliations

Emory University, Atlanta, GA, USA
Fei Liu
Microsoft Research Asia, Beijing, China
Nan Duan
Soochow University, Suzhou, China
Qingting Xu
Soochow University, Suzhou, China
Yu Hong

Copyright information

About this paper

Cite this paper

Xu, H., Wang, X., Xing, X., Hong, Y. (2023). Monolingual Denoising with Large Language Models for Low-Resource Machine Translation. In: Liu, F., Duan, N., Xu, Q., Hong, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2023. Lecture Notes in Computer Science(), vol 14302. Springer, Cham. https://doi.org/10.1007/978-3-031-44693-1_33

Download citation

DOI: https://doi.org/10.1007/978-3-031-44693-1_33
Published: 08 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44692-4
Online ISBN: 978-3-031-44693-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the China Computer Federation (CCF) (opens in a new tab)

Monolingual Denoising with Large Language Models for Low-Resource Machine Translation