Simple measures of bridging lexical divergence help unsupervised neural machine translation for low-resource languages

Khatri, Jyotsana; Murthy, Rudra; Banerjee, Tamali; Bhattacharyya, Pushpak

doi:10.1007/s10590-021-09292-y

Simple measures of bridging lexical divergence help unsupervised neural machine translation for low-resource languages

Published: 05 January 2022

Volume 35, pages 711–744, (2021)
Cite this article

Machine Translation

Jyotsana Khatri ORCID: orcid.org/0000-0001-8519-661X¹^na1,
Rudra Murthy²^na1,
Tamali Banerjee¹ &
…
Pushpak Bhattacharyya¹

479 Accesses
1 Citation
4 Altmetric
Explore all metrics

Abstract

Unsupervised Neural Machine Translation (UNMT) approaches have gained widespread popularity in recent times. Though these approaches show impressive translation performance using only monolingual corpora of the languages involved, these approaches have mostly been tried on high-resource European language pairs viz. English–French, English–German, etc. In this paper, we explore UNMT for 6 Indic language pairs viz., Hindi–Bengali, Hindi–Gujarati, Hindi–Marathi, Hindi–Malayalam, Hindi–Tamil, and Hindi–Telugu which are low-resource language pairs. We additionally perform experiments on 4 European language pairs viz., English–Czech, English–Estonian, English–Lithuanian, and English–Finnish. We observe that the lexical divergence within these language pairs plays a big role in the success of UNMT. In this context, we explore three approaches viz., (i) script conversion, (ii) unsupervised bilingual embedding-based initialization to bring the vocabulary of the two languages closer, and (iii) dictionary word substitution using a bilingual dictionary. We found that the script conversion using a simple rule-based system benefits language pairs that have high cognate overlap but use different scripts. We observe that script conversion combined with word substitution using a dictionary further improves the UNMT performance. We use a ground truth bilingual dictionary in our dictionary word substitution experiments, and such dictionaries can also be obtained using unsupervised bilingual embeddings. We empirically demonstrate that minimizing lexical divergence using simple heuristics leads to significant improvements in the BLEU score for both related and distant language pairs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Machine translation by projecting text into the same phonetic-orthographic space using a common encoding

Article 04 November 2023

Integrating Knowledge Encoded by Linguistic Phenomena of Indian Languages with Neural Machine Translation

A Study of Word Embedding Models for Machine Translation of North Eastern Languages

Notes

References

Artetxe M, Labaka G, Agirre E (2018a) A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Melbourne, Australia, pp 789–798
Artetxe M, Labaka G, Agirre E (2018b) Unsupervised statistical machine translation. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Brussels, Belgium, pp 3632–3642
Artetxe M, Labaka G, Agirre E, Cho K (2018c) Unsupervised neural machine translation. In: ICLR 2018, Proceedings of the Sixth International Conference on Learning Representations, Vancouver, Canada, 12pp
Artetxe M, Labaka G, Agirre E (2019) An effective approach to unsupervised machine translation. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Florence, Italy, pp 194–203
Banerjee T, V Murthy R, Bhattacharya P (2021) Crosslingual embeddings are essential in UNMT for distant languages: An English to IndoAryan case study. In: Proceedings of the 18th Biennial Machine Translation Summit (Volume 1: Research Track), Association for Machine Translation in the Americas, Virtual, pp 23–34
Barrault L, Bojar O, Costa-jussà MR, Federmann C, Fishel M, Graham Y, Haddow B, Huck M, Koehn P, Malmasi S, Monz C, Müller M, Pal S, Post M, Zampieri M (2019) Findings of the 2019 conference on machine translation (WMT19). In: Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), Association for Computational Linguistics, Florence, Italy, pp 1–61
Barrault L, Biesialska M, Bojar O, Costa-jussà MR, Federmann C, Graham Y, Grundkiewicz R, Haddow B, Huck M, Joanis E, Kocmi T, Koehn P, Lo Ck, Ljubešić N, Monz C, Morishita M, Nagata M, Nakazawa T, Pal S, Post M, Zampieri M (2020) Findings of the 2020 conference on machine translation (WMT20). In: Proceedings of the Fifth Conference on Machine Translation, Association for Computational Linguistics, Online, pp 1–55
Bhattacharyya P (2012) Natural language processing: a perspective from computation in presence of ambiguity, resource constraint and multilinguality. CSI J Comput 1(2):1–13
MathSciNet Google Scholar
Bhattacharyya P (2017) IndoWordNet. In: Dash NS, Bhattacharyya P, Pawar JD (eds) The WordNet in Indian languages. Springer, Singapore, pp 1–18
Google Scholar
Bojar O, Chatterjee R, Federmann C, Graham Y, Haddow B, Huang S, Huck M, Koehn P, Liu Q, Logacheva V, Monz C, Negri M, Post M, Rubino R, Specia L, Turchi M (2017) Findings of the 2017 conference on machine translation (WMT17). In: Proceedings of the Second Conference on Machine Translation, Volume 2: Shared Task Papers, Association for Computational Linguistics, Copenhagen, Denmark, pp 169–214
Bojar O, Federmann C, Fishel M, Graham Y, Haddow B, Huck M, Koehn P, Monz C (2018) Findings of the 2018 conference on machine translation (WMT18). In: Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers, Association for Computational Linguistics, Belgium, Brussels, pp 272–307
Chronopoulou A, Stojanovski D, Fraser A (2020a) Reusing a Pretrained Language Model on Languages with Limited Corpora for Unsupervised NMT. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, pp 2703–2711
Chronopoulou A, Stojanovski D, Hangya V, Fraser A (2020b) The LMU Munich System for the WMT 2020 Unsupervised Machine Translation Shared Task. In: Proceedings of the Fifth Conference on Machine Translation, Association for Computational Linguistics, Online, pp 1084–1091
Chronopoulou A, Stojanovski D, Fraser A (2021) Improving the lexical ability of pretrained language models for unsupervised neural machine translation. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Online, pp 173–180
Conneau A, Lample G (2019) Cross-lingual language model pretraining. In: Advances in Neural Information Processing Systems, Proceedings, Vancouver, Canada, pp 7057–7067
Conneau A, Lample G, Ranzato M, Denoyer L, Jégou H (2017) Word translation without parallel data. http://arxiv.org/abs/1710.04087. 14pp
Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F, Grave E, Ott M, Zettlemoyer L, Stoyanov V (2020) Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, pp 8440–8451
Dabre R, Shrotriya H, Kunchukuttan A, Puduppully R, Khapra MM, Kumar P (2021) Indicbart: A pre-trained model for natural language generation of indic languages. http://arxiv.org/abs/210902903
Dewangan S, Alva S, Joshi N, Bhattacharyya P (2021) Experience of neural machine translation between indian languages. Mach Transl 35(1):71–99
Article Google Scholar
Dou Q, Knight K (2012) Large scale decipherment for out-of-domain machine translation. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Association for Computational Linguistics, Jeju Island, Korea, pp 266–275
Dou Q, Knight K (2013) Dependency-based decipherment for resource-limited machine translation. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Seattle, Washington, USA, pp 1668–1676
Dou Q, Vaswani A, Knight K (2014) Beyond parallel data: Joint word alignment and decipherment improves machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Doha, Qatar, pp 557–565
Dou Q, Vaswani A, Knight K, Dyer C (2015) Unifying Bayesian inference and vector space models for improved decipherment. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Association for Computational Linguistics, Beijing, China, pp 836–845
Du J, Way A (2017) Pinyin as subword unit for Chinese-sourced neural machine translation. In: 25th Irish Conference on Artificial Intelligence and Cognitive Science (AICS 2017), Dublin, Ireland, 11pp
Duan X, Zhang M, Li H, Banchs R, Kumaran A (2016) Whitepaper of NEWS 2016 shared task on machine transliteration. In: Proceedings of the Sixth Named Entity Workshop, Association for Computational Linguistics, Berlin, Germany, pp 49–57
Duan X, Ji B, Jia H, Tan M, Zhang M, Chen B, Luo W, Zhang Y (2020) Bilingual dictionary based neural machine translation without using parallel sentences. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, pp 1570–1579, 10.18653/v1/2020.acl-main.143, https://www.aclweb.org/anthology/2020.acl-main.143
Emeneau MB (1956) India as a linguistic area. Language 32(1):3–16
Article Google Scholar
Grave E, Joulin A, Berthet Q (2019) Unsupervised alignment of embeddings with wasserstein procrustes. The 22nd International Conference on Artificial Intelligence and Statistics. PMLR, Naha, Okinawa, Japan, pp 1880–1890
Jha GN (2010) The TDIL program and the Indian Langauge Corpora Intitiative (ILCI). In: Proceedings of the Seventh conference on International Language Resources and Evaluation, LREC 2010, Valletta, Malta, pp 982–985
Joulin A, Bojanowski P, Mikolov T, Jégou H, Grave E (2018) Loss in translation: learning bilingual word mapping with a retrieval criterion. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Brussels, Belgium, pp 2979–2984
Khatri J, Murthy R, Bhattacharyya P (2020) A study of efficacy of cross-lingual word embeddings for indian languages. In: Proceedings of the 7th ACM IKDD CoDS and 25th COMAD, Association for Computing Machinery, New York, NY, USA, p 347–348
Kingma DP, Ba JL (2015) Adam: a method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, San Diego, CA, USA, 15pp
Kunchukuttan A, Bhattacharyya P (2016) Orthographic syllable as basic unit for SMT between related languages. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Austin, Texas, pp 1912–1917
Kunchukuttan A, Puduppully R, Bhattacharyya P (2015) Brahmi-net: A transliteration and script conversion system for languages of the Indian subcontinent. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, Association for Computational Linguistics, Denver, Colorado, pp 81–85
Kunchukuttan A, Khapra M, Singh G, Bhattacharyya P (2018) Leveraging orthographic similarity for multilingual neural transliteration. Trans Assoc Comput Linguistics 6:303–316
Article Google Scholar
Kunchukuttan A, Kakwani D, Golla S, NC G, Bhattacharyya A, Khapra MM, Kumar P (2020) AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages. In: 5th Workshop on Representation Learning for NLP (RepL4NLP-2020), Online, 7pp
Lample G, Conneau A, Denoyer L, Ranzato M (2018a) Unsupervised machine translation using monolingual corpora only. In: Proceedings of the Sixth International Conference on Learning Representations, Vancouver, Canada, 14pp
Lample G, Ott M, Conneau A, Denoyer L, Ranzato M (2018b) Phrase-based & neural unsupervised machine translation. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Brussels, Belgium, pp 5039–5049
Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, Stoyanov V, Zettlemoyer L (2020) BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, pp 7871–7880
Liu Y, Gu J, Goyal N, Li X, Edunov S, Ghazvininejad M, Lewis M, Zettlemoyer L (2020) Multilingual denoising pre-training for neural machine translation. Trans Assoc Comput Linguist 8:726–742
Article Google Scholar
Marchisio K, Duh K, Koehn P (2020) When does unsupervised machine translation work? In: Proceedings of the Fifth Conference on Machine Translation, Association for Computational Linguistics, Online, pp 571–583
Melamed ID (1995) Automatic evaluation and uniform filter cascades for inducing n-best translation lexicons. Third Workshop on Very Large Corpora. Massachusetts, USA, Cambridge, pp 184–198
Muller B, Anastasopoulos A, Sagot B, Seddah D (2021) When Being Unseen from mBERT is just the Beginning: Handling New Languages With Multilingual Language Models. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, pp 448–462
Nakazawa T, Nakayama H, Ding C, Dabre R, Higashiyama S, Mino H, Goto I, Pa Pa W, Kunchukuttan A, Parida S, Bojar O, Kurohashi S (2020) Overview of the 7th workshop on Asian translation. In: Proceedings of the 7th Workshop on Asian Translation, Association for Computational Linguistics, Suzhou, China, pp 1–44
Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, Philapelphia, PA, USA, pp 311–318
Popović M (2015) chrF: character n-gram F-score for automatic MT evaluation. In: Proceedings of the Tenth Workshop on Statistical Machine Translation, Association for Computational Linguistics, Lisbon, Portugal, pp 392–395
Popović M (2016) chrF deconstructed: beta parameters and n-gram weights. In: Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, Association for Computational Linguistics, Berlin, Germany, pp 499–504
Provilkov I, Emelianenko D, Voita E (2020) Bpe-dropout: Simple and effective subword regularization. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp 1882–1892
Ramesh G, Doddapaneni S, Bheemaraj A, Jobanputra M, AK R, Sharma A, Sahoo S, Diddee H, Kakwani D, Kumar N, et al. (2021) Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages. arXiv preprint arXiv:210405596
Ravi S, Knight K (2011) Deciphering foreign language. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Portland, Oregon, USA, pp 12–21
Ren S, Wu Y, Liu S, Zhou M, Ma S (2019) Explicit cross-lingual pre-training for unsupervised machine translation. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, pp 770–779
Sennrich R, Haddow B, Birch A (2016) Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Berlin, Germany, pp 1715–1725
Siddhant A, Bapna A, Cao Y, Firat O, Chen MX, Kudugunta S, Arivazhagan N, Wu Y (2020) Leveraging monolingual data with self-supervision for multilingual neural machine translation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp 2827–2835
Song K, Tan X, Qin T, Lu J, Liu TY (2019) MASS: Masked Sequence to Sequence Pre-training for Language Generation. In: ICML 2019, Thirty-sixth International Conference on Machine Learning, Proceedings, Long Beach, California, USA, pp 5926–5936
Song H, Dabre R, Mao Z, Cheng F, Kurohashi S, Sumita E (2020) Pre-training via leveraging assisting languages for neural machine translation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, Online, pp 279–285
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(56):1929–1958
MathSciNet MATH Google Scholar
Sun H, Wang R, Utiyama M, Marie B, Chen K, Sumita E, Zhao T (2021) Unsupervised neural machine translation for similar and distant language pairs: An empirical study. ACM Trans Asian Low-Resource Lang Inf Process (TALLIP) 20(1):1–17
Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is All you Need. In: Advances in Neural Information Processing Systems 30, Curran Associates, Inc., pp 5998–6008
Wang Z, Xie J, Xu R, Yang Y, Neubig G, Carbonell JG (2019) Cross-lingual alignment vs joint training: A comparative study and a simple unified framework. In: ICLR 2019, Seventh International Conference on Learning Representations, Proceedings, New Orleans, LA, USA, 15pp
Wenzek G, Lachaux MA, Conneau A, Chaudhary V, Guzmán F, Joulin A, Grave E (2020) CCNet: Extracting high quality monolingual datasets from web crawl data. In: Proceedings of the 12th Language Resources and Evaluation Conference, European Language Resources Association, Marseille, France, pp 4003–4012
Yang Z, Hu B, Han A, Huang S, Ju Q (2020) CSP:code-switching pre-training for neural machine translation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, pp 2624–2636
Zhang M, Xu K, Kawarabayashi Ki, Jegelka S, Boyd-Graber J (2019) Are Girls Neko or Shōjo? Cross-Lingual Alignment of Non-Isomorphic Embeddings with Iterative Normalization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Florence, Italy, pp 3180–3189

Download references

Acknowledgements

The authors acknowledge the IBM Research Cognitive Computing Cluster service for providing resources that have contributed to the research results reported in this paper.

Author information

Jyotsana Khatri and Rudra Murthy contributed equally.

Authors and Affiliations

Indian Institute of Technology Bombay, Mumbai, India
Jyotsana Khatri, Tamali Banerjee & Pushpak Bhattacharyya
IBM Research, Bangalore, India
Rudra Murthy

Authors

Jyotsana Khatri
View author publications
You can also search for this author in PubMed Google Scholar
Rudra Murthy
View author publications
You can also search for this author in PubMed Google Scholar
Tamali Banerjee
View author publications
You can also search for this author in PubMed Google Scholar
Pushpak Bhattacharyya
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jyotsana Khatri.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Khatri, J., Murthy, R., Banerjee, T. et al. Simple measures of bridging lexical divergence help unsupervised neural machine translation for low-resource languages. Machine Translation 35, 711–744 (2021). https://doi.org/10.1007/s10590-021-09292-y

Download citation

Received: 13 January 2021
Accepted: 08 December 2021
Published: 05 January 2022
Issue Date: December 2021
DOI: https://doi.org/10.1007/s10590-021-09292-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Simple measures of bridging lexical divergence help unsupervised neural machine translation for low-resource languages

Abstract

Access this article

Similar content being viewed by others

Machine translation by projecting text into the same phonetic-orthographic space using a common encoding

Integrating Knowledge Encoded by Linguistic Phenomena of Indian Languages with Neural Machine Translation

A Study of Word Embedding Models for Machine Translation of North Eastern Languages

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Simple measures of bridging lexical divergence help unsupervised neural machine translation for low-resource languages

Abstract

Access this article

Similar content being viewed by others

Machine translation by projecting text into the same phonetic-orthographic space using a common encoding

Integrating Knowledge Encoded by Linguistic Phenomena of Indian Languages with Neural Machine Translation

A Study of Word Embedding Models for Machine Translation of North Eastern Languages

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation