Skip to main content
Log in

Simple measures of bridging lexical divergence help unsupervised neural machine translation for low-resource languages

  • Published:
Machine Translation

Abstract

Unsupervised Neural Machine Translation (UNMT) approaches have gained widespread popularity in recent times. Though these approaches show impressive translation performance using only monolingual corpora of the languages involved, these approaches have mostly been tried on high-resource European language pairs viz. EnglishFrench, EnglishGerman, etc. In this paper, we explore UNMT for 6 Indic language pairs viz., HindiBengali, HindiGujarati, HindiMarathi, HindiMalayalam, HindiTamil, and HindiTelugu which are low-resource language pairs. We additionally perform experiments on 4 European language pairs viz., EnglishCzech, EnglishEstonian, EnglishLithuanian, and EnglishFinnish. We observe that the lexical divergence within these language pairs plays a big role in the success of UNMT. In this context, we explore three approaches viz., (i) script conversion, (ii) unsupervised bilingual embedding-based initialization to bring the vocabulary of the two languages closer, and (iii) dictionary word substitution using a bilingual dictionary. We found that the script conversion using a simple rule-based system benefits language pairs that have high cognate overlap but use different scripts. We observe that script conversion combined with word substitution using a dictionary further improves the UNMT performance. We use a ground truth bilingual dictionary in our dictionary word substitution experiments, and such dictionaries can also be obtained using unsupervised bilingual embeddings. We empirically demonstrate that minimizing lexical divergence using simple heuristics leads to significant improvements in the BLEU score for both related and distant language pairs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. http://www.statmt.org/wmt20/translation-task.html.

  2. http://lotus.kuee.kyoto-u.ac.jp/WAT/indic-multilingual/.

  3. https://github.com/mttravel/Dictionary-based-MT/blob/master/replace_by_dictionary.py.

  4. https://github.com/anoopkunchukuttan/indic_nlp_library.

  5. https://github.com/microsoft/MASS/.

References

  • Artetxe M, Labaka G, Agirre E (2018a) A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Melbourne, Australia, pp 789–798

  • Artetxe M, Labaka G, Agirre E (2018b) Unsupervised statistical machine translation. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Brussels, Belgium, pp 3632–3642

  • Artetxe M, Labaka G, Agirre E, Cho K (2018c) Unsupervised neural machine translation. In: ICLR 2018, Proceedings of the Sixth International Conference on Learning Representations, Vancouver, Canada, 12pp

  • Artetxe M, Labaka G, Agirre E (2019) An effective approach to unsupervised machine translation. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Florence, Italy, pp 194–203

  • Banerjee T, V Murthy R, Bhattacharya P (2021) Crosslingual embeddings are essential in UNMT for distant languages: An English to IndoAryan case study. In: Proceedings of the 18th Biennial Machine Translation Summit (Volume 1: Research Track), Association for Machine Translation in the Americas, Virtual, pp 23–34

  • Barrault L, Bojar O, Costa-jussà MR, Federmann C, Fishel M, Graham Y, Haddow B, Huck M, Koehn P, Malmasi S, Monz C, Müller M, Pal S, Post M, Zampieri M (2019) Findings of the 2019 conference on machine translation (WMT19). In: Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), Association for Computational Linguistics, Florence, Italy, pp 1–61

  • Barrault L, Biesialska M, Bojar O, Costa-jussà MR, Federmann C, Graham Y, Grundkiewicz R, Haddow B, Huck M, Joanis E, Kocmi T, Koehn P, Lo Ck, Ljubešić N, Monz C, Morishita M, Nagata M, Nakazawa T, Pal S, Post M, Zampieri M (2020) Findings of the 2020 conference on machine translation (WMT20). In: Proceedings of the Fifth Conference on Machine Translation, Association for Computational Linguistics, Online, pp 1–55

  • Bhattacharyya P (2012) Natural language processing: a perspective from computation in presence of ambiguity, resource constraint and multilinguality. CSI J Comput 1(2):1–13

    MathSciNet  Google Scholar 

  • Bhattacharyya P (2017) IndoWordNet. In: Dash NS, Bhattacharyya P, Pawar JD (eds) The WordNet in Indian languages. Springer, Singapore, pp 1–18

    Google Scholar 

  • Bojar O, Chatterjee R, Federmann C, Graham Y, Haddow B, Huang S, Huck M, Koehn P, Liu Q, Logacheva V, Monz C, Negri M, Post M, Rubino R, Specia L, Turchi M (2017) Findings of the 2017 conference on machine translation (WMT17). In: Proceedings of the Second Conference on Machine Translation, Volume 2: Shared Task Papers, Association for Computational Linguistics, Copenhagen, Denmark, pp 169–214

  • Bojar O, Federmann C, Fishel M, Graham Y, Haddow B, Huck M, Koehn P, Monz C (2018) Findings of the 2018 conference on machine translation (WMT18). In: Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers, Association for Computational Linguistics, Belgium, Brussels, pp 272–307

  • Chronopoulou A, Stojanovski D, Fraser A (2020a) Reusing a Pretrained Language Model on Languages with Limited Corpora for Unsupervised NMT. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, pp 2703–2711

  • Chronopoulou A, Stojanovski D, Hangya V, Fraser A (2020b) The LMU Munich System for the WMT 2020 Unsupervised Machine Translation Shared Task. In: Proceedings of the Fifth Conference on Machine Translation, Association for Computational Linguistics, Online, pp 1084–1091

  • Chronopoulou A, Stojanovski D, Fraser A (2021) Improving the lexical ability of pretrained language models for unsupervised neural machine translation. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Online, pp 173–180

  • Conneau A, Lample G (2019) Cross-lingual language model pretraining. In: Advances in Neural Information Processing Systems, Proceedings, Vancouver, Canada, pp 7057–7067

  • Conneau A, Lample G, Ranzato M, Denoyer L, Jégou H (2017) Word translation without parallel data. http://arxiv.org/abs/1710.04087. 14pp

  • Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F, Grave E, Ott M, Zettlemoyer L, Stoyanov V (2020) Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, pp 8440–8451

  • Dabre R, Shrotriya H, Kunchukuttan A, Puduppully R, Khapra MM, Kumar P (2021) Indicbart: A pre-trained model for natural language generation of indic languages. http://arxiv.org/abs/210902903

  • Dewangan S, Alva S, Joshi N, Bhattacharyya P (2021) Experience of neural machine translation between indian languages. Mach Transl 35(1):71–99

    Article  Google Scholar 

  • Dou Q, Knight K (2012) Large scale decipherment for out-of-domain machine translation. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Association for Computational Linguistics, Jeju Island, Korea, pp 266–275

  • Dou Q, Knight K (2013) Dependency-based decipherment for resource-limited machine translation. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Seattle, Washington, USA, pp 1668–1676

  • Dou Q, Vaswani A, Knight K (2014) Beyond parallel data: Joint word alignment and decipherment improves machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Doha, Qatar, pp 557–565

  • Dou Q, Vaswani A, Knight K, Dyer C (2015) Unifying Bayesian inference and vector space models for improved decipherment. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Association for Computational Linguistics, Beijing, China, pp 836–845

  • Du J, Way A (2017) Pinyin as subword unit for Chinese-sourced neural machine translation. In: 25th Irish Conference on Artificial Intelligence and Cognitive Science (AICS 2017), Dublin, Ireland, 11pp

  • Duan X, Zhang M, Li H, Banchs R, Kumaran A (2016) Whitepaper of NEWS 2016 shared task on machine transliteration. In: Proceedings of the Sixth Named Entity Workshop, Association for Computational Linguistics, Berlin, Germany, pp 49–57

  • Duan X, Ji B, Jia H, Tan M, Zhang M, Chen B, Luo W, Zhang Y (2020) Bilingual dictionary based neural machine translation without using parallel sentences. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, pp 1570–1579, 10.18653/v1/2020.acl-main.143, https://www.aclweb.org/anthology/2020.acl-main.143

  • Emeneau MB (1956) India as a linguistic area. Language 32(1):3–16

    Article  Google Scholar 

  • Grave E, Joulin A, Berthet Q (2019) Unsupervised alignment of embeddings with wasserstein procrustes. The 22nd International Conference on Artificial Intelligence and Statistics. PMLR, Naha, Okinawa, Japan, pp 1880–1890

  • Jha GN (2010) The TDIL program and the Indian Langauge Corpora Intitiative (ILCI). In: Proceedings of the Seventh conference on International Language Resources and Evaluation, LREC 2010, Valletta, Malta, pp 982–985

  • Joulin A, Bojanowski P, Mikolov T, Jégou H, Grave E (2018) Loss in translation: learning bilingual word mapping with a retrieval criterion. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Brussels, Belgium, pp 2979–2984

  • Khatri J, Murthy R, Bhattacharyya P (2020) A study of efficacy of cross-lingual word embeddings for indian languages. In: Proceedings of the 7th ACM IKDD CoDS and 25th COMAD, Association for Computing Machinery, New York, NY, USA, p 347–348

  • Kingma DP, Ba JL (2015) Adam: a method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, San Diego, CA, USA, 15pp

  • Kunchukuttan A, Bhattacharyya P (2016) Orthographic syllable as basic unit for SMT between related languages. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Austin, Texas, pp 1912–1917

  • Kunchukuttan A, Puduppully R, Bhattacharyya P (2015) Brahmi-net: A transliteration and script conversion system for languages of the Indian subcontinent. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, Association for Computational Linguistics, Denver, Colorado, pp 81–85

  • Kunchukuttan A, Khapra M, Singh G, Bhattacharyya P (2018) Leveraging orthographic similarity for multilingual neural transliteration. Trans Assoc Comput Linguistics 6:303–316

    Article  Google Scholar 

  • Kunchukuttan A, Kakwani D, Golla S, NC G, Bhattacharyya A, Khapra MM, Kumar P (2020) AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages. In: 5th Workshop on Representation Learning for NLP (RepL4NLP-2020), Online, 7pp

  • Lample G, Conneau A, Denoyer L, Ranzato M (2018a) Unsupervised machine translation using monolingual corpora only. In: Proceedings of the Sixth International Conference on Learning Representations, Vancouver, Canada, 14pp

  • Lample G, Ott M, Conneau A, Denoyer L, Ranzato M (2018b) Phrase-based & neural unsupervised machine translation. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Brussels, Belgium, pp 5039–5049

  • Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, Stoyanov V, Zettlemoyer L (2020) BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, pp 7871–7880

  • Liu Y, Gu J, Goyal N, Li X, Edunov S, Ghazvininejad M, Lewis M, Zettlemoyer L (2020) Multilingual denoising pre-training for neural machine translation. Trans Assoc Comput Linguist 8:726–742

    Article  Google Scholar 

  • Marchisio K, Duh K, Koehn P (2020) When does unsupervised machine translation work? In: Proceedings of the Fifth Conference on Machine Translation, Association for Computational Linguistics, Online, pp 571–583

  • Melamed ID (1995) Automatic evaluation and uniform filter cascades for inducing n-best translation lexicons. Third Workshop on Very Large Corpora. Massachusetts, USA, Cambridge, pp 184–198

  • Muller B, Anastasopoulos A, Sagot B, Seddah D (2021) When Being Unseen from mBERT is just the Beginning: Handling New Languages With Multilingual Language Models. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, pp 448–462

  • Nakazawa T, Nakayama H, Ding C, Dabre R, Higashiyama S, Mino H, Goto I, Pa Pa W, Kunchukuttan A, Parida S, Bojar O, Kurohashi S (2020) Overview of the 7th workshop on Asian translation. In: Proceedings of the 7th Workshop on Asian Translation, Association for Computational Linguistics, Suzhou, China, pp 1–44

  • Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, Philapelphia, PA, USA, pp 311–318

  • Popović M (2015) chrF: character n-gram F-score for automatic MT evaluation. In: Proceedings of the Tenth Workshop on Statistical Machine Translation, Association for Computational Linguistics, Lisbon, Portugal, pp 392–395

  • Popović M (2016) chrF deconstructed: beta parameters and n-gram weights. In: Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, Association for Computational Linguistics, Berlin, Germany, pp 499–504

  • Provilkov I, Emelianenko D, Voita E (2020) Bpe-dropout: Simple and effective subword regularization. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp 1882–1892

  • Ramesh G, Doddapaneni S, Bheemaraj A, Jobanputra M, AK R, Sharma A, Sahoo S, Diddee H, Kakwani D, Kumar N, et al. (2021) Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages. arXiv preprint arXiv:210405596

  • Ravi S, Knight K (2011) Deciphering foreign language. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Portland, Oregon, USA, pp 12–21

  • Ren S, Wu Y, Liu S, Zhou M, Ma S (2019) Explicit cross-lingual pre-training for unsupervised machine translation. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, pp 770–779

  • Sennrich R, Haddow B, Birch A (2016) Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Berlin, Germany, pp 1715–1725

  • Siddhant A, Bapna A, Cao Y, Firat O, Chen MX, Kudugunta S, Arivazhagan N, Wu Y (2020) Leveraging monolingual data with self-supervision for multilingual neural machine translation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp 2827–2835

  • Song K, Tan X, Qin T, Lu J, Liu TY (2019) MASS: Masked Sequence to Sequence Pre-training for Language Generation. In: ICML 2019, Thirty-sixth International Conference on Machine Learning, Proceedings, Long Beach, California, USA, pp 5926–5936

  • Song H, Dabre R, Mao Z, Cheng F, Kurohashi S, Sumita E (2020) Pre-training via leveraging assisting languages for neural machine translation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, Online, pp 279–285

  • Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(56):1929–1958

    MathSciNet  MATH  Google Scholar 

  • Sun H, Wang R, Utiyama M, Marie B, Chen K, Sumita E, Zhao T (2021) Unsupervised neural machine translation for similar and distant language pairs: An empirical study. ACM Trans Asian Low-Resource Lang Inf Process (TALLIP) 20(1):1–17

    Google Scholar 

  • Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is All you Need. In: Advances in Neural Information Processing Systems 30, Curran Associates, Inc., pp 5998–6008

  • Wang Z, Xie J, Xu R, Yang Y, Neubig G, Carbonell JG (2019) Cross-lingual alignment vs joint training: A comparative study and a simple unified framework. In: ICLR 2019, Seventh International Conference on Learning Representations, Proceedings, New Orleans, LA, USA, 15pp

  • Wenzek G, Lachaux MA, Conneau A, Chaudhary V, Guzmán F, Joulin A, Grave E (2020) CCNet: Extracting high quality monolingual datasets from web crawl data. In: Proceedings of the 12th Language Resources and Evaluation Conference, European Language Resources Association, Marseille, France, pp 4003–4012

  • Yang Z, Hu B, Han A, Huang S, Ju Q (2020) CSP:code-switching pre-training for neural machine translation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, pp 2624–2636

  • Zhang M, Xu K, Kawarabayashi Ki, Jegelka S, Boyd-Graber J (2019) Are Girls Neko or Shōjo? Cross-Lingual Alignment of Non-Isomorphic Embeddings with Iterative Normalization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Florence, Italy, pp 3180–3189

Download references

Acknowledgements

The authors acknowledge the IBM Research Cognitive Computing Cluster service for providing resources that have contributed to the research results reported in this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jyotsana Khatri.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Khatri, J., Murthy, R., Banerjee, T. et al. Simple measures of bridging lexical divergence help unsupervised neural machine translation for low-resource languages. Machine Translation 35, 711–744 (2021). https://doi.org/10.1007/s10590-021-09292-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10590-021-09292-y

Keywords

Navigation