Abstract
Sentence-level parallel data is essential for training machine translation systems. However, existing parallel data is extremely limited for thousands of languages. In order to increase the available parallel data for a low-resource language we borrow parallel data from a higher-resource closely related language (RL). In so doing we propose a method for translating texts from RL to the low-resource language without requiring any parallel data between them. We use this method to convert RL/English parallel data and use it as an extra resource for machine translation. We show that this extra parallel data highly helps the BLEU score.
Similar content being viewed by others
Notes
Tuning with a held-out subset of training data results in lower BLEU scores in all the experiments but does not change the conclusions of this paper.
References
Chalamandaris A, Protopapas A, Tsiakoulis P, Raptis S (2006) All Greek to me! an automatic Greeklish to Greek transliteration system. In: Proceedings of the 5th international conference on language resources and evaluation (LREC’06), Genoa, Italy, pp 1226–1229
Chen Y, Liu Y, Cheng Y, Li V (2017) A teacher-student framework for zero-resource neural machine translation. In: Proceedings of the 55th annual meeting of the association for computational linguistics, vol 1, Long Papers. Vancouver, Canada, pp 1925–1935
Cicekli I (2002) A machine translation system between a pair of closely related languages. In: Proceedings of the 17th international symposium on computer and information sciences (ISCIS 2002), CRC Press, Orlando, Florida, pp 192–196
Conneau A, Lample G, Ranzato M, Denoyer L, Jégou H (2017) Word translation without parallel data. arXiv:1710.04087
Currey A, Karakanta A, Dehdari J (2016) Using related languages to enhance statistical language models. In: Proceedings of the NAACL student research workshop, San Diego, California, pp 116–123
Dou Q, Knight K (2012) Large scale decipherment for out-of-domain machine translation. In: Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, Jeju Island, Korea, pp 266–275
Firat O, Sankaran B, Al-Onaizan Y, Yarman Vural FT, Cho K (2016) Zero-resource translation with multi-lingual neural machine translation. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp 268–277
Forcada ML, Ginestí-Rosell M, Nordfalk J, O’Regan J, Ortiz-Rojas S, Pérez-Ortiz JA, Sánchez-Martínez F, Ramírez-Sánchez G, Tyers FM (2011) Apertium: a free/open-source platform for rule-based machine translation. Mach Transl 25(2):127–144
Fung P, Yee LY (1998) An IR approach for translating new words from nonparallel, comparable texts. In: Proceedings of the 36th annual meeting of the association for computational linguistics and 17th international conference on computational linguistics, Vol 1. Montreal, Quebec, Canada, pp 414–420
Goldhahn D, Eckart T, Quasthoff U (2012) Building large monolingual dictionaries at the Leipzig corpora collection: From 100 to 200 languages. In: Proceedings of the 8th international conference on language resources and evaluation (LREC-2012), Istanbul, Turkey, pp 759–765
Haghighi A, Liang P, Berg-Kirkpatrick T, Klein D (2008) Learning bilingual lexicons from monolingual corpora. In: Proceedings of ACL-08: HLT, Columbus, Ohio, pp 771–779
Hajič J, Hric J, Kuboň V (2000) Machine translation of very close languages. In: Proceedings of the 6th conference on applied natural language processing, Seattle, Washington, USA, pp 7–12
Hana J, Feldman A, Brew C, Amaral L (2006) Tagging Portuguese with a Spanish tagger using cognates. In: Proceedings of the international workshop on cross-language knowledge induction, Sydney, Australia, pp 33–40
Hitham AB, Shaalan K, Ziedan I (2008) A hybrid approach for converting written Egyptian colloquial dialect into diacritized Arabic. The 6th international conference on informatics and systems. Egypt, Cairo, pp 27–33
Irvine A (2013) Statistical machine translation in low resource settings. In: Proceedings of the 2013 NAACL HLT student research workshop, Atlanta, Georgia, pp 54–61
Irvine A, Callison-Burch C (2013) Combining bilingual and comparable corpora for low resource machine translation. In: Proceedings of the 8th workshop on statistical machine translation, Sofia, Bulgaria, pp 262–270
Johnson M, Schuster M, Le QV, Krikun M, Wu Y, Chen Z, Thorat N, Viégas F, Wattenberg M, Corrado G et al (2017) Google’s multilingual neural machine translation system: enabling zero-shot translation. Trans Assoc Comput Linguist 5:339–351
Karakanta A, Dehdari J, van Genabith J (2018) Neural machine translation for low-resource languages without parallel corpora. Mach Transl 32(1):1–23
Knight K, Nair A, Rathod N, Yamada K (2006) Unsupervised analysis for decipherment problems. In: Proceedings of the COLING/ACL 2006 main conference poster sessions, Sydney, Australia, pp 499–506
Koehn P, Knight K (2002) Learning a translation lexicon from monolingual corpora. In: Proceedings of the ACL-02 workshop on unsupervised lexical acquisition, Philadelphia, Pennsylvania, USA, pp 9–16
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: Open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the association for computational linguistics companion volume proceedings of the demo and poster sessions, Prague, Czech Republic, pp 177–180
Kondrak G, Marcu D, Knight K (2003) Cognates can improve statistical translation models. In: Companion volume of the proceedings of HLT-NAACL 2003—short papers, Edmonton, Canada, pp 46–48
Lample G, Denoyer L, Ranzato M (2017) Unsupervised machine translation using monolingual corpora only. arXiv:1711.00043
Larasati SD, Kuboň V (2010) A study of Indonesian-to-Malaysian MT system. In: Proceedings of the 4th international MALINDO workshop, Depok, Indonesia, pp 16–22
Liu CH, Silva CC, Wang L, Way A (2018) Pivot machine translation using Chinese as pivot language. In: CWMT 2018: Proceedings of the 14th China workshop on machine translation, Wuyishan, China, pp 1–12
Mann GS, Yarowsky D (2001) Multipath translation lexicon induction via bridge languages. In: Second Meeting of the North American Chapter of the Association for Computational Linguistics, Pittsburgh, Pennsylvania, USA, 8pp
May J, Benjira Y, Echihabi A (2014) An Arabizi-English social media statistical machine translation system. In: Proceedings of the 11th conference of the association for machine translation in the Americas, Vancouver, British Columbia, Canada, pp 329–341
Naim I, Riley P, Gildea D (2018) Feature-based decipherment for machine translation. Comput Linguist 44(3):525–546
Nakov P, Ng HT (2009) Improved statistical machine translation for resource-poor languages using related resource-rich languages. In: Proceedings of the 2009 conference on empirical methods in natural language processing, Singapore, pp 1358–1367
Nakov P, Tiedemann J (2012) Combining word-level and character-level models for machine translation between closely-related languages. In: Proceedings of the 50th annual meeting of the association for computational linguistics, Vol 2, Short Papers. Jeju Island, Korea, pp 301–305
Nuhn M, Mauser A, Ney H (2012) Deciphering foreign language by combining language models and context vectors. In: Proceedings of the 50th annual meeting of the association for computational linguistics, vol 1, Long Papers. Jeju Island, Korea, pp 156–164
Och FJ (2003) Minimum error rate training in statistical machine translation. In: Proceedings of the 41st annual meeting of the association for computational linguistics, Sapporo, Japan, pp 160–167
Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29(1):19–51
Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the ACL-2002 40th annual meeting of the association for computational linguistics, Philadelphia, pp 311–318
Passban P, Liu Q, Way A (2017) Translating low-resource languages by vocabulary adaptation from close counterparts. ACM Trans Asian Low-Resour Lang Inf Process 16(4):29. https://doi.org/10.1145/3099556
Pourdamghani N, Knight K (2017) Deciphering related languages. In: Proceedings of the 2017 conference on empirical methods in natural language processing, Copenhagen, Denmark, pp 2513–2518
Rapp R (1995) Identifying word translations in non-parallel texts. In: Proceedings of the 33rd annual meeting of the association for computational linguistics, Cambridge, Massachusetts, USA, pp 320–322
Ravi S (2013) Scalable decipherment for machine translation via hash sampling. In: Proceedings of the 51st annual meeting of the association for computational linguistics, vol 1, Long Papers. Sofia, Bulgaria, pp 362–371
Ravi S, Knight K (2009) Learning phoneme mappings for transliteration without parallel data. In: Proceedings of human language technologies: The 2009 annual conference of the North American chapter of the association for computational linguistics, Boulder, Colorado, pp 37–45
Ravi S, Knight K (2011a) Bayesian inference for Zodiac and other homophonic ciphers. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, Portland, Oregon, USA, pp 239–247
Ravi S, Knight K (2011b) Deciphering foreign language. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, Portland, Oregon, USA, pp 12–21
Salloum W, Habash N (2011) Dialectal to standard Arabic paraphrasing to improve Arabic–English statistical machine translation. In: Proceedings of the 1st workshop on algorithms and resources for modelling of dialects and language varieties, Edinburgh, Scotland, pp 10–21
Sawaf H, (2010) Arabic dialect handling in hybrid machine translation. In: Proceedings of the 2010 AMTA, 9th conference of the association for machine translation in the Americas. Denver, Colorado, p 8
Scannell KP (2006) Machine translation for closely related language pairs. In: Proceedings of the LREC workshop on strategies for developing machine translation for minority languages, Genoa, Italy, pp 103–109
Smith JR, Quirk C, Toutanova K (2010) Extracting parallel sentences from comparable corpora using document level alignment. In: Proceedings of the Human Language Technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics. Los Angeles, California, pp 403–411
Tiedemann J (2009) Character-based PSMT for closely related languages. In: Proceedings of the 13th conference of the European association for machine translation, Barcelona, Spain, pp 12–19
Tiedemann J (2012) Parallel data, tools and interfaces in OPUS. In: Proceedings of the 8th international conference on language resources and evaluation (LREC-2012), Istanbul, Turkey, pp 2214–2218
Utiyama M, Isahara H (2007) A comparison of pivot methods for phrase-based statistical machine translation. In: Proceedings of the Main Conference, Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics.Rochester, New York, pp 484–491
Vilar D, Peter JT, Ney H (2007) Can we translate letters? In: Proceedings of the 2nd workshop on statistical machine translation, Prague, Czech Republic, pp 33–39
Wu H, Wang H (2007) Pivot language approach for phrase-based statistical machine translation. In: Proceedings of the 45th annual meeting of the association of computational linguistics, Prague, Czech Republic, pp 856–863
Acknowledgements
This work was supported by DARPA Contract HR0011-15-C-0115. The authors would like to thank Marjan Ghazvininejad, Ulf Hermjakob, Jonathan May, and Michael Pust for their comments and suggestions.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work is a significant extension to Pourdamghani and Knight (2017). The cipher model (Sects. 4.1, 5.1, and 6 ) and the evaluation of the RL to IL translation accuracy (Sect. 8.1) are initially presented in Pourdamghani and Knight (2017). Other sections including description of the language models and their training as well as the idea of converting the parallel data, methods for combining converted and original parallel data and machine translation experiments are presented in this paper for the first time.
Rights and permissions
About this article
Cite this article
Pourdamghani, N., Knight, K. Neighbors helping the poor: improving low-resource machine translation using related languages. Machine Translation 33, 239–258 (2019). https://doi.org/10.1007/s10590-019-09236-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10590-019-09236-7