Abstract
In this study, performance analysis of a state-of-art phrase-based statistical machine translation (SMT) system is presented on eight Indian languages. State of the art in SMT on different Indian languages to English language has also been discussed briefly. The motivation of this study was to promote the development of SMT and linguistic resources for these Indian language pairs, as the current systems are in infancy stage due to sparse data resources. EMILLE and crowdsourcing parallel corpora have been used in this study for experimental purposes. The study is concluded by presenting the performance of baseline SMT system for Indian languages (Bengali, Gujarati, Hindi, Malayalam, Punjabi, Tamil, Telugu and Urdu) into English with average 10–20 % accurate results for all the language pairs. As a result of this study, both of these annotated parallel corpora resources and SMT system will serve as benchmarks for future approaches to SMT in Hindi → English, Urdu → English, Punjabi → English, Telugu → English, Tamil → English, Gujarati → English, Bengali → English and Malayalam → English.

Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Koehn P (2010) A book on statistical machine translation. Cambridge University Press, Cambridge
Islam Z, Tiedemann J, Eisele A (2010) English to Bangla phrase-based machine translation. In: Proceedings of the 14th annual conference of the European Association for Machine Translation
Koehn P, Och F, Marcu D (2003) Statistical Phrase-Based Translation. In: HLT-NAACL: conference combining Human Language Technology conference series and the North American chapter of the Association for Computational Linguistics conference series, pp 48–54
Och FJ, Ney H (2000) Improved statistical alignment models. In: Proceedings of the 38th annual meeting of the Association for Computational Linguistics (ACL), pp 440–447
Shannon CE (1948) A mathematical theory of communication. Bell System Tech J 27:379–423 and 623–656
The Editors of Encyclopædia Britannica (2014) Bengali language. Encyclopedia Britannica Online. Encyclopedia Britannica, n.d. Web. 15 June 2014
The Editors of Encyclopædia Britannica (2014) Gujarati language. Encyclopedia Britannica Online. Encyclopedia Britannica, n.d. Web. 18 June 2014
The Editors of Encyclopædia Britannica (2014) Hindi language. Encyclopedia Britannica Online. Encyclopedia Britannica, n.d. Web. 18 June 2014
The Editors of Encyclopædia Britannica (2014) Malayalam language. Encyclopedia Britannica Online. Encyclopedia Britannica, n.d. Web. 18 June 2014
The Editors of Encyclopædia Britannica (2014) Punjabi language. Encyclopedia Britannica Online. Encyclopedia Britannica, n.d. Web. 18 June 2014
The Editors of Encyclopædia Britannica (2014) Tamil language. Encyclopedia Britannica Online. Encyclopedia Britannica, n.d. Web. 18 June 2014
The Editors of Encyclopædia Britannica (2014) Telugu language. Encyclopedia Britannica Online. Encyclopedia Britannica, n.d. Web. 18 June 2014
The Editors of Encyclopædia Britannica (2014) Urdu language. Encyclopedia Britannica Online. Encyclopedia Britannica, n.d. Web. 18 June 2014
Dasgupta S, Wasif A, Azam S (2004) An optimal way towards machine translation from English to Bengali. In: Proceedings of the 7th international conference on computer and information technology (ICCIT)
Naskar SK, Bandyopadhyay S (2006) A phrasal EBMT system for translating English to Bengali. In: Workshop on language, artificial intelligence and computer science for natural language processing applications, Bangkok, Thailand, pp 69–72
Anwar MM, Anwar MZ, Bhuiyan MA-A (2009) Syntax analysis and machine translation of Bangla sentences. Int J Comput Sci Netw Secur 9:317–326
Durrani N, Sajjad H, Fraser A, Schmid H (2010) Hindi-to-Urdu machine translation through transliteration. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp 465–474
Roy M (2009) A semi-supervised approach to Bengali–English phrase-based statistical Machine Translation. In: Proceedings of the 22nd Canadian conference on artificial intelligence
Singh D et al (2012) Modeling phrase based SMT for English to Hindi language. Int J Res Rev Eng Sci Technol 1:95–99
Sharma N (2011) English to Hindi statistical machine translation system. Dissertation, Thapar University, Patiala
Yamada K, Knight K (2001) A syntax -based statistical translation model. In: Proceedings of the 39th annual meeting of the ACL, pp 523–530
Eisner J (2003) Learning non-isomorphic tree mappings for machine translation. In: Proceedings of the ACL interactive poster/demonstration sessions, pp 205–208
Liu T, Che W, Li S, Hu Y, Liu H (2005) Semantic role labeling system using maximum entropy classifier. In: Proceedings of CoNLL, pp 189–192
Bisazza A, Federico M (2010) Chunk-based verb reordering in VSO sentences for Arabic–English statistical machine translation. In: Proceedings of the joint fifth workshop on statistical machine translation and metrics MATR, WMT’10, pp 235–243
Jawaid B, Zeman D (2011) Word-order issues in English-to-Urdu statistical machine translation. Prague Bull Math Linguist 95:87–106 (ISSN 0032-6585)
Khan N, Anwar W, Bajwa U, Durrani N (2013) English to Urdu hierarchical phrase based SMT system. In: The fourth workshop n South and Southeast Asian NLP (WSSANLP), International joint conference on natural language processing, pp 72–76
Chiang D (2005) A hierarchical phrase-based model for statistical machine translation. In: Proceedings of the 43rd annual meeting on Association for Computational Linguistics (ACL)
Singh G (2008) A Punjabi to Hindi Machine translation system. In: Proceeding of COLING, 22nd international conference on computational linguistics
Khalilov M et al (2008) Neural network language models for translation with limited data. In: Proceedings of 20th IEEE international conference on tools with artificial intelligence, Dayton, Ohio, pp 445–451
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473
Baker P, EMILLE (2002) A 70-million word corpus of indic languages: data collection, mark-up and harmonization. In: Proceedings of the 3rd language resources and evaluation conference, pp 819–825, LREC’
Moore R (2002) Fast and accurate sentence alignment of bilingual corpora. In: Conference of the association for machine translation in the Americas
Koehn P (2005) EuroParl: a parallel corpus for statistical machine translation. The tenth Machine Translation Summit, Phuket, Thailand, pp 79–86
Post M, Callison-Burch C, Osborne M (2012) Constructing parallel corpora for Six Indian languages via Crowdsourcing. In: Proceedings of the seventh workshop on statistical machine translation, pp 401–409
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the Association for Computational Linguistics companion volume proceedings of the demo and poster sessions. Association for Computational Linguistics, Prague, Czech Republic, pp 177–180
Och F (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29(1):19–51
Stolcke A (2002) SRILM-an extensible language modeling toolkit. In: Proceedings of the international conference on spoken language processing, Denver, Colorado, pp 257–286
Durrani N, Fraser A, Schmid H, Hoang H, Koehn P (2013) Can Markov models over minimal translation units help phrase-based SMT? In: Proceedings of the 51st annual meeting of the association for computational linguistics
Hasler E, Haddow B, Koehn P (2012) Sparse lexicalised features and topic adaptation for SMT. In: Proceedings of the seventh international workshop on spoken language translation, pp 268–275
Kumar S, Byrne WJ (2004) Minimum bayes-risk decoding for statistical machine translation. The fifth meeting of the North American Chapter of the ACL, Boston, USA, pp 169–176
Huang L, Chiang D (2007) Forest rescoring: faster decoding with integrated language models. In: Proceedings of the 45th annual meeting of the association of computational linguistics, pp 144–151
Cherry C, Foster G (2012) Batch tuning strategies for statistical machine translation. In: Proceedings of the 2012 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 427–436
Papineni K (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of 40th annual meeting of the Association for Computational Linguistics (ACL), pp 311–318
Durrani N, Sajjad H, Hoang H, Koehn P (2014) Integrating an unsupervised transliteration model into statistical machine translation. In: Proceedings of the 15th conference of the European chapter of the ACL
Cohn T, Lapata M (2007) Machine Translation by triangulation: making effective use of multi-parallel corpora. In: Proceedings of the 45th annual meeting of the association of computational linguistics, Prague, Czech Republic, pp 728–735
Bertoldi N, Barbaiani M, Federico M, Cattoni R (2008) Phrase-based statistical machine translation with pivot languages. In: International workshop on spoken language translation evaluation campaign on Spoken Language Translation (IWSLT), Hawaii, USA, pp 143–149
Koehn P, Monz C (2005) Shared task: statistical machine translation between European languages. In: Proceedings of the ACL workshop on building and using parallel texts
Acknowledgement
We would like to thank Dr. Nadir Durrani from University of Edinburgh for his helpful comments and suggestions during the experimentation and proof reading the write up, which has helped us a lot to improve the paper. He also provided examples to be included in the text.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Khan Jadoon, N., Anwar, W., Bajwa, U.I. et al. Statistical machine translation of Indian languages: a survey. Neural Comput & Applic 31, 2455–2467 (2019). https://doi.org/10.1007/s00521-017-3206-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-017-3206-2