Statistical machine translation of Indian languages: a survey

Khan Jadoon, Nadeem; Anwar, Waqas; Bajwa, Usama Ijaz; Ahmad, Farooq

doi:10.1007/s00521-017-3206-2

Statistical machine translation of Indian languages: a survey

Original Article
Published: 17 November 2017

Volume 31, pages 2455–2467, (2019)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Nadeem Khan Jadoon¹,
Waqas Anwar²,
Usama Ijaz Bajwa² &
…
Farooq Ahmad²

960 Accesses
7 Citations
Explore all metrics

Abstract

In this study, performance analysis of a state-of-art phrase-based statistical machine translation (SMT) system is presented on eight Indian languages. State of the art in SMT on different Indian languages to English language has also been discussed briefly. The motivation of this study was to promote the development of SMT and linguistic resources for these Indian language pairs, as the current systems are in infancy stage due to sparse data resources. EMILLE and crowdsourcing parallel corpora have been used in this study for experimental purposes. The study is concluded by presenting the performance of baseline SMT system for Indian languages (Bengali, Gujarati, Hindi, Malayalam, Punjabi, Tamil, Telugu and Urdu) into English with average 10–20 % accurate results for all the language pairs. As a result of this study, both of these annotated parallel corpora resources and SMT system will serve as benchmarks for future approaches to SMT in Hindi → English, Urdu → English, Punjabi → English, Telugu → English, Tamil → English, Gujarati → English, Bengali → English and Malayalam → English.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Phrase-Based English–Nyishi Machine Translation

Addressing the Issue of Unavailability of Parallel Corpus Incorporating Monolingual Corpus on PBSMT System for English-Manipuri Translation

Indowordnet’s help in Indian language machine translation

Article 06 September 2019

References

Koehn P (2010) A book on statistical machine translation. Cambridge University Press, Cambridge
MATH Google Scholar
Islam Z, Tiedemann J, Eisele A (2010) English to Bangla phrase-based machine translation. In: Proceedings of the 14th annual conference of the European Association for Machine Translation
Koehn P, Och F, Marcu D (2003) Statistical Phrase-Based Translation. In: HLT-NAACL: conference combining Human Language Technology conference series and the North American chapter of the Association for Computational Linguistics conference series, pp 48–54
Och FJ, Ney H (2000) Improved statistical alignment models. In: Proceedings of the 38th annual meeting of the Association for Computational Linguistics (ACL), pp 440–447
Shannon CE (1948) A mathematical theory of communication. Bell System Tech J 27:379–423 and 623–656
The Editors of Encyclopædia Britannica (2014) Bengali language. Encyclopedia Britannica Online. Encyclopedia Britannica, n.d. Web. 15 June 2014
The Editors of Encyclopædia Britannica (2014) Gujarati language. Encyclopedia Britannica Online. Encyclopedia Britannica, n.d. Web. 18 June 2014
The Editors of Encyclopædia Britannica (2014) Hindi language. Encyclopedia Britannica Online. Encyclopedia Britannica, n.d. Web. 18 June 2014
The Editors of Encyclopædia Britannica (2014) Malayalam language. Encyclopedia Britannica Online. Encyclopedia Britannica, n.d. Web. 18 June 2014
The Editors of Encyclopædia Britannica (2014) Punjabi language. Encyclopedia Britannica Online. Encyclopedia Britannica, n.d. Web. 18 June 2014
The Editors of Encyclopædia Britannica (2014) Tamil language. Encyclopedia Britannica Online. Encyclopedia Britannica, n.d. Web. 18 June 2014
The Editors of Encyclopædia Britannica (2014) Telugu language. Encyclopedia Britannica Online. Encyclopedia Britannica, n.d. Web. 18 June 2014
The Editors of Encyclopædia Britannica (2014) Urdu language. Encyclopedia Britannica Online. Encyclopedia Britannica, n.d. Web. 18 June 2014
Dasgupta S, Wasif A, Azam S (2004) An optimal way towards machine translation from English to Bengali. In: Proceedings of the 7th international conference on computer and information technology (ICCIT)
Naskar SK, Bandyopadhyay S (2006) A phrasal EBMT system for translating English to Bengali. In: Workshop on language, artificial intelligence and computer science for natural language processing applications, Bangkok, Thailand, pp 69–72
Anwar MM, Anwar MZ, Bhuiyan MA-A (2009) Syntax analysis and machine translation of Bangla sentences. Int J Comput Sci Netw Secur 9:317–326
Google Scholar
Durrani N, Sajjad H, Fraser A, Schmid H (2010) Hindi-to-Urdu machine translation through transliteration. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp 465–474
Roy M (2009) A semi-supervised approach to Bengali–English phrase-based statistical Machine Translation. In: Proceedings of the 22nd Canadian conference on artificial intelligence
Singh D et al (2012) Modeling phrase based SMT for English to Hindi language. Int J Res Rev Eng Sci Technol 1:95–99
Google Scholar
Sharma N (2011) English to Hindi statistical machine translation system. Dissertation, Thapar University, Patiala
Yamada K, Knight K (2001) A syntax -based statistical translation model. In: Proceedings of the 39th annual meeting of the ACL, pp 523–530
Eisner J (2003) Learning non-isomorphic tree mappings for machine translation. In: Proceedings of the ACL interactive poster/demonstration sessions, pp 205–208
Liu T, Che W, Li S, Hu Y, Liu H (2005) Semantic role labeling system using maximum entropy classifier. In: Proceedings of CoNLL, pp 189–192
Bisazza A, Federico M (2010) Chunk-based verb reordering in VSO sentences for Arabic–English statistical machine translation. In: Proceedings of the joint fifth workshop on statistical machine translation and metrics MATR, WMT’10, pp 235–243
Jawaid B, Zeman D (2011) Word-order issues in English-to-Urdu statistical machine translation. Prague Bull Math Linguist 95:87–106 (ISSN 0032-6585)
Article Google Scholar
Khan N, Anwar W, Bajwa U, Durrani N (2013) English to Urdu hierarchical phrase based SMT system. In: The fourth workshop n South and Southeast Asian NLP (WSSANLP), International joint conference on natural language processing, pp 72–76
Chiang D (2005) A hierarchical phrase-based model for statistical machine translation. In: Proceedings of the 43rd annual meeting on Association for Computational Linguistics (ACL)
Singh G (2008) A Punjabi to Hindi Machine translation system. In: Proceeding of COLING, 22nd international conference on computational linguistics
Khalilov M et al (2008) Neural network language models for translation with limited data. In: Proceedings of 20th IEEE international conference on tools with artificial intelligence, Dayton, Ohio, pp 445–451
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473
Baker P, EMILLE (2002) A 70-million word corpus of indic languages: data collection, mark-up and harmonization. In: Proceedings of the 3rd language resources and evaluation conference, pp 819–825, LREC’
Moore R (2002) Fast and accurate sentence alignment of bilingual corpora. In: Conference of the association for machine translation in the Americas
Koehn P (2005) EuroParl: a parallel corpus for statistical machine translation. The tenth Machine Translation Summit, Phuket, Thailand, pp 79–86
Post M, Callison-Burch C, Osborne M (2012) Constructing parallel corpora for Six Indian languages via Crowdsourcing. In: Proceedings of the seventh workshop on statistical machine translation, pp 401–409
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the Association for Computational Linguistics companion volume proceedings of the demo and poster sessions. Association for Computational Linguistics, Prague, Czech Republic, pp 177–180
Och F (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29(1):19–51
Article MATH Google Scholar
Stolcke A (2002) SRILM-an extensible language modeling toolkit. In: Proceedings of the international conference on spoken language processing, Denver, Colorado, pp 257–286
Durrani N, Fraser A, Schmid H, Hoang H, Koehn P (2013) Can Markov models over minimal translation units help phrase-based SMT? In: Proceedings of the 51st annual meeting of the association for computational linguistics
Hasler E, Haddow B, Koehn P (2012) Sparse lexicalised features and topic adaptation for SMT. In: Proceedings of the seventh international workshop on spoken language translation, pp 268–275
Kumar S, Byrne WJ (2004) Minimum bayes-risk decoding for statistical machine translation. The fifth meeting of the North American Chapter of the ACL, Boston, USA, pp 169–176
Huang L, Chiang D (2007) Forest rescoring: faster decoding with integrated language models. In: Proceedings of the 45th annual meeting of the association of computational linguistics, pp 144–151
Cherry C, Foster G (2012) Batch tuning strategies for statistical machine translation. In: Proceedings of the 2012 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 427–436
Papineni K (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of 40th annual meeting of the Association for Computational Linguistics (ACL), pp 311–318
Durrani N, Sajjad H, Hoang H, Koehn P (2014) Integrating an unsupervised transliteration model into statistical machine translation. In: Proceedings of the 15th conference of the European chapter of the ACL
Cohn T, Lapata M (2007) Machine Translation by triangulation: making effective use of multi-parallel corpora. In: Proceedings of the 45th annual meeting of the association of computational linguistics, Prague, Czech Republic, pp 728–735
Bertoldi N, Barbaiani M, Federico M, Cattoni R (2008) Phrase-based statistical machine translation with pivot languages. In: International workshop on spoken language translation evaluation campaign on Spoken Language Translation (IWSLT), Hawaii, USA, pp 143–149
Koehn P, Monz C (2005) Shared task: statistical machine translation between European languages. In: Proceedings of the ACL workshop on building and using parallel texts

Download references

Acknowledgement

We would like to thank Dr. Nadir Durrani from University of Edinburgh for his helpful comments and suggestions during the experimentation and proof reading the write up, which has helped us a lot to improve the paper. He also provided examples to be included in the text.

Author information

Authors and Affiliations

Department of Computer Science, COMSATS Institute of Information Technology, Abbottabad, Pakistan
Nadeem Khan Jadoon
Department of Computer Science, COMSATS Institute of Information Technology, Lahore, Pakistan
Waqas Anwar, Usama Ijaz Bajwa & Farooq Ahmad

Authors

Nadeem Khan Jadoon
View author publications
You can also search for this author in PubMed Google Scholar
Waqas Anwar
View author publications
You can also search for this author in PubMed Google Scholar
Usama Ijaz Bajwa
View author publications
You can also search for this author in PubMed Google Scholar
Farooq Ahmad
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Waqas Anwar.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Khan Jadoon, N., Anwar, W., Bajwa, U.I. et al. Statistical machine translation of Indian languages: a survey. Neural Comput & Applic 31, 2455–2467 (2019). https://doi.org/10.1007/s00521-017-3206-2

Download citation

Received: 12 July 2015
Accepted: 21 May 2016
Published: 17 November 2017
Issue Date: 01 July 2019
DOI: https://doi.org/10.1007/s00521-017-3206-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Statistical machine translation of Indian languages: a survey

Abstract

Access this article

Similar content being viewed by others

Phrase-Based English–Nyishi Machine Translation

Addressing the Issue of Unavailability of Parallel Corpus Incorporating Monolingual Corpus on PBSMT System for English-Manipuri Translation

Indowordnet’s help in Indian language machine translation

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Statistical machine translation of Indian languages: a survey

Abstract

Access this article

Similar content being viewed by others

Phrase-Based English–Nyishi Machine Translation

Addressing the Issue of Unavailability of Parallel Corpus Incorporating Monolingual Corpus on PBSMT System for English-Manipuri Translation

Indowordnet’s help in Indian language machine translation

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation