Abstract
Machine translation (MT) systems have been built using numerous different techniques for bridging the language barriers. These techniques are broadly categorized into approaches like Statistical Machine Translation (SMT) and Neural Machine Translation (NMT). End-to-end NMT systems significantly outperform SMT in translation quality on many language pairs, especially those with the adequate parallel corpus. We report comparative experiments on baseline MT systems for Assamese to other Indo-Aryan languages (in both translation directions) using the traditional Phrase-Based SMT as well as some more successful NMT architectures, namely basic sequence-to-sequence model with attention, Transformer, and finetuned Transformer. The results are evaluated using the most prominent and popular standard automatic metric BLEU (BiLingual Evaluation Understudy), as well as other well-known metrics for exploring the performance of different baseline MT systems, since this is the first such work involving Assamese. The evaluation scores are compared for SMT and NMT models for the effectiveness of bi-directional language pairs involving Assamese and other Indo-Aryan languages (Bangla, Gujarati, Hindi, Marathi, Odia, Sinhalese, and Urdu). The highest BLEU scores obtained are for Assamese to Sinhalese for SMT (35.63) and the Assamese to Bangla for NMT systems (seq2seq is 50.92, Transformer is 50.01, and finetuned Transformer is 50.19). We also try to relate the results with the language characteristics, distances, family trees, domains, data sizes, and sentence lengths. We find that the effect of the domain is the most important factor affecting the results for the given data domains and sizes. We compare our results with the only existing MT system for Assamese (Bing Translator) and also with pairs involving Hindi.
- [1] . 2013. Machine translation approaches and survey for Indian languages. In Proceedings of the International Journal of Computational Linguistics & Chinese Language Processing, Vol. 18.Google Scholar
- [2] . 2014. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations (ICLR’15), San Diego, CA, USA, May 7-9, 2015, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1409.0473.Google Scholar
- [3] . 2002. EMILLE, A 67-Million word corpus of Indic languages: Data collection, mark-up and harmonisation. In Proceedings of the 3rd International Conference on Language Resources and Evaluation.Google Scholar
- [4] . 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 65–72.Google ScholarDigital Library
- [5] . 2019. Assembling translations from multi-engine machine translation outputs. Applied Soft Computing 78 (2019), 230–239.
DOI: https://doi.org/10.1016/j.asoc.2019.02.031Google ScholarDigital Library - [6] . 2014. Assamese-English Bilingual Machine Translation. CoRR abs/1407.2019. http://arxiv.org/abs/1407.2019.Google Scholar
- [7] . 2016. Neural versus Phrase-Based Machine Translation Quality: a Case Study. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 257–267.Google ScholarCross Ref
- [8] . 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, 1 (2017), 135–146.Google ScholarCross Ref
- [9] . 2017. Massive Exploration of Neural Machine Translation Architectures. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 1442–1451.Google ScholarCross Ref
- [10] . 1993. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics 19, 2 (1993), 263–311. Retrieved from https://www.aclweb.org/anthology/J93-2003.Google ScholarDigital Library
- [11] . 2009. A hybrid approach for bengali to hindi machine translation. In Proceedings of the ICON-2009 7th International Conference on Natural Language Processing. 81–91.Google Scholar
- [12] . 2014. On the properties of neural machine translation: Encoder–decoder approaches. In Proceedings of the SSST-8, 8th Workshop on Syntax, Semantics and Structure in Statistical Translation. 103–111.Google ScholarCross Ref
- [13] . 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1724–1734.Google Scholar
- [14] . 2020. A survey of multilingual neural machine translation. ACM Computing Surveys 53, 5 (2020), 1–38.Google ScholarDigital Library
- [15] . 2019. Writer identification system for indic and non-indic scripts: State-of-the-art survey. Archives of Computational Methods in Engineering 26, 4 (2019), 1283–1311.Google ScholarCross Ref
- [16] . 2016. A study of attention-based neural machine translation model on Indian languages. In Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing. 163–172.Google Scholar
- [17] . 2014. Assamese to English statistical machine translation integrated with a transliteration module. International Journal of Computer Applications 100, 5 (2014), 20–24.Google ScholarCross Ref
- [18] . 2012. On case marking in assamese bengali and oriya. International Journal of Applied Linguistics & English Literature 1, 2 (2012), 102.Google ScholarCross Ref
- [19] . 2002. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of the 2nd International Conference on Human Language Technology Research. 138–145.Google ScholarCross Ref
- [20] . 2010. Hindi-to-Urdu machine translation through transliteration. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 465–474. Retrieved from https://www.aclweb.org/anthology/P10-1048.Google ScholarDigital Library
- [21] . 2013. Survey of machine translation systems in India. International Journal on Natural Language Computing 2, 4 (2013), 47–65.Google ScholarCross Ref
- [22] . 2017. Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning,Vol. 70. JMLR. org, 1243–1252.Google ScholarDigital Library
- [23] . 2020. Efficient neural machine translation for low-resource languages via exploiting related languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 162–168.Google Scholar
- [24] . 2008. Comparative study of Hindi and Punjabi language scripts. Nepalese Linguistics 23 (2008), 67–82.Google Scholar
- [25] . 2011. Hindi to Punjabi machine translation system. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Systems Demonstrations. Association for Computational Linguistics, 1–6.Google ScholarDigital Library
- [26] . 2018. Learning word vectors for 157 languages. In Proceedings of the International Conference on Language Resources and Evaluation.Google Scholar
- [27] . 2019. The FLORES evaluation datasets for low-resource machine translation: Nepali–English and Sinhala–English. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 6100–6113.Google ScholarCross Ref
- [28] . 2019. Neural machine translation for the Bangla-English language pair. In Proceedings of the 2019 22nd International Conference on Computer and Information Technology. IEEE, 1–6.Google ScholarCross Ref
- [29] . 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarCross Ref
- [30] . 2016. Improved neural machine translation with SMT features. In Proceedings of the 13th AAAI Conference on Artificial Intelligence.Google ScholarCross Ref
- [31] . 2011. KenLM: Faster and smaller language model queries. In Proceedings of the 6th Workshop on Statistical Machine Translation. Association for Computational Linguistics, 187–197.Google Scholar
- [32] . 2013. Scalable modified Kneser-Ney language model estimation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Vol. 2. 690–696.Google Scholar
- [33] . 2008. Design of the moses decoder for statistical machine translation. In Proceedings of the Software Engineering, Testing, and Quality Assurance for Natural Language Processing. Association for Computational Linguistics, 58–65. Retrieved from https://www.aclweb.org/anthology/W08-0510.Google ScholarDigital Library
- [34] . 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.Google ScholarDigital Library
- [35] . 2017. Dialect identification of assamese language using spectral features. Indian Journal of Science and Technology 10, 20 (2017), 1–7.Google ScholarCross Ref
- [36] . 2010. Automatic evaluation of translation quality for distant language pairs. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 944–952.Google ScholarDigital Library
- [37] . 2011. Word-order issues in english-to-urdu statistical machine translation. The Prague Bulletin of Mathematical Linguistics 95, 1 (2011), 87–106.Google ScholarCross Ref
- [38] . 2015. On using very large target vocabulary for neural machine translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing Vol. 1. Association for Computational Linguistics, 1–10.
DOI: https://doi.org/10.3115/v1/P15-1001Google ScholarCross Ref - [39] . 2017. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics 5 (2017), 339–351.
DOI: 10.1162/tacl_a_00065Google ScholarCross Ref - [40] . 2008. A Punjabi to Hindi machine translation system. In Proceedings of the 22nd International Conference on on Computational Linguistics. Association for Computational Linguistics, 157–160.Google Scholar
- [41] . 1953. Aspects of Early Assamese Literature-1953. Gauhati University.Google Scholar
- [42] . 1962. Assamese, its Formation and Development: a Scientific Treatise on the History and Philology of the Assamese Language. Lawyer’s Book Stall.Google Scholar
- [43] . 2013. Recurrent continuous translation models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1700–1709. Retrieved from https://www.aclweb.org/anthology/D13-1176.Google Scholar
- [44] . 2015. Bengali to assamese statistical machine translation using moses (corpus based). CoRR abs/1504.01182. http://arxiv.org/abs/1504.01182.Google Scholar
- [45] . 2018. A comprehensive survey on word recognition for non-Indic and Indic scripts. Pattern Analysis and Applications 21, 4 (2018), 897–929.Google ScholarDigital Library
- [46] . 2017. Machine translation approaches and survey for indian languages. CoRR abs/1701.04290. http://arxiv.org/abs/1701.04290.Google Scholar
- [47] . 2014. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR’15), San Diego, CA, USA, May 7-9, 2015. http://arxiv.org/abs/1412.6980.Google Scholar
- [48] . 2017. Opennmt: Open-source toolkit for neural machine translation. In Proceedings of ACL 2017, System Demonstrations. 67–72.Google Scholar
- [49] . 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of the MT Summit, Vol. 5. Citeseer, 79–86.Google Scholar
- [50] . 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions. 177–180.Google ScholarCross Ref
- [51] . 2017. Six challenges for neural machine translation. In Proceedings of the 1st Workshop on Neural Machine Translation. 28–39.Google ScholarCross Ref
- [52] . 2003. Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, Vol. 1. Association for Computational Linguistics, 48–54.Google ScholarDigital Library
- [53] . 2011. Review on OCR for handwritten Indian scripts character recognition. In Proceedings of the International Conference on Digital Image Processing and Information Technology. Springer, 268–276.Google ScholarCross Ref
- [54] . 2016. A novel framework for grading of writers using offline Gurmukhi characters. Proceedings of the National Academy of Sciences, India Section A: Physical Sciences 86, 3 (2016), 405–415.Google ScholarCross Ref
- [55] . 2019. Character and numeral recognition for non-Indic and Indic scripts: A survey. Artificial Intelligence Review 52, 4 (2019), 2235–2261.Google ScholarDigital Library
- [56] . 2020. Performance evaluation of classifiers for the recognition of offline handwritten Gurmukhi characters and numerals: A study. Artificial Intelligence Review 53, 3 (2020), 2075–2097.Google ScholarCross Ref
- [57] . 2019. Improved recognition results of medieval handwritten Gurmukhi manuscripts using boosting and bagging methodologies. Neural Processing Letters 50, 1 (2019), 43–56.Google ScholarDigital Library
- [58] . 2018. Classifiers in surjapuri. Jadavpur Journal of Languages and Linguistics 2, 1 (2018), 27–37.Google Scholar
- [59] . 2020. EnAsCorp1. 0: English-Assamese Corpus. In Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages. 62–68.Google Scholar
- [60] . 2010. Evaluating the output of machine translation systems. In Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Tutorials. Association for Machine Translation in the Americas. https://aclanthology.org/2010.amta-tutorials.4.Google Scholar
- [61] . 2006. CDER: Efficient MT evaluation using block movements. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics.Google Scholar
- [62] . 1966. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10, 8 (1966), 707–710.Google Scholar
- [63] . 2015. Finding function in form: Compositional character models for open vocabulary word representation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 1520–1530.Google Scholar
- [64] . 2015. Not all contexts are created equal: Better word representations with variable attention. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 1367–1372.Google ScholarCross Ref
- [65] . 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 412–1421.Google Scholar
- [66] . 1991. The Indo-Aryan Languages. Cambridge University Press, Cambridge.Google Scholar
- [67] . 2005. A new survey of the Indo-Aryan languages. The Journal of the American Oriental Society 125, 1 (2005), 79–90.Google Scholar
- [68] . 2019. Neural machine translation for low-resource English-Bangla. Journal of Computer Science 15, 11 (2019), 1627–1637.
DOI: https://doi.org/10.3844/jcssp.2019.1627.1637Google ScholarCross Ref - [69] . 2021. Basic linguistic resources and baselines for Bhojpuri, Magahi and Maithili for natural language processing. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 20, 6, Article 95 (2021), 37 pages.Google Scholar
- [70] . 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, Vol. 1. Association for Computational Linguistics, 160–167.Google ScholarDigital Library
- [71] . 2003. A systematic comparison of various statistical alignment models. Computational linguistics 29, 1 (2003), 19–51.Google ScholarDigital Library
- [72] . 2019. Panlingua-KMI MT system for similar language translation task at WMT 2019. In Proceedings of the 4th Conference on Machine Translation, Vol. 3. 213–218.Google ScholarCross Ref
- [73] . 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 311–318.Google Scholar
- [74] . 2016. ORIYA and ASSAMESE. Current Trends in Linguistics.De Gruyter Mouton, 122–152.Google Scholar
- [75] . 2019. A baseline neural machine translation system for Indian languages. CoRR abs/1907.12437 (2019). https://dblp.org/rec/journals/corr/abs-1907-12437.bib.Google Scholar
- [76] 2008. Simple syntactic and morphological processing can help English-Hindi statistical machine translation. In Proceedings of the 3rd International Joint Conference on Natural Language Processing, Vol. 1. Retrieved from https://www.aclweb.org/anthology/I08-1067.Google Scholar
- [77] . 2019. Unsupervised neural machine translation with smt as posterior regularization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 241–248.Google ScholarDigital Library
- [78] . 2017. Neural machine translation of Indian languages. In Proceedings of the 10th Annual ACM India Compute Conference. ACM, 11–20.Google ScholarDigital Library
- [79] . 2013. An improved stemming approach using HMM for a highly inflectional language. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics. Springer, 164–173.Google ScholarDigital Library
- [80] . 2018. IITP-MT at WAT2018: Transformer-based multilingual Indic-English neural machine translation system. In Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation: 5th Workshop on Asian Translation.Google Scholar
- [81] . 2015. Study on similarity among Indian languages using language verification framework. Advances in Artificial Intelligence 2015, Article 2 (2015), 1.Google ScholarDigital Library
- [82] . 2016. Improving Neural Machine Translation Models with Monolingual Data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Vol. 1. 86–96.Google ScholarCross Ref
- [83] . 2016. Neural machine translation of rare words with Subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Vol. 1. 1715–1725.Google ScholarCross Ref
- [84] . 2019. Neural machine translation system of Indic languages-an attention based approach. In Proceedings of the 2019 2nd International Conference on Advanced Computational and Communication Paradigms. IEEE, 1–5.Google ScholarCross Ref
- [85] . 2020. ASRoIL: A comprehensive survey for automatic speech recognition of Indian languages. Artificial Intelligence Review 53, 5 (2020), 3673–3704.Google ScholarDigital Library
- [86] . 2010. Modeling and Application of Linguistic Similarity. Ph.D. Dissertation. International Institute of Information Technology, Hyderabad, India.Google Scholar
- [87] . 2019. Neural-based machine translation system outperforming statistical phrase-based machine translation for low-resource languages. In Proceedings of the 2019 12th International Conference on Contemporary Computing. IEEE, 1–7.Google ScholarCross Ref
- [88] . 2014. An English-assamese machine translation system. International Journal of Computer Applications 93, 4 (2014), 1–6.Google Scholar
- [89] . 2004. An engineering perspective of machine translation: anglabharti-II and anubharti-II architectures. In Proceedings of the International Symposium on Machine Translation, NLP and Translation Support System. 10–17.Google Scholar
- [90] . 2006. A study of translation edit rate with targeted human annotation. In Proceedings of the Association for Machine Translation in the Americas, Vol. 200.Google Scholar
- [91] . 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, 1 (2014), 1929–1958. Retrieved from http://jmlr.org/papers/v15/srivastava14a.html.Google Scholar
- [92] . 2014. Sequence to sequence learning with neural networks. In Proceedings of the Advances in Neural Information Processing Systems.Google Scholar
- [93] . 1997. Accelerated DP based search for statistical translation. In Proceedings of the 5th European Conference on Speech Communication and Technology.Google ScholarCross Ref
- [94] . 2017. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems. 5998–6008.Google Scholar
- [95] . 2017. Neural machine translation advised by statistical machine translation. In Proceedings of the 31st AAAI Conference on Artificial Intelligence.Google ScholarCross Ref
- [96] . 2018. Three strategies to improve one-to-many multilingual translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2955–2960.Google ScholarCross Ref
- [97] . 2016. Multi-source neural translation. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 30–34.Google ScholarCross Ref
- [98] . 2016. Transfer learning for low-resource neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 1568–1575.Google ScholarCross Ref
Index Terms
- Low Resource Neural Machine Translation: Assamese to/from Other Indo-Aryan (Indic) Languages
Recommendations
Neural Machine Translation of Indian Languages
Compute '17: Proceedings of the 10th Annual ACM India Compute ConferenceNeural Machine Translation (NMT) is a new technique for machine translation that has led to remarkable improvements compared to rule-based and statistical machine translation (SMT) techniques, by overcoming many of the weaknesses in the conventional ...
Parallel Corpora Preparation for English-Amharic Machine Translation
Advances in Computational IntelligenceAbstractIn this paper, we describe the development of an English-Amharic parallel corpus and Machine Translation (MT) experiments conducted on it. Two different tests have been achieved. Statistical Machine Translation (SMT) and Neural Machine Translation ...
Deep Neural Network--based Machine Translation System Combination
Deep neural networks (DNNs) have provably enhanced the state-of-the-art natural language process (NLP) with their capability of feature learning and representation. As one of the more challenging NLP tasks, neural machine translation (NMT) becomes a new ...
Comments