Skip to main content
Log in

Linguistically enhanced word segmentation for better neural machine translation of low resource agglutinative languages

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

One of the several challenges faced by neural machine translation systems is the lack of standard parallel corpora for several language pairs. Poor translation qualities often result from inadequate data. Aggravating this problem further are the issues of morphological complexity and agglutination, leading to unmanageable vocabulary size, rare words and data sparsity issues. Though this problem has been partly addressed by sub-word algorithms such as BPE, translation systems still lag in their ability to understand sentence and word structures associated with rich morphologies. This paper aims to address these issues by employing linguistically driven sub-word units into NMT systems. This approach is further enhanced by additional POS tag feature inputs. The proposed approach outperforms BPE driven machine translation models by several BLEU points and is also shown to have better recall measures from evaluation by ROUGE metric. The results have been evaluated upon a morphologically complex Dravidian language pair, Kannada and Telugu.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. http://opus.nlpl.eu/

  2. https://www.tdil-dc.in/

  3. https://github.com/bhaddow/pmindia-crawler

References

  • Aharoni, R., & Goldberg, Y. (2017). Towards string-to-tree neural machine translation. arXiv preprint arXiv:1704.04743.

  • Alexandrescu, A., & Kirchhoff, K. (2006). Factored neural language models. In: Proceedings of the human language technology conference of the NAACL, Companion volume: short papers (pp. 1–4).

  • Ataman, D., & Federico, M. (2018). Compositional representation of morphologically-rich input for neural machine translation. arXiv preprint arXiv:1805.02036.

  • Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.

  • Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C. (2003). A neural probabilistic language model. The Journal of Machine Learning Research, 3, 1137–1155.

    MATH  Google Scholar 

  • Bostrom, K., & Durrett, G. (2020). Byte pair encoding is suboptimal for language model pretraining. arXiv preprint arXiv:2004.03720.

  • Chen, H., Huang, S., Chiang, D., Dai, X., & Chen, J. (2018). Combining character and word information in neural machine translation using a multi-level attention. In: NAACL-HLT.

  • Chimalamarri, S., Sitaram, D., & Jain, A. K. (2020). Morphological segmentation to improve crosslingual word embeddings for low resource languages. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), 19, 1–15.

    Article  Google Scholar 

  • Cho, K., Van Merriënboer, B., Bahdanau, D., & Bengio, Y. (2014). On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259.

  • Dyer, C., Kuncoro, A., Ballesteros, M., & Smith, N. A. (2016). Recurrent neural network grammars. arXiv preprint arXiv:1602.07776.

  • Fadaei, H., & Faili, H.(2019). Using syntax for improving phrase-based smt in low-resource languages. Digital Scholarship in the Humanities.

  • García-Martínez, M., Barrault, L., & Bougares, F. (2016). Factored neural machine translation architectures

  • Haddow, B., & Kirefu, F. (2020). Pmindia–A collection of parallel corpora of languages of India. arXiv preprint arXiv:2001.09907.

  • Hoang, C. D. V., Haffari, R., & Cohn, T. (2016). Improving neural translation models with linguistic factors. Proceedings of the Australasian Language Technology Association Workshop, 2016, 7–14.

    Google Scholar 

  • Klein, G., Kim, Y., Deng, Y., Senellart, J., & Rush, A. M. (2017). Opennmt: Open-source toolkit for neural machine translation. arXiv preprint arXiv:1701.02810.

  • Koehn, P., & Hoang, H. (2007). Factored translation models. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL) (pp. 868–876).

  • Koehn, P., & Knowles, R. (2017). Six challenges for neural machine translation. arXiv preprint arXiv:1706.03872.

  • Kunchukuttan, A., Mishra, A., Chatterjee, R., Shah, R., & Bhattacharyya, P. (2014). Shata-anuvadak: Tackling multiway translation of Indian languages. In: LREC.

  • Lee, J., Cho, K., & Hofmann, T. (2017). Fully character-level neural machine translation without explicit segmentation. Transactions of the Association for Computational Linguistics, 5, 365–378.

    Article  Google Scholar 

  • Li, J., Xiong, D., Tu, Z., Zhu, M., Zhang, M., & Zhou, G. (2017). Modeling source syntax for neural machine translation. arXiv preprint arXiv:1705.01020.

  • Lin, C. Y. (2004). Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out (pp. 74–81).

  • Ling, W., Trancoso, I., Dyer, C., & Black, A. W. (2015). Character-based neural machine translation. ArXiv abs/1511.04586.

  • Luong, M. T., & Manning, C. D. (2016). Achieving open vocabulary neural machine translation with hybrid word-character models. arXiv preprint arXiv:1604.00788.

  • Luong, M. T., Pham, H., & Manning, C. D. (2015). Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025.

  • Niehues, J., & Cho, E. (2017). Exploiting linguistic resources for neural machine translation using multi-task learning. arXiv preprint arXiv:1708.00993.

  • Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). Bleu: A method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics (pp. 311–318).

  • Philip, J., Namboodiri, V. P., & Jawahar, C. (2019). A baseline neural machine translation system for Indian languages. arXiv preprint arXiv:1907.12437.

  • Popović, M. (2017). chrf++: Words helping character n-grams. In: Proceedings of the second conference on machine translation (pp. 612–618).

  • Provilkov, I., Emelianenko, D., & Voita, E. (2020). Bpe-dropout: Simple and effective subword regularization. ArXiv abs/1910.13267.

  • Reddy, S., & Sharoff, S. (2011). Cross language pos taggers (and other tools) for Indian languages: An experiment with Kannada using telugu resources. In: Proceedings of the fifth international workshop on cross lingual information access (pp. 11–19).

  • Schick, T., & Schütze, H. (2019). Learning semantic representations for novel words: Leveraging both form and context. In: AAAI.

  • Sennrich, R. (2016). How grammatical is character-level neural machine translation? Assessing mt quality with contrastive translation pairs. arXiv preprint arXiv:1612.04629.

  • Sennrich, R., & Haddow, B. (2016). Linguistic input features improve neural machine translation. arXiv preprint arXiv:1606.02892.

  • Sennrich, R., Haddow, B., & Birch, A. (2015). Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.

  • Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In: Advances in neural information processing systems (pp. 3104–3112).

  • Virpioja, S., Smit, P., Grönroos, S. A., Kurimo, M., et al. (2013). Morfessor 2.0: Python implementation and extensions for morfessor baseline.

  • Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., et al. (2016). Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.

  • Yuret, D., & Biçici, E. (2009). Modeling morphologically rich languages using split words and unstructured dependencies. In: Proceedings of the ACL-IJCNLP 2009 conference short papers (pp. 345–348). Association for Computational Linguistics.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Santwana Chimalamarri.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chimalamarri, S., Sitaram, D. Linguistically enhanced word segmentation for better neural machine translation of low resource agglutinative languages. Int J Speech Technol 24, 1047–1053 (2021). https://doi.org/10.1007/s10772-021-09865-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-021-09865-5

Keywords

Navigation