Linguistically enhanced word segmentation for better neural machine translation of low resource agglutinative languages

Chimalamarri, Santwana; Sitaram, Dinkar

doi:10.1007/s10772-021-09865-5

Linguistically enhanced word segmentation for better neural machine translation of low resource agglutinative languages

Published: 05 July 2021

Volume 24, pages 1047–1053, (2021)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

350 Accesses
3 Citations
Explore all metrics

Abstract

One of the several challenges faced by neural machine translation systems is the lack of standard parallel corpora for several language pairs. Poor translation qualities often result from inadequate data. Aggravating this problem further are the issues of morphological complexity and agglutination, leading to unmanageable vocabulary size, rare words and data sparsity issues. Though this problem has been partly addressed by sub-word algorithms such as BPE, translation systems still lag in their ability to understand sentence and word structures associated with rich morphologies. This paper aims to address these issues by employing linguistically driven sub-word units into NMT systems. This approach is further enhanced by additional POS tag feature inputs. The proposed approach outperforms BPE driven machine translation models by several BLEU points and is also shown to have better recall measures from evaluation by ROUGE metric. The results have been evaluated upon a morphologically complex Dravidian language pair, Kannada and Telugu.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Semi-automatic Quasi-morphological Word Segmentation for Neural Machine Translation

Morphological and Language-Agnostic Word Segmentation for NMT

Exploring the Advantages of Corpus in Neural Machine Translation of Agglutinative Language

Notes

References

Aharoni, R., & Goldberg, Y. (2017). Towards string-to-tree neural machine translation. arXiv preprint arXiv:1704.04743.
Alexandrescu, A., & Kirchhoff, K. (2006). Factored neural language models. In: Proceedings of the human language technology conference of the NAACL, Companion volume: short papers (pp. 1–4).
Ataman, D., & Federico, M. (2018). Compositional representation of morphologically-rich input for neural machine translation. arXiv preprint arXiv:1805.02036.
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C. (2003). A neural probabilistic language model. The Journal of Machine Learning Research, 3, 1137–1155.
MATH Google Scholar
Bostrom, K., & Durrett, G. (2020). Byte pair encoding is suboptimal for language model pretraining. arXiv preprint arXiv:2004.03720.
Chen, H., Huang, S., Chiang, D., Dai, X., & Chen, J. (2018). Combining character and word information in neural machine translation using a multi-level attention. In: NAACL-HLT.
Chimalamarri, S., Sitaram, D., & Jain, A. K. (2020). Morphological segmentation to improve crosslingual word embeddings for low resource languages. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), 19, 1–15.
Article Google Scholar
Cho, K., Van Merriënboer, B., Bahdanau, D., & Bengio, Y. (2014). On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259.
Dyer, C., Kuncoro, A., Ballesteros, M., & Smith, N. A. (2016). Recurrent neural network grammars. arXiv preprint arXiv:1602.07776.
Fadaei, H., & Faili, H.(2019). Using syntax for improving phrase-based smt in low-resource languages. Digital Scholarship in the Humanities.
García-Martínez, M., Barrault, L., & Bougares, F. (2016). Factored neural machine translation architectures
Haddow, B., & Kirefu, F. (2020). Pmindia–A collection of parallel corpora of languages of India. arXiv preprint arXiv:2001.09907.
Hoang, C. D. V., Haffari, R., & Cohn, T. (2016). Improving neural translation models with linguistic factors. Proceedings of the Australasian Language Technology Association Workshop, 2016, 7–14.
Google Scholar
Klein, G., Kim, Y., Deng, Y., Senellart, J., & Rush, A. M. (2017). Opennmt: Open-source toolkit for neural machine translation. arXiv preprint arXiv:1701.02810.
Koehn, P., & Hoang, H. (2007). Factored translation models. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL) (pp. 868–876).
Koehn, P., & Knowles, R. (2017). Six challenges for neural machine translation. arXiv preprint arXiv:1706.03872.
Kunchukuttan, A., Mishra, A., Chatterjee, R., Shah, R., & Bhattacharyya, P. (2014). Shata-anuvadak: Tackling multiway translation of Indian languages. In: LREC.
Lee, J., Cho, K., & Hofmann, T. (2017). Fully character-level neural machine translation without explicit segmentation. Transactions of the Association for Computational Linguistics, 5, 365–378.
Article Google Scholar
Li, J., Xiong, D., Tu, Z., Zhu, M., Zhang, M., & Zhou, G. (2017). Modeling source syntax for neural machine translation. arXiv preprint arXiv:1705.01020.
Lin, C. Y. (2004). Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out (pp. 74–81).
Ling, W., Trancoso, I., Dyer, C., & Black, A. W. (2015). Character-based neural machine translation. ArXiv abs/1511.04586.
Luong, M. T., & Manning, C. D. (2016). Achieving open vocabulary neural machine translation with hybrid word-character models. arXiv preprint arXiv:1604.00788.
Luong, M. T., Pham, H., & Manning, C. D. (2015). Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025.
Niehues, J., & Cho, E. (2017). Exploiting linguistic resources for neural machine translation using multi-task learning. arXiv preprint arXiv:1708.00993.
Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). Bleu: A method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics (pp. 311–318).
Philip, J., Namboodiri, V. P., & Jawahar, C. (2019). A baseline neural machine translation system for Indian languages. arXiv preprint arXiv:1907.12437.
Popović, M. (2017). chrf++: Words helping character n-grams. In: Proceedings of the second conference on machine translation (pp. 612–618).
Provilkov, I., Emelianenko, D., & Voita, E. (2020). Bpe-dropout: Simple and effective subword regularization. ArXiv abs/1910.13267.
Reddy, S., & Sharoff, S. (2011). Cross language pos taggers (and other tools) for Indian languages: An experiment with Kannada using telugu resources. In: Proceedings of the fifth international workshop on cross lingual information access (pp. 11–19).
Schick, T., & Schütze, H. (2019). Learning semantic representations for novel words: Leveraging both form and context. In: AAAI.
Sennrich, R. (2016). How grammatical is character-level neural machine translation? Assessing mt quality with contrastive translation pairs. arXiv preprint arXiv:1612.04629.
Sennrich, R., & Haddow, B. (2016). Linguistic input features improve neural machine translation. arXiv preprint arXiv:1606.02892.
Sennrich, R., Haddow, B., & Birch, A. (2015). Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In: Advances in neural information processing systems (pp. 3104–3112).
Virpioja, S., Smit, P., Grönroos, S. A., Kurimo, M., et al. (2013). Morfessor 2.0: Python implementation and extensions for morfessor baseline.
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., et al. (2016). Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
Yuret, D., & Biçici, E. (2009). Modeling morphologically rich languages using split words and unstructured dependencies. In: Proceedings of the ACL-IJCNLP 2009 conference short papers (pp. 345–348). Association for Computational Linguistics.

Download references

Author information

Authors and Affiliations

Department of Computer Science, PES University, Bangashankari 3rd Stage, Bangalore, Karnataka, 560085, India
Santwana Chimalamarri & Dinkar Sitaram
C/O Centre for Cloud Computing and Big Data (CCBD), Computer Science Department, Tech Park B Block, PES University, Banashankari, Bangalore, 560085, India
Santwana Chimalamarri

Authors

Santwana Chimalamarri
View author publications
You can also search for this author in PubMed Google Scholar
Dinkar Sitaram
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Santwana Chimalamarri.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chimalamarri, S., Sitaram, D. Linguistically enhanced word segmentation for better neural machine translation of low resource agglutinative languages. Int J Speech Technol 24, 1047–1053 (2021). https://doi.org/10.1007/s10772-021-09865-5

Download citation

Received: 22 July 2020
Accepted: 23 June 2021
Published: 05 July 2021
Issue Date: December 2021
DOI: https://doi.org/10.1007/s10772-021-09865-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Linguistically enhanced word segmentation for better neural machine translation of low resource agglutinative languages

Abstract

Access this article

Similar content being viewed by others

Semi-automatic Quasi-morphological Word Segmentation for Neural Machine Translation

Morphological and Language-Agnostic Word Segmentation for NMT

Exploring the Advantages of Corpus in Neural Machine Translation of Agglutinative Language

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Linguistically enhanced word segmentation for better neural machine translation of low resource agglutinative languages

Abstract

Access this article

Similar content being viewed by others

Semi-automatic Quasi-morphological Word Segmentation for Neural Machine Translation

Morphological and Language-Agnostic Word Segmentation for NMT

Exploring the Advantages of Corpus in Neural Machine Translation of Agglutinative Language

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation