Skip to main content
Log in

Augmenting training data with syntactic phrasal-segments in low-resource neural machine translation

  • Published:
Machine Translation

Abstract

Neural machine translation (NMT) has emerged as a preferred alternative to the previous mainstream statistical machine translation (SMT) approaches largely due to its ability to produce better translations. The NMT training is often characterized as data hungry since a lot of training data, in the order of a few million parallel sentences, is generally required. This is indeed a bottleneck for the under-resourced languages that lack the availability of such resources. The researchers in machine translation (MT) have tried to solve the problem of data sparsity by augmenting the training data using different strategies. In this paper, we propose a generalized linguistically motivated data augmentation approach for NMT taking low-resource translation into consideration. The proposed method operates by generating source—target phrasal segments from an authentic parallel corpus, whose target counterparts are linguistic phrases extracted from the syntactic parse trees of the target-side sentences. We augment the authentic training corpus with the parser generated phrasal-segments, and investigate the efficacy of our proposed strategy in low-resource scenarios. To this end, we carried out experiments with resource-poor language pairs, viz. Hindi-to-English, Malayalam-to-English, and Telugu-to-English, considering the three state-of-the-art NMT paradigms, viz. attention-based recurrent neural network (Bahdanau et al., 2015), Google Transformer (Vaswani et al. 2017) and convolution sequence-to-sequence (Gehring et al. 2017) neural network models. The MT systems built on the training data prepared with our data augmentation strategy significantly surpassed the state-of-the-art NMT systems with large margins in all three translation tasks. Further, we tested our approach along with back-translation (Sennrich et al. 2016a), and found these to be complementary to each other. This joint approach has turned out to be the best-performing one in our low-resource experimental settings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. https://nlp.stanford.edu/software/lex-parser.shtml.

  2. http://lotus.kuee.kyoto-u.ac.jp/WAT/indic-multilingual/indic_languages_corpus.tar.gz.

  3. https://github.com/anoopkunchukuttan/indic_nlp_library.t

  4. http://www.statmt.org/moses/giza/GIZA++.html.

  5. https://github.com/awslabs/sockeye.

  6. https://github.com/moses-smt/mosesdecoder/blob/master/scripts/analysis/bootstrap-hypothesis-difference-significance.pl.

References

  • Aharoni R, Goldberg Y (2017) Towards string-to-tree neural machine translation. In: Proceedings of the 55th annual meeting of the association for computational linguistics (Volume 2: Short Papers), Association for Computational Linguistics, Vancouver, pp 132–140

  • Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: International conference on learning representation (ICLR)

  • Bojar O, Buck C, Federmann C, Haddow B, Koehn P, Leveling J, Monz C, Pecina P, Post M, Saint-Amand H, Soricut R, Specia L, Tamchyna A (2014a) Findings of the 2014 workshop on statistical machine translation. In: Proceedings of the ninth workshop on statistical machine translation, Association for Computational Linguistics, Baltimore, pp 12–58

  • Bojar O, Diatka V, Rychlỳ P, Stranák P, Suchomel V, Tamchyna A, Zeman D (2014b) Hindencorp-hindi-english and hindi-only corpus for machine translation. In: LREC, pp 3550–3555

  • Chen P-J, Shen J, Le M, Chaudhary V, El-Kishky A, Wenzek G, Ott M, Ranzato M (2019) Facebook AI’s WAT19 Myanmar-English translation task submission. In: Proceedings of the 6th workshop on Asian translation, Hong Kong, pp 112–122

  • Chiang D (2005) A hierarchical phrase-based model for statistical machine translation. In: Proceedings of the 43rd annual meeting on association for computational linguistics. Association for Computational Linguistics, pp 263–270

  • Collins M, Koehn P, Kučerová I (2005) Clause restructuring for statistical machine translation. In: Proceedings of the 43rd annual meeting on association for computational linguistics. Association for Computational Linguistics, pp 531–540

  • Currey A, Miceli BAV, Heafield K (2017) Copied monolingual data improves low-resource neural machine translation. In: Proceedings of the second conference on machine translation. Association for Computational Linguistics, Copenhagen, pp 148–156

  • Dauphin YN, Fan A, Auli M, Grangier D (2017) Language modeling with gated convolutional networks. In: International conference on machine learning. PMLR, pp 933–941

  • Durrani N, Schmid H, Fraser A (2011) A joint sequence translation model with integrated reordering. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies. pp 1045–1054, Portland

  • Edunov S, Ott M, Auli M, Grangier D (2018) Understanding back-translation at scale. In: Proceedings of the 2018 conference on empirical methods in natural language processing. Association for Computational Linguistics, Brussels, pp 489–500

  • Eriguchi A, Hashimoto K, Tsuruoka Y (2019) Incorporating source-side phrase structures into neural machine translation. Comput Linguist 45(2):267–292

    Article  MathSciNet  Google Scholar 

  • Fadaee M, Bisazza A, Monz C (2017) Data augmentation for low-resource neural machine translation. In: Proceedings of the 55th annual meeting of the association for computational linguistics (volume 2: short papers). Association for Computational Linguistics, Vancouver, pp 567–573

  • Fadaee M, Monz C (2018) Back-translation sampling by targeting difficult words in neural machine translation. In: EMNLP

  • Gehring J, Auli M, Grangier D, Yarats D, Dauphin YN (2017) Convolutional sequence to sequence learning. In: Proceedings of the 34th international conference on machine learning-volume 70. JMLR. org, pp 1243–1252

  • Hassan H, Aue A, Chen C, Chowdhary V, Clark J, Federmann C, Huang X, Junczys-Dowmunt M, Lewis W, Li M, Liu S, Liu T-Y, Luo R, Menezes A, Qin T, Seide F, Tan X, Tian F, Wu L, Wu S, Xia Y, Zhang D, Zhang Z, Zhou M (2018) Achieving human parity on automatic Chinese to English news translation. arXiv:1803.05567

  • Hieber F, Domhan T, Denkowski M, Vilar D, Sokolov A, Clifton A, Post M (2018) The sockeye neural machine translation toolkit at AMTA 2018. In : Proceedings of the 13th conference of the association for machine translation in the Americas (Volume 1: Research Papers). Association for Machine Translation in the Americas, Boston, pp 200–207

  • Iyer K (2020) Sentence boundary detection in legal texts grading: Option 3

  • Jha GN (2010) The TDIL program and the Indian langauge corpora intitiative (ILCI). In: LREC

  • Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: International conference on learning representation (ICLR)

  • Kneser R, Ney H (1995) Improved backing-off for m-gram language modeling. In: Acoustics, speech, and signal processing, 1995. ICASSP-95, 1995 International Conference on. IEEE, vol 1, pp 181–184

  • Koehn P (2004) Statistical significance tests for machine translation evaluation. In: Lin D, Wu D (eds), Proceedings of the 2004 conference on empirical methods in natural language processing (EMNLP), Barcelona, pp 388–395

  • Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: MT summit. Citeseer, vol 5, pp 79–86

  • Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, et al. (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions. Association for Computational Linguistics, pp 177–180

  • Koehn P, Knowles R (2017) Six challenges for neural machine translation. In: Proceedings of the first workshop on neural machine translation. Association for Computational Linguistics, pp 28–39

  • Koehn P, Och FJ, Marcu D (2003) Statistical phrase-based translation. In: Proceedings of the 2003 conference of the North American chapter of the association for computational linguistics on human language technology-volume 1. Association for Computational Linguistics, pp 48–54

  • Kunchukuttan A, Mehta P, Bhattacharyya P (2018) The IIT Bombay English-Hindi Parallel Corpus. In: Chair NCC, Choukri K, Cieri C, Declerck T, Goggi S, Hasida K, Isahara H, Maegaard B, Mariani J, Mazo H, Moreno A, Odijk J, Piperidis S, Tokunaga T, (eds). Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018), European Language Resources Association (ELRA), Paris

  • Nakazawa T, Higashiyama S, Ding C, Dabre R, Kunchukuttan A, Pa WP, Goto I, Mino H, Sudoh K, Kurohashi S (2018) Overview of the 5th workshop on asian translation. In: Proceedings of the 5th Workshop on Asian Translation (WAT2018), Hong Kong

  • Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29(1):19–51

    Article  Google Scholar 

  • Papineni K, Roukos S, Ward T, Zhu W-J (2002) BLEU: a method for automatic evaluation of machine translation. In: ACL-2002: 40th annual meeting of the association for computational linguistics. ACL, Philadelphia, pp 311–318

  • Ramanathan A, Bhattacharyya P, Visweswariah K, Ladha K, Gandhe A (2011) Clause-based reordering constraints to improve statistical machine translation. In: Proceedings of 5th international joint conference on natural language processing, Philadelphia, pp 1351–1355

  • Sennrich R, Haddow B, Birch A (2016a) Improving neural machine translation models with monolingual data. In: Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: long papers). Association for Computational Linguistics, Berlin, pp 86–96

  • Sennrich R, Haddow B, Birch A (2016b) Improving neural machine translation models with monolingual data. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL, Berlin

  • Sennrich R, Haddow B, Birch A (2016c) Neural machine translation of rare words with subword units. In: Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: long papers). Association for Computational Linguistics, Berlin, pp 1715–1725

  • Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008

  • Wang S, Liu Y, Wang C, Luan H, Sun M (2019) Improving back-translation with uncertainty-based confidence estimation. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, pp 791–802

  • Wang X, Pham H, Dai Z, Neubig G (2018) SwitchOut: an efficient data augmentation algorithm for neural machine translation. In: Proceedings of the 2018 conference on empirical methods in natural language processing. Association for Computational Linguistics, Brussels, pp 856–861

  • Wang X, Tu Z, Xiong D, Zhang M (2017) Translating phrases in neural machine translation. In: Proceedings of the 2017 conference on empirical methods in natural language processing. Association for Computational Linguistics, Copenhagen, pp 1421–1431

  • Zhang J, Zong C (2016) Exploiting source-side monolingual data in neural machine translation. In: Proceedings of the 2016 conference on empirical methods in natural language processing. pp 1535–1545

  • Zhao Y, Wang Y, Zhang J, Zong C (2018) Phrase table as recommendation memory for neural machine translation. In: IJCAI

  • Zhou C, Ma X, Hu J, Neubig G (2019) Handling syntactic divergence in low-resource machine translation. arXiv preprint arXiv:1909.00040

  • Zhu J, Gao F, Wu L, Xia Y, Qin T, Zhou W, Cheng X, Liu T-Y (2019) Soft contextual data augmentation for neural machine translation. In: ACL

  • Zoph B, Yuret D, May J, Knight K (2016) Transfer learning for low-resource neural machine translation. In: Proceedings of the 2016 conference on empirical methods in natural language processing. Association for Computational Linguistics, Austin, pp 1568–1575

Download references

Acknowledgements

Authors gratefully acknowledge the support of the project “Hindi to English Machine Aided Translation (HEMAT) for the Judicial Domain”, sponsored by TDIL, Meity, Govt. of India. Asif Ekbal gratefully acknowledges the Young Faculty Research Fellowship (YFRF), supported by Visvesvaraya PhD scheme for Electronics and IT, Ministry of Electronics and Information Technology (MeitY), Government of India, being implemented by Digital India Corporation (formerly MediaLab Asia).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Kamal Kumar Gupta or Asif Ekbal.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gupta, K.K., Sen, S., Haque, R. et al. Augmenting training data with syntactic phrasal-segments in low-resource neural machine translation. Machine Translation 35, 661–685 (2021). https://doi.org/10.1007/s10590-021-09290-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10590-021-09290-0

Keywords

Navigation