Abstract
Neural machine translation (NMT) has emerged as a preferred alternative to the previous mainstream statistical machine translation (SMT) approaches largely due to its ability to produce better translations. The NMT training is often characterized as data hungry since a lot of training data, in the order of a few million parallel sentences, is generally required. This is indeed a bottleneck for the under-resourced languages that lack the availability of such resources. The researchers in machine translation (MT) have tried to solve the problem of data sparsity by augmenting the training data using different strategies. In this paper, we propose a generalized linguistically motivated data augmentation approach for NMT taking low-resource translation into consideration. The proposed method operates by generating source—target phrasal segments from an authentic parallel corpus, whose target counterparts are linguistic phrases extracted from the syntactic parse trees of the target-side sentences. We augment the authentic training corpus with the parser generated phrasal-segments, and investigate the efficacy of our proposed strategy in low-resource scenarios. To this end, we carried out experiments with resource-poor language pairs, viz. Hindi-to-English, Malayalam-to-English, and Telugu-to-English, considering the three state-of-the-art NMT paradigms, viz. attention-based recurrent neural network (Bahdanau et al., 2015), Google Transformer (Vaswani et al. 2017) and convolution sequence-to-sequence (Gehring et al. 2017) neural network models. The MT systems built on the training data prepared with our data augmentation strategy significantly surpassed the state-of-the-art NMT systems with large margins in all three translation tasks. Further, we tested our approach along with back-translation (Sennrich et al. 2016a), and found these to be complementary to each other. This joint approach has turned out to be the best-performing one in our low-resource experimental settings.
Similar content being viewed by others
Notes
References
Aharoni R, Goldberg Y (2017) Towards string-to-tree neural machine translation. In: Proceedings of the 55th annual meeting of the association for computational linguistics (Volume 2: Short Papers), Association for Computational Linguistics, Vancouver, pp 132–140
Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: International conference on learning representation (ICLR)
Bojar O, Buck C, Federmann C, Haddow B, Koehn P, Leveling J, Monz C, Pecina P, Post M, Saint-Amand H, Soricut R, Specia L, Tamchyna A (2014a) Findings of the 2014 workshop on statistical machine translation. In: Proceedings of the ninth workshop on statistical machine translation, Association for Computational Linguistics, Baltimore, pp 12–58
Bojar O, Diatka V, Rychlỳ P, Stranák P, Suchomel V, Tamchyna A, Zeman D (2014b) Hindencorp-hindi-english and hindi-only corpus for machine translation. In: LREC, pp 3550–3555
Chen P-J, Shen J, Le M, Chaudhary V, El-Kishky A, Wenzek G, Ott M, Ranzato M (2019) Facebook AI’s WAT19 Myanmar-English translation task submission. In: Proceedings of the 6th workshop on Asian translation, Hong Kong, pp 112–122
Chiang D (2005) A hierarchical phrase-based model for statistical machine translation. In: Proceedings of the 43rd annual meeting on association for computational linguistics. Association for Computational Linguistics, pp 263–270
Collins M, Koehn P, Kučerová I (2005) Clause restructuring for statistical machine translation. In: Proceedings of the 43rd annual meeting on association for computational linguistics. Association for Computational Linguistics, pp 531–540
Currey A, Miceli BAV, Heafield K (2017) Copied monolingual data improves low-resource neural machine translation. In: Proceedings of the second conference on machine translation. Association for Computational Linguistics, Copenhagen, pp 148–156
Dauphin YN, Fan A, Auli M, Grangier D (2017) Language modeling with gated convolutional networks. In: International conference on machine learning. PMLR, pp 933–941
Durrani N, Schmid H, Fraser A (2011) A joint sequence translation model with integrated reordering. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies. pp 1045–1054, Portland
Edunov S, Ott M, Auli M, Grangier D (2018) Understanding back-translation at scale. In: Proceedings of the 2018 conference on empirical methods in natural language processing. Association for Computational Linguistics, Brussels, pp 489–500
Eriguchi A, Hashimoto K, Tsuruoka Y (2019) Incorporating source-side phrase structures into neural machine translation. Comput Linguist 45(2):267–292
Fadaee M, Bisazza A, Monz C (2017) Data augmentation for low-resource neural machine translation. In: Proceedings of the 55th annual meeting of the association for computational linguistics (volume 2: short papers). Association for Computational Linguistics, Vancouver, pp 567–573
Fadaee M, Monz C (2018) Back-translation sampling by targeting difficult words in neural machine translation. In: EMNLP
Gehring J, Auli M, Grangier D, Yarats D, Dauphin YN (2017) Convolutional sequence to sequence learning. In: Proceedings of the 34th international conference on machine learning-volume 70. JMLR. org, pp 1243–1252
Hassan H, Aue A, Chen C, Chowdhary V, Clark J, Federmann C, Huang X, Junczys-Dowmunt M, Lewis W, Li M, Liu S, Liu T-Y, Luo R, Menezes A, Qin T, Seide F, Tan X, Tian F, Wu L, Wu S, Xia Y, Zhang D, Zhang Z, Zhou M (2018) Achieving human parity on automatic Chinese to English news translation. arXiv:1803.05567
Hieber F, Domhan T, Denkowski M, Vilar D, Sokolov A, Clifton A, Post M (2018) The sockeye neural machine translation toolkit at AMTA 2018. In : Proceedings of the 13th conference of the association for machine translation in the Americas (Volume 1: Research Papers). Association for Machine Translation in the Americas, Boston, pp 200–207
Iyer K (2020) Sentence boundary detection in legal texts grading: Option 3
Jha GN (2010) The TDIL program and the Indian langauge corpora intitiative (ILCI). In: LREC
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: International conference on learning representation (ICLR)
Kneser R, Ney H (1995) Improved backing-off for m-gram language modeling. In: Acoustics, speech, and signal processing, 1995. ICASSP-95, 1995 International Conference on. IEEE, vol 1, pp 181–184
Koehn P (2004) Statistical significance tests for machine translation evaluation. In: Lin D, Wu D (eds), Proceedings of the 2004 conference on empirical methods in natural language processing (EMNLP), Barcelona, pp 388–395
Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: MT summit. Citeseer, vol 5, pp 79–86
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, et al. (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions. Association for Computational Linguistics, pp 177–180
Koehn P, Knowles R (2017) Six challenges for neural machine translation. In: Proceedings of the first workshop on neural machine translation. Association for Computational Linguistics, pp 28–39
Koehn P, Och FJ, Marcu D (2003) Statistical phrase-based translation. In: Proceedings of the 2003 conference of the North American chapter of the association for computational linguistics on human language technology-volume 1. Association for Computational Linguistics, pp 48–54
Kunchukuttan A, Mehta P, Bhattacharyya P (2018) The IIT Bombay English-Hindi Parallel Corpus. In: Chair NCC, Choukri K, Cieri C, Declerck T, Goggi S, Hasida K, Isahara H, Maegaard B, Mariani J, Mazo H, Moreno A, Odijk J, Piperidis S, Tokunaga T, (eds). Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018), European Language Resources Association (ELRA), Paris
Nakazawa T, Higashiyama S, Ding C, Dabre R, Kunchukuttan A, Pa WP, Goto I, Mino H, Sudoh K, Kurohashi S (2018) Overview of the 5th workshop on asian translation. In: Proceedings of the 5th Workshop on Asian Translation (WAT2018), Hong Kong
Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29(1):19–51
Papineni K, Roukos S, Ward T, Zhu W-J (2002) BLEU: a method for automatic evaluation of machine translation. In: ACL-2002: 40th annual meeting of the association for computational linguistics. ACL, Philadelphia, pp 311–318
Ramanathan A, Bhattacharyya P, Visweswariah K, Ladha K, Gandhe A (2011) Clause-based reordering constraints to improve statistical machine translation. In: Proceedings of 5th international joint conference on natural language processing, Philadelphia, pp 1351–1355
Sennrich R, Haddow B, Birch A (2016a) Improving neural machine translation models with monolingual data. In: Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: long papers). Association for Computational Linguistics, Berlin, pp 86–96
Sennrich R, Haddow B, Birch A (2016b) Improving neural machine translation models with monolingual data. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL, Berlin
Sennrich R, Haddow B, Birch A (2016c) Neural machine translation of rare words with subword units. In: Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: long papers). Association for Computational Linguistics, Berlin, pp 1715–1725
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
Wang S, Liu Y, Wang C, Luan H, Sun M (2019) Improving back-translation with uncertainty-based confidence estimation. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, pp 791–802
Wang X, Pham H, Dai Z, Neubig G (2018) SwitchOut: an efficient data augmentation algorithm for neural machine translation. In: Proceedings of the 2018 conference on empirical methods in natural language processing. Association for Computational Linguistics, Brussels, pp 856–861
Wang X, Tu Z, Xiong D, Zhang M (2017) Translating phrases in neural machine translation. In: Proceedings of the 2017 conference on empirical methods in natural language processing. Association for Computational Linguistics, Copenhagen, pp 1421–1431
Zhang J, Zong C (2016) Exploiting source-side monolingual data in neural machine translation. In: Proceedings of the 2016 conference on empirical methods in natural language processing. pp 1535–1545
Zhao Y, Wang Y, Zhang J, Zong C (2018) Phrase table as recommendation memory for neural machine translation. In: IJCAI
Zhou C, Ma X, Hu J, Neubig G (2019) Handling syntactic divergence in low-resource machine translation. arXiv preprint arXiv:1909.00040
Zhu J, Gao F, Wu L, Xia Y, Qin T, Zhou W, Cheng X, Liu T-Y (2019) Soft contextual data augmentation for neural machine translation. In: ACL
Zoph B, Yuret D, May J, Knight K (2016) Transfer learning for low-resource neural machine translation. In: Proceedings of the 2016 conference on empirical methods in natural language processing. Association for Computational Linguistics, Austin, pp 1568–1575
Acknowledgements
Authors gratefully acknowledge the support of the project “Hindi to English Machine Aided Translation (HEMAT) for the Judicial Domain”, sponsored by TDIL, Meity, Govt. of India. Asif Ekbal gratefully acknowledges the Young Faculty Research Fellowship (YFRF), supported by Visvesvaraya PhD scheme for Electronics and IT, Ministry of Electronics and Information Technology (MeitY), Government of India, being implemented by Digital India Corporation (formerly MediaLab Asia).
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Gupta, K.K., Sen, S., Haque, R. et al. Augmenting training data with syntactic phrasal-segments in low-resource neural machine translation. Machine Translation 35, 661–685 (2021). https://doi.org/10.1007/s10590-021-09290-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10590-021-09290-0