Skip to main content
Log in

Quality estimation-guided supplementary data selection for domain adaptation of statistical machine translation

  • Published:
Machine Translation

Abstract

The problem of domain adaptation in statistical machine translation systems emanates from the fundamental assumption that test and training data are drawn independently from the same distribution (topic, domain, genre, style etc.). In real-life translation tasks, the sparseness of in-domain parallel training data often leads to poor model estimation, and consequentially poor translation quality. Domain adaptation by supplementary data selection aims at addressing this specific issue by selecting relevant parallel training data from out-of-domain or general-domain bi-text to enhance the quality of a poor baseline system. State-of-the-art research in data selection focuses on the development of novel similarity measures to improve the relevance of selected data. However, in this paper we approach the problem from a different perspective. In contrast to the conventional approach of using the entire available target-domain data as a reference for supplementary data selection, we restrict the reference set to only those sentences that are expected to be poorly translated by the baseline MT system using a Quality Estimation model. Our rationale is to focus help (i.e. supplementary training material) to where it is needed most. Automatic quality estimation techniques are used to identify such poorly translated sentences in the target domain. The experiments reported in this paper show that (i) this technique provides statistically significant improvements over the unadapted baseline translation and (ii) using significantly smaller amounts of supplementary data our approach achieves results comparable to state-of-the-art approaches using conventional reference sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. https://community.norton.com/.

  2. This is achieved by creating a dictionary on in-domain LMs and using it to filter the out-of-domain LMs for vocabulary matching.

  3. http://www.statmt.org/wmt12/translation-task.html.

  4. http://www.opensubtitles.org/.

  5. http://www.euromatrixplus.net/multi-un/.

  6. The parameters used are average sentence length, average type-token ratio, average stop-word to function-word ratio and the standard deviations of the same measures.

  7. http://www.csie.ntu.edu.tw/~cjlin/libsvm/.

  8. In terms of inter-annotator agreement, a Kappa coefficient value of 0.200 was obtained in this task (slight agreement according to Landis and Koch (1977).

References

  • Axelrod A, He X, Gao J (2011) Domain adaptation via pseudo in-domain data selection. In: Proceedings of the conference on EMNLP-11, Edinburgh, pp 355–362

  • Banerjee P, Naskar S, Roturier J, Way A, van Genabith J (2012) Translation quality-based supplementary data selection by incremental update of translation models. In: Proceedings of COLING-2012. Mumbai, pp 149–165

  • Banerjee P, Naskar SK, Roturier J, Way A, van Genabith J (2011) Domain adaptation in statistical machine translation of user-forum data using component level mixture modelling. In: Proceedings of MT Summit XIII, Xiamen, pp 285–292

  • Banerjee P, Naskar SK, Roturier J, Way A, van Genabith J (2012) Domain adaptation in SMT of user-generated forum content guided by OOV word reduction: normalization andor supplementary data? In: Proceedings of the 16th Annual Conference of the EAMT (EAMT-2012), Trento, pp 169–176

  • Banerjee S, Lavie A (2005) METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, Ann Arbor, pp 65–72

  • Blatz J, Fitzgerald E, Foster G, Gandrabur S, Goutte C, Kulesza A, Sanchis A, Ueffing N (2004) Confidence estimation for machine translation. In: Proceedings of the 20th International conference on computational linguistics, COLING ’04, Geneva

  • Callison-Burch C, Koehn P, Monz C, Post M, Soricut R, Specia L (2012) Findings of the 2012 workshop on statistical machine translation. In: Proceedings of the seventh workshop on statistical machine translation. Montreal, pp 10–51

  • Daume III H, Jagarlamudi J (2011) Domain adaptation for machine translation by mining unseen words. In: Proceedings of the 49th annual meeting of the ACL: HLT, Portland, pp 407–412

  • Duh K, Neubig G, Sudoh K, Tsukada H (2013) Adaptation data selection using neural language models: Experiments in machine translation. In: Proceedings of ACL (2). Sofia, pp 678–683

  • Eck M, Vogel S, Waibel A (2004) Language model adaptation for statistical machine translation based on information retrieval. In: Proceedings of 4th international conference on language resources and evaluation, (LREC 2004), Lisbon, pp 327–330

  • Federico M, Bertoldi N, Cettolo M (2008) IRSTLM: an open source toolkit for handling large scale language models. In: Interspeech 2008. Brisbane, pp 1618–1621

  • Federmann C (2012) Appraise: an Open-source toolkit for manual evaluation of MT output. Prague Bull Math Linguist 98:130–134

    Article  Google Scholar 

  • Foster G, Kuhn R (2007) Mixture-model adaptation for SMT. In: ACL 2007: Proceedings of the second WMT, Prague, pp 128–135

  • Gandrabur S, Foster G (2003) Confidence estimation for text prediction. In: Proceedings of the conference on natural language learning (CoNLL), Edmonton, pp 315–321

  • Heafield K (2011) KenLM: Faster and smaller language model queries. In: Proceedings of the sixth WMT, Edinburgh, pp 187–197

  • Hildebrand AS, Eck M, Vogel S, Waibel A (2005) Adaptation of the Translation model for statistical machine translation based on information retrieval. In: Proceedings of 10th EAMT conference, Budapest, pp 119–125

  • Irvine A, Morgan J, Carpuat M III, HD, Munteanu D, Daum H, (2013) Measuring machine translation errors in new domains. Trans Assoc Comput Linguist 1:429–440

  • Joachims T (1999) Making large-scale SVM learning practical. In: Schölkopf B, Burges C, Smola A (eds) Advances in kernel methods—support vector learning. MIT Press, Cambridge

    Google Scholar 

  • Kaljahi R, Foster J, Roturier J, Rubino R (2014) Quality estimation of english-french machine translation: a detailed study of the role of syntax. In: COLING 2014, pp 2052–2063

  • Kaljahi RSZ, Foster J, Rubino R, Roturier J, Hollowood F (2013) Parser accuracy in quality estimation of machine translation: a tree kernel approach. In: 6th International joint conference on natural language processing (IJCNLP), pp 1092–1096

  • Koehn P (2004) Statistical significance tests for machine translation evaluation. In: Proceedings of the conference on EMNLP, (EMNLP 2004), Barcelona, pp 388–395

  • Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Proceedings of MT Summit X, Phuket, pp 79–86

  • Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of the interactive poster and demonstration sessions, ACL 2007, Prague, pp 177–180

  • Koehn P, Schroeder J (2007) Experiments in domain adaptation for statistical machine translation. In: ACL 2007: Proceedings of the second WMT, Prague, pp 224–227

  • Landis J, Koch G (1977) The measurement of observer agreement for categorical data. Biometrics 33:159–174

    Article  MathSciNet  MATH  Google Scholar 

  • LDC (2005) Linguistic data annotation specification: Assessment of fluency and adequacy in translations. Technical report

  • Lü Y, Huang J, Liu Q (2007) Improving statistical machine translation performance by training data selection and optimization. In: Proceedings of 2007 joint conference on empirical methods in natural language processing and computational natural language learning, Prague, pp 343–350

  • Moore RC, Lewis W (2010) Intelligent selection of language model training data. In: Proceedings of the ACL 2010 conference short papers, pp 220–224

  • Och FJ (2003) Minimum error rate training in statistical machine translation. In: Proceedings of the 41st annual meeting on ACL-, vol 1. Sapporo, pp 160–167

  • Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29:19–51

    Article  MATH  Google Scholar 

  • Okita T, Rubino R, van Genabith J (2012) Sentence-Level Quality Estimation for MT System Combination. In: Proceedings of the ML4HMT-12 workshop, Mumbai, p 55

  • Ozdowska S, Way A (2009) Optimal bilingual data for French–English PB-SMT. In: Proceedings of the 13th Annual conference of the European Association for Machine Translation (EAMT-2009), Barcelona, pp 96–103

  • Papineni K, Roukos S, Ward T, Zhu W (2002) BLEU: a method for automatic evaluation of machine translation. In: 40th annual meeting of the ACL (ACL 2002), Philadelphia, pp 311–318

  • Quirk C (2004) Training a sentence-level machine translation confidence measure. In: Proceedings of LREC, Lisbon, pp 825–828

  • Raybaud S, Langlois D, Smaïli K (2011) This sentence is wrong. Detecting errors in machine-translated sentences. Mach Transl 25(1):1–34

    Article  Google Scholar 

  • Rubino R, De Souza JG, Foster J, Specia L (2013) Topic models for translation quality estimation for gisting purposes. In: Machine translation Summit XIV, Nice, pp 295–302

  • Rubino R, Huet S, Lefèvre F, Linarès G (2012) Statistical post-editing of machine translation for domain adaptation. In: Proceedings of the European Association for Machine Translation (EAMT), Trento, pp 221–228

  • Sennrich R (2012) Mixture-modeling with unsupervised clusters for domain adaptation in statistical machine translation. In: Proceedings of the 16th annual conference of the EAMT (EAMT-2012), Trento, pp 185–192

  • Sennrich R (2012) Perplexity minimization for translation model domain adaptation in statistical machine translation. In: Proceedings of the 13th conference of the European Chapter of the Association for Computational Linguistics (EACL-2012), Avignon, pp 539–549

  • Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: Proceedings of AMTA, Cambridge, pp 223–231

  • Specia L, Farzindar A (2010) Estimating machine translation post-editing effort with hter. In: AMTA 2010-workshop, bringing MT to the User: MT Research and the Translation Industry, Denver

  • Stolcke A (2002) SRILM-An extensible language modeling toolkit. In: ICSLP 2002, Interspeech 2002: 7th international conference on spoken language processing, Denver, pp 901–904

  • Suzuki H (2011) Automatic post-editing based on SMT and its selective application by sentence-level automatic quality evaluation. In: Proceedings of MT Summit XIII, Xiamen, pp 156–163

  • Tiedemann J (2009) News from OPUS—a collection of multilingual parallel corpora with tools and interfaces. In: Nicolov N, Bontcheva K, Angelova G, Mitkov R (eds) Recent advances in natural language processing, vol V, pp 237–248

  • Ueffing N, Macherey K, Ney H (2003) Confidence Measures for Statistical Machine Translation. In: Proceedings of the MT Summit IX, New Orleans, pp 394–401

  • Yasuda K, Zhang R, Yamamoto H, Sumita E (2008) Method of selecting training data to build a compact and efficient translation model. In: Proceedings of international joint conference on natural language processing, Hyderabad, pp 655–660

Download references

Acknowledgments

The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement n 288769, from Science Foundation Ireland (Grant 07/CE/I1142) as part of CNGL at Dublin City University, and from Research Ireland under the Enterprise Partnership Scheme (EPSPD/2011/135).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pratyush Banerjee.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Banerjee, P., Rubino, R., Roturier, J. et al. Quality estimation-guided supplementary data selection for domain adaptation of statistical machine translation. Machine Translation 29, 77–100 (2015). https://doi.org/10.1007/s10590-014-9165-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10590-014-9165-9

Keywords

Navigation