Abstract
The problem of domain adaptation in statistical machine translation systems emanates from the fundamental assumption that test and training data are drawn independently from the same distribution (topic, domain, genre, style etc.). In real-life translation tasks, the sparseness of in-domain parallel training data often leads to poor model estimation, and consequentially poor translation quality. Domain adaptation by supplementary data selection aims at addressing this specific issue by selecting relevant parallel training data from out-of-domain or general-domain bi-text to enhance the quality of a poor baseline system. State-of-the-art research in data selection focuses on the development of novel similarity measures to improve the relevance of selected data. However, in this paper we approach the problem from a different perspective. In contrast to the conventional approach of using the entire available target-domain data as a reference for supplementary data selection, we restrict the reference set to only those sentences that are expected to be poorly translated by the baseline MT system using a Quality Estimation model. Our rationale is to focus help (i.e. supplementary training material) to where it is needed most. Automatic quality estimation techniques are used to identify such poorly translated sentences in the target domain. The experiments reported in this paper show that (i) this technique provides statistically significant improvements over the unadapted baseline translation and (ii) using significantly smaller amounts of supplementary data our approach achieves results comparable to state-of-the-art approaches using conventional reference sets.
Similar content being viewed by others
Notes
This is achieved by creating a dictionary on in-domain LMs and using it to filter the out-of-domain LMs for vocabulary matching.
The parameters used are average sentence length, average type-token ratio, average stop-word to function-word ratio and the standard deviations of the same measures.
In terms of inter-annotator agreement, a Kappa coefficient value of 0.200 was obtained in this task (slight agreement according to Landis and Koch (1977).
References
Axelrod A, He X, Gao J (2011) Domain adaptation via pseudo in-domain data selection. In: Proceedings of the conference on EMNLP-11, Edinburgh, pp 355–362
Banerjee P, Naskar S, Roturier J, Way A, van Genabith J (2012) Translation quality-based supplementary data selection by incremental update of translation models. In: Proceedings of COLING-2012. Mumbai, pp 149–165
Banerjee P, Naskar SK, Roturier J, Way A, van Genabith J (2011) Domain adaptation in statistical machine translation of user-forum data using component level mixture modelling. In: Proceedings of MT Summit XIII, Xiamen, pp 285–292
Banerjee P, Naskar SK, Roturier J, Way A, van Genabith J (2012) Domain adaptation in SMT of user-generated forum content guided by OOV word reduction: normalization andor supplementary data? In: Proceedings of the 16th Annual Conference of the EAMT (EAMT-2012), Trento, pp 169–176
Banerjee S, Lavie A (2005) METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, Ann Arbor, pp 65–72
Blatz J, Fitzgerald E, Foster G, Gandrabur S, Goutte C, Kulesza A, Sanchis A, Ueffing N (2004) Confidence estimation for machine translation. In: Proceedings of the 20th International conference on computational linguistics, COLING ’04, Geneva
Callison-Burch C, Koehn P, Monz C, Post M, Soricut R, Specia L (2012) Findings of the 2012 workshop on statistical machine translation. In: Proceedings of the seventh workshop on statistical machine translation. Montreal, pp 10–51
Daume III H, Jagarlamudi J (2011) Domain adaptation for machine translation by mining unseen words. In: Proceedings of the 49th annual meeting of the ACL: HLT, Portland, pp 407–412
Duh K, Neubig G, Sudoh K, Tsukada H (2013) Adaptation data selection using neural language models: Experiments in machine translation. In: Proceedings of ACL (2). Sofia, pp 678–683
Eck M, Vogel S, Waibel A (2004) Language model adaptation for statistical machine translation based on information retrieval. In: Proceedings of 4th international conference on language resources and evaluation, (LREC 2004), Lisbon, pp 327–330
Federico M, Bertoldi N, Cettolo M (2008) IRSTLM: an open source toolkit for handling large scale language models. In: Interspeech 2008. Brisbane, pp 1618–1621
Federmann C (2012) Appraise: an Open-source toolkit for manual evaluation of MT output. Prague Bull Math Linguist 98:130–134
Foster G, Kuhn R (2007) Mixture-model adaptation for SMT. In: ACL 2007: Proceedings of the second WMT, Prague, pp 128–135
Gandrabur S, Foster G (2003) Confidence estimation for text prediction. In: Proceedings of the conference on natural language learning (CoNLL), Edmonton, pp 315–321
Heafield K (2011) KenLM: Faster and smaller language model queries. In: Proceedings of the sixth WMT, Edinburgh, pp 187–197
Hildebrand AS, Eck M, Vogel S, Waibel A (2005) Adaptation of the Translation model for statistical machine translation based on information retrieval. In: Proceedings of 10th EAMT conference, Budapest, pp 119–125
Irvine A, Morgan J, Carpuat M III, HD, Munteanu D, Daum H, (2013) Measuring machine translation errors in new domains. Trans Assoc Comput Linguist 1:429–440
Joachims T (1999) Making large-scale SVM learning practical. In: Schölkopf B, Burges C, Smola A (eds) Advances in kernel methods—support vector learning. MIT Press, Cambridge
Kaljahi R, Foster J, Roturier J, Rubino R (2014) Quality estimation of english-french machine translation: a detailed study of the role of syntax. In: COLING 2014, pp 2052–2063
Kaljahi RSZ, Foster J, Rubino R, Roturier J, Hollowood F (2013) Parser accuracy in quality estimation of machine translation: a tree kernel approach. In: 6th International joint conference on natural language processing (IJCNLP), pp 1092–1096
Koehn P (2004) Statistical significance tests for machine translation evaluation. In: Proceedings of the conference on EMNLP, (EMNLP 2004), Barcelona, pp 388–395
Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Proceedings of MT Summit X, Phuket, pp 79–86
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of the interactive poster and demonstration sessions, ACL 2007, Prague, pp 177–180
Koehn P, Schroeder J (2007) Experiments in domain adaptation for statistical machine translation. In: ACL 2007: Proceedings of the second WMT, Prague, pp 224–227
Landis J, Koch G (1977) The measurement of observer agreement for categorical data. Biometrics 33:159–174
LDC (2005) Linguistic data annotation specification: Assessment of fluency and adequacy in translations. Technical report
Lü Y, Huang J, Liu Q (2007) Improving statistical machine translation performance by training data selection and optimization. In: Proceedings of 2007 joint conference on empirical methods in natural language processing and computational natural language learning, Prague, pp 343–350
Moore RC, Lewis W (2010) Intelligent selection of language model training data. In: Proceedings of the ACL 2010 conference short papers, pp 220–224
Och FJ (2003) Minimum error rate training in statistical machine translation. In: Proceedings of the 41st annual meeting on ACL-, vol 1. Sapporo, pp 160–167
Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29:19–51
Okita T, Rubino R, van Genabith J (2012) Sentence-Level Quality Estimation for MT System Combination. In: Proceedings of the ML4HMT-12 workshop, Mumbai, p 55
Ozdowska S, Way A (2009) Optimal bilingual data for French–English PB-SMT. In: Proceedings of the 13th Annual conference of the European Association for Machine Translation (EAMT-2009), Barcelona, pp 96–103
Papineni K, Roukos S, Ward T, Zhu W (2002) BLEU: a method for automatic evaluation of machine translation. In: 40th annual meeting of the ACL (ACL 2002), Philadelphia, pp 311–318
Quirk C (2004) Training a sentence-level machine translation confidence measure. In: Proceedings of LREC, Lisbon, pp 825–828
Raybaud S, Langlois D, Smaïli K (2011) This sentence is wrong. Detecting errors in machine-translated sentences. Mach Transl 25(1):1–34
Rubino R, De Souza JG, Foster J, Specia L (2013) Topic models for translation quality estimation for gisting purposes. In: Machine translation Summit XIV, Nice, pp 295–302
Rubino R, Huet S, Lefèvre F, Linarès G (2012) Statistical post-editing of machine translation for domain adaptation. In: Proceedings of the European Association for Machine Translation (EAMT), Trento, pp 221–228
Sennrich R (2012) Mixture-modeling with unsupervised clusters for domain adaptation in statistical machine translation. In: Proceedings of the 16th annual conference of the EAMT (EAMT-2012), Trento, pp 185–192
Sennrich R (2012) Perplexity minimization for translation model domain adaptation in statistical machine translation. In: Proceedings of the 13th conference of the European Chapter of the Association for Computational Linguistics (EACL-2012), Avignon, pp 539–549
Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: Proceedings of AMTA, Cambridge, pp 223–231
Specia L, Farzindar A (2010) Estimating machine translation post-editing effort with hter. In: AMTA 2010-workshop, bringing MT to the User: MT Research and the Translation Industry, Denver
Stolcke A (2002) SRILM-An extensible language modeling toolkit. In: ICSLP 2002, Interspeech 2002: 7th international conference on spoken language processing, Denver, pp 901–904
Suzuki H (2011) Automatic post-editing based on SMT and its selective application by sentence-level automatic quality evaluation. In: Proceedings of MT Summit XIII, Xiamen, pp 156–163
Tiedemann J (2009) News from OPUS—a collection of multilingual parallel corpora with tools and interfaces. In: Nicolov N, Bontcheva K, Angelova G, Mitkov R (eds) Recent advances in natural language processing, vol V, pp 237–248
Ueffing N, Macherey K, Ney H (2003) Confidence Measures for Statistical Machine Translation. In: Proceedings of the MT Summit IX, New Orleans, pp 394–401
Yasuda K, Zhang R, Yamamoto H, Sumita E (2008) Method of selecting training data to build a compact and efficient translation model. In: Proceedings of international joint conference on natural language processing, Hyderabad, pp 655–660
Acknowledgments
The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement n 288769, from Science Foundation Ireland (Grant 07/CE/I1142) as part of CNGL at Dublin City University, and from Research Ireland under the Enterprise Partnership Scheme (EPSPD/2011/135).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Banerjee, P., Rubino, R., Roturier, J. et al. Quality estimation-guided supplementary data selection for domain adaptation of statistical machine translation. Machine Translation 29, 77–100 (2015). https://doi.org/10.1007/s10590-014-9165-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10590-014-9165-9