Skip to main content
Log in

Survey of data-selection methods in statistical machine translation

  • Published:
Machine Translation

Abstract

Statistical machine translation has seen significant improvements in quality over the past several years. The single biggest factor in this improvement has been the accumulation of ever larger stores of data. We now find ourselves, however, the victims of our own success, in that it has become increasingly difficult to train on such large sets of data, due to limitations in memory, processing power, and ultimately, speed (i.e. data-to-models takes an inordinate amount of time). Moreover, the training data has a wide quality spectrum. A variety of methods for data cleaning and data selection have been developed to address these issues. Each of these methods employs a search or filtering algorithm to select a subset of the data, given a defined set of feature functions. In this paper we provide a comparative overview of research in this area based on application scenario, feature functions and search method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. One point to acknowledge here is the extent to which the data supplied is optimally used. Ozdowska and Way (2009) provide comparative scores of systems built using different subsets of Europarl data (Koehn 2005). As might be expected, performance is highest for language pairs which use parallel data constructed with that particular language direction in mind, i.e. a French-to-English MT system works best when the original language data was French, and subsequently translated into English, as opposed to (i) where the original language data was English, and translated into French, or (ii) where the source language was neither French nor English. This is an important point, yet remains largely ignored by the wider community; system developers tend to value larger amounts of (say) French–English parallel data rather than select segments of a data set based on what the original language was.

  2. Low-resource languages are languages that have relatively little monolingual or parallel data available for training, tuning and evaluation.

  3. Free translation or freely translated text—contrary to literal, direct or word-for-word translation—conveys the overall meaning of a sentence or phrase without necessarily a word-for-word correspondence between source and translated text (Han et al. 2009).

  4. The terms “translation units” or “utterances” could also be used, but they are functionally equivalent in this context.

  5. Size can also be measured in total number of characters, tokens, token types, etc depending on the scenario.

  6. The number of subsets of size m from a set of size n is \(\frac{n!}{m!\times (n-m)!}\). Since this value is exponential in n, to be precise: \(\frac{n}{k}\), it is impractical to do an exhaustive search enumerating all subsets.

  7. We will use BLEU as the automated metric here, but other measures are certainly viable. Bilingual Evaluation Understudy is the most popular automatic evaluation metric based on n-gram matches between translation output and reference translations (Papineni et al. 2002; Koehn 2009).

  8. In this paper, unless otherwise specified, word n-grams are referred to as simply n-grams for brevity. A word n-gram is a sequence of n words that appear in natural language text.

  9. Batch-learning techniques are often used to make data-selection methods practical. See Sect. 9 for details.

  10. Term Frequency—Inverse Document Frequency.

  11. By definition the conditional entropy of H(Y|X) is defined as \(\sum _{x\in X, y\in Y} p(x,y)\log p(y|x)\). However, the definition provided for PhrEnt here is missing a \(P_{{\mathcal {TM}}(S_1^{j})}(p_{\scriptscriptstyle {\textit{s}}})\) inside the summation but outside the log. We recognize this inaccuracy but keep formula (14) consistent with the referenced work (Ambati 2011).

  12. The perplexity of a random variable is defined as two (or any given base number) to the power of its entropy. In natural language processing this function is commonly used as a measure of how surprised a language model is when observing a sequence of words.

  13. For all practical purposes in their work (Liu et al. 2010a), a phrase is the same as an n-gram as they consider phrases for a sentence to be all n-grams of up to a certain length.

  14. Kullback–Leibler divergence is an information-theoretic measure of the distance between two probability distributions.

  15. Translation Edit Rate (TER) measures the amount of editing a human would need to perform on an MT output to exactly match a reference translation (Snover et al. 2006).

  16. The seed corpus can be used for this purpose as well.

  17. Entropy, H(X), measures the amount of information or uncertainty in a random variable (Manning and Schütze 1999).

  18. The cross-entropy of a random variable, X, with the true probability distribution function of P(X) and an estimated probability distribution function of Q(X) is formally defined as the sum of the entropy of X plus the KL-divergence between P(X) and Q(X): \(H(P(X),Q(X)) = H(X,P(X)) + D_{{\scriptscriptstyle {\textit{KL}}}}(P(X)||Q(X))\) (Cover and Thomas 1991).

  19. A word-alignment model is a statistical model trained and used to align individual words in a sentence to their translations in the translated sentence. A word-aligned sentence pair contains alignment links used to align words in one sentence to the other (Brown et al. 1993).

  20. In their work, Khadivi and Ney use IBM model 1, HMM and IBM model 2 in succession to train the final model. However, their scoring function does not depend on any particular alignment model, so any alignment model can be used.

  21. We think a more natural derivation for alignment entropy of a sentence is averaging alignment entropies of each source or target word according to all possible alignments. This will provide a measure of how uncertain an alignment model is about word alignments in a sentence. This is a more natural use of entropy compared to the uncertainty about a word given the number of Viterbi alignment links for the word.

  22. In word alignment, the fertility of a word is defined as the number of alignment links initiated from that word.

  23. The weights are trained using a manually analyzed development set of size 1000.

  24. A set function is modular if and only if the value of the function over a set equals the sum of the function value over its individual elements.

  25. An oracle can be a human or a human-labeled data set that provides the true label for a query. In the context of MT, the oracle is a human translator or a parallel corpus where the source text has already been translated by humans.

  26. In this batch-mode learning strategy the idea is to avoid retraining all SMT models after selection of each sentence. A less expensive update is used to improve the diversity of the batch where only the scoring function is updated. After the entire batch is selected, then all SMT models are updated.

  27. Hierarchical Sampling attempts to leverage the cluster structure of the data for sampling in an active learning setting (Dasgupta and Hsu 2008). In this work, a static hierarchical cluster structure of the unlabeled data is given. A set of cluster nodes for sampling is maintained throughout the algorithm. Initially, the sampling set only contains the root node of the cluster which contains all nodes. Random samples are drawn from the sampling set and queried from the oracle. Based on these queries, each node in the hierarchical cluster maintains statistics about its positive and negative labels. Cluster nodes in the sampling set with mixed labels and more pure child nodes are removed from the sampling set and their child nodes are added. These steps are repeated until all nodes reach a predefined level of purity. This method is motivated by addressing the “sampling bias” problem (Schütze et al. 2006) in active learning and provides theoretical guarantees for a better learning performance than random sampling.

References

  • Adesam Y (2012) The multilingual forest : Investigating high-quality parallel corpus development. PhD thesis, Stockholm University, Stockholm

  • Allauzen A, Bonneau-Maynard H, Le HS, Max A, Wisniewski G, Yvon F, Adda G, Crego JM, Lardilleux A, Lavergne T, Sokolov A (2011) LIMSI @ WMT11. In: Proceedings of the Sixth Workshop on Statistical Machine Translation. Edinburgh, pp 309–315

  • Ambati V (2011) Active learning and crowdsourcing for machine translation in low resource scenarios. Ph.D. thesis, Carnegie Mellon University, Pittsburgh

  • Ambati V, Vogel S, Carbonell J (2010) Active learning and crowd-sourcing for machine translation. In: Proceedings of the seventh conference on international language resources and evaluation (LREC’10), Valletta, vol 7, pp 2169–2174

  • Ananthakrishnan S, Prasad R, Stallard D, Natarajan P (2010a) Discriminative sample selection for statistical machine translation. In: Proceedings of the 2010 conference on empirical methods in natural language processing, Cambridge, pp 626–635

  • Ananthakrishnan S, Prasad R, Stallard D, Natarajan P (2010b) A semi-supervised batch-mode active learning strategy for improved statistical machine translation. In: Proceedings of the fourteenth conference on computational natural language learning. Uppsala, pp 126–134

  • Axelrod A, He X, Gao J (2011) Domain adaptation via pseudo in-domain data selection. In: Proceedings of the conference on empirical methods in natural language processing. Edinburgh, pp 355–362

  • Axelrod A, Li Q, Lewis W (2012) Applications of data selection via cross-entropy difference for real-world statistical machine translation. In: Proceedings of the international workshop on spoken language translation, Hong Kong, pp 201–108

  • Banerjee P, Naskar S, Roturier J, Way A, van Genabith J (2011) Domain adaptation in statistical machine translation of user-forum data using component level mixture modelling. In: Proceedings of machine translation summit XIII. Xiamen, pp 285–292

  • Biçici E, Yuret D (2011) Instance selection for machine translation using feature decay algorithms. In: Proceedings of the sixth workshop on statistical machine translation. Association for Computational Linguistics, Edinburgh, pp 272–283

  • Biçici E, Liu Q, Way A (2014) Parallel FDA5 for fast deployment of accurate statistical machine translation systems. In: Proceedings of the ninth workshop on statistical machine translation, Baltimore, pp 59–65

  • Biçici E, Liu Q, Way A (2015) Parfda for fast deployment of accurate statistical machine translation systems, benchmarks, and statistics. In: Proceedings of the Tenth Workshop on Statistical Machine Translation, Lisbon, pp 74–78

  • Bloodgood M, Callison-Burch C (2010) Bucking the trend: Large-scale cost-focused active learning for statistical machine translation. In: ACL 2010, The 48th annual meeting of the association for computational linguistics, conference proceedings, Uppsala, pp 854–864

  • Brown PF, Della Pietra VJ, Della Pietra SA, Mercer RL (1993) The mathematics of statistical machine translation: parameter estimation. Comput Linguist 19(2):263–311

    Google Scholar 

  • Callison-Burch C, Dredze M (2010) Creating speech and language data with amazon’s mechanical turk. In: Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s mechanical turk. Montreal, pp 1–12

  • Chao W, Li Z (2011a) A graph-based bilingual corpus selection approach for SMT. In: Proceedings of the 25th Pacific Asia conference on language, information and computation, Singapore, pp 120–129

  • Chao W, Li Z (2011b) Improved graph-based bilingual corpus selection with sentence pair ranking for statistical machine translation. In: Proceedings of the 23rd IEEE international conference on tools with artificial intelligence, Boca Raton, pp 446–451

  • Chen B, Kuhn R, Foster G (2014) A comparison of mixture and vector space techniques for translation model adaptation. In: AMTA 2014, Proceedings of the 11th conference of the association for machine translation in the Americas, Vol 1, MT Researchers Track, Vancouver, pp 124–138

  • Chiang D (2005) A hierarchical phrase-based model for statistical machine translation. In: 43rd annual meeting of the association for computational linguistics, Ann Arbor, pp 263–270

  • Cover TM, Thomas JA (1991) Elements of information theory. Wiley-Interscience, New York

    Book  MATH  Google Scholar 

  • Dasgupta S, Hsu D (2008) Hierarchical sampling for active learning. In: Proceedings of the 25th international conference on machine learning, Helsinki, pp 208–215

  • Denkowski M, Hanneman G, Lavie A (2012) The CMU-Avenue french-english translation system. In: Proceedings of the NAACL 2012 workshop on statistical machine translation, Montreal, pp 261–266

  • Dyer C, Cordova A, Mont A, Lin J (2008) Fast, easy, and cheap: Construction of statistical machine translation models with mapreduce. In: Proceedings of the third workshop on statistical machine translation, Columbus, pp 199–207

  • Eck M, Vogel S, Waibel A (2005) Low cost portability for statistical machine translation based in N-gram frequency and TF-IDF. In: IWSLT 2005, Proceedings of the international workshop on spoken language translation: evaluation campaign on spoken language translation, Pittsburgh

  • Eetemadi S, Radha H (2010) Effects of parallel corpus selection on statistical machine translation quality. In: NW-NLP 201: Proceedings of the Pacific Northwest Regional NLP workshop, Redmond

  • Gangadharaiah R, Brown R, Carbonell J (2009) Active learning in example-based machine translation. In: Proceedings of the 17th Nordic conference of computational linguistics NODALIDA, Odense, pp 227–230

  • Goodman J, Gao J (2000) Language model size reduction by pruning and clustering. In: Proceedings of the international conference on spoken language processing, Beijing, pp 110–113

  • Goutte C, Carpuat M, Foster G (2012) The impact of sentence alignment errors on phrase-based machine translation performance. In: Proceedings of the 10th conference of association for machine translation in the Americas, San Diego

  • Haffari G (2009) Machine learning approaches for dealing with limited bilingual training data in statistical machine translation. Ph.D. thesis, Simon Fraser University, Burnaby

  • Haffari G, Roy M, Sarkar A (2009) Active learning for statistical phrase-based machine translation. In: Proceedings of human language technologies: the 2009 annual conference of the North American chapter of the association for computational linguistics, Boulder, pp 415–423

  • Han X, Li H, Zhao T (2009) Train the machine with what it can learn: corpus selection for SMT. In: BUCC-09, Proceedings of the 2nd workshop on building and using comparable corpora: from parallel to non-parallel corpora, Singapore, pp 27–33

  • Jiang J, Way A, Carson-Berndsen J (2010) Lattice score-based data cleaning for phrase-based statistical machine translation. In: Proceedings of the 14th annual conference of the European association for machine translation, Saint-Raphaël

  • Kauchak D (2006) Contributions to research on machine translation. Ph.D. thesis, UC San Diego, San Diego

  • Khadivi S, Ney H (2005) Automatic filtering of bilingual corpora for statistical machine translation. Nat Lang Process Inform Syst 3513:263–274

    Article  Google Scholar 

  • Kirchhoff K, Bilmes J (2014) Submodularity for data selection in machine translation. In: EMNLP 2014, The 2014 conference on empirical methods in natural language processing, Proceedings of the conference, Doha, pp 131–141

  • Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Proceedings of MT Summit X. Phuket, pp 79–86

  • Koehn P (2009) Statistical machine translation. Cambridge University Press, Cambridge

    Book  Google Scholar 

  • Koehn P, Och F, Marcu D (2003) Statistical phrase-based translation. In: Proceedings of the joint human language technology conference and the annual meeting of the North American chapter of the association for computational linguistics (HLT-NAACL), Edmonton, pp 127–133

  • Kullback S, Leibler L (1951) On information and sufficiency. Ann Math Stat 22(1):79–86

    Article  MathSciNet  MATH  Google Scholar 

  • Lewis W, Eetemadi S (2013) Dramatically reducing training data size through vocabulary saturation. In: Proceedings of the eighth workshop on statistical machine translation, Sofia, pp 281–291

  • Liu P, Zhou Y, Zong C (2009) Approach to selecting best development set for phrase-based statistical machine translation. In: Proceedings of the 23rd Pacific Asia conference on language, information and computation, Hong Kong, pp 325–334

  • Liu P, Zhou Y, Zong C (2010a) Data selection for statistical machine translation. In: Proceedings of the international conference on natural language processing and knowledge engineering (NLP-KE), Beijing, pp 1–5

  • Liu P, Zhou Y, Zong CQ (2010b) Approaches to improving corpus quality for statistical machine translation. In: Proceedings of the international conference on machine learning and cybernetics, vol 6, Qingdao, pp 3293–3298

  • Lü Y, Huang J, Liu Q (2007) Improving statistical machine translation performance by training data selection and optimization. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL), Czech Republic, pp 343–350

  • Mandal A, Vergyri D, Wang W, Zheng J, Stolcke A, Tur G, Hakkani-Tur D, Ayan NF (2008) Efficient data selection for machine translation. In: Proceedings of the spoken language technology workshop, Goa, pp 261–264

  • Manning CD, Schütze H (1999) Foundations of statistical natural language processing. MIT press, Cambridge

    MATH  Google Scholar 

  • Moore RC, Lewis W (2010) Intelligent selection of language model training data. In: ACL 2010, The 48th annual meeting of the association for computational linguistics, conference proceedings, Uppsala, pp 220–224

  • Munteanu DS, Marcu D (2005) Improving machine translation performance by exploiting non-parallel corpora. Comput Linguist 31(4):477–504

    Article  Google Scholar 

  • Okita T (2009) Data cleaning for word alignment. In: Proceedings of the ACL-IJCNLP 2009 student research workshop, Singapore, pp 72–80

  • Okita T, Naskar SK, Way A (2009) Noise reduction experiments in machine translation. In: ECML-PKDD, Proceedings of the European conference on machine learning and principles and practice of knowledge discovery in databases, Bled

  • Ozdowska S, Way A (2009) Optimal bilingual data for french-english pb-smt. In: Proceedings of EAMT-09, the 13th Annual conference of the European association for machine translation, Barcelona, pp 96–103

  • Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: A method for automatic evaluation of machine translation. In: ACL-2002: 40th annual meeting of the association for computational linguistics, proceedings of the conference, Philadelphia, pp 311–318

  • Pecina P, Toral A, Papavassiliou V, Prokopidis P, Tamchyna A, Way A, Van Genabith J (2014) Domain adaptation of statistical machine translation using web-crawled resources and model parameter tuning. Lang Resour Eval 49(1):147–193

    Article  Google Scholar 

  • Resnik P (1999) Mining the web for bilingual text. In: ACL-1999: 37th annual meeting of the association for computational linguistics: proceedings of the conference, College Park, pp 527–534

  • Schütze H, Velipasaoglu E, Pedersen JO (2006) Performance thresholding in practical text classification. In: Proceedings of the 15th ACM international conference on Information and knowledge management, Kansas City, pp 662–671

  • Settles B (2010) Active learning literature survey. Tech. rep., University of Wisconsin, Madison, WN, URL http://burrsettles.com/pub/settles.activelearning.pdf

  • Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: AMTA 2006: Proceedings of the 7th conference of the association for machine translation in the Americas, visions for the future of machine translation, Cambridge, Massachusetts, pp 223–231

  • Somers H (2005) Round trip translation: What is it good for? In: ALTW 2005: Proceedings of Australasian language technology workshop, Australia, pp 127–133

  • Specia L, Raj D, Turchi M (2010) Machine translation evaluation versus quality estimation. Mach Transl 24(1):39–50

    Article  Google Scholar 

  • Taghipour K, Afhami N, Khadivi S, Shiry S (2010) A discriminative approach to filter out noisy sentence pairs from bilingual corpora. In: Proceedings of the 5th international symposium on telecommunications, Tehran, pp 537–541

  • Ueffing N, Ney H (2007) Word-level confidence estimation for machine translation. Comput Linguist 33(1):9–40

    Article  MATH  Google Scholar 

  • Vogel S, Ney H, Tillmann C (1996) HMM-based word alignment in statistical translation. In: Proceedings of the 16th conference on computational linguistics, vol 2, Copenhagen, pp 836–841

  • Wei K, Liu Y, Kirchhoff K, Bilmes J (2013) Using document summarization techniques for speech data subset selection. In: Proceedings of the 2013 Conference of the North American chapter of the association for computational linguistics: human language technologies, Atlanta, pp 721–726

  • Wuebker J, Mauser A, Ney H (2010) Training phrase translation models with leaving-one-out. In: Proceedings of the 48th annual meeting of the association for computational linguistics, Uppsala, pp 475–484

  • Yarowsky D (1995) Unsupervised word sense disambiguation rivaling supervised methods. In: Proceedings of the 33rd annual meeting of the association for computational linguistics, Cambridge, Massachusetts, pp 189–196

  • Yasuda K, Zhang R, Yamamoto H, Sumita E (2008) Method of selecting training data to build a compact and efficient translation model. In: Proceedings of the third international joint conference on natural language processing, vol 2, Hyderabad, pp 655–660

  • Zens R, Stanton D, Xu P (2012) A systematic comparison of phrase table pruning techniques. In: Proceedings of the 2012 Joint conference on empirical methods in natural language processing and computational natural language learning, Jeju, pp 972–983

Download references

Acknowledgments

We thank reviewers and in particular the editor who went above and beyond in helping us make this a better paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sauleh Eetemadi.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Eetemadi, S., Lewis, W., Toutanova, K. et al. Survey of data-selection methods in statistical machine translation. Machine Translation 29, 189–223 (2015). https://doi.org/10.1007/s10590-015-9176-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10590-015-9176-1

Keywords

Navigation