Survey of data-selection methods in statistical machine translation

Eetemadi, Sauleh; Lewis, William; Toutanova, Kristina; Radha, Hayder

doi:10.1007/s10590-015-9176-1

Survey of data-selection methods in statistical machine translation

Published: 28 December 2015

Volume 29, pages 189–223, (2015)
Cite this article

Machine Translation

Sauleh Eetemadi^1,2,
William Lewis²,
Kristina Toutanova² &
…
Hayder Radha¹

726 Accesses
9 Citations
9 Altmetric
Explore all metrics

Abstract

Statistical machine translation has seen significant improvements in quality over the past several years. The single biggest factor in this improvement has been the accumulation of ever larger stores of data. We now find ourselves, however, the victims of our own success, in that it has become increasingly difficult to train on such large sets of data, due to limitations in memory, processing power, and ultimately, speed (i.e. data-to-models takes an inordinate amount of time). Moreover, the training data has a wide quality spectrum. A variety of methods for data cleaning and data selection have been developed to address these issues. Each of these methods employs a search or filtering algorithm to select a subset of the data, given a defined set of feature functions. In this paper we provide a comparative overview of research in this area based on application scenario, feature functions and search method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Natural language processing: state of the art, current trends and challenges

Article 14 July 2022

A survey on large language model based autonomous agents

Article Open access 22 March 2024

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Notes

One point to acknowledge here is the extent to which the data supplied is optimally used. Ozdowska and Way (2009) provide comparative scores of systems built using different subsets of Europarl data (Koehn 2005). As might be expected, performance is highest for language pairs which use parallel data constructed with that particular language direction in mind, i.e. a French-to-English MT system works best when the original language data was French, and subsequently translated into English, as opposed to (i) where the original language data was English, and translated into French, or (ii) where the source language was neither French nor English. This is an important point, yet remains largely ignored by the wider community; system developers tend to value larger amounts of (say) French–English parallel data rather than select segments of a data set based on what the original language was.
Low-resource languages are languages that have relatively little monolingual or parallel data available for training, tuning and evaluation.
Free translation or freely translated text—contrary to literal, direct or word-for-word translation—conveys the overall meaning of a sentence or phrase without necessarily a word-for-word correspondence between source and translated text (Han et al. 2009).
The terms “translation units” or “utterances” could also be used, but they are functionally equivalent in this context.
Size can also be measured in total number of characters, tokens, token types, etc depending on the scenario.
The number of subsets of size m from a set of size n is \(\frac{n!}{m!\times (n-m)!}\). Since this value is exponential in n, to be precise: \(\frac{n}{k}\), it is impractical to do an exhaustive search enumerating all subsets.
We will use BLEU as the automated metric here, but other measures are certainly viable. Bilingual Evaluation Understudy is the most popular automatic evaluation metric based on n-gram matches between translation output and reference translations (Papineni et al. 2002; Koehn 2009).
In this paper, unless otherwise specified, word n-grams are referred to as simply n-grams for brevity. A word n-gram is a sequence of n words that appear in natural language text.
Batch-learning techniques are often used to make data-selection methods practical. See Sect. 9 for details.
Term Frequency—Inverse Document Frequency.
By definition the conditional entropy of H(Y|X) is defined as \(\sum _{x\in X, y\in Y} p(x,y)\log p(y|x)\). However, the definition provided for PhrEnt here is missing a \(P_{{\mathcal {TM}}(S_1^{j})}(p_{\scriptscriptstyle {\textit{s}}})\) inside the summation but outside the log. We recognize this inaccuracy but keep formula (14) consistent with the referenced work (Ambati 2011).
The perplexity of a random variable is defined as two (or any given base number) to the power of its entropy. In natural language processing this function is commonly used as a measure of how surprised a language model is when observing a sequence of words.
For all practical purposes in their work (Liu et al. 2010a), a phrase is the same as an n-gram as they consider phrases for a sentence to be all n-grams of up to a certain length.
Kullback–Leibler divergence is an information-theoretic measure of the distance between two probability distributions.
Translation Edit Rate (TER) measures the amount of editing a human would need to perform on an MT output to exactly match a reference translation (Snover et al. 2006).
The seed corpus can be used for this purpose as well.
Entropy, H(X), measures the amount of information or uncertainty in a random variable (Manning and Schütze 1999).
The cross-entropy of a random variable, X, with the true probability distribution function of P(X) and an estimated probability distribution function of Q(X) is formally defined as the sum of the entropy of X plus the KL-divergence between P(X) and Q(X): \(H(P(X),Q(X)) = H(X,P(X)) + D_{{\scriptscriptstyle {\textit{KL}}}}(P(X)||Q(X))\) (Cover and Thomas 1991).
A word-alignment model is a statistical model trained and used to align individual words in a sentence to their translations in the translated sentence. A word-aligned sentence pair contains alignment links used to align words in one sentence to the other (Brown et al. 1993).
In their work, Khadivi and Ney use IBM model 1, HMM and IBM model 2 in succession to train the final model. However, their scoring function does not depend on any particular alignment model, so any alignment model can be used.
We think a more natural derivation for alignment entropy of a sentence is averaging alignment entropies of each source or target word according to all possible alignments. This will provide a measure of how uncertain an alignment model is about word alignments in a sentence. This is a more natural use of entropy compared to the uncertainty about a word given the number of Viterbi alignment links for the word.
In word alignment, the fertility of a word is defined as the number of alignment links initiated from that word.
The weights are trained using a manually analyzed development set of size 1000.
A set function is modular if and only if the value of the function over a set equals the sum of the function value over its individual elements.
An oracle can be a human or a human-labeled data set that provides the true label for a query. In the context of MT, the oracle is a human translator or a parallel corpus where the source text has already been translated by humans.
In this batch-mode learning strategy the idea is to avoid retraining all SMT models after selection of each sentence. A less expensive update is used to improve the diversity of the batch where only the scoring function is updated. After the entire batch is selected, then all SMT models are updated.
Hierarchical Sampling attempts to leverage the cluster structure of the data for sampling in an active learning setting (Dasgupta and Hsu 2008). In this work, a static hierarchical cluster structure of the unlabeled data is given. A set of cluster nodes for sampling is maintained throughout the algorithm. Initially, the sampling set only contains the root node of the cluster which contains all nodes. Random samples are drawn from the sampling set and queried from the oracle. Based on these queries, each node in the hierarchical cluster maintains statistics about its positive and negative labels. Cluster nodes in the sampling set with mixed labels and more pure child nodes are removed from the sampling set and their child nodes are added. These steps are repeated until all nodes reach a predefined level of purity. This method is motivated by addressing the “sampling bias” problem (Schütze et al. 2006) in active learning and provides theoretical guarantees for a better learning performance than random sampling.

References

Adesam Y (2012) The multilingual forest : Investigating high-quality parallel corpus development. PhD thesis, Stockholm University, Stockholm
Allauzen A, Bonneau-Maynard H, Le HS, Max A, Wisniewski G, Yvon F, Adda G, Crego JM, Lardilleux A, Lavergne T, Sokolov A (2011) LIMSI @ WMT11. In: Proceedings of the Sixth Workshop on Statistical Machine Translation. Edinburgh, pp 309–315
Ambati V (2011) Active learning and crowdsourcing for machine translation in low resource scenarios. Ph.D. thesis, Carnegie Mellon University, Pittsburgh
Ambati V, Vogel S, Carbonell J (2010) Active learning and crowd-sourcing for machine translation. In: Proceedings of the seventh conference on international language resources and evaluation (LREC’10), Valletta, vol 7, pp 2169–2174
Ananthakrishnan S, Prasad R, Stallard D, Natarajan P (2010a) Discriminative sample selection for statistical machine translation. In: Proceedings of the 2010 conference on empirical methods in natural language processing, Cambridge, pp 626–635
Ananthakrishnan S, Prasad R, Stallard D, Natarajan P (2010b) A semi-supervised batch-mode active learning strategy for improved statistical machine translation. In: Proceedings of the fourteenth conference on computational natural language learning. Uppsala, pp 126–134
Axelrod A, He X, Gao J (2011) Domain adaptation via pseudo in-domain data selection. In: Proceedings of the conference on empirical methods in natural language processing. Edinburgh, pp 355–362
Axelrod A, Li Q, Lewis W (2012) Applications of data selection via cross-entropy difference for real-world statistical machine translation. In: Proceedings of the international workshop on spoken language translation, Hong Kong, pp 201–108
Banerjee P, Naskar S, Roturier J, Way A, van Genabith J (2011) Domain adaptation in statistical machine translation of user-forum data using component level mixture modelling. In: Proceedings of machine translation summit XIII. Xiamen, pp 285–292
Biçici E, Yuret D (2011) Instance selection for machine translation using feature decay algorithms. In: Proceedings of the sixth workshop on statistical machine translation. Association for Computational Linguistics, Edinburgh, pp 272–283
Biçici E, Liu Q, Way A (2014) Parallel FDA5 for fast deployment of accurate statistical machine translation systems. In: Proceedings of the ninth workshop on statistical machine translation, Baltimore, pp 59–65
Biçici E, Liu Q, Way A (2015) Parfda for fast deployment of accurate statistical machine translation systems, benchmarks, and statistics. In: Proceedings of the Tenth Workshop on Statistical Machine Translation, Lisbon, pp 74–78
Bloodgood M, Callison-Burch C (2010) Bucking the trend: Large-scale cost-focused active learning for statistical machine translation. In: ACL 2010, The 48th annual meeting of the association for computational linguistics, conference proceedings, Uppsala, pp 854–864
Brown PF, Della Pietra VJ, Della Pietra SA, Mercer RL (1993) The mathematics of statistical machine translation: parameter estimation. Comput Linguist 19(2):263–311
Google Scholar
Callison-Burch C, Dredze M (2010) Creating speech and language data with amazon’s mechanical turk. In: Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s mechanical turk. Montreal, pp 1–12
Chao W, Li Z (2011a) A graph-based bilingual corpus selection approach for SMT. In: Proceedings of the 25th Pacific Asia conference on language, information and computation, Singapore, pp 120–129
Chao W, Li Z (2011b) Improved graph-based bilingual corpus selection with sentence pair ranking for statistical machine translation. In: Proceedings of the 23rd IEEE international conference on tools with artificial intelligence, Boca Raton, pp 446–451
Chen B, Kuhn R, Foster G (2014) A comparison of mixture and vector space techniques for translation model adaptation. In: AMTA 2014, Proceedings of the 11th conference of the association for machine translation in the Americas, Vol 1, MT Researchers Track, Vancouver, pp 124–138
Chiang D (2005) A hierarchical phrase-based model for statistical machine translation. In: 43rd annual meeting of the association for computational linguistics, Ann Arbor, pp 263–270
Cover TM, Thomas JA (1991) Elements of information theory. Wiley-Interscience, New York
Book MATH Google Scholar
Dasgupta S, Hsu D (2008) Hierarchical sampling for active learning. In: Proceedings of the 25th international conference on machine learning, Helsinki, pp 208–215
Denkowski M, Hanneman G, Lavie A (2012) The CMU-Avenue french-english translation system. In: Proceedings of the NAACL 2012 workshop on statistical machine translation, Montreal, pp 261–266
Dyer C, Cordova A, Mont A, Lin J (2008) Fast, easy, and cheap: Construction of statistical machine translation models with mapreduce. In: Proceedings of the third workshop on statistical machine translation, Columbus, pp 199–207
Eck M, Vogel S, Waibel A (2005) Low cost portability for statistical machine translation based in N-gram frequency and TF-IDF. In: IWSLT 2005, Proceedings of the international workshop on spoken language translation: evaluation campaign on spoken language translation, Pittsburgh
Eetemadi S, Radha H (2010) Effects of parallel corpus selection on statistical machine translation quality. In: NW-NLP 201: Proceedings of the Pacific Northwest Regional NLP workshop, Redmond
Gangadharaiah R, Brown R, Carbonell J (2009) Active learning in example-based machine translation. In: Proceedings of the 17th Nordic conference of computational linguistics NODALIDA, Odense, pp 227–230
Goodman J, Gao J (2000) Language model size reduction by pruning and clustering. In: Proceedings of the international conference on spoken language processing, Beijing, pp 110–113
Goutte C, Carpuat M, Foster G (2012) The impact of sentence alignment errors on phrase-based machine translation performance. In: Proceedings of the 10th conference of association for machine translation in the Americas, San Diego
Haffari G (2009) Machine learning approaches for dealing with limited bilingual training data in statistical machine translation. Ph.D. thesis, Simon Fraser University, Burnaby
Haffari G, Roy M, Sarkar A (2009) Active learning for statistical phrase-based machine translation. In: Proceedings of human language technologies: the 2009 annual conference of the North American chapter of the association for computational linguistics, Boulder, pp 415–423
Han X, Li H, Zhao T (2009) Train the machine with what it can learn: corpus selection for SMT. In: BUCC-09, Proceedings of the 2nd workshop on building and using comparable corpora: from parallel to non-parallel corpora, Singapore, pp 27–33
Jiang J, Way A, Carson-Berndsen J (2010) Lattice score-based data cleaning for phrase-based statistical machine translation. In: Proceedings of the 14th annual conference of the European association for machine translation, Saint-Raphaël
Kauchak D (2006) Contributions to research on machine translation. Ph.D. thesis, UC San Diego, San Diego
Khadivi S, Ney H (2005) Automatic filtering of bilingual corpora for statistical machine translation. Nat Lang Process Inform Syst 3513:263–274
Article Google Scholar
Kirchhoff K, Bilmes J (2014) Submodularity for data selection in machine translation. In: EMNLP 2014, The 2014 conference on empirical methods in natural language processing, Proceedings of the conference, Doha, pp 131–141
Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Proceedings of MT Summit X. Phuket, pp 79–86
Koehn P (2009) Statistical machine translation. Cambridge University Press, Cambridge
Book Google Scholar
Koehn P, Och F, Marcu D (2003) Statistical phrase-based translation. In: Proceedings of the joint human language technology conference and the annual meeting of the North American chapter of the association for computational linguistics (HLT-NAACL), Edmonton, pp 127–133
Kullback S, Leibler L (1951) On information and sufficiency. Ann Math Stat 22(1):79–86
Article MathSciNet MATH Google Scholar
Lewis W, Eetemadi S (2013) Dramatically reducing training data size through vocabulary saturation. In: Proceedings of the eighth workshop on statistical machine translation, Sofia, pp 281–291
Liu P, Zhou Y, Zong C (2009) Approach to selecting best development set for phrase-based statistical machine translation. In: Proceedings of the 23rd Pacific Asia conference on language, information and computation, Hong Kong, pp 325–334
Liu P, Zhou Y, Zong C (2010a) Data selection for statistical machine translation. In: Proceedings of the international conference on natural language processing and knowledge engineering (NLP-KE), Beijing, pp 1–5
Liu P, Zhou Y, Zong CQ (2010b) Approaches to improving corpus quality for statistical machine translation. In: Proceedings of the international conference on machine learning and cybernetics, vol 6, Qingdao, pp 3293–3298
Lü Y, Huang J, Liu Q (2007) Improving statistical machine translation performance by training data selection and optimization. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL), Czech Republic, pp 343–350
Mandal A, Vergyri D, Wang W, Zheng J, Stolcke A, Tur G, Hakkani-Tur D, Ayan NF (2008) Efficient data selection for machine translation. In: Proceedings of the spoken language technology workshop, Goa, pp 261–264
Manning CD, Schütze H (1999) Foundations of statistical natural language processing. MIT press, Cambridge
MATH Google Scholar
Moore RC, Lewis W (2010) Intelligent selection of language model training data. In: ACL 2010, The 48th annual meeting of the association for computational linguistics, conference proceedings, Uppsala, pp 220–224
Munteanu DS, Marcu D (2005) Improving machine translation performance by exploiting non-parallel corpora. Comput Linguist 31(4):477–504
Article Google Scholar
Okita T (2009) Data cleaning for word alignment. In: Proceedings of the ACL-IJCNLP 2009 student research workshop, Singapore, pp 72–80
Okita T, Naskar SK, Way A (2009) Noise reduction experiments in machine translation. In: ECML-PKDD, Proceedings of the European conference on machine learning and principles and practice of knowledge discovery in databases, Bled
Ozdowska S, Way A (2009) Optimal bilingual data for french-english pb-smt. In: Proceedings of EAMT-09, the 13th Annual conference of the European association for machine translation, Barcelona, pp 96–103
Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: A method for automatic evaluation of machine translation. In: ACL-2002: 40th annual meeting of the association for computational linguistics, proceedings of the conference, Philadelphia, pp 311–318
Pecina P, Toral A, Papavassiliou V, Prokopidis P, Tamchyna A, Way A, Van Genabith J (2014) Domain adaptation of statistical machine translation using web-crawled resources and model parameter tuning. Lang Resour Eval 49(1):147–193
Article Google Scholar
Resnik P (1999) Mining the web for bilingual text. In: ACL-1999: 37th annual meeting of the association for computational linguistics: proceedings of the conference, College Park, pp 527–534
Schütze H, Velipasaoglu E, Pedersen JO (2006) Performance thresholding in practical text classification. In: Proceedings of the 15th ACM international conference on Information and knowledge management, Kansas City, pp 662–671
Settles B (2010) Active learning literature survey. Tech. rep., University of Wisconsin, Madison, WN, URL http://burrsettles.com/pub/settles.activelearning.pdf
Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: AMTA 2006: Proceedings of the 7th conference of the association for machine translation in the Americas, visions for the future of machine translation, Cambridge, Massachusetts, pp 223–231
Somers H (2005) Round trip translation: What is it good for? In: ALTW 2005: Proceedings of Australasian language technology workshop, Australia, pp 127–133
Specia L, Raj D, Turchi M (2010) Machine translation evaluation versus quality estimation. Mach Transl 24(1):39–50
Article Google Scholar
Taghipour K, Afhami N, Khadivi S, Shiry S (2010) A discriminative approach to filter out noisy sentence pairs from bilingual corpora. In: Proceedings of the 5th international symposium on telecommunications, Tehran, pp 537–541
Ueffing N, Ney H (2007) Word-level confidence estimation for machine translation. Comput Linguist 33(1):9–40
Article MATH Google Scholar
Vogel S, Ney H, Tillmann C (1996) HMM-based word alignment in statistical translation. In: Proceedings of the 16th conference on computational linguistics, vol 2, Copenhagen, pp 836–841
Wei K, Liu Y, Kirchhoff K, Bilmes J (2013) Using document summarization techniques for speech data subset selection. In: Proceedings of the 2013 Conference of the North American chapter of the association for computational linguistics: human language technologies, Atlanta, pp 721–726
Wuebker J, Mauser A, Ney H (2010) Training phrase translation models with leaving-one-out. In: Proceedings of the 48th annual meeting of the association for computational linguistics, Uppsala, pp 475–484
Yarowsky D (1995) Unsupervised word sense disambiguation rivaling supervised methods. In: Proceedings of the 33rd annual meeting of the association for computational linguistics, Cambridge, Massachusetts, pp 189–196
Yasuda K, Zhang R, Yamamoto H, Sumita E (2008) Method of selecting training data to build a compact and efficient translation model. In: Proceedings of the third international joint conference on natural language processing, vol 2, Hyderabad, pp 655–660
Zens R, Stanton D, Xu P (2012) A systematic comparison of phrase table pruning techniques. In: Proceedings of the 2012 Joint conference on empirical methods in natural language processing and computational natural language learning, Jeju, pp 972–983

Download references

Acknowledgments

We thank reviewers and in particular the editor who went above and beyond in helping us make this a better paper.

Author information

Authors and Affiliations

Michigan State University, East Lansing, MI, USA
Sauleh Eetemadi & Hayder Radha
Microsoft Research, Redmond, WA, USA
Sauleh Eetemadi, William Lewis & Kristina Toutanova

Authors

Sauleh Eetemadi
View author publications
You can also search for this author in PubMed Google Scholar
William Lewis
View author publications
You can also search for this author in PubMed Google Scholar
Kristina Toutanova
View author publications
You can also search for this author in PubMed Google Scholar
Hayder Radha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sauleh Eetemadi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Eetemadi, S., Lewis, W., Toutanova, K. et al. Survey of data-selection methods in statistical machine translation. Machine Translation 29, 189–223 (2015). https://doi.org/10.1007/s10590-015-9176-1

Download citation

Received: 10 August 2015
Accepted: 14 December 2015
Published: 28 December 2015
Issue Date: December 2015
DOI: https://doi.org/10.1007/s10590-015-9176-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Survey of data-selection methods in statistical machine translation

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

A survey on large language model based autonomous agents

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Survey of data-selection methods in statistical machine translation

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

A survey on large language model based autonomous agents

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation