Abstract
In Minimum Error Rate Training (MERT), Bleu is often used as the error function, despite the fact that it has been shown to have a lower correlation with human judgment than other metrics such as Meteor and Ter. In this paper, we present empirical results in which parameters tuned on Bleu may lead to sub-optimal Bleu scores under certain data conditions. Such scores can be improved significantly by tuning on an entirely different metric altogether, e.g. Meteor, by 0.0082 Bleu or 3.38% relative improvement on the WMT08 English–French data. We analyze the influence of the number of references and choice of metrics on the result of MERT and experiment on different data sets. We show the problems of tuning on a metric that is not designed for the single reference scenario and point out some possible solutions.
Similar content being viewed by others
References
Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. Ann Arbor, MI, pp 65–72
Callison-Burch C, Osborne M, Koehn P (2006) Re-evaluation the role of bleu in machine translation research. In: EACL-2006, Proceedings of the 11th conference of the european chapter of the association for computational linguistics. Trento, Italy, pp 249–256
Callison-Burch C, Fordyce C, Koehn P, Monz C, Schroeder J (2008) Further meta-evaluation of machine translation. In: Proceedings of the third workshop on statistical machine translation. Columbus, OH, pp 70–106
Cer D, Jurafsky D, Manning C (2008) Regularization and search for minimum error rate training. In: Proceedings of the third workshop on statistical machine translation. Columbus, OH, pp 26–34
Chiang D, DeNeefe S, Chan YS, Ng HT (2008) Decomposability of translation metrics for improved evaluation and efficient algorithms. In: Proceedings of the 2008 conference on empirical methods in natural language processing. Honolulu, HI, pp 610–619
Dyer C, Setiawan H, Marton Y, Resnik P (2009) The University of Maryland statistical machine translation system for the fourth workshop on machine translation. In: Proceedings of the fourth workshop on statistical machine translation. Athens, Greece, pp 145–149
He Y, Way A (2009) Improving the objective function in minimum error rate training. In: Proceedings of the twelfth machine translation summit. Ottawa, ON, Canada, pp 238–245
Koehn P (2004) Statistical significance tests for machine translation evaluation. In: Proceedings the 2004 conference of empirical methods in natural language processing (EMNLP-2004). Barcelona, Spain, pp 388–395
Lambert P, Giménez J, Costa-jussà MR, Amigó E, Banchs RE, Màrquez L, Fonollosa JAR (2006) Machine Translation system development based on human likeness. In: Proceedings of the IEEE/ACL workshop on spoken language technology. Palm Beach, Aruba, pp 246–249
Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10(8): 707–710
Macherey W, Och F, Thayer I, Uszkoreit J (2008) Lattice-based minimum error rate training for statistical machine translation. In: Proceedings of the 2008 conference on empirical methods in natural language processing. Honolulu, HI, pp 725–734
Moore RC, Quirk C (2008) Random restarts in minimum error rate training for statistical machine translation. In: Proceedings of the 22nd international conference on computational linguistics (Coling 2008). Manchester, UK, pp 585–592
Och FJ (2003) Minimum error rate training in statistical machine translation. In: 41st annual meeting of the association for computational linguistics. Sapporo, Japan, pp 160–167
Och FJ, Ney H (2002) Discriminative training and maximum entropy models for statistical machine translation. In: 40th annual meeting of the association for computational linguistics. Philadelphia, PA, pp 295–302
Owczarzak K, van Genabith J, Way A (2007) Labelled dependencies in machine translation evaluation. In: Proceedings of the second workshop on statistical machine translation. Prague, Czech Republic, pp 104–111
Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: 40th annual meeting of the association for computational linguistics. Philadelphia, PA, pp 311–318
Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: AMTA 2006, Proceedings of the 7th conference of the association for machine translation in the Americas, visions for the future of machine translation. Cambridge, MA, pp 223–231
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
He, Y., Way, A. Metric and reference factors in minimum error rate training. Machine Translation 24, 27–38 (2010). https://doi.org/10.1007/s10590-010-9072-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10590-010-9072-7