The NIST 2008 Metrics for machine translation challenge—overview, methodology, metrics, and results

Przybocki, Mark; Peterson, Kay; Bronsart, Sébastien; Sanders, Gregory

doi:10.1007/s10590-009-9065-6

The NIST 2008 Metrics for machine translation challenge—overview, methodology, metrics, and results

Published: 14 January 2010

Volume 23, pages 71–103, (2009)
Cite this article

Machine Translation

Mark Przybocki¹,
Kay Peterson¹,
Sébastien Bronsart¹ &
…
Gregory Sanders¹

286 Accesses
12 Citations
Explore all metrics

Abstract

This paper discusses the evaluation of automated metrics developed for the purpose of evaluating machine translation (MT) technology. A general discussion of the usefulness of automated metrics is offered. The NIST MetricsMATR evaluation of MT metrology is described, including its objectives, protocols, participants, and test data. The methodology employed to evaluate the submitted metrics is reviewed. A summary is provided for the general classes of evaluated metrics. Overall results of this evaluation are presented, primarily by means of correlation statistics, showing the degree of agreement between the automated metric scores and the scores of human judgments. Metrics are analyzed at the sentence, document, and system level with results conditioned by various properties of the test data. This paper concludes with some perspective on the improvements that should be incorporated into future evaluations of metrics for MT evaluation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Machine Translation Evaluation: Manual Versus Automatic—A Comparative Study

Metrics for Translation Quality Assessment: A Case for Standardising Error Typologies

A new deal for translation quality

Article 20 July 2020

References

Babych B, Hartley A (2004) Extending the BLEU MT evaluation method with frequency weightings. Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04). Association for Computational Linguistics, Barcelona, Spain
Callison-Burch C, Fordyce C, Koehn P, Monz C, Schroeder J (2007) (Meta-) evaluation of machine translation. Proceedings of the Second Workshop on Statistical Machine Translation. Prague Czech Republic, Association for Computational Linguistics
Callison-Burch C, Fordyce C, Koehn P, Monz C, Schroeder J (2008) Further meta-evaluation of machine translation. Proceedings of the Third Workshop on Statistical Machine Translation (WMT08). Association for Computational Linguistics, Columbus OH
Callison-Burch C, Osborne M, Koehn P (2006) Re-evaluating the role of BLEU in machine translation research. Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Trento, Italy
Chiang D, Knight K, Wang W (2009) 11,001 New features for statistical machine translation. Proceedings of The Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2009). Association for Computational Linguistics, Boulder, CO
Chiang D, DeNeefe S, Chan YS, Ng HT (2008) Decomposability of translation metrics for improved evaluation and efficient algorithms. Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Honolulu, Hawaii
Chinchor N, Hirschman L, Lewis DD (1993) Evaluating message understanding systems: an analysis of the third message understanding conference (MUC-3). Computational Linguistics 19(3)
Cohen J (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas 37–46
Condon S, Sanders GA, Parvaz D, Rubenstein A, Doran C, Aberdeen J, Oshika B (2009) Normalization for automated metrics: English and Arabic speech translation. Proceedings of MT Summit XII. Association for Machine Translation in the Americas, Ottawa, ON, Canada
Coughlin D (2003) Correlating automated and human assessments of machine translation quality. Proceedings of MT Summit IX. Association for Machine Translation in the Americas, New Orleans, LA
Doddington G (2002) Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. Proceedings of the Second International Conference on Human Language Technology Research, Morgan Kaufmann, San Diego, CA
Fellbaum C (1998) Wordnet: an electronic lexical database. Bradford Books
Fleiss JL (1971) Measuring nominal scale agreement among many raters. Psychol Bull 76
Fleiss JL, Cohen J, Everitt BS (1969) Large sample standard errors of Kappa and weighted Kappa. Psychol Bull 72
Hodges JL, Lehmann EL (1963) Estimates of location based on tank tests. Ann Math Stat 34
Jones D, Herzog M, Ibrahim H, Jairam A, Shen W, Gibson E, Emonts M (2007) ILR-based MT comprehension test with multi-level questions. Proceedings of The Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2007). Association for Computational Linguistics, Rochester, NY
Kendall MG (1938) A new measure of rank correlation. Biometrika 30
Lavie A, Agarwal A (2007) METEOR: an automatic metric for MT evaluation with high levels of correlation with human judgments. Workshop on statistical machine translation at the 45th annual meeting of the association of computational linguistics (ACL-2007). Prague
Neter J, Kutner M, Nachtsheim C, Wasserman W (1996) Applied linear statistical models. McGraw-Hill/Irwin
Noreen EW (1989) Computer intensive methods for testing hypotheses. An introduction. Wiley, New York
Google Scholar
Och FJ (2003) Minimum error rate training in statistical machine translation. Association for Computational Linguistics, Sapporo, Japan
Google Scholar
Papineni K, Roukos S, Ward T, Zhu W-J (2001) BLEU: a method for automatic evaluation of machine translation. Technical Report, Yorktown Heights NY. IBM Research Division
Paul M (2006) Overview of the IWSLT 2006 evaluation campaign. Proceedings of the international workshop on spoken language translation. Kyoto, Japan
Pearson K (1900) On a criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can reasonably be supposed to have arisen in random sampling. Philos Mag 50
Porter M (2009) Snowball: a language for stemming algorithms. 2001. http://snowball.tartarus.org/texts/introduction.html. Accessed 28 Oct 2009)
Przybocki M, Peterson K, Bronsart S (2008) NIST metrics for machine translation challenge (MetricsMATR). NIST Multimodal Information Group. National Institute of Standards and Technology. http://www.nist.gov/speech/tests/metricsmatr/2008/doc/mm08_evalplan_v1.1.pdf. Accessed 28 Oct 2009)
Riezler J, Maxwell JT (2005) On some pitfalls in automatic evaluation and significance testing for MT. ACL-05 workshop on intrinsic and extrinsic evaluation measures for MT and/or summarization. Association for Computational Linguistics, Ann Arbor, MI
Sanders GA, Bronsart S, Condon S, Schlenoff C (2008) Odds of successful transfer of low-level concepts: a key metric for bidirectional speech-to-speech machine translation in DARPA’s TRANSTAC program. Proceedings of the 6th international conference on language resources and evaluation (LREC’08). European Language Resources Association (ELRA), Marrakech, Morocco
Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. Proceedings of association for machine translation in the Americas. Cambridge, MA
Spearman CE (1904) The proof and measurement of association between two things. Am J Psychol 15
Wilcoxon F (1945) Individual comparisons by ranking methods. Biometrics 1

Download references

Author information

Authors and Affiliations

Multimodal Information Group, National Institute of Standards and Technology, Gaithersburg, MD, USA
Mark Przybocki, Kay Peterson, Sébastien Bronsart & Gregory Sanders

Authors

Mark Przybocki
View author publications
You can also search for this author in PubMed Google Scholar
Kay Peterson
View author publications
You can also search for this author in PubMed Google Scholar
Sébastien Bronsart
View author publications
You can also search for this author in PubMed Google Scholar
Gregory Sanders
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mark Przybocki.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Przybocki, M., Peterson, K., Bronsart, S. et al. The NIST 2008 Metrics for machine translation challenge—overview, methodology, metrics, and results. Machine Translation 23, 71–103 (2009). https://doi.org/10.1007/s10590-009-9065-6

Download citation

Received: 18 May 2009
Accepted: 26 November 2009
Published: 14 January 2010
Issue Date: September 2009
DOI: https://doi.org/10.1007/s10590-009-9065-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The NIST 2008 Metrics for machine translation challenge—overview, methodology, metrics, and results

Abstract

Access this article

Similar content being viewed by others

Machine Translation Evaluation: Manual Versus Automatic—A Comparative Study

Metrics for Translation Quality Assessment: A Case for Standardising Error Typologies

A new deal for translation quality

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The NIST 2008 Metrics for machine translation challenge—overview, methodology, metrics, and results

Abstract

Access this article

Similar content being viewed by others

Machine Translation Evaluation: Manual Versus Automatic—A Comparative Study

Metrics for Translation Quality Assessment: A Case for Standardising Error Typologies

A new deal for translation quality

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation