Skip to main content
Log in

Regression for machine translation evaluation at the sentence level

  • Published:
Machine Translation

Abstract

Machine learning offers a systematic framework for developing metrics that use multiple criteria to assess the quality of machine translation (MT). However, learning introduces additional complexities that may impact on the resulting metric’s effectiveness. First, a learned metric is more reliable for translations that are similar to its training examples; this calls into question whether it is as effective in evaluating translations from systems that are not its contemporaries. Second, metrics trained from different sets of training examples may exhibit variations in their evaluations. Third, expensive developmental resources (such as translations that have been evaluated by humans) may be needed as training examples. This paper investigates these concerns in the context of using regression to develop metrics for evaluating machine-translated sentences. We track a learned metric’s reliability across a 5 year period to measure the extent to which the learned metric can evaluate sentences produced by other systems. We compare metrics trained under different conditions to measure their variations. Finally, we present an alternative formulation of metric training in which the features are based on comparisons against pseudo-references in order to reduce the demand on human produced resources. Our results confirm that regression is a useful approach for developing new metrics for MT evaluation at the sentence level.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Albrecht JS, Hwa R (2007a) A re-examination of machine learning approaches for sentence-level MT evaluation. In: ACL 2007 Proceedings of the 45th annual meeting of the Association for Computational Linguistics, Prague, Czech Republic, pp 880–887

  • Albrecht JS, Hwa R (2007b) Regression for sentence-level MT evaluation with pseudo references. In: ACL 2007 Proceedings of the 45th annual meeting of the Association for Computational Linguistics, Prague, Czech Republic, pp 296–303

  • Al-Onaizan Y, Curin J, Jahr M, Knight K, Lafferty J, Melamed ID, Och F-J, Purdy D, Smith NA, Yarowsky D (1999) Statistical machine translation. Technical report natural language engineering workshop final report. Johns Hopkins University, Baltimore

  • Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, Ann Arbor, Michigan, pp 65–72

  • Bishop CM (2006) Pattern recognition and machine learning. Springer-Verlag, New York

    Google Scholar 

  • Blatz J, Fitzgerald E, Foster G, Gandrabur S, Goutte C, Kulesza A, Sanchis A, Ueffing N (2003) Confidence estimation for machine translation. Technical report natural language engineering workshop final report. Johns Hopkins University, Baltimore

  • Burbank A, Carpuat M, Clark S, Dreyer M, Groves D, Fox P, Hall K, Hearne M, Melamed ID, Shen Y, Way A, Wellington B, Wu D (2005) Final report of the 2005 language engineering workshop on statistical machine translation by parsing. Technical report natural language engineering workshop final report. Johns Hopkins University, Baltimore

  • Carbonell JG, Cullingford RE, Gershman AV (1981) Steps toward knowledge-based machine translation. IEEE Trans Pattern Anal Mach Intell 3(4): 376–392

    Article  Google Scholar 

  • Corston-Oliver S, Gamon M, Brockett C (2001) A machine learning approach to the automatic evaluation of machine translation. In: Association for Computational Linguistics, 39th annual meeting and 10th conference of the European chapter, proceedings of the conference, Toulouse, France, pp 148–155

  • Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3): 273–297

    Google Scholar 

  • Doddington G (2002) Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In: Proceedings of the second conference on human language technology (HLT-2002), San Diego, California, pp 128–132

  • Dorr BJ, Jordan PW, Benoit JW (1999) A survey of current paradigms in machine translation. Adv Comput 49: 2–68

    Google Scholar 

  • Duh K (2008) Ranking vs. regression in machine translation evaluation. In: Proceedings of the third workshop on statistical machine translation, Columbus, Ohio, pp 191–194

  • Gamon M, Aue A, Smets M (2005) Sentence-level MT evaluation without reference translations: beyond language modeling. In: 10th EAMT conference, practical applications of machine translation, proceedings, Budapest, Hungary, pp 103–111

  • Giménez J, Màrquez L (2008) A smorgasbord of features for automatic MT evaluation. In: ACL-08: HLT third workshop on statistical machine translation, Columbus, Ohio, pp 195–198

  • Gimpel K, Smith NA (2008) Rich source-side context for statistical machine translation. In: ACL-08: HLT third workshop on statistical machine translation, Columbus, Ohio, pp 9–17

  • Goldberg Y, Elhadad M (2007) SVM model tampering and anchored learning: a case study in Hebrew NP chunking. In: ACL 2007 Proceedings of the 45th annual meeting of the Association for Computational Linguistics, Prague, Czech Republic, pp 224–231

  • Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Springer-Verlag, New York

    Google Scholar 

  • Hovy E, King M, Popescu-Belis A (2002) Principles of context-based machine translation evaluation. Mach Transl 17(1): 43–75

    Article  Google Scholar 

  • Joachims T (1999) Making large-scale SVM learning practical. In: Schöelkopf B, Burges C, Smola A(eds) Advances in kernel methods—support vector learning. MIT Press, Cambridge, pp 169–184

    Google Scholar 

  • Kauchak D, Barzilay R (2006) Paraphrasing for automatic evaluation. In: HLT-NAACL 2006 Human language technology conference of the North American chapter of the Association for Computational Linguistics, New York, NY, pp 455–462

  • Koehn P (2004) Statistical significance tests for machine translation evaluation. In: Proceedings of the 2004 conference on empirical methods in natural language processing, Barcelona, Spain, pp 388–395

  • Kulesza A, Shieber SM (2004) A learning approach to improving sentence-level MT evaluation. In: TMI-2004: Proceedings of the tenth conference on theoretical and methodological issues in machine translation, Baltimore, MD, pp 75–84

  • Lavie A, Agarwal A (2007) METEOR: an automatic metric for MT evaluation with high levels of correlation with human judgments. In: ACL 2007: Proceedings of the second workshop on statistical machine translation, Prague, Czech Republic, pp 228–231

  • Leusch G, Ueffing N, Ney H (2006) CDER: efficient MT evaluation using block movements. In: EACL-2006, 11th conference of the European chapter of the Association for Computational Linguistics, proceedings of the conference, Trento, Italy, pp 241–248

  • Lin C-Y, Och FJ (2004a) Automatic evaluation of machine translation quality using longest common subsequence and Skip-Bigram statistics. In: ACL-04: 42nd annual meeting of the Association for Computational Linguistics, Barcelona, Spain, pp 605–612

  • Lin C-Y, Och FJ (2004b) ORANGE: a method for evaluating automatic evaluation metrics for machine translation. In: Coling, 20th international conference on computational linguistics, proceedings, Geneva, Switzerland, pp 501–507

  • Lita LV, Rogati M, Lavie A (2005) Blanc: learning evaluation metrics for MT. In: HLT/EMNLP 2005 Human language technology conference and conference on empirical methods in natural language processing, Vancouver, British Columbia, Canada, pp 740–747

  • Liu D, Gildea D (2005) Syntactic features for evaluation of machine translation. In: Intrinsic and extrinsic evaluation measures for machine translation and/or summarization, Proceedings of the ACL-05 workshop, Ann Arbor, MI, pp 25–32

  • Liu D, Gildea D (2006) Stochastic iterative alignment for machine translation evaluation. In: COLING·ACL 2006 21st international conference on computational linguistics and 44th annual meeting of the Association for Computational Linguistics, Proceedings of the main conference poster sessions, Sydney, Australia, pp 539–546

  • Liu D, Gildea D (2007) Source-language features and maximum correlation training for machine translation evaluation. In: NAACL HLT 2007 Human language technologies 2007: the conference of the North American chapter of the Association for Computational Linguistics, Rochester, NY, pp 41–48

  • Melamed ID, Green R, Turian J (2003) Precision and recall of machine translation. In: HLT-NAACL 2003: conference combining human language technology conference series and the North American chapter of the Association for Computational Linguistics series, companion volume, Edmonton, Canada, pp 61–63

  • Owczarzak K, Groves D, Van Genabith J, Way A (2006) Contextual bitext-derived paraphrases in automatic MT evaluation. In: HLT-NAACL 06 Statistical machine translation, Proceedings of the workshop, New York City, pp 86–93

  • Owczarzak K, Van Genabith J, Way A (2007) Dependency-based automatic evaluation for machine translation. In: Proceedings of SSST, NAACL-HLT 2007/AMTA workshop on syntax and structure in statistical translation, Rochester, NY, pp 80–87

  • Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: 40th annual meeting of the Association for Computational Linguistics (ACL-2002), Philadelphia, PA, pp 311–318

  • Quirk C (2004) Training a sentence-level machine translation confidence measure. In: Proceedings of the international conference on language resources and evaluation (LREC-2004), Lisbon, Portugal, pp 825–828

  • Riezler S, Maxwell JT III (2005) On some pitfalls in automatic evaluation and significance testing for MT. In: Intrinsic and extrinsic evaluation measures for machine translation and/or summarization, Proceedings of the ACL-05 workshop, Ann Arbor, MI, pp 57–64

  • Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: AMTA 2006: Proceedings of the 7th conference of the Association for Machine Translation in the Americas, visions of the future of machine translation, Cambridge, MA, pp 223–231

  • Tillmann C, Vogel S, Ney H, Sawaf H, Zubiaga A (1997) Accelerated DP-based search for statistical translation. In: Proceedings of the 5th European conference on speech communication and technology (EuroSpeech’97), Rhodes, Greece, pp 2667–2670

  • Uchimoto K, Kotani K, Zhang Y, Isahara H (2007) Automatic evaluation of machine translation based on rate of accomplishment of sub-goals. In: NAACL HLT 2007 Human language technologies 2007: the conference of the North American chapter of the Association for Computational Linguistics, Rochester, New York, pp 33–40

  • Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco

    Google Scholar 

  • Ye Y, Zhou M, Lin C-Y (2007) Sentence level machine translation evaluation as a ranking. In: ACL 2007: Proceedings of the second workshop on statistical machine translation, Prague, Czech Republic, pp 240–247

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rebecca Hwa.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Albrecht, J.S., Hwa, R. Regression for machine translation evaluation at the sentence level. Machine Translation 22, 1–27 (2008). https://doi.org/10.1007/s10590-008-9046-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10590-008-9046-1

Keywords

Navigation