How to evaluate machine translation: A review of automated and human metrics

Eirini Chatzikoumi

doi:10.1017/S1351324919000469

How to evaluate machine translation: A review of automated and human metrics

Published online by Cambridge University Press: 11 September 2019

Eirini Chatzikoumi

Show author details

Eirini Chatzikoumi*: Affiliation:
Instituto de Literatura y Ciencias del Lenguaje, Pontificia Universidad Católica de Valparaíso, Av. El Bosque 1290, Viña del Mar, Chile
*: *Corresponding author. Emails: chatzikoumi@gmail.com, eirini.chatzikoumi@mail.pucv.cl

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

This article presents the most up-to-date, influential automated, semiautomated and human metrics used to evaluate the quality of machine translation (MT) output and provides the necessary background for MT evaluation projects. Evaluation is, as repeatedly admitted, highly relevant for the improvement of MT. This article is divided into three parts: the first one is dedicated to automated metrics; the second, to human metrics; and the last, to the challenges posed by neural machine translation (NMT) regarding its evaluation. The first part includes reference translation–based metrics; confidence or quality estimation (QE) metrics, which are used as alternatives for quality assessment; and diagnostic evaluation based on linguistic checkpoints. Human evaluation metrics are classified according to the criterion of whether human judges directly express a so-called subjective evaluation judgment, such as ‘good’ or ‘better than’, or not, as is the case in error classification. The former methods are based on directly expressed judgment (DEJ); therefore, they are called ‘DEJ-based evaluation methods’, while the latter are called ‘non-DEJ-based evaluation methods’. In the DEJ-based evaluation section, tasks such as fluency and adequacy annotation, ranking and direct assessment (DA) are presented, whereas in the non-DEJ-based evaluation section, tasks such as error classification and postediting are detailed, with definitions and guidelines, thus rendering this article a useful guide for evaluation projects. Following the detailed presentation of the previously mentioned metrics, the specificities of NMT are set forth along with suggestions for its evaluation, according to the latest studies. As human translators are the most adequate judges of the quality of a translation, emphasis is placed on the human metrics seen from a translator-judge perspective to provide useful methodology tools for interdisciplinary research groups that evaluate MT systems.

Keywords

Machine translation Machine translation evaluation Human metrics Automated metrics Machine translation quality

Type: Survey Paper
Information: Natural Language Engineering , Volume 26 , Issue 2 , March 2020 , pp. 137 - 161

DOI: https://doi.org/10.1017/S1351324919000469 [Opens in a new window]
Copyright: © Cambridge University Press 2019

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abend, O. and Rappoport, A. (2013). Universal conceptual cognitive annotation (UCCA). In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Sofia, Bulgaria, pp. 228–238.Google Scholar

Ageeva, E., Tyers, F., Forcada, M. and Perez-Ortiz, J. (2015). Evaluating machine translation for assimilation via a gap-filling task. In Proceedings of the Conference of the European Association for Machine Translation, Antalya, Turkey, pp. 137–144.Google Scholar

Amigo, E., Giménez, J., Gonzalo, J. and Màrquez, L. (2006). MT evaluation: Human-like vs. human acceptable. In Proceedings of COLING-ACL06, Joint Conference of the International Committee on Computational Linguistics and the Association for Computational Lingustics, Sydney, Australia.Google Scholar

Babych, B. and Hartley, A. (2004). Extending BLEU MT evaluation method with frequency weighting. In Proceedings of ACL (Association for Computational Linguistics), Barcelona, Spain.Google Scholar

Bahdanau, D., Cho, K. and Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In Proceedings of ICRL 2015, San Diego, USA.Google Scholar

Banerjee, S. and Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization, Ann Arbor, Michigan, pp. 65–72.Google Scholar

Bentivogli, L., Cettolo, M., Federico, M. and Federmann, C. (2018). Machine translation human evaluation: An investigation of evaluation based on Post-editing and its relation with Direct Assessment. In Proceedings of the International Workshop on Spoken Language Translation, Bruges, Belgium, pp. 62–69.Google Scholar

Berka, J., Bojar, O., Fishel, M., Popovic, M. and Zeman, D. (2012). Automatic MT error analysis: Hjerson helping Addicter. In Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC’12, Istanbul, Turkey, pp. 2158–2163.Google Scholar

Birch, A., Abend, O., Bojar, O. and Haddow, B. (2016). HUME: Human UCCA-based evaluation of machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 1264–1274.CrossRef Google Scholar

Blatz, J., Fitzgerald, E., Foster, G., Gandraburn, S., Goutte, C., Kulesza, A., Sanchis, A. and Ueffing, N. (2004). Confidence estimation for machine translation. In Proceedings of the 20th International Conference on Computational Linguistics (COLING), Geneva, Switzerland.Google Scholar

Bojar, O. (2011). Analyzing error types in English-Czech machine translation. Prague Bulletin of Mathematical Linguistics 95, 63–76.CrossRef Google Scholar

Bojar, O., Federmann, C., Haddow, B., Koehn, P., Post, M. and Specia, L. (2016). Ten years of WMT evaluation campaigns: Lessons learnt. In Proceedings of the LREC 2016 Workshop “Translation Evaluation – From Fragmented Tools and Data Sets to an Integrated Ecosystem”. Available at http://www.cracking-the-language-barrier.eu/wp-content/uploads/Bojar-Federmann-etal.pdf. Google Scholar

Callison-Burch, C. (2009). Fast, cheap, and creative: Evaluating translation quality using Amazon’s Mechanical Turk. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Suntec, Singapore, pp. 286–295.Google Scholar

Callison-Burch, C., Fordyce, C., Koehn, P., Monz, C. and Schroeder, J. (2007). (Meta-)evaluation of machine translation. In Proceedings of the Second Workshop on Statistical Machine Translation ’07, Prague, Czech Republic, pp. 136–158.CrossRef Google Scholar

Callison-Burch, C., Osborne, M. and Koehn, P. (2006). Re-evaluating the role of BLEU in machine translation research. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguisics (EACL), Trento, Italy, pp. 249–256.Google Scholar

Carl, M. and Buch-Kromann, M. (2010). Correlating translation product and translation process data of professional and student translators. In Proceedings of the Annual Conference of the European Association for Machine Translation, Saint-Raphaél, France.Google Scholar

Castagnoli, S., Ciobanu, D., Kunz, K., Volanschi, A. and Kübler, N. (2010). Designing a learner translator corpus for training purposes. In Kübler, N. (ed), Corpora, Language, Teaching and Resources: From Theory to Practice. Bern, Switzerland: Peter Lang.Google Scholar

Cho, K., Van Merriënboer, B., Bahdanau, B. and Bengio, Y. (2014). On the properties of neural machine translation: Encoder-decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, Doha, Qatar, pp. 103–111.CrossRef Google Scholar

Coughlin, D. (2003). Correlating automated and human assessments of machine translation quality. In Proceedings of MT Summit IX, New Orleans, LA, USA, pp. 63–70.Google Scholar

Doddington, G. (2002). Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of the 2nd Human Language Technologies Conference (HLT-02), San Diego, CA, USA, pp. 128–132.Google Scholar

Dorr, B., Snover, M. and Madnani, N. (2011). Chapter 5.1 introduction. In Olive, J., McCary, J. and Christianson, C. (eds), Handbook of Natural Language Processing and Machine Translation. DARPA Global Autonomous Language Exploitation. New York: Springer, pp. 801–803.Google Scholar

Dreyer, M. and Marcu, D. (2012). HyTER: Meaning-equivalent semantics for translation evaluation. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Montreal, Canada, pp. 162–171.Google Scholar

Euromatrix (2007). Survey of machine translation evaluation. Statistical and Hybrid Machine Translation Between All European Languages, IST 034291, Deliverable 1.3.Google Scholar

Federmann, C. (2010). Appraise: An open-source toolkit for manual phrase-based evaluation of translations. In Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC), Valletta, Malta.Google Scholar

Federmann, C. (2018). Appraise evaluation framework for machine translation. In Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, Santa Fe, New Mexico, USA, pp. 86–88.Google Scholar

Gandrabur, S. and Foster, G. (2003). Confidence estimation for translation prediction. In Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL (CONLL), Edmonton, Canada.Google Scholar

Girardi, C., Bentivogli, L., Farajian, M. and Federico, M. (2014). MT-EquAl: A toolkit for human assessment of machine translation output. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: System Demonstrations, Dublin, Ireland, pp. 120–123.Google Scholar

Gonzàlez, M. and Giménez, J. (2014). Asiya. An open toolkit for automatic machine translation (meta-)evaluation. Technical Manual, version 3.0. TALP Research Center, LSI Department, Universitat Politècnica de Catalunya.Google Scholar

Görög, A. (2014). Quantifying and benchmarking quality: The TAUS Dynamic Quality Framework. Revista Tradumàtica: tecnologies de la traducció, Traducció i qualitat 12. ISSN: 1578–7559. Available at http://revistes.uab.cat/tradumatica.Google Scholar

Graham, Y., Baldwin, T., Moffat, A. and Zobel, J. (2013). Continuous measurement scales in human evaluation of machine translation. In Proceedings of the 7th Linguistic Annotation Workshop & Interoperability with Discourse, Sofia, Bulgaria, pp. 33–41.Google Scholar

Graham, Y., Baldwin, T., Moffat, A. and Zobel, J. (2015). Can machine translation systems be evaluated by the crowd alone. Natural Language Engineering 23(1), 3–30.CrossRef Google Scholar

Han, A.L.F., Wong, D.F. and Chao, L.S. (2012). LEPOR: A robust evaluation metric for machine translation with augmented factors. In Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012): Posters, Mumbai, India, pp. 441–450.Google Scholar

Han, L. (2018). Machine translation evaluation resources and methods: A survey. arXiv:1605.04515v8. Cornell University Library.Google Scholar

Hassan, H., Aue, A., Chen, C., Chowdhary, V., Clark, J., Federmann, C., Huang, X., Junczys-Dowmunt, M., Lewis, W., Li, M., Liu, S., Liu, T., Luo, R., Menezes, A., Qin, T., Seide, F., Tan, X., Tian, F., Wu, L., Wu, S., Xia, Y., Zhang, D., Zhang, Z. and Zhou, M. (2018). Achieving human parity on automatic Chinese to English news translation. arXiv:1803.05567.Google Scholar

House, J. (2014). Translation Quality Assessment: Past and Present. New York: Routledge.Google Scholar

Isabelle, P., Cherry, C. and Foster, G. (2017). A challenge set approach to evaluating machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 2486–2496.CrossRef Google Scholar

Kalchbrenner, N. and Blunsom, P. (2013). Recurrent continuous translation models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp. 1700–1709.Google Scholar

Klubièka, F., Toral, A. and Sánchez-Cartagena, V. (2018). Quantitative fine-grained human evaluation of machine translation systems: A case study on English to Croatian. arXiv:1802.01451v1.Google Scholar

Koby, G.S., Fields, P., Hague, D., Lommel, A. and Melby, A. (2014). Defining translation quality. Tradumàtica 12, 413–420.CrossRef Google Scholar

Koehn, P. (2010). Statistical Machine Translation. Cambridge: Cambridge University Press.Google Scholar

Koehn, P. and Knowles, R. (2017). Six challenges for neural machine translation. arXiv:1706.03872v1.Google Scholar

Koehn, P. and Monz, C. (2006). Manual and automatic evaluation of machine translation between European languages. In Proceedings of the 2006 Workshop on Statistical Machine Translation, New York, USA.CrossRef Google Scholar

Lacruz, I., Denkowski, M. and Lavie, A. (2014). Cognitive demand and cognitive effort in post-editing. In Proceedings of the Third Workshop on Post-Editing Technology and Practice. 11th Conference of the Association for Machine Translation in the Americas, Vancouver, BC, Canada.Google Scholar

Läubli, S., Sennrich, R. and Volk, M. (2018). Has machine translation achieved human parity? A case for document-level evaluation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 4791–4796.CrossRef Google Scholar

Lavie, A. (2011). Evaluating the output of machine translation systems. In Proceedings of the 13th MT Summit, Xiamen, China.Google Scholar

Leusch, G, Ueffing, N. and Ney, H. (2003). A novel string-to-string distance measure with applications to machine translation evaluation. In Proceedings of MT Summit IX, New Orleans, LA, USA.Google Scholar

Leusch, G, Ueffing, N. and Ney, H. (2006). CDER: Efficient MT evaluation using block movements. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, Trento, Italy.Google Scholar

Levenshtein, V. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics – Doklady 10(8), 707–710. Original in Russian 1965.Google Scholar

Lin, C.Y. and Och, F.J. (2004). Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL-04), Main Volume, Barcelona, Spain, pp. 605–612.CrossRef Google Scholar

Lita, L.V., Rogatti, M. and Lavie, A. (2005). BLANC: Learning evaluation metrics for MT. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), Vancouver, Canada, pp. 740–747.CrossRef Google Scholar

Lommel, A., Popović, M. and Burchardt, A. (2014). Assessing inter-annotator agreement for translation error annotation. In Proceedings of LREC Workshop on Automatic and manual Metrics for Operational Translation Evaluation, Reykjavik, Iceland.Google Scholar

Martins, A., Junczys-Dowmunt, M., Kepler, F., Astudillo, R., Hokamp, C. and Grundkiewicz, R. (2017). Pushing the limits of translation quality estimation. Transactions of the Association for Computational Linguistics 5, 205–218.CrossRef Google Scholar

Massardo, I., Van der Meer, J., O’Brien, S., Hollowood, F., Aranberri, N. and Drescher, K. (2016). MT Post-Editing Guidelines. The Netherlands: TAUS Signature Editions.Google Scholar

Melamed, I., Green, R. and Turian, J. (2003). Precision and recall of machine translation. In Proceedings of the HLT-NAACL 2003, Edmonton, Canada.CrossRef Google Scholar

Navarro, G. (2001). A guided tour to approximate string matching. ACM Computing Surveys 33(1), 31–88.CrossRef Google Scholar

Newmark, P. (1988). A Textbook of Translation. Essex: Pearson Education Limited.Google Scholar

Niessen, S., Och, F., Leusch, G. and Ney, H. (2000). An evaluation tool for machine translation: Fast evaluation for MT research. In Proceedings of the 2nd International Conference on Language Resources and Evaluation, Athens, Greece.Google Scholar

Nord, C. (1997). Translating as a Purposeful Activity: Functionalist Approaches Explained. Manchester: St. Jerome.Google Scholar

Papineni, K., Roukos, S., Ward, T. and Zhu, W.J. (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of ACL-2002: 40th Annual meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp. 311–318. CiteSeerX: 10.1.1.19.9416Google Scholar

Popović, M. (2015). CHRF: Character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, Lisbon, Portugal, pp. 392–395.CrossRef Google Scholar

Popović, M. (2018). Error classification and analysis for machine translation quality assessment. In Moorkens, J., Castilho, S., Gaspari, F. and Doherty, S. (eds), Translation Quality Assessment. From Principles to Practice. Cham, Switzerland: Springer.Google Scholar

Popović, M. and Ney, H. (2007). Word error rates: Decomposition over POS classes and applications for error analysis. In Proceedings of the Second Workshop on Statistical Machine Translation, Association for Computational Linguistics, Prague, Czech Republic, pp. 48–55.CrossRef Google Scholar

Popović, M. and Ney, H. (2011). Towards automatic error analysis of machine translation output. Computational Linguistics 37(1), 657–688.CrossRef Google Scholar

Przybocki, M., Le, A., Sanders, G., Bronsart, S., Strassel, S. and Glenn, M. (2011). Chapter 5.4.3 Post-editing. In Olive, J., McCary, J. and Christianson, C. (eds), Handbook of Natural Language Processing and Machine Translation. DARPA Global Autonomous Language Exploitation. New York: Springer.Google Scholar

Przybocki, M., Peterson, K., Bronsart, S. and Sanders, G. (2009). The NIST 2008 Metrics for Machine Translation Challenge – Overview, Methodology, Metrics, and Results. Gaithersburg MD, USA: Multimodal Information Group, National Institute of Standards and Technology.CrossRef Google Scholar

Quirk, C.B. (2004). Training a sentence-level machine translation confidence measure. In Proceedings of the 4th Conference on Language Resources and Evaluation, Lisbon, Portugal, pp. 825–828.Google Scholar

Ricoeur, P. (2003). Sur la traduction. Paris: Bayard.Google Scholar

Sánchez-Gijón, P. and Torres-Hostench, O. (2014). MT post-editing into the mother tongue or into a foreign language? Spanish-to-English MT translation output post-edited by translation trainees. In Proceedings of the Third Workshop on Post-Editing Technology and Practice, 11th Conference of the Association for Machine Translation in the Americas (AMTA), Vancouver, Canada, pp. 5–19.Google Scholar

Sanders, G., Przybocki, M., Madnani, N. and Snover, M. (2011). Chapter 5.1.2 human subjective judgments. In Olive, J., McCary, J. and Christianson, C. (eds), Handbook of Natural Language Processing and Machine Translation. DARPA Global Autonomous Language Exploitation. New York: Springer, pp. 806–807.Google Scholar

Sennrich, R. (2017). How grammatical is character-level neural machine translation? Assessing MT quality with contrastive translation pairs. arXiv:1612.04629v3.Google Scholar

Snover, M., Dorr, B., Schwartz, R., Micciulla, L. and Makhoul, J. (2006). A study of translation edit rate with targeted human annotation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas, Boston Marriott, Cambridge, Massachusetts, USA.Google Scholar

Specia, L., Raj, D. and Turchi, M. (2010). Machine translation evaluation versus quality estimation. Machine Translation 24, 39–50. Springer Science+Business Media B.V. doi:10.1007/s10590-010-9077-2.CrossRef Google Scholar

Specia, L., Shah, K., De Souza, J.G.C. and Cohn, T. (2013). QuEst – A translation quality estimation framework. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria, pp. 79–84.Google Scholar

Specia, L., Turchi, M., Cancedda, N., Dymetman, M. and Cristianini, N. (2009). Estimating the Sentence Level Quality of Machine Translation Systems. In EAMT09, Barcelona, Spain, pp. 28–37.Google Scholar

Sutskever, I., Vinyals, O. and Le, Q. (2014). Sequence to sequence learning with neural networks. In Proceedings of Advances in Neural Information Processing Systems, Montreal, Canada, pp. 3104–3112.Google Scholar

Temnikova, I. (2010). A cognitive evaluation approach for a controlled language post-editing experiment. In Proceedings of International Conference Language Resources and Evaluation (LREC2010), Valletta, Malta.Google Scholar

Tomás, J., Mas, J.A. and Casacuberta, F. (2003). A quantitative method for machine translation evaluation. In Proceedings of the EACL 2003 Workshop on Evaluation Initiatives in Natural Language Processing: Are Evaluation Methods, Metrics and Resources Reusable?, Budapest, Hungary.Google Scholar

Toral, A., Castilho, S., Hu, K. and Way, A. (2018). Attaining the unattainable? Reassessing claims of human parity in neural machine translation. In Proceedings of the Third Conference on Machine Translation (WMT), Volume 1: Research Papers, Association for Computational Linguistics, Brussels, Belgium, pp. 113–123.CrossRef Google Scholar

Turing, A. (1950). Computing machinery and intelligence. Mind 49, 433–460.CrossRef Google Scholar

Ueffing, N. and Ney, H. (2005). Application of word-level confidence measures in interactive statistical machine translation. In Proceedings of the 10th Conference of the European Association for Machine Translation, Budapest, Hungary, pp. 262–270.Google Scholar

Wisniewski, G., Kumar Singh, A. and Yvon, F. (2012). Quality estimation for machine translation: Some lessons learned. Machine Translation 27(3–4), 213–238. doi:10.1007/s10590-013-9141-9.CrossRef Google Scholar

Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, L., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., Stevens, K., Kurian, G., Patil, N., Wang, W., Young, C., Smith, J., Riesa, J., Rudnick, A., Vinyals, O., Corrado, G., Hughes, M. and Dean, J. (2016). Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR abs/1609.08144. Available at: http://arxiv.org/abs/1609.08144.Google Scholar

Zhou, L., Lin, C.-Y. and Hovy, E. (2006). Re-evaluating machine translation results with paraphrase support. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP), Sydney, Australia.CrossRef Google Scholar

Zhou, M., Wang, B., Liu, S., Li, M., Zhang, D. and Zhao, T. (2008). Diagnostic evaluation of machine translation systems using automatically constructed linguistic check-points. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), Manchester, United Kingdom, pp. 1121–1128.CrossRef Google Scholar

Article contents

How to evaluate machine translation: A review of automated and human metrics

Abstract

Keywords

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests