Skip to main content
Log in

A Comprehensive Survey on Various Fully Automatic Machine Translation Evaluation Metrics

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

The fast advancement in machine translation models necessitates the development of accurate evaluation metrics that would allow researchers to track the progress in text languages. The evaluation of machine translation models is crucial since its results are exploited for improvements of translation models. However fully automatically evaluating the machine translation models in itself is a huge challenge for the researchers as human evaluation is very expensive, time-consuming, unreproducible. This paper presents a detailed classification and comprehensive survey on various fully automated evaluation metrics, which are used to assess the performance or quality of machine translated output. Various fully automatic evaluation metrics are classified into five categories that are lexical, character, semantic, syntactic, and semantic & syntactic evaluation metrics for better understanding purpose. Taking account of the challenges posed in the field of machine translation evaluation by Statistical Machine Translation and Neural Machine Translation, along with a discussion on the advantages, disadvantages, and gaps for each fully automatic machine translation evaluation metric has been provided. The presented study will help machine translation researchers in quickly identifying automatic machine translation evaluation metrics that are most appropriate for the improvement or development of their machine translation model, as well as researchers in gaining a general understanding of how automatic machine translation evaluation research evolved.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  1. Blatz J, Fitzgerald E, Foster G, Gandrabur S, Goutte C, Kulesza A, Sanchis A, Ueffing N (2004) Confidence estimation for machine translation. In: Proceedings of the 20th international conference on computational linguistics, Geneva, Switzerland, pp 315–321

  2. Mariani J (2005) Developing language technologies with the support of language resources and evaluation programs. Lang Resour Eval 39(1):35–44

    Article  Google Scholar 

  3. Bentivogli L, Cettolo M, Federico M, Federmann C (2018) Machine translation human evaluation: an investigation of evaluation based on post-editing and its relation with direct assessment. In: Proceedings of the international workshop on spoken language translation, Bruges, Belgium, pp 62–69

  4. Gonzàlez M, Giménez J (2014) Asiya. An open toolkit for automatic machine translation (meta-)evaluation. Technical Manual, version 3.0. TALP Research Center, LSI Department, Universitat Politècnica de Catalunya. http://asiya.lsi.upc.edu/Asiya_technical_manual_v3.0.pdf

  5. Graham Y, Baldwin T, Moffat A, Zobel J (2015) Can machine translation systems be evaluated by the crowd alone. Nat Lang Eng 23(1):3–30

    Article  Google Scholar 

  6. Zhou M, Wang B, Liu S, Li M, Zhang D, Zhao T (2008) Diagnostic evaluation of machine translation systems using automatically constructed linguistic check-points. In: Proceedings of the 22nd international conference on computational linguistics (Coling 2008), Manchester, United Kingdom, pp 1121–1128

  7. Han L (2016) Machine translation evaluation resources and methods: a survey. arXiv:1605.04515v8. Cornell University Library

  8. Chatzikoumi E (2020) How to evaluate machine translation: a review of automated and human metrics. Nat Lang Eng 26(2):137–161

    Article  Google Scholar 

  9. Sai AB, Mohankumar AK, Khapra MM (2020) A survey of evaluation metrics used for NLG systems. arXiv preprint arXiv:2008.12009

  10. Mikel A, Gorka L, Eneko A, Kyunghyun C (2018) Unsupervised neural machine translation. In: Proceedings of the 6th international conference on learning representations (ICLR), Vancouver, Canada

  11. Mikel A, Gorka L, Eneko A (2018) Unsupervised statistical machine translation. In: Proceedings of the 2018 conference on empirical methods in natural language processing, Brussels

  12. Lample G, Conneau A, Denoyer L, Ranzato MA (2017) Unsupervised machine translation using monolingual corpora only. In: Proceedings of the 6th international conference on learning representations (ICLR) Canada, arXiv preprint arXiv:1711.00043

  13. Burlot F, Yvon F (2019) Using monolingual data in neural machine translation: a systematic study. In: Proceedings of the third conference on machine translation, 2019, Brussels, Belgium. arXiv preprint arXiv:1903.11437

  14. Dalvi F, Durrani N, Sajjad H, Vogel S (2018) Incremental decoding and training methods for simultaneous translation in neural machine translation. arXiv preprint arXiv:1806.03661

  15. Ramesh A, Parthasarathy VB, Haque R, Way A (2021) Comparing statistical and neural machine translation performance on hindi-to-tamil and english-to-tamil. Digital 1(2):86–102

    Article  Google Scholar 

  16. Wang X, Tu Z, Zhang M (2018) Incorporating statistical machine translation word knowledge into neural machine translation. IEEE/ACM Trans Audio Speech Lang Process 26(12):2255–2266

    Article  Google Scholar 

  17. Xia Y (2020) Research on statistical machine translation model based on deep neural network. Computing 102(3):643–661

    Article  MathSciNet  Google Scholar 

  18. Yang Z, Chen W, Wang F, Xu B (2018) Unsupervised neural machine translation with weight sharing. In: 56th Annual meeting of the association for computational linguistics, Melbourne, Australia arXiv preprint arXiv:1804.09057

  19. Koehn P, Knowles R (2017) Six challenges for neural machine translation. arXiv preprint arXiv:1706.03872

  20. Kishore P, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, Philadelphia, Pennsylvania, USA, pp 311–318

  21. Ananthakrishnan R, Bhattacharyya P, Sasikumar M, Shah RM (2007) Some issues in automatic evaluation of english-hindi mt: more blues for bleu. In: Proceeding of 5th international conference on natural language processing, Hyderabad, India

  22. Freitag M, Grangier D, Caswell I (2020) BLEU might be guilty, but references are not innocent. arXiv preprint arXiv:2004.06063

  23. Liu CW, Lowe R, Serban IV, Noseworthy M, Charlin L, Pineau J (2016) How not to evaluate your dialogue system: an empirical study of unsupervised evaluation metrics for dialogue response generation. arXiv preprint arXiv:1603.08023

  24. Stent A, Marge M, Singhai M (2005) Evaluating evaluation methods for generation in the presence of variation. In: International conference on intelligent text processing and computational linguistics, Springer, Berlin, Heidelberg, pp 341–351

  25. Zhang Y, Vogel S, Waibel A (2004) Interpreting BLEU/NIST scores: how much improvement do we need to have a better system?. In: Fourth international conference on language resources and evaluation, Portugal

  26. Celikyilmaz A, Clark E, Gao J (2020) Evaluation of text generation: a survey. arXiv preprint arXiv:2006.14799

  27. Su KY, Wu MW, Chang JS (1992) A new quantitative quality measure for machine translation systems. In: COLING 1992 volume 2: the 14th international conference on computational linguistics

  28. Tillmann C, Vogel S, Ney H, Zubiaga A, Sawaf H (1997) Accelerated DP based search for statistical translation. In: Proceeding of EuroSpeech, Rhodes, Greece, pp 2123–2126

  29. Post M (2018) A call for clarity in reporting BLEU scores. In: Proceedings of the third conference on machine translation: research papers, WMT 2018, Belgium, Brussels, October 31–November 1, 2018, Ondrej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno-Yepes, Philipp Koehn, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana L. Neves, Matt Post, Lucia Specia, Marco Turchi, and Karin Verspoor (Eds.). Association for Computational Linguistics, pp 186–191. https://doi.org/10.18653/v1/w18-6319

  30. Galley M, Brockett C, Sordoni A, Ji Y, Auli M, Quirk C, Mitchell M, Gao J, Dolan B (2015) deltaBLEU: a discriminative metric for generation tasks with intrinsically diverse targets. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26–31, 2015, Beijing, China, Volume 2: Short Papers. The Association for Computer Linguistics, pp 445–450.https://doi.org/10.3115/v1/p15-2073

  31. Libovický J, Pecina P (2014) Tolerant BLEU: a submission to the WMT14 metrics task. In: Proceedings of the ninth workshop on statistical machine translation, pp 409–413

  32. Zhu Y, Lu S, Zheng L, Guo J, Zhang W, Wang J, Yu Y (2018) Texygen: A benchmarking platform for text generation models. In: The 41st international ACM SIGIR conference on research & development in information retrieval, pp 1097–1100

  33. Chen L, Dai S, Tao C, Zhang H, Gan Z, Shen D, Zhang Y, Wang G, Zhang R, Carin L (2018) Adversarial text generation via feature-mover's distance. In: Advances in neural information processing systems vol 31

  34. Lu S, Zhu Y, Zhang W, Wang J, Yu Y (2018) Neural text generation: past, present and beyond. arXiv preprint arXiv:1803.07133

  35. Caccia M, Caccia L, Fedus W, Larochelle H, Pineau J, Charlin L (2018) Language gans falling short. In: ICLR 2020—proceedings of the seventh international conference on learning representation Canada. arXiv preprint arXiv:1811.02549

  36. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems, vol 27

  37. Semeniuta S, Severyn A, Gelly S (2018) On accurate evaluation of gans for language generation. In: Seventh international conference on learning representations, United States, 2019 URL https://openreview.net/forum?id=rJMcdsA5FX

  38. Doddington G (2002) Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In: Proceedings of the second international conference on human language technology research March 2002, pp 138–145

  39. Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: Proceedings of the 7th conference of the association for machine translation in the Americas: technical papers, pp 223–231

  40. Snover MG, Madnani N, Dorr B, Schwartz R (2009) Ter-plus: paraphrase, semantic, and alignment enhancements to translation edit rate. Mach Transl 23(2):117–127

    Article  Google Scholar 

  41. Kilickaya M, Erdem A, Ikizler-Cinbis N, Erdem E (2016) In Proceedings of the 15th conference of the European chapter of the association for computational linguistics: volume 1, Long Papers. Association for Computational Linguistics, 2017. https://doi.org/10.18653/v1/e17-1019

  42. Wong B, Kit C (2009) ATEC: automatic evaluation of machine translation via word choice and word order. Mach Transl 23(2–3):141–155

    Article  Google Scholar 

  43. Han AL, Wong DF, Chao LS (2012) LEPOR: a robust evaluation metric for machine translation with augmented factors. In: Proceedings of COLING 2012: Posters, pp 441–450

  44. Chen B, Kuhn R, Larkin S (2012). Port: a precision-order-recall MT evaluation metric for tuning. In: Proceedings of the 50th annual meeting of the association for computational linguistics, volume 1: Long Papers, Jeju Island, Korea, pp 930–939

  45. Shen L, Turian JP, Melamed ID (2003) Evaluation of machine translation and its evaluation. In: Proceedings of MT Summit IX, New Orleans, U.S.A.

  46. Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization

  47. Denkowski M, Lavie A (2010). METEOR-NEXT and the METEOR paraphrase tables: improved evaluation support for five target languages. In: Proceedings of the joint fifth workshop on statistical machine translation and MetricsMATR, WMT@ACL 2010, Uppsala, Sweden, July 15–16, 2010, Chris Callison-Burch, Philipp Koehn, Christof Monz, Kay Peterson, and Omar Zaidan (Eds.). Association for Computational Linguistics, pp 339–342. https://www.aclweb.org/anthology/W10-1751/

  48. Guo Y, Ruan C, Hu J (2018) Meteor++: incorporating copy knowledge into machine translation evaluation. In: Proceedings of the third conference on machine translation: shared task paper, pp 740–745

  49. Gupta A, Venkatapathy S, Sangal R (2010) METEOR-Hindi: automatic MT evaluation metric for hindi as a target. In: Proceedings of ICON-2010: 8th international conference on natural language processing, Macmillan Publishers. India

  50. Melamed ID, Green R, Turian J (2003) Precision and recall of machine translation. In: Companion volume of the proceedings of HLT-NAACL 2003-Short Papers, pp 61–63

  51. Aliguliyev RM (2008) Using the F-measure as similarity measure for automatic text summarization. Bычиcлитeльныe тexнoлoгии 13(3):5–14

    Google Scholar 

  52. Isozaki H, Hirao T, Duh K, Sudoh K, Tsukada H (2010) Automatic evaluation of translation quality for distant language pairs. In: Proceedings of the 2010 conference on empirical methods in natural language processing, pp 944–952

  53. Popović M (2015) chrF: character n-gram F-score for automatic MT evaluation. In: Proceedings of the tenth workshop on statistical machine translation, WMT@EMNLP 2015, 17–18 September 2015, Lisbon, Portugal. The Association for Computer Linguistics, pp 392–395. https://doi.org/10.18653/v1/w15-3049

  54. Popović M (2017) chrF++: words helping character n-grams. In: Proceedings of the second conference on machine translation, WMT 2017, Copenhagen, Denmark, September 7–8, 2017

  55. Wang W, Peter JT, Rosendahl H, Ney H (2016) Character: translation edit rate on character level. In: Proceedings of the first conference on machine translation: Volume 2, Shared Task Papers, pp 505–510

  56. Stanojević M, Sima’an K (2014) Beer: better evaluation as ranking. In: Proceedings of the ninth workshop on statistical machine translation, WMT@ACL 2014, June 26–27, 2014, Baltimore, Maryland, USA. The Association for Computer Linguistics, pp 414–419. https://doi.org/10.3115/v1/w14-3354

  57. Stanchev P, Wang W, Ney H (2019) EED: extended edit distance measure for machine translation. In: Proceedings of the fourth conference on machine translation (Volume 2: Shared Task Papers, Day 1). Association for Computational Linguistics, Florence, Italy, pp 514–520.https://doi.org/10.18653/v1/W19-5359

  58. Chan YS, Ng HT (2008) MAXSIM: a maximum similarity metric for machine translation evaluation. In: Proceedings of ACL-08: HLT, Columbus, Ohi, pp 55–62

  59. Taskar B, Lacoste-Julien S, Klein D (2005) A discriminative matching approach to word alignment. In: Proceedings of human language technology conference and conference on empirical methods in natural language processing, pp 73–80

  60. Han ALF, Wong DF, Chao LS, He L, Lu Y, Xing J, Zeng X (2013) Language-independent model for machine translation evaluation with reinforced factors. In: Proceedings of the 14th international conference of machine translation summit, pp 215–222

  61. Liu D, Gildea D (2005) Syntactic features for evaluation of machine translation. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 25–32

  62. Collins M, Duffy N (2001) Convolution kernels for natural language. In: Advances in neural information processing systems vol 14

  63. Popović M, Ney H (2007) Word error rates: decomposition over POS classes and applications for error analysis. In: Proceedings of the second workshop on statistical machine translation. pp 48–55

  64. Duma M, Menzel W (2017) UHH submission to the WMT17 quality estimation shared task. In: Proceedings of the second conference on machine translation, pp 556–561

  65. Chauhan S, Daniel P, Mishra A, Kumar A (2021) AdaBLEU: a modified BLEU score for morphologically rich languages. IETE J Res 12:1–12

    Google Scholar 

  66. Chauhan S, Saxena S, Daniel P (2021) Monolingual and parallel corpora for kangri low resource language. arXiv preprint arXiv:2103.11596

  67. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781

  68. Pennington J, Socher R, Manning CD (2014). Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543

  69. Li P, Chen C, Zheng W, Deng Y, Ye F, Zheng Z (2019) STD: An automatic evaluation metric for machine translation based on word embeddings. IEEE/ACM Trans Audio, Speech Lang Process 27(10):1497–1506

    Article  Google Scholar 

  70. Rei R, Stewart C, Farinha AC, Lavie A (2020) COMET: a neural framework for MT evaluation. In: Conference on empirical methods in natural language processing 2020 (online) arXiv preprint arXiv:2009.09025

  71. Artetxe M, Schwenk H (2019) Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Trans Assoc Comput Linguist 7:597–610

    Article  Google Scholar 

  72. Lample G, Conneau A (2019) Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291

  73. Lommel A, Uszkoreit H, Burchardt A (2014) Multidimensional quality metrics (MQM): a framework for declaring and describing translation quality metrics. Rev Tradumàtica Tecnol Trad 12:455–463

    Article  Google Scholar 

  74. Fonseca E, Yankovskaya L, Martins AF, Fishel M, Federmann C (2019) Findings of the WMT 2019 shared tasks on quality estimation. In: Proceedings of the fourth conference on machine translation (volume 3: Shared Task Papers, Day 2), pp 1–10, Florence, Italy. Association for Computational Linguistics

  75. Chen Q, Zhu X, Ling Z-H, Wei S, Jiang H, Inkpen D (2017) Enhanced LSTM for natural language inference. In: Proceedings of the 55th annual meeting of the association for computational linguistics, ACL 2017, Vancouver, Canada, July 30–August 4, volume 1: Long Papers, Regina Barzilay and Min-Yen Kan (Eds.). Association for Computational Linguistics, pp 1657–1668. https://doi.org/10.18653/v1/P17-1152

  76. Devlin J, Chang MW, Lee K, Toutanova K (2018) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, Volume 1 (Long and Short Papers), pp 4171–4186, Minneapolis, Minnesota

  77. Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (Long Papers), pp 2227–2237

  78. Lo CK (2019) YiSi-a unified semantic MT quality evaluation and estimation metric for languages with different levels of available resources. In: Proceedings of the fourth conference on machine translation (volume 2: Shared Task Papers, Day 1), pp 507–513, Florence, Italy, August 2019. Association for Computational Linguistics. doi:https://doi.org/10.18653/v1/W19-5358. URL https://www.aclweb.org/anthology/W19-5358

  79. Lo CK, Wu D (2011) MEANT: an inexpensive, high-accuracy, semiautomatic metric for evaluating translation utility via semantic frames. In: proceedings of the 49th annual meeting of the association for computational linguistics, human language technologies, vol 1, pp 220–229

  80. Lo CK, Beloucif M, Saers M, Wu D (2014). XMEANT: better semantic MT evaluation without reference translations. In: Proceedings of the 52nd annual meeting of the association for computational linguistics, Short Papers, 2014, vol 2, pp 765–771.

  81. Lo CK, Dowling P, Wu D (2015) Improving evaluation and optimization of MT systems against meant. In: Proceedings of the 10th workshop on statistical machine translation, pp 434–441, Lisbon, Portugal

  82. Lo CK (2017) MEANT 2.0: accurate semantic MT evaluation for any output language. In: Second conference on World machine translation, Denmark

  83. Banchs RE, D’Haro LF, Li H (2015) Adequacy–fluency metrics: evaluating mt in the continuous space model framework. IEEE/ACM Trans Audio Speech Lang Process 23(3):472–482

    Article  Google Scholar 

  84. Wieting J, Berg-Kirkpatrick T, Gimpel K, Neubig G (2019) Beyond BLEU: training neural machine translation with semantic similarity. In: Proceedings of the 57th conference of the association for computational linguistics, ACL 2019, Florence, Italy, July 28–August 2, 2019, Volume 1: Long Papers, Anna Korhonen, David R. Traum, and Lluís Màrquez (Eds.). Association for Computational Linguistics, pp 4344–4355. https://doi.org/10.18653/v1/p19-1427

  85. Gekhman Z, Aharoni R, Beryozkin G, Freitag M, Macherey W (2020) KoBE: knowledge-based machine translation evaluation. arXiv preprint arXiv:2009.11027

  86. Hiroki S, Tomoyuki K, Mamoru K (2018) RUSE: regressor using sentence embeddings for automatic machine translation evaluation. In: Proceedings of the third conference on machine translation: shared task papers, WMT 2018, Belgium, Brussels, October 31–November 1, 2018, Ondrej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno-Yepes, Philipp Koehn, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana L. Neves, Matt Post, Lucia Specia, Marco Turchi, and Karin Verspoor (Eds.). Association for Computational Linguistics, pp 751–758. https://doi.org/10.18653/v1/w18-6456

  87. Conneau A, Kiela D, Schwenk H, Barrault L, Bordes A (2017) Supervised learning of universal sentence representations from natural language inference data. In: Proceedings of the 2017 conference on empirical methods in natural language processing. Association for Computational Linguistics, Copenhagen, Denmark, pp 670–680. https://doi.org/10.18653/v1/D17-1070

  88. Logeswaran L, Lee H (2018) An efficient framework for learning sentence representations. In: 6th International conference on learning representations, ICLR 2018, Vancouver, BC, Canada, April 30—May 3, 2018

  89. Cer D, Yang Y, Kong SY, Hua N, Limtiaco N, John RS, Kurzweil R (2018) Universal sentence encoder. arXiv preprint arXiv:1803.11175

  90. Shimanaka H, Kajiwara T, Komachi M (2019) Machine translation evaluation with bert regressor. arXiv preprint arXiv:1907.12679

  91. Sellam T, Das D, Parikh AP (2020) BLEURT: learning robust metrics for text generation. In: 58th annual meeting of the association for computational linguistics. arXiv preprint arXiv:2004.04696

  92. Sellam T, Das D, Parikh AP (2020) BLEURT: learning robust metrics for text generation. arXiv preprint arXiv:2004.04696

  93. Rus V, Lintean M (2012) An optimal assessment of natural language student input using word-to-word similarity metrics. In: International conference on intelligent tutoring systems. Springer, Berlin, Heidelberg, pp 675–676

  94. Kusner MJ, Sun Y, Kolkin NI, Weinberger KQ (2015) From word embeddings to document distances. In: Proceedings of the 32nd international conference on machine learning, Lille, France, 2015

  95. Ren Z, Yuan J, Zhang Z (2011) Robust hand gesture recognition based on finger-earth mover's distance with a commodity depth camera. In: Proceedings of the 19th ACM international conference on Multimedia, pp 1093–1096

  96. Clark E, Celikyilmaz A, Smith NA (2019) Sentence mover’s similarity: automatic evaluation for multi-sentence texts. In: Proceedings of the 57th annual meeting of the association for computational linguistics, Italy, pp 2748–2760

  97. Zhao W, Peyrard M, Liu F, Gao Y, Meyer CM, Eger S (2019) MoverScore: text generation evaluating with contextualized embeddings and earth mover distance. arXiv preprint arXiv:1909.02622

  98. Comelles E, Atserias J (2019) VERTa: a linguistic approach to automatic machine translation evaluation. Lang Resour Eval 53(1):57–86

    Article  Google Scholar 

  99. Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. In: Proceedings of ICRL 2014, San Diego, USA

  100. Cho K, Van Merriënboer B, Bahdanau B, Bengio Y (2014) On the properties of neural machine translation: encoder-decoder approaches. In: Proceedings of SSST-8, eighth workshop on syntax, semantics and structure in statistical translation, Doha, Qatar, pp 103–111

  101. Kalchbrenner N, Blunsom P (2013) Recurrent continuous translation models. In: Proceedings of the 2013 conference on empirical methods in natural language processing, Seattle, Washington, USA, pp 1700–1709

  102. Sutskever I, Vinyals O, Le Q (2014) Sequence to sequence learning with neural networks. In: Proceedings of advances in neural information processing systems, Montreal, Canada, pp 3104–3112

  103. Toral A, Castilho S, Hu K, Way A (2018) Attaining the unattainable? Reassessing claims of human parity in neural machine translation. In: Proceedings of the third conference on machine translation (WMT), Volume 1: Research Papers, Association for Computational Linguistics, Brussels, Belgium, pp 113–123

  104. Hassan H, Aue A, Chen C, Chowdhary V, Clark J, Federmann C, Huang X, Junczys-Dowmunt M, Lewis W, Li M, Liu S, Liu T, Luo R, Menezes A, Qin T, Seide F, Tan X, Tian F, Wu L, Wu S, Xia Y, Zhang D, Zhang Z, Zhou M (2018) Achieving human parity on automatic Chinese to English news translation. arXiv:1803.05567

  105. Isabelle P, Cherry C, Foster G (2017) A challenge set approach to evaluating machine translation. In: Proceedings of the 2017 conference on empirical methods in natural language processing, Copenhagen, Denmark, pp 2486–2496

  106. Sennrich R (2017) How grammatical is character-level neural machine translation? Assessing MT quality with contrastive translation pairs. In: 15th conference of the European chapter of the association for computational linguistics, Spain arXiv:1612.04629v3

  107. Klubička F, Toral A, Sánchez-Cartagena VM (2018) Quantitative fine-grained human evaluation of machine translation systems: a case study on English to Croatian. Mach Transl 32(3):195–215

    Article  Google Scholar 

  108. Cheng Y, Jiang L, Macherey W (2019) Robust neural machine translation with doubly adversarial inputs. In: Proceedings of the annual meeting of the association for computational linguistics. Florence, pp 4324–4333

  109. Cheng Y, Tu Z, Meng F, Zhai J, Liu Y (2018) Towards robust neural machine translation. In: Proceedings of the annual meeting of the association for computational linguistics. Melbourne pp 1756–1766

  110. Ding Y, Liu Y, Luan H et al (2017) Visualizing and understanding neural machine translation. In: Proceedings of the annual meeting of the association for computational linguistics. Vancouver, pp 1150–1159

  111. Thompson B, Post M (2020) Automatic machine translation evaluation in many languages via zero-shot paraphrasing. arXiv preprint arXiv:2004.14564

  112. Kocmi T, Federmann C, Grundkiewicz R, Junczys-Dowmunt M, Matsushita H, Menezes A (2021) To ship or not to ship: an extensive evaluation of automatic metrics for machine translation. arXiv preprint arXiv:2107.10821

  113. Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A,Herbst E (2007). Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the association for computational linguistics companion volume proceedings of the demo and poster sessions, pp 177–180

  114. Heafield K (2011) KenLM: faster and smaller language model queries. In: Proceedings of the sixth workshop on statistical machine translation, Edinburgh, Scotland, pp 187–197

  115. Kunchukuttan A, Kakwani D, Golla S, Bhattacharyya A, Khapra MM, Kumar P (2020) Ai4bharat-indicnlp corpus: monolingual corpora and word embeddings for indic languages. arXiv preprint arXiv:2005.00085

  116. Parton K, Tetreault J, Madnani N, Chodorow M (2011) E-rating machine translation. In: Proceedings of the 6th workshop on statistical machine translation, Edinburgh, Scotland, UK, pp 108–115

  117. Song X, Cohn T (2011) Regression and ranking based optimisation for sentence level MT evaluation. In: Proceedings of the sixth workshop on statistical machine translation, pp 123–129

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shweta Chauhan.

Ethics declarations

Conflict of interests

The author does not have any conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chauhan, S., Daniel, P. A Comprehensive Survey on Various Fully Automatic Machine Translation Evaluation Metrics. Neural Process Lett 55, 12663–12717 (2023). https://doi.org/10.1007/s11063-022-10835-4

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-022-10835-4

Keywords