Abstract
The fast advancement in machine translation models necessitates the development of accurate evaluation metrics that would allow researchers to track the progress in text languages. The evaluation of machine translation models is crucial since its results are exploited for improvements of translation models. However fully automatically evaluating the machine translation models in itself is a huge challenge for the researchers as human evaluation is very expensive, time-consuming, unreproducible. This paper presents a detailed classification and comprehensive survey on various fully automated evaluation metrics, which are used to assess the performance or quality of machine translated output. Various fully automatic evaluation metrics are classified into five categories that are lexical, character, semantic, syntactic, and semantic & syntactic evaluation metrics for better understanding purpose. Taking account of the challenges posed in the field of machine translation evaluation by Statistical Machine Translation and Neural Machine Translation, along with a discussion on the advantages, disadvantages, and gaps for each fully automatic machine translation evaluation metric has been provided. The presented study will help machine translation researchers in quickly identifying automatic machine translation evaluation metrics that are most appropriate for the improvement or development of their machine translation model, as well as researchers in gaining a general understanding of how automatic machine translation evaluation research evolved.




Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Blatz J, Fitzgerald E, Foster G, Gandrabur S, Goutte C, Kulesza A, Sanchis A, Ueffing N (2004) Confidence estimation for machine translation. In: Proceedings of the 20th international conference on computational linguistics, Geneva, Switzerland, pp 315–321
Mariani J (2005) Developing language technologies with the support of language resources and evaluation programs. Lang Resour Eval 39(1):35–44
Bentivogli L, Cettolo M, Federico M, Federmann C (2018) Machine translation human evaluation: an investigation of evaluation based on post-editing and its relation with direct assessment. In: Proceedings of the international workshop on spoken language translation, Bruges, Belgium, pp 62–69
Gonzàlez M, Giménez J (2014) Asiya. An open toolkit for automatic machine translation (meta-)evaluation. Technical Manual, version 3.0. TALP Research Center, LSI Department, Universitat Politècnica de Catalunya. http://asiya.lsi.upc.edu/Asiya_technical_manual_v3.0.pdf
Graham Y, Baldwin T, Moffat A, Zobel J (2015) Can machine translation systems be evaluated by the crowd alone. Nat Lang Eng 23(1):3–30
Zhou M, Wang B, Liu S, Li M, Zhang D, Zhao T (2008) Diagnostic evaluation of machine translation systems using automatically constructed linguistic check-points. In: Proceedings of the 22nd international conference on computational linguistics (Coling 2008), Manchester, United Kingdom, pp 1121–1128
Han L (2016) Machine translation evaluation resources and methods: a survey. arXiv:1605.04515v8. Cornell University Library
Chatzikoumi E (2020) How to evaluate machine translation: a review of automated and human metrics. Nat Lang Eng 26(2):137–161
Sai AB, Mohankumar AK, Khapra MM (2020) A survey of evaluation metrics used for NLG systems. arXiv preprint arXiv:2008.12009
Mikel A, Gorka L, Eneko A, Kyunghyun C (2018) Unsupervised neural machine translation. In: Proceedings of the 6th international conference on learning representations (ICLR), Vancouver, Canada
Mikel A, Gorka L, Eneko A (2018) Unsupervised statistical machine translation. In: Proceedings of the 2018 conference on empirical methods in natural language processing, Brussels
Lample G, Conneau A, Denoyer L, Ranzato MA (2017) Unsupervised machine translation using monolingual corpora only. In: Proceedings of the 6th international conference on learning representations (ICLR) Canada, arXiv preprint arXiv:1711.00043
Burlot F, Yvon F (2019) Using monolingual data in neural machine translation: a systematic study. In: Proceedings of the third conference on machine translation, 2019, Brussels, Belgium. arXiv preprint arXiv:1903.11437
Dalvi F, Durrani N, Sajjad H, Vogel S (2018) Incremental decoding and training methods for simultaneous translation in neural machine translation. arXiv preprint arXiv:1806.03661
Ramesh A, Parthasarathy VB, Haque R, Way A (2021) Comparing statistical and neural machine translation performance on hindi-to-tamil and english-to-tamil. Digital 1(2):86–102
Wang X, Tu Z, Zhang M (2018) Incorporating statistical machine translation word knowledge into neural machine translation. IEEE/ACM Trans Audio Speech Lang Process 26(12):2255–2266
Xia Y (2020) Research on statistical machine translation model based on deep neural network. Computing 102(3):643–661
Yang Z, Chen W, Wang F, Xu B (2018) Unsupervised neural machine translation with weight sharing. In: 56th Annual meeting of the association for computational linguistics, Melbourne, Australia arXiv preprint arXiv:1804.09057
Koehn P, Knowles R (2017) Six challenges for neural machine translation. arXiv preprint arXiv:1706.03872
Kishore P, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, Philadelphia, Pennsylvania, USA, pp 311–318
Ananthakrishnan R, Bhattacharyya P, Sasikumar M, Shah RM (2007) Some issues in automatic evaluation of english-hindi mt: more blues for bleu. In: Proceeding of 5th international conference on natural language processing, Hyderabad, India
Freitag M, Grangier D, Caswell I (2020) BLEU might be guilty, but references are not innocent. arXiv preprint arXiv:2004.06063
Liu CW, Lowe R, Serban IV, Noseworthy M, Charlin L, Pineau J (2016) How not to evaluate your dialogue system: an empirical study of unsupervised evaluation metrics for dialogue response generation. arXiv preprint arXiv:1603.08023
Stent A, Marge M, Singhai M (2005) Evaluating evaluation methods for generation in the presence of variation. In: International conference on intelligent text processing and computational linguistics, Springer, Berlin, Heidelberg, pp 341–351
Zhang Y, Vogel S, Waibel A (2004) Interpreting BLEU/NIST scores: how much improvement do we need to have a better system?. In: Fourth international conference on language resources and evaluation, Portugal
Celikyilmaz A, Clark E, Gao J (2020) Evaluation of text generation: a survey. arXiv preprint arXiv:2006.14799
Su KY, Wu MW, Chang JS (1992) A new quantitative quality measure for machine translation systems. In: COLING 1992 volume 2: the 14th international conference on computational linguistics
Tillmann C, Vogel S, Ney H, Zubiaga A, Sawaf H (1997) Accelerated DP based search for statistical translation. In: Proceeding of EuroSpeech, Rhodes, Greece, pp 2123–2126
Post M (2018) A call for clarity in reporting BLEU scores. In: Proceedings of the third conference on machine translation: research papers, WMT 2018, Belgium, Brussels, October 31–November 1, 2018, Ondrej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno-Yepes, Philipp Koehn, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana L. Neves, Matt Post, Lucia Specia, Marco Turchi, and Karin Verspoor (Eds.). Association for Computational Linguistics, pp 186–191. https://doi.org/10.18653/v1/w18-6319
Galley M, Brockett C, Sordoni A, Ji Y, Auli M, Quirk C, Mitchell M, Gao J, Dolan B (2015) deltaBLEU: a discriminative metric for generation tasks with intrinsically diverse targets. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26–31, 2015, Beijing, China, Volume 2: Short Papers. The Association for Computer Linguistics, pp 445–450.https://doi.org/10.3115/v1/p15-2073
Libovický J, Pecina P (2014) Tolerant BLEU: a submission to the WMT14 metrics task. In: Proceedings of the ninth workshop on statistical machine translation, pp 409–413
Zhu Y, Lu S, Zheng L, Guo J, Zhang W, Wang J, Yu Y (2018) Texygen: A benchmarking platform for text generation models. In: The 41st international ACM SIGIR conference on research & development in information retrieval, pp 1097–1100
Chen L, Dai S, Tao C, Zhang H, Gan Z, Shen D, Zhang Y, Wang G, Zhang R, Carin L (2018) Adversarial text generation via feature-mover's distance. In: Advances in neural information processing systems vol 31
Lu S, Zhu Y, Zhang W, Wang J, Yu Y (2018) Neural text generation: past, present and beyond. arXiv preprint arXiv:1803.07133
Caccia M, Caccia L, Fedus W, Larochelle H, Pineau J, Charlin L (2018) Language gans falling short. In: ICLR 2020—proceedings of the seventh international conference on learning representation Canada. arXiv preprint arXiv:1811.02549
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems, vol 27
Semeniuta S, Severyn A, Gelly S (2018) On accurate evaluation of gans for language generation. In: Seventh international conference on learning representations, United States, 2019 URL https://openreview.net/forum?id=rJMcdsA5FX
Doddington G (2002) Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In: Proceedings of the second international conference on human language technology research March 2002, pp 138–145
Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: Proceedings of the 7th conference of the association for machine translation in the Americas: technical papers, pp 223–231
Snover MG, Madnani N, Dorr B, Schwartz R (2009) Ter-plus: paraphrase, semantic, and alignment enhancements to translation edit rate. Mach Transl 23(2):117–127
Kilickaya M, Erdem A, Ikizler-Cinbis N, Erdem E (2016) In Proceedings of the 15th conference of the European chapter of the association for computational linguistics: volume 1, Long Papers. Association for Computational Linguistics, 2017. https://doi.org/10.18653/v1/e17-1019
Wong B, Kit C (2009) ATEC: automatic evaluation of machine translation via word choice and word order. Mach Transl 23(2–3):141–155
Han AL, Wong DF, Chao LS (2012) LEPOR: a robust evaluation metric for machine translation with augmented factors. In: Proceedings of COLING 2012: Posters, pp 441–450
Chen B, Kuhn R, Larkin S (2012). Port: a precision-order-recall MT evaluation metric for tuning. In: Proceedings of the 50th annual meeting of the association for computational linguistics, volume 1: Long Papers, Jeju Island, Korea, pp 930–939
Shen L, Turian JP, Melamed ID (2003) Evaluation of machine translation and its evaluation. In: Proceedings of MT Summit IX, New Orleans, U.S.A.
Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization
Denkowski M, Lavie A (2010). METEOR-NEXT and the METEOR paraphrase tables: improved evaluation support for five target languages. In: Proceedings of the joint fifth workshop on statistical machine translation and MetricsMATR, WMT@ACL 2010, Uppsala, Sweden, July 15–16, 2010, Chris Callison-Burch, Philipp Koehn, Christof Monz, Kay Peterson, and Omar Zaidan (Eds.). Association for Computational Linguistics, pp 339–342. https://www.aclweb.org/anthology/W10-1751/
Guo Y, Ruan C, Hu J (2018) Meteor++: incorporating copy knowledge into machine translation evaluation. In: Proceedings of the third conference on machine translation: shared task paper, pp 740–745
Gupta A, Venkatapathy S, Sangal R (2010) METEOR-Hindi: automatic MT evaluation metric for hindi as a target. In: Proceedings of ICON-2010: 8th international conference on natural language processing, Macmillan Publishers. India
Melamed ID, Green R, Turian J (2003) Precision and recall of machine translation. In: Companion volume of the proceedings of HLT-NAACL 2003-Short Papers, pp 61–63
Aliguliyev RM (2008) Using the F-measure as similarity measure for automatic text summarization. Bычиcлитeльныe тexнoлoгии 13(3):5–14
Isozaki H, Hirao T, Duh K, Sudoh K, Tsukada H (2010) Automatic evaluation of translation quality for distant language pairs. In: Proceedings of the 2010 conference on empirical methods in natural language processing, pp 944–952
Popović M (2015) chrF: character n-gram F-score for automatic MT evaluation. In: Proceedings of the tenth workshop on statistical machine translation, WMT@EMNLP 2015, 17–18 September 2015, Lisbon, Portugal. The Association for Computer Linguistics, pp 392–395. https://doi.org/10.18653/v1/w15-3049
Popović M (2017) chrF++: words helping character n-grams. In: Proceedings of the second conference on machine translation, WMT 2017, Copenhagen, Denmark, September 7–8, 2017
Wang W, Peter JT, Rosendahl H, Ney H (2016) Character: translation edit rate on character level. In: Proceedings of the first conference on machine translation: Volume 2, Shared Task Papers, pp 505–510
Stanojević M, Sima’an K (2014) Beer: better evaluation as ranking. In: Proceedings of the ninth workshop on statistical machine translation, WMT@ACL 2014, June 26–27, 2014, Baltimore, Maryland, USA. The Association for Computer Linguistics, pp 414–419. https://doi.org/10.3115/v1/w14-3354
Stanchev P, Wang W, Ney H (2019) EED: extended edit distance measure for machine translation. In: Proceedings of the fourth conference on machine translation (Volume 2: Shared Task Papers, Day 1). Association for Computational Linguistics, Florence, Italy, pp 514–520.https://doi.org/10.18653/v1/W19-5359
Chan YS, Ng HT (2008) MAXSIM: a maximum similarity metric for machine translation evaluation. In: Proceedings of ACL-08: HLT, Columbus, Ohi, pp 55–62
Taskar B, Lacoste-Julien S, Klein D (2005) A discriminative matching approach to word alignment. In: Proceedings of human language technology conference and conference on empirical methods in natural language processing, pp 73–80
Han ALF, Wong DF, Chao LS, He L, Lu Y, Xing J, Zeng X (2013) Language-independent model for machine translation evaluation with reinforced factors. In: Proceedings of the 14th international conference of machine translation summit, pp 215–222
Liu D, Gildea D (2005) Syntactic features for evaluation of machine translation. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 25–32
Collins M, Duffy N (2001) Convolution kernels for natural language. In: Advances in neural information processing systems vol 14
Popović M, Ney H (2007) Word error rates: decomposition over POS classes and applications for error analysis. In: Proceedings of the second workshop on statistical machine translation. pp 48–55
Duma M, Menzel W (2017) UHH submission to the WMT17 quality estimation shared task. In: Proceedings of the second conference on machine translation, pp 556–561
Chauhan S, Daniel P, Mishra A, Kumar A (2021) AdaBLEU: a modified BLEU score for morphologically rich languages. IETE J Res 12:1–12
Chauhan S, Saxena S, Daniel P (2021) Monolingual and parallel corpora for kangri low resource language. arXiv preprint arXiv:2103.11596
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
Pennington J, Socher R, Manning CD (2014). Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Li P, Chen C, Zheng W, Deng Y, Ye F, Zheng Z (2019) STD: An automatic evaluation metric for machine translation based on word embeddings. IEEE/ACM Trans Audio, Speech Lang Process 27(10):1497–1506
Rei R, Stewart C, Farinha AC, Lavie A (2020) COMET: a neural framework for MT evaluation. In: Conference on empirical methods in natural language processing 2020 (online) arXiv preprint arXiv:2009.09025
Artetxe M, Schwenk H (2019) Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Trans Assoc Comput Linguist 7:597–610
Lample G, Conneau A (2019) Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291
Lommel A, Uszkoreit H, Burchardt A (2014) Multidimensional quality metrics (MQM): a framework for declaring and describing translation quality metrics. Rev Tradumàtica Tecnol Trad 12:455–463
Fonseca E, Yankovskaya L, Martins AF, Fishel M, Federmann C (2019) Findings of the WMT 2019 shared tasks on quality estimation. In: Proceedings of the fourth conference on machine translation (volume 3: Shared Task Papers, Day 2), pp 1–10, Florence, Italy. Association for Computational Linguistics
Chen Q, Zhu X, Ling Z-H, Wei S, Jiang H, Inkpen D (2017) Enhanced LSTM for natural language inference. In: Proceedings of the 55th annual meeting of the association for computational linguistics, ACL 2017, Vancouver, Canada, July 30–August 4, volume 1: Long Papers, Regina Barzilay and Min-Yen Kan (Eds.). Association for Computational Linguistics, pp 1657–1668. https://doi.org/10.18653/v1/P17-1152
Devlin J, Chang MW, Lee K, Toutanova K (2018) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, Volume 1 (Long and Short Papers), pp 4171–4186, Minneapolis, Minnesota
Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (Long Papers), pp 2227–2237
Lo CK (2019) YiSi-a unified semantic MT quality evaluation and estimation metric for languages with different levels of available resources. In: Proceedings of the fourth conference on machine translation (volume 2: Shared Task Papers, Day 1), pp 507–513, Florence, Italy, August 2019. Association for Computational Linguistics. doi:https://doi.org/10.18653/v1/W19-5358. URL https://www.aclweb.org/anthology/W19-5358
Lo CK, Wu D (2011) MEANT: an inexpensive, high-accuracy, semiautomatic metric for evaluating translation utility via semantic frames. In: proceedings of the 49th annual meeting of the association for computational linguistics, human language technologies, vol 1, pp 220–229
Lo CK, Beloucif M, Saers M, Wu D (2014). XMEANT: better semantic MT evaluation without reference translations. In: Proceedings of the 52nd annual meeting of the association for computational linguistics, Short Papers, 2014, vol 2, pp 765–771.
Lo CK, Dowling P, Wu D (2015) Improving evaluation and optimization of MT systems against meant. In: Proceedings of the 10th workshop on statistical machine translation, pp 434–441, Lisbon, Portugal
Lo CK (2017) MEANT 2.0: accurate semantic MT evaluation for any output language. In: Second conference on World machine translation, Denmark
Banchs RE, D’Haro LF, Li H (2015) Adequacy–fluency metrics: evaluating mt in the continuous space model framework. IEEE/ACM Trans Audio Speech Lang Process 23(3):472–482
Wieting J, Berg-Kirkpatrick T, Gimpel K, Neubig G (2019) Beyond BLEU: training neural machine translation with semantic similarity. In: Proceedings of the 57th conference of the association for computational linguistics, ACL 2019, Florence, Italy, July 28–August 2, 2019, Volume 1: Long Papers, Anna Korhonen, David R. Traum, and Lluís Màrquez (Eds.). Association for Computational Linguistics, pp 4344–4355. https://doi.org/10.18653/v1/p19-1427
Gekhman Z, Aharoni R, Beryozkin G, Freitag M, Macherey W (2020) KoBE: knowledge-based machine translation evaluation. arXiv preprint arXiv:2009.11027
Hiroki S, Tomoyuki K, Mamoru K (2018) RUSE: regressor using sentence embeddings for automatic machine translation evaluation. In: Proceedings of the third conference on machine translation: shared task papers, WMT 2018, Belgium, Brussels, October 31–November 1, 2018, Ondrej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno-Yepes, Philipp Koehn, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana L. Neves, Matt Post, Lucia Specia, Marco Turchi, and Karin Verspoor (Eds.). Association for Computational Linguistics, pp 751–758. https://doi.org/10.18653/v1/w18-6456
Conneau A, Kiela D, Schwenk H, Barrault L, Bordes A (2017) Supervised learning of universal sentence representations from natural language inference data. In: Proceedings of the 2017 conference on empirical methods in natural language processing. Association for Computational Linguistics, Copenhagen, Denmark, pp 670–680. https://doi.org/10.18653/v1/D17-1070
Logeswaran L, Lee H (2018) An efficient framework for learning sentence representations. In: 6th International conference on learning representations, ICLR 2018, Vancouver, BC, Canada, April 30—May 3, 2018
Cer D, Yang Y, Kong SY, Hua N, Limtiaco N, John RS, Kurzweil R (2018) Universal sentence encoder. arXiv preprint arXiv:1803.11175
Shimanaka H, Kajiwara T, Komachi M (2019) Machine translation evaluation with bert regressor. arXiv preprint arXiv:1907.12679
Sellam T, Das D, Parikh AP (2020) BLEURT: learning robust metrics for text generation. In: 58th annual meeting of the association for computational linguistics. arXiv preprint arXiv:2004.04696
Sellam T, Das D, Parikh AP (2020) BLEURT: learning robust metrics for text generation. arXiv preprint arXiv:2004.04696
Rus V, Lintean M (2012) An optimal assessment of natural language student input using word-to-word similarity metrics. In: International conference on intelligent tutoring systems. Springer, Berlin, Heidelberg, pp 675–676
Kusner MJ, Sun Y, Kolkin NI, Weinberger KQ (2015) From word embeddings to document distances. In: Proceedings of the 32nd international conference on machine learning, Lille, France, 2015
Ren Z, Yuan J, Zhang Z (2011) Robust hand gesture recognition based on finger-earth mover's distance with a commodity depth camera. In: Proceedings of the 19th ACM international conference on Multimedia, pp 1093–1096
Clark E, Celikyilmaz A, Smith NA (2019) Sentence mover’s similarity: automatic evaluation for multi-sentence texts. In: Proceedings of the 57th annual meeting of the association for computational linguistics, Italy, pp 2748–2760
Zhao W, Peyrard M, Liu F, Gao Y, Meyer CM, Eger S (2019) MoverScore: text generation evaluating with contextualized embeddings and earth mover distance. arXiv preprint arXiv:1909.02622
Comelles E, Atserias J (2019) VERTa: a linguistic approach to automatic machine translation evaluation. Lang Resour Eval 53(1):57–86
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. In: Proceedings of ICRL 2014, San Diego, USA
Cho K, Van Merriënboer B, Bahdanau B, Bengio Y (2014) On the properties of neural machine translation: encoder-decoder approaches. In: Proceedings of SSST-8, eighth workshop on syntax, semantics and structure in statistical translation, Doha, Qatar, pp 103–111
Kalchbrenner N, Blunsom P (2013) Recurrent continuous translation models. In: Proceedings of the 2013 conference on empirical methods in natural language processing, Seattle, Washington, USA, pp 1700–1709
Sutskever I, Vinyals O, Le Q (2014) Sequence to sequence learning with neural networks. In: Proceedings of advances in neural information processing systems, Montreal, Canada, pp 3104–3112
Toral A, Castilho S, Hu K, Way A (2018) Attaining the unattainable? Reassessing claims of human parity in neural machine translation. In: Proceedings of the third conference on machine translation (WMT), Volume 1: Research Papers, Association for Computational Linguistics, Brussels, Belgium, pp 113–123
Hassan H, Aue A, Chen C, Chowdhary V, Clark J, Federmann C, Huang X, Junczys-Dowmunt M, Lewis W, Li M, Liu S, Liu T, Luo R, Menezes A, Qin T, Seide F, Tan X, Tian F, Wu L, Wu S, Xia Y, Zhang D, Zhang Z, Zhou M (2018) Achieving human parity on automatic Chinese to English news translation. arXiv:1803.05567
Isabelle P, Cherry C, Foster G (2017) A challenge set approach to evaluating machine translation. In: Proceedings of the 2017 conference on empirical methods in natural language processing, Copenhagen, Denmark, pp 2486–2496
Sennrich R (2017) How grammatical is character-level neural machine translation? Assessing MT quality with contrastive translation pairs. In: 15th conference of the European chapter of the association for computational linguistics, Spain arXiv:1612.04629v3
Klubička F, Toral A, Sánchez-Cartagena VM (2018) Quantitative fine-grained human evaluation of machine translation systems: a case study on English to Croatian. Mach Transl 32(3):195–215
Cheng Y, Jiang L, Macherey W (2019) Robust neural machine translation with doubly adversarial inputs. In: Proceedings of the annual meeting of the association for computational linguistics. Florence, pp 4324–4333
Cheng Y, Tu Z, Meng F, Zhai J, Liu Y (2018) Towards robust neural machine translation. In: Proceedings of the annual meeting of the association for computational linguistics. Melbourne pp 1756–1766
Ding Y, Liu Y, Luan H et al (2017) Visualizing and understanding neural machine translation. In: Proceedings of the annual meeting of the association for computational linguistics. Vancouver, pp 1150–1159
Thompson B, Post M (2020) Automatic machine translation evaluation in many languages via zero-shot paraphrasing. arXiv preprint arXiv:2004.14564
Kocmi T, Federmann C, Grundkiewicz R, Junczys-Dowmunt M, Matsushita H, Menezes A (2021) To ship or not to ship: an extensive evaluation of automatic metrics for machine translation. arXiv preprint arXiv:2107.10821
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A,Herbst E (2007). Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the association for computational linguistics companion volume proceedings of the demo and poster sessions, pp 177–180
Heafield K (2011) KenLM: faster and smaller language model queries. In: Proceedings of the sixth workshop on statistical machine translation, Edinburgh, Scotland, pp 187–197
Kunchukuttan A, Kakwani D, Golla S, Bhattacharyya A, Khapra MM, Kumar P (2020) Ai4bharat-indicnlp corpus: monolingual corpora and word embeddings for indic languages. arXiv preprint arXiv:2005.00085
Parton K, Tetreault J, Madnani N, Chodorow M (2011) E-rating machine translation. In: Proceedings of the 6th workshop on statistical machine translation, Edinburgh, Scotland, UK, pp 108–115
Song X, Cohn T (2011) Regression and ranking based optimisation for sentence level MT evaluation. In: Proceedings of the sixth workshop on statistical machine translation, pp 123–129
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interests
The author does not have any conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Chauhan, S., Daniel, P. A Comprehensive Survey on Various Fully Automatic Machine Translation Evaluation Metrics. Neural Process Lett 55, 12663–12717 (2023). https://doi.org/10.1007/s11063-022-10835-4
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-022-10835-4