Skip to main content
Log in

VERTa: a linguistic approach to automatic machine translation evaluation

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Machine translation (MT) is directly linked to its evaluation in order to both compare different MT system outputs and analyse system errors so that they can be addressed and corrected. As a consequence, MT evaluation has become increasingly important and popular in the last decade, leading to the development of MT evaluation metrics aiming at automatically assessing MT output. Most of these metrics use reference translations in order to compare system output, and the most well-known and widely spread work at lexical level. In this study we describe and present a linguistically-motivated metric, VERTa, which aims at using and combining a wide variety of linguistic features at lexical, morphological, syntactic and semantic level. Before designing and developing VERTa a qualitative linguistic analysis of data was performed so as to identify the linguistic phenomena that an MT metric must consider (Comelles et al. 2017). In the present study we introduce VERTa’s design and architecture and we report the experiments performed in order to develop the metric and to check the suitability and interaction of the linguistic information used. The experiments carried out go beyond traditional correlation scores and step towards a more qualitative approach based on linguistic analysis. Finally, in order to check the validity of the metric, an evaluation has been conducted comparing the metric’s performance to that of other well-known state-of-the-art MT metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Similar content being viewed by others

Notes

  1. Sources available at https://github.com/jatserias/VERTa.

  2. http://www.statmt.org/wmt10/evaluation-task.html.

  3. http://www.itl.nist.gov/iad/mig/tests/mt/2006/.

  4. The lexical semantic relations used are obtained from WordNet 3.0 (Fellbaum 1998).

  5. The data has been annotated by the Stanford Log-Linear Part of Speech Tagger (Toutanova et al. 2003), included in the Stanford CoreNLP suite.

  6. This LM was used as a baseline feature in the WMT13 Quality Estimation Task (http://www.statmt.org/wmt13/quality-estimation-task.html).

  7. det stands for determiner; num stands for numeral and _ refers to those intermediate categories that help moving from standard dependencies to collapsed dependencies.

  8. Lexical module matches in bold and N-gram module matches underlined.

  9. https://www.nist.gov/.

  10. https://www.ldc.upenn.edu/.

  11. https://catalog.ldc.upenn.edu/LDC2010T14.

  12. Missing subject.

  13. http://asiya.lsi.upc.edu/.

  14. Although both SP and CP metrics use the Penn Treebank PoS tagset, SP metrics use a different tool to automatically annotate sentences [SVM tool (Giménez and Márquez 2004) and BIOS (Surdeanu and Turmo 2005)], thus its different performance.

  15. In the original combination of metrics, there were two metrics that are not available in the Asiya framework nowadays, DP-HWCM_c and DP-HWCM_r, and which have been substituted by the variants DP-HWCM_c-4 and DP-HWCM_r-4.

  16. The one obtaining the best results between the two.

  17. http://www.quest.dcs.shef.ac.uk/quest_files/lm.europarl-nc.en.

References

  • Agarwal, A., & Lavie, A. (2008). METEOR, M-BLEU and M-TER: Flexible Matching and Parameter Tuning for High-Correlation with Human Judgments of Machine Translation Quality. In Proceedings of the ACL2008 Workshop on Statistical Machine Translation. Columbus, Ohio, USA.

  • Albrecht, J. S., & Hwa, R. (2007a). A re-examination of machine learning approaches for sentence-level MT evaluation. In Proceedings of the 45th annual meeting of the association for computational linguistics (ACL), Prague, Czech Republic (pp. 880–887).

  • Albrecht, J. S., & Hwa, R. (2007b) Regression for sentence-level MT evaluation with pseudo references. In Proceedings of the 45th annual meeting of the association for computational linguistics, Prague, Czech Republic (pp. 296–303).

  • Atserias, J., Blanco, R., Chenlo, J. M., & Rodriguez, C. (2012). FBM-Yahoo at RepLab 2012. CLEF (Online Working Notes/Labs/Workshop).

  • Banerjee, S., & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of ACL workshop on intrinsic and extrinsic evaluation measures for MT and/or summarization, Michigan, USA.

  • Callison-Burch, Ch., Koehn, P., Monz, Ch., Post, M., Soricut, R., & Specia, L. (2012). Findings of the 2012 workshop on statistical machine translation. In Proceedings of the seventh workshop on statistical machine translation, Montréal, Canada (pp. 10–51).

  • Castillo, J., & Estrella, P. (2012). Semantic textual similarity for MT evaluation. In Proceedings of the seventh workshop on statistical machine translation, Montréal, Canada (pp. 52–58).

  • Chang, A. X., & Manning, Ch. D. (2012). SUTIME: A library for recognizing and normalizing time expressions. In Proceedings of the 8th international conference on language resources and evaluation, Istanbul, Turkey.

  • Chang, Y. S., & Ng, H. T. (2008). MAXSIM: A maximum similarity metric for machine translation evaluation. In Proceedings of the ACL-08: HLT, Columbus, Ohio, USA (pp. 55–62).

  • Charniak, E., & Johnson, M. (2005). Coarse-to-fine n-best parsing and MaxEnt discriminative reranking. In Proceedings of the 43rd annual meeting of the association for computational linguistics (ACL), Michigan, USA.

  • Chen, B., Kuhn, R., & Foster, G. (2012). Improving AMBER, an MT evaluation metric. In Proceedings of the 7th workshop on statistical machine translation, Montréal, Canada (pp. 59–63).

  • Ciaramita, M., & Altun, Y. (2006). Broad-coverage sense disambiguation and information extraction with a supersense sequence tagger. In Proceedings of the conference on empirical methods in natural language processing (EMNLP).

  • Comelles, E., Arranz, V., & Castellon, I. (2010). Constituency and dependency parsers evaluation. Revista de la Sociedad Española para el Procesamiento del Lenguaje Natural, 45, 59–66.

    Google Scholar 

  • Comelles, E., Arranz, V., & Castellon, I. (2017). Guiding automatic MT evaluation by means of linguistic features. Digital Scholarship in the Humanities, 32(4), 761–778.

    Google Scholar 

  • Comelles, E., & Atserias, J. (2014). VERTa participation in the WMT14 metrics task. In Proceedings of the ninth workshop on statistical machine translation, Baltimore, USA.

  • Comelles, E., & Atserias, J. (2015). VERTa: A linguistically-motivated metric at the WMT15 metrics task. In Proceedings of the tenth workshop on statistical machine translation, Lisbon, Portugal.

  • Comelles, E., & Atserias, J. (2016). Through the eyes of VERTa. Revista de la Sociedad Española para el Procesamiento del Lenguaje Natural, 57, 181–184.

    Google Scholar 

  • De Marneffe, M.C., MacCartney, B., & Manning, C. D. (2006). Generating typed dependency parses from phrase structure parses. In Proceedings of the 5th edition of the international conference on language resources and evaluation (LREC-2006), Genoa, Italy.

  • Denkowski, M., & Lavie, A. (2011). Meteor 1.3: Automatic metric for reliable optimization and evaluation of machine translation systems. In Proceedings of the 6th workshop on statistical machine translation, Edinburgh, Scotland, UK (pp. 85–91).

  • Denkowski, M., & Lavie, A. (2014). Meteor universal. Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation, Baltimore, Maryland, USA (pp. 376–380).

  • Doddington, G. (2002). Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics. In Proceedings of the 2nd international conference on human language technology, San Diego, California (pp. 138–145).

  • Farrús, M., Costa-Jussà, M. R., Mariño, J. B., & Fonollosa, J. A. R. (2010). Linguistic-based evaluation criteria to identify statistical machine translation errors. In Proceedings of the 14th annual conference of the European association for machine translation, Saint Raphael, France.

  • Fellbaum, Ch. (1998). WordNet: An electronic lexical database. Cambridge, MA: MIT Press.

    Book  Google Scholar 

  • Gautam, S., & Bhattacharyya, P. (2014). Layered: Metric for machine translation evaluation. In Proceedings of the ninth workshop on statistical machine translation, Baltimore, Maryland, USA (pp. 387–393).

  • Giménez, J. (2008). Empirical machine translation and its evaluation. PhD thesis, Universitat Politècnica de Catalunya, Spain.

  • Giménez, J., & Márquez, L. (2004). SVM tool: A general POS tagger generator based on support vector machines. In Proceedings of the 4th international conference on language resources and evaluation (LREC’04), Lisbon, Portugal (pp. 43–46).

  • Giménez, J., & Márquez, Ll. (2008). Discriminative phrase selection for statistical machine translation. In C. Goutte, N. Cancedda, M. Dymetman, & G. Foster (Eds.), Learning machine translation. NIPS workshop series. Cambridge: MIT Press.

    Google Scholar 

  • Giménez, J., & Márquez, Ll. (2010a). Asiya: An open toolkit for automatic machine translation (meta-) evaluation. The Prague Bulletin of Mathematical Linguistics, 94, 77–86.

    Article  Google Scholar 

  • Giménez, J., & Márquez, Ll. (2010b). Linguistic measures for automatic machine translation evaluation. Machine Translation, 24(3–4), 77–86.

    Google Scholar 

  • González, M., Barrón-Cedeño, A., & Márquez, L. L. (2014). IPA and STOUT: Leveraging linguistic and source-based features for machine translation evaluation. In Proceedings of the ninth workshop on statistical machine translation, Baltimore, Maryland, USA (pp. 394–401).

  • González, M., & Giménez, J. (2014). Asiya: An open toolkit for automatic machine translation (meta-) evaluation. Technical manual 3.0. Barcelona: Universitat Politècnica de Catalunya.

    Google Scholar 

  • Gupta, R., Orăsan, C., & van Genabith, J. (2015). ReVal: A simple and effective machine translation evaluation metric based on recurrent neural networks. In Proceedings of the 2015 conference on empirical methods in natural language processing, Lisbon, Portugal (pp. 1066–1072).

  • Hachey, B., Radford, W., & Curran, J. R. (2011). Graph-based named entity linking with Wikipedia. In Proceedings of the 12th international conference on web information system engineering (pp. 213–226).

  • He, Y., Du, J., Way, A., & van Genabith, J. (2010). The DCU dependency-based metric in WMT-MetricsMATR 2010. In Proceedings of the 5th workshop on statistical machine translation, Uppsala, Sweeden (pp. 349–353).

  • Hutchins, J. W., & Somers, H. L. (1992). An introduction to machine translation. London: Academic Press.

    Google Scholar 

  • Joty, S., Guzmán, F., Márquez, L., & Nakov, P. (2014). DiscoTK: Using discourse structure for machine translation evaluation. In Proceedings of the ninth workshop on statistical machine translation, Baltimore, Maryland, USA (pp. 402–408).

  • Liu, D., & Gildea, D. (2005). Syntactic features for evaluation of machine translation. Proceedings of ACL workshop on intrinsic and extrinsic evaluation measures for MT and/or summarization (pp. 25–32).

  • Lo, Ch. (2017). Meant 2.0: Accurate semantic MT evaluation for any output language. In Proceedings of the second conference on machine translation, volume 2: Shared tasks papers, Copenhagen, Denmark (pp. 589–597).

  • Lommel, A. (2016). Blues for BLEU: Reconsidering the validity of reference-based MT evaluation. In Proceedings of the LREC 2016 workshop “translation evaluationfrom fragmented tools and data sets to an integrated ecosystem”, Portoroz, Slovenia (pp 63–70).

  • Lo, Ch., & Wu, D. (2010). Semantic vs. syntactic vs. N-gram structure for machine translation evaluation. In Proceedings of the fourth workshop on syntax and structure in statistical translation, Beijing (pp. 52–60).

  • Lo, Ch., & Wu, D. (2011). Meant: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, Portland, Oregon, USA (pp. 220–229).

  • Lo, Ch., & Wu, D. (2012). Unsupervised vs. supervised weight estimation for semantic MT evaluation metrics. In The sixth workshop on syntax, semantics and structure in statistical translation (SSST-6), Jeju Island, South Korea.

  • Lo, Ch., & Wu, D. (2013). MEANT at WMT2013: A tunable, accurate yet inexpensive semantic frame based MT evaluation metric. In Proceedings of the eighth workshop on statistical machine translation, Sofia, Bulgaria (pp. 422–428).

  • Ma, Q., Graham, Y., Wang, S., & Liu, Q. (2017). Blend: A novel combined MT metric based on direct assessment CASICT-DCU submission to WMT17 metrics task. In Proceedings of the second conference on machine translation, volume 2: Shared tasks papers, Copenhagen, Denmark (pp. 598–603).

  • Macketanz, V., Avramidis, E., Burchardt, A., Helcl, J., & Srivastava, A. (2017). Machine translation: Phrase-based, rule-based and neural approaches with linguistic evaluation. Cybernetics and Information Technologies, 17(2), 28–43.

    Article  Google Scholar 

  • Owczarzak, K., van Genabith, J., & Way, A. (2007a). Dependency-based automatic evaluation for machine translation. In Proceedings of SSST, NAACL-HLT/AMTA workshop on syntax and structure in statistical translation (pp. 80–87).

  • Owczarzak, K., van Genabith, J., & Way, A. (2007b). Labelled dependencies in machine translation evaluation. In Proceedings of the ACL workshop on statistical machine translation, Czech Republic (pp. 104–111).

  • Padó, S., Galley, M., Jurafsky, D., & Manning, Ch D. (2009). Measuring machine translation quality as semantic equivalence: A metric based on entailment features. Machine Translation, 23(2–3), 181–193.

    Article  Google Scholar 

  • Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2001). Bleu: A method for automatic evaluation of machine translation. RC22176 (technical report). IBM T.J. Watson Research Center.

  • Pearson, K. (1914, 1924, 1930). The life, letters and labours of Francis Galton (3 volumes).

  • Popović, M. (2012). Class error rates for evaluation of machine translation output. In Proceedings of the 7th Workshop on Statistical Machine Translation (pp. 71-75). Montréal, Canda.

  • Popović, M. (2015). chrF: Character n-gram F-score for automatic MT evaluation. In Proceedings of the tenth workshop on statistical machine translation, Lisbon, Portugal (pp. 392–395).

  • Popović, M. (2017). chrF ++: Words helping character n-grams. In Proceedings of the second conference on machine translation, Volume 2: Shared tasks papers, Copenhagen, Denmark (pp. 612–618).

  • Reeder, F., Miller, K., Doyon, J., & White, J. (2001). The naming of things and the confusion of tongues: An MT metric. In Proceedings of the workshop on MT evaluation “who did what to whom?” at machine translation summit VIII (pp. 55–59).

  • Snover, M., Dorr, B., Schwartz, R., Micciulla, L., & Makhoul, J. (2006). A study of translation edit rate with targeted human annotation. In Proceedings of the 7th conference of the association for machine translation in the Americas (AMTA) (pp. 223–231).

  • Snover, M., Madnani, N., Dorr, B., & Schwartz, R. (2009). Fluency, adequacy or HTER? Exploring different human judgments with a tunable MT metric. In Proceedings of the 4th workshop on statistical machine translation at the 12th meeting of the european chapter of the association for computational linguistics (EACL-2009), Athens, Greece.

  • Specia, L., Cancedda, N., Dymetman, M., Turchi, M., & Cristianini, N. (2009). Estimating the sentence-level quality of machine translation systems. In Proceedings of the 13th annual conference of the EAMT, Barcelona, Spain (pp. 28–35).

  • Specia, L., Hajlaoui, N., Hallet, C., & Aziz, W. (2011). Prediting machine translation adequacy. In Proceedings of the 13th translation summit, Xiamen, China (pp. 513–520).

  • Specia, L., Raj, D., & Turchi, M. (2010). Machine translation evaluation versus quality estimation. Machine Translation, 24, 39–50.

    Article  Google Scholar 

  • Stanojević, M., & Sima’an, K. (2015). BEER 1.1: ILLC UvA submission to metrics and tuning task. In Proceedings of the tenth workshop on statistical machine translation, Lisbon, Portugal (pp. 396–401).

  • Surdeanu, M., & Turmo, J. (2005). Semantic role labeling using complete syntactic analysis. In Proceedings of CoNLL shared task.

  • Toutanova, K., Klein, D., Manning, Dh., & Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of HLT-NAACL (pp. 252–259).

  • Turian, J. P., Shen, L., & Melamed, I. D. (2003). Evaluation of machine translation and its evaluation. In Proceedings of MT SUMMIT IX.

  • Wang, M., & Manning, Ch. (2012). SPEDE: Probabilistic edit distance metrics for MT evaluation. In Proceedings of the 7th workshop on statistical machine translation, Montréal, Canada.

  • Wang, W., Peter, J., Rosendahl, H., & Ney, H. (2016). CharacTer: Translation edit rate on character level. In Proceedings of the first conference on machine translation, Berlin, Germany (pp. 505–510).

  • White, J. S., O’Connell, T., & O’Mara, F. (1994). The ARPA MT evaluation methodologies: Evolution, lessons, and future approaches. In Proceedings of the 1st conference of the association for machine translation in the Americas (AMTA) (pp. 193–205).

  • Wu, X., Yu, H., & Liu, Q. (2013). DCU participation in WMT13 metrics task. In Proceedings of the eighth workshop on statistical machine translation, Sofia, Bulgaria (pp. 435–439).

  • Yang, M. Y., Sun, S. Q., Zhu, J. G., Li, S., Zhao, T. J., & Zhu, X. N. (2011). Improvement of machine translation evaluation by simple linguistically motivated features. Journal of Computer Science and Technology, 26(1), 57–67.

    Article  Google Scholar 

  • Ye, Y., Zhou, M., & Lin, Ch. (2007). Sentence level machine translation evaluation as a ranking problem: One step aside from BLEU. In Proceedings of the second workshop on statistical machine translation, Prague, Czech Republic (pp. 240–247).

  • Yu, H., Ma, Q., Wu, X., & Liu, Q. (2015). CASICT-DCU participation in WMT15 metrics task. In Proceedings of the tenth workshop on statistical machine translation, Lisbon, Portugal (pp. 417–421).

  • Zhang, L., Weng, Z., Xiao, W., Wuan, J., Chen, Z., Tan, Y., Liand, Ma., & Wang, M. (2016). Extract domain-specific paraphrase from monoligual corpus for automatic evaluation of machine translation. In Proceedings of the first conference on machine translation. Volume 2 shared task papers, Berlin, Germany (pp. 511–517).

Links

Download references

Acknowledgements

We would like to thank LDC and NIST for kindly providing the data used in this study. This work has been funded by the Spanish Government (Project TUNER, TIN2015-65308- C5-1-R, MINECO/FEDER, UE).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Elisabet Comelles.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Comelles, E., Atserias, J. VERTa: a linguistic approach to automatic machine translation evaluation. Lang Resources & Evaluation 53, 57–86 (2019). https://doi.org/10.1007/s10579-018-9430-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-018-9430-2

Keywords

Navigation