Skip to main content
Log in

Data-driven annotation of binary MT quality estimation corpora based on human post-editions

  • Published:
Machine Translation

Abstract

Advanced computer-assisted translation (CAT) tools include automatic quality estimation (QE) mechanisms to support post-editors in identifying and selecting useful suggestions. Based on supervised learning techniques, QE relies on high-quality data annotations obtained from expensive manual procedures. However, as the notion of MT quality is inherently subjective, such procedures might result in unreliable or uninformative annotations. To overcome these issues, we propose an automatic method to obtain binary annotated data that explicitly discriminate between useful (suitable for post-editing) and useless suggestions. Our approach is fully data-driven and bypasses the need for explicit human labelling. Experiments with different language pairs and domains demonstrate that it yields better models than those based on the adaptation into binary datasets of the available QE corpora. Furthermore, our analysis suggests that the learned thresholds separating useful from useless translations are significantly lower than as suggested in the existing guidelines for human annotators. Finally, a verification experiment with several translators operating with a CAT tool confirms our empirical findings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. Henceforth we will use the term target to indicate the output of an MT system.

  2. http://www.statmt.org/wmt14/.

  3. Possible editing operations include the insertion, deletion, and substitution of single words as well as shifts of word sequences.

  4. http://www.statmt.org/wmt11/translation-task.html.

  5. Such biases support the idea that labelling translations with quality scores is per se a highly subjective task.

  6. http://www.matecat.com/.

  7. Partitions with thresholds below 2 were also considered, including the most intuitive partition with cut-off set to 1. However, the resulting number of negative instances, if any, was too scarce, and the overall dataset too unbalanced, to make standard supervised learning methods effective.

  8. The partition most closely related to our task (i.e. 1-1-1) was impossible to produce since none of the examples was labelled with 1 by all the annotators. Even for 1-1-X, the negative class contains only one example. Moreover, based on human scores it was impossible to create balanced datasets to compare with.

  9. Note that, since for the CAT dataset only HTER labels are available, only HTER-based partitions could be performed.

  10. Such assumption is supported by the fact that reference sentences are, by definition, free translations manually produced, independently and without any influence from the target.

  11. Monolingual stem-to-stem exact matches between TGT and correct_translation are inferred by computing the HTER, as in Blain et al. (2012).

  12. All ROUGE scores, described in Lin (2004), have been calculated using the software available at http://www.berouge.com.

  13. Such partitions are: average effort scores \(=\) 3, human scores \(=\) 3-3-3, HTER score \(=\) 0.45 for WMT-12 and HTER score \(=\) 0.3 for CAT.

  14. Each threshold corresponds to the HTER value t that maximizes the number of rewritings above t and the number of post-editions below t. To set t we computed such counts for \(0<\hbox {HTER}<1\) with a step of 0.001.

  15. Hence, independently from the HTER, some instances previously marked as positive examples are now considered as negative and vice-versa.

  16. PET is the time required to transform the target into a publishable sentence.

  17. FPR \(=\) FP/(FP+TN), where FP and TN respectively stand for the number of false positives and true negatives.

  18. FDR \(=\) 1-precision \(=\) FP/(TP+FP) where TP stands for the number of true positives (Benjamini and Hochberg 1995).

  19. For the WMT dataset, twenty classifiers are trained on partitions based on average effort scores (AES), human scores (HS) and HTER, while one is trained on data resulting from our automatic annotation method. Regarding the CAT dataset, ten classifiers are trained on HTER-based partitions, while 1 is trained on automatically-labelled data.

  20. As regards the WMT-12 training data (see Table 1), the distribution of positive/negative instances in the training sets is: 1194/638 for classifier 3 AES, 1095/737 for classifier 0.35 HTER, 1418/414 for classifier AA. For the CAT data, the distribution is: 470/476 for classifier 0.30 HTER and 494/452 for classifier AA.

  21. ExpertRating: http://www.expertrating.com.

  22. Nothing guarantees that the translations obtained from System-100 are actually good and those obtained from System-30 are bad. However, the large difference in BLEU score between the two systems will likely lead to suggestions that require different amounts of corrections.

  23. Even when measured in a controlled lab environment, post-editing time has a high variability due to a myriad of factors that are impossible to control. For this reason, the Forward Search algorithm (Atkinson and Riani 2000; Atkinson et al. 2004) has been run to remove possible outliers (e.g. four rewritings and five post-editions in total for Translator 1). For our experiments, we used the FSDA Matlab toolbox (Riani et al. 2012).

References

  • Atkinson AC, Riani M (2000) Robust diagnostic regression analysis. Springer Series in Statistics. Springer, New York

  • Atkinson AC, Riani M, Cerioli A (2004) Exploring multivariate data with the forward search. Springer Series in Statistics. Springer, New York

  • Bach N, Huang F, Al-Onaizan Y (2011) Goodness: a method for measuring machine translation confidence. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, Portland, Oregon, USA, pp 211–219

  • Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B (Methodological) 57:289–300

    MATH  MathSciNet  Google Scholar 

  • Blain F, Schwenk H, Senellart J (2012) Incremental adaptation using translation information and post-editing analysis. In: Proceedings of the international workshop on spoken language translation, Hong-Kong, China, pp 234–241

  • Blatz J, Fitzgerald E, Foster G, Gandrabur S, Goutte C, Kulesza A, Sanchis A, Ueffing N (2004) Confidence estimation for machine translation. In: Proceedings of the 20th International Conference on Computational Linguistics, Switzerland, Geneva, pp 315–321

  • Bojar O, Buck C, Callison-Burch C, Federmann C, Haddow B, Koehn P, Monz C, Post M, Soricut R, Specia L (2013) Findings of the 2013 workshop on statistical machine translation. In: Proceedings of the eighth workshop on statistical machine translation, Sofia, Bulgaria, pp 1–44

  • Callison-Burch C, Koehn P, Monz C, Post M, Soricut R, Specia L (2012) Findings of the 2012 Workshop on statistical machine translation. In: Proceedings of the seventh workshop on statistical machine translation, Montréal, Canada, WMT-2012, pp 10–51

  • Carl M, Dragsted B, Elming J, Hardt D, Jakobsen AL (2011) The process of post-editing: a pilot study. In: Proceedings of the 8th international NLPSC workshop. Special theme: Human-machine interaction in translation, Copenhagen, Denmark, pp 131–142

  • Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):27:1–27:27. doi:10.1145/1961189.1961199

    Article  Google Scholar 

  • Chen CY, Yeh JY, Ke HR (2010) Plagiarism detection using ROUGE and WordNet. J Comput 2(3):34–44

    Google Scholar 

  • Cohn T, Specia L (2013) Modelling annotator bias with multi-task gaussian processes: an application to machine translation quality estimation. In: Proceedings of the 51st annual meeting of the association for computational linguistics, Sofia, Bulgaria, pp 32–42

  • Camargo de Souza JG, Turchi M, Negri M (2014) Machine translation quality estimation across domains. In: Proceedings of COLING 2014, the 25th international conference on computational linguistics: technical papers, Dublin City University and Association for Computational Linguistics. Dublin, Ireland, pp 409–420, http://www.aclweb.org/anthology/C14-1040

  • Efron B (1979) Bootstrap methods: another look at the jackknife. Ann Stat 7(1):1–26

    Article  MATH  MathSciNet  Google Scholar 

  • Federico M, Cattelan A, Trombetti M (2012) Measuring user productivity in machine translation enhanced computer assisted translation. In: Proceedings of the Tenth conference of the association for machine translation in the Americas, San Diego, California

  • Federico M, Bertoldi N, Cettolo M, Negri M, Turchi M, Trombetti M, Cattelan A, Farina A, Lupinetti D, Martines A, Massidda A, Schwenk H, Barrault L, Blain F, Koehn P, Buck C, Germann U (2014) The MateCat tool. In: Proceedings of COLING 2014, the 25th international conference on computational linguistics: system demonstrations, Dublin City University and Association for Computational Linguistics, Dublin, Ireland, pp 129–132. http://www.aclweb.org/anthology/C14-2028

  • Garcia I (2011) Translating by post-editing: is it the way forward? Mach Transl 25(3):217–237

    Article  Google Scholar 

  • Graham Y, Baldwin T, Moffat A, Zobel J (2013) Continuous measurement scales in human evaluation of machine translation. In: Proceedings of the 7th linguistic annotation workshop and interoperability with discourse, Sofia, Bulgaria, pp 33–41

  • Green S, Heer J, Manning CD (2013) The efficacy of human post-editing for language translation. In: Proceedings of the SIGCHI conference on human factors in computing systems, ACM, Paris, France, pp 439–448

  • Guerberof A (2009) Productivity and quality in MT post-editing. In: Proceedings of Machine Translation Summit XII—Workshop: Beyond translation memories: new tools for translators MT, Ottawa, Ontario, Canada

  • Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422

    Article  MATH  Google Scholar 

  • Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Proceedings of Machine Translation Summit X, Phuket, Thailand, pp 79–86

  • Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R et al (2007) Moses: Open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, Czech Republic, Prague, pp 177–180

  • Koponen M (2012) Comparing human perceptions of post-editing effort with post-editing operations. In: Proceedings of the seventh workshop on statistical machine translation, Montréal, Canada, pp 181–190

  • Koponen M, Aziz W, Ramos L, Specia L (2012) Post-editing time as a measure of cognitive effort. In: Proceedings of the AMTA 2012 workshop on post-editing technology and practice, San Diego, CA, USA

  • Läubli S, Fishel M, Massey G, Ehrensberger-Dow M, Volk M (2013) Assessing post-editing efficiency in a realistic translation environment. In: Proceedings of Machine Translation Summit XIV Workshop on Post-editing Technology and Practice, Nice, France, pp 83–91

  • Lesk M (1986) Automated sense disambiguation using machine-readable dictionaries: How to tell a pine cone from an ice cream cone. In: Proceedings of the 5th annual international conference on systems documentation (SIGDOC86), Canada, Toronto, pp 24–26

  • Lin CY (2004) ROUGE: A package for automatic evaluation of summaries. In: Proceedings of the ACL workshop on text summarization branches out, Barcelona, Spain, pp 74–81

  • Mehdad Y, Negri M, Federico M (2012) Match without a referee: Evaluating MT adequacy without reference translations. In: Proceedings of the seventh workshop on statistical machine translation, Montréal, Canada, pp 171–180

  • O’Brien S (2011) Towards predicting post-editing productivity. Mach Transl 25(3):197–215

    Article  Google Scholar 

  • Papadopoulos H, Proedrou K, Vovk V, Gammerman A (2002) Inductive confidence machines for regression. In: Proceedings of the 13th European conference on machine learning, Helsinki, Finland, pp 345–356

  • Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, Pennsylvania, Philadelphia, pp 311–318

  • Porter M (2001) Snowball: a language for stemming algorithms. http://snowball.tartarus.org/texts/introduction.html, Accessed 01 Aug 2014

  • Potet M, Esperança-Rodier E, Besacier L, Blanchon H (2012) Collection of a large database of french-english SMT output corrections. In: Proceedings of the eighth international conference on language resources and evaluation, Istanbul, Turkey, pp 4043–4048

  • Potthast M, Barrón-Cedeño A, Eiselt A, Stein B, Rosso P (2010) Overview of the 2nd international competition on plagiarism detection. In: Notebook Papers of CLEF (2010) LABs and Workshops, Padua, Italy

  • Press WH, Teukolsky SA, Vetterling WT, Flannery BP (2007) Numerical recipes: the art of scientific computing, 3rd edn. Cambridge University Press, New York

    Google Scholar 

  • Quirk CB (2004) Training a sentence-level machine translation confidence measure. In: Proceedings of the fourth international conference on language resources and evaluation, pp 825–828

  • Riani M, Perrotta D, Torti F (2012) FSDA: a MATLAB toolbox for robust analysis and interactive data exploration. Chemom Intell Lab Syst 116:17–32

    Article  Google Scholar 

  • Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: Proceedings of association for machine translation in the Americas, Massachusetts, USA, Cambridge, pp 223–231

  • Soricut R, Echihabi A (2010) TrustRank: inducing trust in automatic translations via ranking. In: Proceedings of the 48th annual meeting of the association for computational linguistics, Uppsala, Sweden, pp 612–621

  • Specia L (2011) Exploiting objective annotations for measuring translation post-editing effort. In: Proceedings of the 15th conference of the European association for machine translation, Belgium, Leuven, pp 73–80

  • Specia L, Cancedda N, Dymetman M, Turchi M, Cristianini N (2009a) Estimating the sentence-level quality of machine translation systems. In: Proceedings of the 13th annual conference of the European Association for machine translation, Barcelona, Spain, pp 28–35

  • Specia L, Turchi M, Wang Z, Shawe-Taylor J, Saunders C (2009b) Improving the confidence of machine translation quality estimates. In: Proceedings of machine translation Summit XII, Ottawa, Ontario, Canada

  • Specia L, Raj D, Turchi M (2010) Machine translation evaluation versus quality estimation. Machine Transl 24(1):39–50

    Article  Google Scholar 

  • Specia L, Shah K, C de Souza JG, Cohn T (2013) QuEst-a translation quality estimation framework. In: Proceedings of the 51st annual meeting of the association for computational linguistics: system demonstrations, Sofia, Bulgaria, pp 79–84

  • Steinberger R, Pouliquen B, Widiger A, Ignat C, Erjavec T, Tufis D, Varga D (2006) The JRC-Acquis: a multilingual aligned parallel corpus with 20+ languages. arXiv preprint cs/0609058

  • Turchi M, Negri M, Federico M (2013) Coping with the subjectivity of human judgements in MT quality estimation. In: Proceedings of the eighth workshop on statistical machine translation, Sofia, Bulgaria, pp 240–251

  • Turchi M, Anastasopoulos A, C de Souza JG, Negri M (2014) Adaptive quality estimation for machine translation. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (Volume 1: Long Papers), Association for Computational Linguistics. Baltimore, Maryland, pp 710–720. http://www.aclweb.org/anthology/P14-1067

  • Zhechev V (2012) Machine translation infrastructure and post-editing performance at Autodesk. In: AMTA 2012 workshop on post-editing technology and practice, San Diego, USA, pp 87–96

Download references

Acknowledgments

This work has been partially supported by the EC-funded project MateCat (ICT-2011.4.2-287688).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marco Turchi.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Turchi, M., Negri, M. & Federico, M. Data-driven annotation of binary MT quality estimation corpora based on human post-editions. Machine Translation 28, 281–308 (2014). https://doi.org/10.1007/s10590-014-9162-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10590-014-9162-z

Keywords

Navigation