Data-driven annotation of binary MT quality estimation corpora based on human post-editions

Turchi, Marco; Negri, Matteo; Federico, Marcello

doi:10.1007/s10590-014-9162-z

Data-driven annotation of binary MT quality estimation corpora based on human post-editions

Published: 21 November 2014

Volume 28, pages 281–308, (2014)
Cite this article

Machine Translation

Marco Turchi¹,
Matteo Negri¹ &
Marcello Federico¹

427 Accesses
2 Citations
Explore all metrics

Abstract

Advanced computer-assisted translation (CAT) tools include automatic quality estimation (QE) mechanisms to support post-editors in identifying and selecting useful suggestions. Based on supervised learning techniques, QE relies on high-quality data annotations obtained from expensive manual procedures. However, as the notion of MT quality is inherently subjective, such procedures might result in unreliable or uninformative annotations. To overcome these issues, we propose an automatic method to obtain binary annotated data that explicitly discriminate between useful (suitable for post-editing) and useless suggestions. Our approach is fully data-driven and bypasses the need for explicit human labelling. Experiments with different language pairs and domains demonstrate that it yields better models than those based on the adaptation into binary datasets of the available QE corpora. Furthermore, our analysis suggests that the learned thresholds separating useful from useless translations are significantly lower than as suggested in the existing guidelines for human annotators. Finally, a verification experiment with several translators operating with a CAT tool confirms our empirical findings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Error Classification Using Automatic Measures Based on n-grams and Edit Distance

Human Post-editing in Hybrid Machine Translation Systems: Automatic and Manual Analysis and Evaluation

Quality Estimation of MT-Engine Output Using Language Models for Post-editing and Their Comparative Study

Notes

Henceforth we will use the term target to indicate the output of an MT system.
http://www.statmt.org/wmt14/.
Possible editing operations include the insertion, deletion, and substitution of single words as well as shifts of word sequences.
http://www.statmt.org/wmt11/translation-task.html.
Such biases support the idea that labelling translations with quality scores is per se a highly subjective task.
http://www.matecat.com/.
Partitions with thresholds below 2 were also considered, including the most intuitive partition with cut-off set to 1. However, the resulting number of negative instances, if any, was too scarce, and the overall dataset too unbalanced, to make standard supervised learning methods effective.
The partition most closely related to our task (i.e. 1-1-1) was impossible to produce since none of the examples was labelled with 1 by all the annotators. Even for 1-1-X, the negative class contains only one example. Moreover, based on human scores it was impossible to create balanced datasets to compare with.
Note that, since for the CAT dataset only HTER labels are available, only HTER-based partitions could be performed.
Such assumption is supported by the fact that reference sentences are, by definition, free translations manually produced, independently and without any influence from the target.
Monolingual stem-to-stem exact matches between TGT and correct_translation are inferred by computing the HTER, as in Blain et al. (2012).
All ROUGE scores, described in Lin (2004), have been calculated using the software available at http://www.berouge.com.
Such partitions are: average effort scores \(=\) 3, human scores \(=\) 3-3-3, HTER score \(=\) 0.45 for WMT-12 and HTER score \(=\) 0.3 for CAT.
Each threshold corresponds to the HTER value t that maximizes the number of rewritings above t and the number of post-editions below t. To set t we computed such counts for \(0<\hbox {HTER}<1\) with a step of 0.001.
Hence, independently from the HTER, some instances previously marked as positive examples are now considered as negative and vice-versa.
PET is the time required to transform the target into a publishable sentence.
FPR \(=\) FP/(FP+TN), where FP and TN respectively stand for the number of false positives and true negatives.
FDR \(=\) 1-precision \(=\) FP/(TP+FP) where TP stands for the number of true positives (Benjamini and Hochberg 1995).
For the WMT dataset, twenty classifiers are trained on partitions based on average effort scores (AES), human scores (HS) and HTER, while one is trained on data resulting from our automatic annotation method. Regarding the CAT dataset, ten classifiers are trained on HTER-based partitions, while 1 is trained on automatically-labelled data.
As regards the WMT-12 training data (see Table 1), the distribution of positive/negative instances in the training sets is: 1194/638 for classifier 3 AES, 1095/737 for classifier 0.35 HTER, 1418/414 for classifier AA. For the CAT data, the distribution is: 470/476 for classifier 0.30 HTER and 494/452 for classifier AA.
ExpertRating: http://www.expertrating.com.
Nothing guarantees that the translations obtained from System-100 are actually good and those obtained from System-30 are bad. However, the large difference in BLEU score between the two systems will likely lead to suggestions that require different amounts of corrections.
Even when measured in a controlled lab environment, post-editing time has a high variability due to a myriad of factors that are impossible to control. For this reason, the Forward Search algorithm (Atkinson and Riani 2000; Atkinson et al. 2004) has been run to remove possible outliers (e.g. four rewritings and five post-editions in total for Translator 1). For our experiments, we used the FSDA Matlab toolbox (Riani et al. 2012).

References

Atkinson AC, Riani M (2000) Robust diagnostic regression analysis. Springer Series in Statistics. Springer, New York
Atkinson AC, Riani M, Cerioli A (2004) Exploring multivariate data with the forward search. Springer Series in Statistics. Springer, New York
Bach N, Huang F, Al-Onaizan Y (2011) Goodness: a method for measuring machine translation confidence. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, Portland, Oregon, USA, pp 211–219
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B (Methodological) 57:289–300
MATH MathSciNet Google Scholar
Blain F, Schwenk H, Senellart J (2012) Incremental adaptation using translation information and post-editing analysis. In: Proceedings of the international workshop on spoken language translation, Hong-Kong, China, pp 234–241
Blatz J, Fitzgerald E, Foster G, Gandrabur S, Goutte C, Kulesza A, Sanchis A, Ueffing N (2004) Confidence estimation for machine translation. In: Proceedings of the 20th International Conference on Computational Linguistics, Switzerland, Geneva, pp 315–321
Bojar O, Buck C, Callison-Burch C, Federmann C, Haddow B, Koehn P, Monz C, Post M, Soricut R, Specia L (2013) Findings of the 2013 workshop on statistical machine translation. In: Proceedings of the eighth workshop on statistical machine translation, Sofia, Bulgaria, pp 1–44
Callison-Burch C, Koehn P, Monz C, Post M, Soricut R, Specia L (2012) Findings of the 2012 Workshop on statistical machine translation. In: Proceedings of the seventh workshop on statistical machine translation, Montréal, Canada, WMT-2012, pp 10–51
Carl M, Dragsted B, Elming J, Hardt D, Jakobsen AL (2011) The process of post-editing: a pilot study. In: Proceedings of the 8th international NLPSC workshop. Special theme: Human-machine interaction in translation, Copenhagen, Denmark, pp 131–142
Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):27:1–27:27. doi:10.1145/1961189.1961199
Article Google Scholar
Chen CY, Yeh JY, Ke HR (2010) Plagiarism detection using ROUGE and WordNet. J Comput 2(3):34–44
Google Scholar
Cohn T, Specia L (2013) Modelling annotator bias with multi-task gaussian processes: an application to machine translation quality estimation. In: Proceedings of the 51st annual meeting of the association for computational linguistics, Sofia, Bulgaria, pp 32–42
Camargo de Souza JG, Turchi M, Negri M (2014) Machine translation quality estimation across domains. In: Proceedings of COLING 2014, the 25th international conference on computational linguistics: technical papers, Dublin City University and Association for Computational Linguistics. Dublin, Ireland, pp 409–420, http://www.aclweb.org/anthology/C14-1040
Efron B (1979) Bootstrap methods: another look at the jackknife. Ann Stat 7(1):1–26
Article MATH MathSciNet Google Scholar
Federico M, Cattelan A, Trombetti M (2012) Measuring user productivity in machine translation enhanced computer assisted translation. In: Proceedings of the Tenth conference of the association for machine translation in the Americas, San Diego, California
Federico M, Bertoldi N, Cettolo M, Negri M, Turchi M, Trombetti M, Cattelan A, Farina A, Lupinetti D, Martines A, Massidda A, Schwenk H, Barrault L, Blain F, Koehn P, Buck C, Germann U (2014) The MateCat tool. In: Proceedings of COLING 2014, the 25th international conference on computational linguistics: system demonstrations, Dublin City University and Association for Computational Linguistics, Dublin, Ireland, pp 129–132. http://www.aclweb.org/anthology/C14-2028
Garcia I (2011) Translating by post-editing: is it the way forward? Mach Transl 25(3):217–237
Article Google Scholar
Graham Y, Baldwin T, Moffat A, Zobel J (2013) Continuous measurement scales in human evaluation of machine translation. In: Proceedings of the 7th linguistic annotation workshop and interoperability with discourse, Sofia, Bulgaria, pp 33–41
Green S, Heer J, Manning CD (2013) The efficacy of human post-editing for language translation. In: Proceedings of the SIGCHI conference on human factors in computing systems, ACM, Paris, France, pp 439–448
Guerberof A (2009) Productivity and quality in MT post-editing. In: Proceedings of Machine Translation Summit XII—Workshop: Beyond translation memories: new tools for translators MT, Ottawa, Ontario, Canada
Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422
Article MATH Google Scholar
Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Proceedings of Machine Translation Summit X, Phuket, Thailand, pp 79–86
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R et al (2007) Moses: Open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, Czech Republic, Prague, pp 177–180
Koponen M (2012) Comparing human perceptions of post-editing effort with post-editing operations. In: Proceedings of the seventh workshop on statistical machine translation, Montréal, Canada, pp 181–190
Koponen M, Aziz W, Ramos L, Specia L (2012) Post-editing time as a measure of cognitive effort. In: Proceedings of the AMTA 2012 workshop on post-editing technology and practice, San Diego, CA, USA
Läubli S, Fishel M, Massey G, Ehrensberger-Dow M, Volk M (2013) Assessing post-editing efficiency in a realistic translation environment. In: Proceedings of Machine Translation Summit XIV Workshop on Post-editing Technology and Practice, Nice, France, pp 83–91
Lesk M (1986) Automated sense disambiguation using machine-readable dictionaries: How to tell a pine cone from an ice cream cone. In: Proceedings of the 5th annual international conference on systems documentation (SIGDOC86), Canada, Toronto, pp 24–26
Lin CY (2004) ROUGE: A package for automatic evaluation of summaries. In: Proceedings of the ACL workshop on text summarization branches out, Barcelona, Spain, pp 74–81
Mehdad Y, Negri M, Federico M (2012) Match without a referee: Evaluating MT adequacy without reference translations. In: Proceedings of the seventh workshop on statistical machine translation, Montréal, Canada, pp 171–180
O’Brien S (2011) Towards predicting post-editing productivity. Mach Transl 25(3):197–215
Article Google Scholar
Papadopoulos H, Proedrou K, Vovk V, Gammerman A (2002) Inductive confidence machines for regression. In: Proceedings of the 13th European conference on machine learning, Helsinki, Finland, pp 345–356
Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, Pennsylvania, Philadelphia, pp 311–318
Porter M (2001) Snowball: a language for stemming algorithms. http://snowball.tartarus.org/texts/introduction.html, Accessed 01 Aug 2014
Potet M, Esperança-Rodier E, Besacier L, Blanchon H (2012) Collection of a large database of french-english SMT output corrections. In: Proceedings of the eighth international conference on language resources and evaluation, Istanbul, Turkey, pp 4043–4048
Potthast M, Barrón-Cedeño A, Eiselt A, Stein B, Rosso P (2010) Overview of the 2nd international competition on plagiarism detection. In: Notebook Papers of CLEF (2010) LABs and Workshops, Padua, Italy
Press WH, Teukolsky SA, Vetterling WT, Flannery BP (2007) Numerical recipes: the art of scientific computing, 3rd edn. Cambridge University Press, New York
Google Scholar
Quirk CB (2004) Training a sentence-level machine translation confidence measure. In: Proceedings of the fourth international conference on language resources and evaluation, pp 825–828
Riani M, Perrotta D, Torti F (2012) FSDA: a MATLAB toolbox for robust analysis and interactive data exploration. Chemom Intell Lab Syst 116:17–32
Article Google Scholar
Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: Proceedings of association for machine translation in the Americas, Massachusetts, USA, Cambridge, pp 223–231
Soricut R, Echihabi A (2010) TrustRank: inducing trust in automatic translations via ranking. In: Proceedings of the 48th annual meeting of the association for computational linguistics, Uppsala, Sweden, pp 612–621
Specia L (2011) Exploiting objective annotations for measuring translation post-editing effort. In: Proceedings of the 15th conference of the European association for machine translation, Belgium, Leuven, pp 73–80
Specia L, Cancedda N, Dymetman M, Turchi M, Cristianini N (2009a) Estimating the sentence-level quality of machine translation systems. In: Proceedings of the 13th annual conference of the European Association for machine translation, Barcelona, Spain, pp 28–35
Specia L, Turchi M, Wang Z, Shawe-Taylor J, Saunders C (2009b) Improving the confidence of machine translation quality estimates. In: Proceedings of machine translation Summit XII, Ottawa, Ontario, Canada
Specia L, Raj D, Turchi M (2010) Machine translation evaluation versus quality estimation. Machine Transl 24(1):39–50
Article Google Scholar
Specia L, Shah K, C de Souza JG, Cohn T (2013) QuEst-a translation quality estimation framework. In: Proceedings of the 51st annual meeting of the association for computational linguistics: system demonstrations, Sofia, Bulgaria, pp 79–84
Steinberger R, Pouliquen B, Widiger A, Ignat C, Erjavec T, Tufis D, Varga D (2006) The JRC-Acquis: a multilingual aligned parallel corpus with 20+ languages. arXiv preprint cs/0609058
Turchi M, Negri M, Federico M (2013) Coping with the subjectivity of human judgements in MT quality estimation. In: Proceedings of the eighth workshop on statistical machine translation, Sofia, Bulgaria, pp 240–251
Turchi M, Anastasopoulos A, C de Souza JG, Negri M (2014) Adaptive quality estimation for machine translation. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (Volume 1: Long Papers), Association for Computational Linguistics. Baltimore, Maryland, pp 710–720. http://www.aclweb.org/anthology/P14-1067
Zhechev V (2012) Machine translation infrastructure and post-editing performance at Autodesk. In: AMTA 2012 workshop on post-editing technology and practice, San Diego, USA, pp 87–96

Download references

Acknowledgments

This work has been partially supported by the EC-funded project MateCat (ICT-2011.4.2-287688).

Author information

Authors and Affiliations

Fondazione Bruno Kessler, Povo, Trento, Italy
Marco Turchi, Matteo Negri & Marcello Federico

Authors

Marco Turchi
View author publications
You can also search for this author in PubMed Google Scholar
Matteo Negri
View author publications
You can also search for this author in PubMed Google Scholar
Marcello Federico
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marco Turchi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Turchi, M., Negri, M. & Federico, M. Data-driven annotation of binary MT quality estimation corpora based on human post-editions. Machine Translation 28, 281–308 (2014). https://doi.org/10.1007/s10590-014-9162-z

Download citation

Received: 24 January 2014
Accepted: 26 September 2014
Published: 21 November 2014
Issue Date: December 2014
DOI: https://doi.org/10.1007/s10590-014-9162-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data-driven annotation of binary MT quality estimation corpora based on human post-editions

Abstract

Access this article

Similar content being viewed by others

Error Classification Using Automatic Measures Based on n-grams and Edit Distance

Human Post-editing in Hybrid Machine Translation Systems: Automatic and Manual Analysis and Evaluation

Quality Estimation of MT-Engine Output Using Language Models for Post-editing and Their Comparative Study

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Data-driven annotation of binary MT quality estimation corpora based on human post-editions

Abstract

Access this article

Similar content being viewed by others

Error Classification Using Automatic Measures Based on n-grams and Edit Distance

Human Post-editing in Hybrid Machine Translation Systems: Automatic and Manual Analysis and Evaluation

Quality Estimation of MT-Engine Output Using Language Models for Post-editing and Their Comparative Study

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation