Skip to main content
Log in

Quality estimation for machine translation: some lessons learned

  • Published:
Machine Translation

Abstract

The dissemination of statistical machine translation (SMT) systems in the professional translation industry is still limited by the lack of reliability of SMT outputs, the quality of which varies to a great extent. A critical piece of information would be for MT systems to automatically assess their output translations with automatically derived quality measures. Predicting quality measures was indeed the goal of a shared task at the Workshop on SMT in 2012. In this contribution, we first report our results for this shared task, detailing the features that we found to be the most predictive of quality. In the latter part, we reexamine the shared task data and protocol and show that several factors actually contributed to the difficulty of the task, and discuss alternative evaluation designs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. It can even be argued that the issue is more difficult since MT evaluation is about comparing a target hypothesis and a target reference, whereas quality estimation is about comparing a target hypothesis and the original source sentence.

  2. “Sentences for which the difference between the maximum score and minimum score assigned by the three judges was greater than 1 were eliminated” (Callison-Burch et al. 2012, p. 25)

  3. The weighted kappa is a generalization of the well-known Cohen kappa to ordinal data; a linear weighting schema was considered here.

  4. The Fleiss coefficient (Fleiss 1971) is a generalization of Cohen kappa to multi-raters.

  5. We thank one of the reviewers for this suggestion.

  6. The different feature sets used in our experiments can be downloaded from http://perso.limsi.fr/Individu/wisniews/.

  7. “Dataset shift is a challenging situation where the joint distribution of inputs and outputs differs between the training and test stages” (Quionero-Candela et al. 2009).

  8. Notwithstanding the discrepancy between perceived and actual post-edition difficulty pointed out by Koponen (2012).

  9. Several participants of the shared task reported rather inconclusive results regarding the effect of sophisticated features (eg. IBM1 features in (Popovic 2012) or linguistic features in (Felice and Specia 2012; Rubino et al. 2012) for the shared task setting). Again, this might be due to the train/test discrepancy mentioned above.

  10. Notwithstanding overfitting effects, which do not seem to play such an important role here (cf. discussion in Sect. 4.1).

  11. Out of the 26 possible combinations.

References

  • Albrecht J, Hwa R (2007) A re-examination of machine learning approaches for sentence-level mt evaluation. In: Proceedings of the 45th annual meeting of the association of computational linguistics, Association for Computational Linguistics, Prague, Czech Republic, pp 880–887

  • Artstein R, Poesio M (2008) Inter-coder agreement for computational linguistics. Comput Linguist 34(4):555–596. doi:10.1162/coli.07-034-R2

    Article  Google Scholar 

  • Bach N, Huang F, Al-Onaizan Y (2011) Goodness: a method for measuring machine translation confidence. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, Association for Computational Linguistics, Portland, Oregon, USA, pp 211–219

  • Bartko J (1966) The intraclass correlation coefficient as a measure of reliability. Psychol Rep 19:3–11

    Article  Google Scholar 

  • Blatz J, Fitzgerald E, Foster G, Gandrabur S, Goutte C, Kulesza A, Sanchis A, Ueffing N (2004) Confidence estimation for machine translation. In: Proceedings of coling, (2004) COLING. Geneva, Switzerland, pp 315–321

  • Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  MATH  Google Scholar 

  • Brown PF, Cocke J, Pietra SD, Pietra VJD, Jelinek F, Lafferty JD, Mercer RL, Roossin PS (1990) A statistical approach to machine translation. Comput Linguist 16(2):79–85

    Google Scholar 

  • Burman P, Nolan D (1995) A general akaike-type criterion for model selection in robust regression. Biometrika 82(4):877–886

    Article  MathSciNet  MATH  Google Scholar 

  • Burstein J, Kukich K, Wolff S, Lu C, Chodorow M, Braden-Harder L, Harris MD (1998) Automated scoring using a hybrid feature identification technique. In: Proceedings of the 36th annual meeting of the Association for Computational Linguistics and 17th international conference on computational linguistics, vol 1. Association for Computational Linguistics, Montreal, Quebec, Canada, pp 206–210. doi:10.3115/980845.980879

  • Callison-Burch C, Koehn P, Monz C, Post M, Soricut R, Specia L (2012) Findings of the 2012 workshop on statistical machine translation. In: Proceedings of the seventh workshop on statistical machine translation, Association for Computational Linguistics, Montréal, Canada, pp 10–51

  • Drucker H, Burges CJC, Kaufman L, Smola AJ, Vapnik V (1996) Support vector regression machines. In: Mozer M, Jordan MI, Petsche T (eds) NIPS. MIT Press, Cambridge, MA, pp 155–161

  • Efron B, Tibshirani R (1993) An introduction to the Bootstrap. Chapman and Hall/CRC monographs on statistics and applied probability series, Chapman & Hall, New York

  • Felice M, Specia L (2012) Linguistic features for quality estimation. In: Proceedings of the seventh workshop on statistical machine translation, Association for Computational Linguistics, Montréal, Canada, pp 96–103

  • Fleiss JL (1971) Measuring nominal scale agreement among many raters. Psychol Bull 76(5):378–382

    Article  Google Scholar 

  • de Gispert A, Blackwood G, Iglesias G, Byrne W (2012) N-gram posterior probability confidence measures for statistical machine translation: an empirical study. Mach Transl 27(2): 85–114

    Google Scholar 

  • Gretton A, Smola A, Huang J, Schmittfull M, Borgwardt K, Schölkopf B (2009) Covariate shift and local learning by distribution matching. MIT Press, Cambridge, MA

    Google Scholar 

  • Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182

    MATH  Google Scholar 

  • Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explor 11:10–18

    Article  Google Scholar 

  • Hastie T, Tibshirani R, Friedman JH (2003) The elements of statistical learning. Springer, New York

    Google Scholar 

  • Kanungo T, Orr D (2009) Predicting the readability of short web summaries. In: Proceedings of the second ACM international conference on web search and data mining, ACM, New York, NY, USA, pp 202–211

  • Koehn P (2010) Statistical machine translation. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  • Koehn P, Monz C (2006) Manual and automatic evaluation of machine translation between european languages. In: Proceedings on the workshop on statistical machine translation, Association for Computational Linguistics, New York, pp 102–121

  • Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, Association for Computational Linguistics, Prague, Czech Republic, pp 177–180

  • Koponen M (2012) Comparing human perceptions of post-editing effort with post-editing operations. In: Proceedings of the seventh workshop on statistical machine translation, Association for Computational Linguistics, Montréal, Canada, pp 181–190

  • Le HS, Oparin I, Allauzen A, Gauvain JL, Yvon F (2011) Structured output layer neural network language model. In: Proceedings of IEEE international conference on acoustic, speech and signal processing, Prague, Czech Republic, pp 5524–5527

  • Le HS, Lavergne T, Allauzen A, Apidianaki M, Gong L, Max A, Sokolov A, Wisniewski G, Yvon F (2012) Limsi @ wmt12. In: Proceedings of the seventh workshop on statistical machine translation, Association for Computational Linguistics, Montréal, Canada, pp 330–337

  • Narendra PM, Fukunaga K (1977) A branch and bound algorithm for feature subset selection. IEEE Trans Comput 26(9):917–922

    Article  MATH  Google Scholar 

  • Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Lear Res 12:2825–2830

    Google Scholar 

  • Popović M (2011) Morphemes and pos tags for n-gram based evaluation metrics. In: Proceedings of the sixth workshop on statistical machine translation, Association for Computational Linguistics, Edinburgh, Scotland, pp 104–107

  • Popovic M (2012) Morpheme- and pos-based ibm1 and language model scores for translation quality estimation. In: Proceedings of the seventh workshop on statistical machine translation, Association for Computational Linguistics, Montréal, Canada, pp 133–137

  • Popović M, Vilar D, Avramidis E, Burchardt A (2011) Evaluation without references: Ibm1 scores as evaluation metrics. In: Proceedings of the sixth workshop on statistical machine translation, Association for Computational Linguistics, Edinburgh, Scotland, pp 99–103

  • Quinlan RJ (1992) Learning with continuous classes. In: 5th Australian joint conference on artificial intelligence, pp 343–348

  • Quionero-Candela J, Sugiyama M, Schwaighofer A, Lawrence ND (2009) Dataset shift in machine learning. MIT Press, Cambridge

    Google Scholar 

  • Quirk C (2004) Training a sentence-level machine translation confidence metric. In: Proceedings of the 4th international conference on language resources and evaluation, pp 825–828

  • Rifkin RM, Lippert RA (2007) Notes on regularized least squares. Tech Rep MIT-CSAIL-TR-2007-025, MIT-CSAIL

  • Rubino R, Foster J, Wagner J, Roturier J, Samad Zadeh Kaljahi R, Hollowood F (2012) Dcu-symantec submission for the wmt 2012 quality estimation task. In: Proceedings of the seventh workshop on statistical machine translation, Association for Computational Linguistics, Montréal, Canada, pp 138–144

  • Schmid H (1995) Improvements in part-of-speech tagging with an application to german. In: Proceedings of the ACL SIGDAT-Workshop, ACL, Dublin, Ireland

  • Shrout P, Fleiss J (1979) Intraclass correlation: uses in assessing rater reliability. Psychol Bull 86:420–428

    Article  Google Scholar 

  • Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: Proceedings of AMTA, pp 223–231

  • Somers H (2003) Computers and translation: a translator’s guide. John Benjamins Publishing Company, Amsterdam

    Google Scholar 

  • Soricut R, Echihabi A (2010) Trustrank: inducing trust in automatic translations via ranking. In: Proceedings of the 48th annual meeting of the Association for Computational Linguistics, Uppsala, Sweden, pp 612–621

  • Soricut R, Bach N, Wang Z (2012) The sdl language weaver systems in the wmt12 quality estimation shared task. In: Proceedings of the seventh workshop on statistical machine translation, Association for Computational Linguistics, Montréal, Canada, pp 145–151

  • Specia L (2011) Exploiting objective annotations for measuring translation post-editing effort. In: Proceedings of the 15th conference of EAMT, Leuven, Belgium, pp 73–80

  • Specia L, Raj D, Turchi M (2010) Machine translation evaluation versus quality estimation. Mach Transl 24(1):39–50

    Article  Google Scholar 

  • Wang Y, Witten IH (1997) Induction of model trees for predicting continuous classes. In: 9th European conference on machine learning, Springer, pp 128–137

  • Xiong D, Zhang M, Li H (2010) Error detection for statistical machine translation using linguistic features. In: Proceedings of the 48th annual meeting of the Association for Computational Linguistics, Uppsala, Sweden, pp 604–611

  • Zhuang Y, Wisniewski G, Yvon F (2012) Non-linear models for confidence estimation. In: Proceedings of the seventh workshop on statistical machine translation, Association for Computational Linguistics, Montréal, Canada, pp 157–162

Download references

Acknowledgments

The authors wish to thank LE Hai Son for helping us with the SOUL language model. This work was partially funded by the French National Research Agency under project ANR-CONTINT-TRACE.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guillaume Wisniewski.

Appendices

Appendix I: Detailed features list

Table 8 gives the complete list of features that we have been working with and from which we have drawn our different sets for experiments. All the language models were trained on WMT’12 monolingual data. For POS tag based features, the POS tags were obtained with the TreeTagger.

Table 8 Complete list of features that we started with (the All set)

Note that there are a few redundant features in the above list. For example, the sentence length features are present as part of the baseline features category as well as separately (Table 9).

Table 9 List of features used in the 4 sets used in our experiments (as indexed in Table 8)

Appendix II: Mapping of POS Tags

See Table 10.

Table 10 Mapping of POS tags to syntactic categories as used for poscount features

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wisniewski, G., Singh, A.K. & Yvon, F. Quality estimation for machine translation: some lessons learned. Machine Translation 27, 213–238 (2013). https://doi.org/10.1007/s10590-013-9141-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10590-013-9141-9

Keywords

Navigation