Abstract
The dissemination of statistical machine translation (SMT) systems in the professional translation industry is still limited by the lack of reliability of SMT outputs, the quality of which varies to a great extent. A critical piece of information would be for MT systems to automatically assess their output translations with automatically derived quality measures. Predicting quality measures was indeed the goal of a shared task at the Workshop on SMT in 2012. In this contribution, we first report our results for this shared task, detailing the features that we found to be the most predictive of quality. In the latter part, we reexamine the shared task data and protocol and show that several factors actually contributed to the difficulty of the task, and discuss alternative evaluation designs.
Similar content being viewed by others
Notes
It can even be argued that the issue is more difficult since MT evaluation is about comparing a target hypothesis and a target reference, whereas quality estimation is about comparing a target hypothesis and the original source sentence.
“Sentences for which the difference between the maximum score and minimum score assigned by the three judges was greater than 1 were eliminated” (Callison-Burch et al. 2012, p. 25)
The weighted kappa is a generalization of the well-known Cohen kappa to ordinal data; a linear weighting schema was considered here.
The Fleiss coefficient (Fleiss 1971) is a generalization of Cohen kappa to multi-raters.
We thank one of the reviewers for this suggestion.
The different feature sets used in our experiments can be downloaded from http://perso.limsi.fr/Individu/wisniews/.
“Dataset shift is a challenging situation where the joint distribution of inputs and outputs differs between the training and test stages” (Quionero-Candela et al. 2009).
Notwithstanding the discrepancy between perceived and actual post-edition difficulty pointed out by Koponen (2012).
Several participants of the shared task reported rather inconclusive results regarding the effect of sophisticated features (eg. IBM1 features in (Popovic 2012) or linguistic features in (Felice and Specia 2012; Rubino et al. 2012) for the shared task setting). Again, this might be due to the train/test discrepancy mentioned above.
Notwithstanding overfitting effects, which do not seem to play such an important role here (cf. discussion in Sect. 4.1).
Out of the 26 possible combinations.
References
Albrecht J, Hwa R (2007) A re-examination of machine learning approaches for sentence-level mt evaluation. In: Proceedings of the 45th annual meeting of the association of computational linguistics, Association for Computational Linguistics, Prague, Czech Republic, pp 880–887
Artstein R, Poesio M (2008) Inter-coder agreement for computational linguistics. Comput Linguist 34(4):555–596. doi:10.1162/coli.07-034-R2
Bach N, Huang F, Al-Onaizan Y (2011) Goodness: a method for measuring machine translation confidence. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, Association for Computational Linguistics, Portland, Oregon, USA, pp 211–219
Bartko J (1966) The intraclass correlation coefficient as a measure of reliability. Psychol Rep 19:3–11
Blatz J, Fitzgerald E, Foster G, Gandrabur S, Goutte C, Kulesza A, Sanchis A, Ueffing N (2004) Confidence estimation for machine translation. In: Proceedings of coling, (2004) COLING. Geneva, Switzerland, pp 315–321
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Brown PF, Cocke J, Pietra SD, Pietra VJD, Jelinek F, Lafferty JD, Mercer RL, Roossin PS (1990) A statistical approach to machine translation. Comput Linguist 16(2):79–85
Burman P, Nolan D (1995) A general akaike-type criterion for model selection in robust regression. Biometrika 82(4):877–886
Burstein J, Kukich K, Wolff S, Lu C, Chodorow M, Braden-Harder L, Harris MD (1998) Automated scoring using a hybrid feature identification technique. In: Proceedings of the 36th annual meeting of the Association for Computational Linguistics and 17th international conference on computational linguistics, vol 1. Association for Computational Linguistics, Montreal, Quebec, Canada, pp 206–210. doi:10.3115/980845.980879
Callison-Burch C, Koehn P, Monz C, Post M, Soricut R, Specia L (2012) Findings of the 2012 workshop on statistical machine translation. In: Proceedings of the seventh workshop on statistical machine translation, Association for Computational Linguistics, Montréal, Canada, pp 10–51
Drucker H, Burges CJC, Kaufman L, Smola AJ, Vapnik V (1996) Support vector regression machines. In: Mozer M, Jordan MI, Petsche T (eds) NIPS. MIT Press, Cambridge, MA, pp 155–161
Efron B, Tibshirani R (1993) An introduction to the Bootstrap. Chapman and Hall/CRC monographs on statistics and applied probability series, Chapman & Hall, New York
Felice M, Specia L (2012) Linguistic features for quality estimation. In: Proceedings of the seventh workshop on statistical machine translation, Association for Computational Linguistics, Montréal, Canada, pp 96–103
Fleiss JL (1971) Measuring nominal scale agreement among many raters. Psychol Bull 76(5):378–382
de Gispert A, Blackwood G, Iglesias G, Byrne W (2012) N-gram posterior probability confidence measures for statistical machine translation: an empirical study. Mach Transl 27(2): 85–114
Gretton A, Smola A, Huang J, Schmittfull M, Borgwardt K, Schölkopf B (2009) Covariate shift and local learning by distribution matching. MIT Press, Cambridge, MA
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explor 11:10–18
Hastie T, Tibshirani R, Friedman JH (2003) The elements of statistical learning. Springer, New York
Kanungo T, Orr D (2009) Predicting the readability of short web summaries. In: Proceedings of the second ACM international conference on web search and data mining, ACM, New York, NY, USA, pp 202–211
Koehn P (2010) Statistical machine translation. Cambridge University Press, Cambridge
Koehn P, Monz C (2006) Manual and automatic evaluation of machine translation between european languages. In: Proceedings on the workshop on statistical machine translation, Association for Computational Linguistics, New York, pp 102–121
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, Association for Computational Linguistics, Prague, Czech Republic, pp 177–180
Koponen M (2012) Comparing human perceptions of post-editing effort with post-editing operations. In: Proceedings of the seventh workshop on statistical machine translation, Association for Computational Linguistics, Montréal, Canada, pp 181–190
Le HS, Oparin I, Allauzen A, Gauvain JL, Yvon F (2011) Structured output layer neural network language model. In: Proceedings of IEEE international conference on acoustic, speech and signal processing, Prague, Czech Republic, pp 5524–5527
Le HS, Lavergne T, Allauzen A, Apidianaki M, Gong L, Max A, Sokolov A, Wisniewski G, Yvon F (2012) Limsi @ wmt12. In: Proceedings of the seventh workshop on statistical machine translation, Association for Computational Linguistics, Montréal, Canada, pp 330–337
Narendra PM, Fukunaga K (1977) A branch and bound algorithm for feature subset selection. IEEE Trans Comput 26(9):917–922
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Lear Res 12:2825–2830
Popović M (2011) Morphemes and pos tags for n-gram based evaluation metrics. In: Proceedings of the sixth workshop on statistical machine translation, Association for Computational Linguistics, Edinburgh, Scotland, pp 104–107
Popovic M (2012) Morpheme- and pos-based ibm1 and language model scores for translation quality estimation. In: Proceedings of the seventh workshop on statistical machine translation, Association for Computational Linguistics, Montréal, Canada, pp 133–137
Popović M, Vilar D, Avramidis E, Burchardt A (2011) Evaluation without references: Ibm1 scores as evaluation metrics. In: Proceedings of the sixth workshop on statistical machine translation, Association for Computational Linguistics, Edinburgh, Scotland, pp 99–103
Quinlan RJ (1992) Learning with continuous classes. In: 5th Australian joint conference on artificial intelligence, pp 343–348
Quionero-Candela J, Sugiyama M, Schwaighofer A, Lawrence ND (2009) Dataset shift in machine learning. MIT Press, Cambridge
Quirk C (2004) Training a sentence-level machine translation confidence metric. In: Proceedings of the 4th international conference on language resources and evaluation, pp 825–828
Rifkin RM, Lippert RA (2007) Notes on regularized least squares. Tech Rep MIT-CSAIL-TR-2007-025, MIT-CSAIL
Rubino R, Foster J, Wagner J, Roturier J, Samad Zadeh Kaljahi R, Hollowood F (2012) Dcu-symantec submission for the wmt 2012 quality estimation task. In: Proceedings of the seventh workshop on statistical machine translation, Association for Computational Linguistics, Montréal, Canada, pp 138–144
Schmid H (1995) Improvements in part-of-speech tagging with an application to german. In: Proceedings of the ACL SIGDAT-Workshop, ACL, Dublin, Ireland
Shrout P, Fleiss J (1979) Intraclass correlation: uses in assessing rater reliability. Psychol Bull 86:420–428
Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: Proceedings of AMTA, pp 223–231
Somers H (2003) Computers and translation: a translator’s guide. John Benjamins Publishing Company, Amsterdam
Soricut R, Echihabi A (2010) Trustrank: inducing trust in automatic translations via ranking. In: Proceedings of the 48th annual meeting of the Association for Computational Linguistics, Uppsala, Sweden, pp 612–621
Soricut R, Bach N, Wang Z (2012) The sdl language weaver systems in the wmt12 quality estimation shared task. In: Proceedings of the seventh workshop on statistical machine translation, Association for Computational Linguistics, Montréal, Canada, pp 145–151
Specia L (2011) Exploiting objective annotations for measuring translation post-editing effort. In: Proceedings of the 15th conference of EAMT, Leuven, Belgium, pp 73–80
Specia L, Raj D, Turchi M (2010) Machine translation evaluation versus quality estimation. Mach Transl 24(1):39–50
Wang Y, Witten IH (1997) Induction of model trees for predicting continuous classes. In: 9th European conference on machine learning, Springer, pp 128–137
Xiong D, Zhang M, Li H (2010) Error detection for statistical machine translation using linguistic features. In: Proceedings of the 48th annual meeting of the Association for Computational Linguistics, Uppsala, Sweden, pp 604–611
Zhuang Y, Wisniewski G, Yvon F (2012) Non-linear models for confidence estimation. In: Proceedings of the seventh workshop on statistical machine translation, Association for Computational Linguistics, Montréal, Canada, pp 157–162
Acknowledgments
The authors wish to thank LE Hai Son for helping us with the SOUL language model. This work was partially funded by the French National Research Agency under project ANR-CONTINT-TRACE.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix I: Detailed features list
Table 8 gives the complete list of features that we have been working with and from which we have drawn our different sets for experiments. All the language models were trained on WMT’12 monolingual data. For POS tag based features, the POS tags were obtained with the TreeTagger.
Note that there are a few redundant features in the above list. For example, the sentence length features are present as part of the baseline features category as well as separately (Table 9).
Appendix II: Mapping of POS Tags
See Table 10.
Rights and permissions
About this article
Cite this article
Wisniewski, G., Singh, A.K. & Yvon, F. Quality estimation for machine translation: some lessons learned. Machine Translation 27, 213–238 (2013). https://doi.org/10.1007/s10590-013-9141-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10590-013-9141-9