Quality estimation for machine translation: some lessons learned

Wisniewski, Guillaume; Singh, Anil Kumar; Yvon, François

doi:10.1007/s10590-013-9141-9

Quality estimation for machine translation: some lessons learned

Published: 30 August 2013

Volume 27, pages 213–238, (2013)
Cite this article

Machine Translation

Guillaume Wisniewski¹,
Anil Kumar Singh² &
François Yvon¹

690 Accesses
4 Citations
Explore all metrics

Abstract

The dissemination of statistical machine translation (SMT) systems in the professional translation industry is still limited by the lack of reliability of SMT outputs, the quality of which varies to a great extent. A critical piece of information would be for MT systems to automatically assess their output translations with automatically derived quality measures. Predicting quality measures was indeed the goal of a shared task at the Workshop on SMT in 2012. In this contribution, we first report our results for this shared task, detailing the features that we found to be the most predictive of quality. In the latter part, we reexamine the shared task data and protocol and show that several factors actually contributed to the difficulty of the task, and discuss alternative evaluation designs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Machine Translation Quality Estimation: Applications and Future Perspectives

Quality Expectations of Machine Translation

Machine Translation Evaluation: Manual Versus Automatic—A Comparative Study

Notes

It can even be argued that the issue is more difficult since MT evaluation is about comparing a target hypothesis and a target reference, whereas quality estimation is about comparing a target hypothesis and the original source sentence.
“Sentences for which the difference between the maximum score and minimum score assigned by the three judges was greater than 1 were eliminated” (Callison-Burch et al. 2012, p. 25)
The weighted kappa is a generalization of the well-known Cohen kappa to ordinal data; a linear weighting schema was considered here.
The Fleiss coefficient (Fleiss 1971) is a generalization of Cohen kappa to multi-raters.
We thank one of the reviewers for this suggestion.
The different feature sets used in our experiments can be downloaded from http://perso.limsi.fr/Individu/wisniews/.
“Dataset shift is a challenging situation where the joint distribution of inputs and outputs differs between the training and test stages” (Quionero-Candela et al. 2009).
Notwithstanding the discrepancy between perceived and actual post-edition difficulty pointed out by Koponen (2012).
Several participants of the shared task reported rather inconclusive results regarding the effect of sophisticated features (eg. IBM1 features in (Popovic 2012) or linguistic features in (Felice and Specia 2012; Rubino et al. 2012) for the shared task setting). Again, this might be due to the train/test discrepancy mentioned above.
Notwithstanding overfitting effects, which do not seem to play such an important role here (cf. discussion in Sect. 4.1).
Out of the 26 possible combinations.

References

Albrecht J, Hwa R (2007) A re-examination of machine learning approaches for sentence-level mt evaluation. In: Proceedings of the 45th annual meeting of the association of computational linguistics, Association for Computational Linguistics, Prague, Czech Republic, pp 880–887
Artstein R, Poesio M (2008) Inter-coder agreement for computational linguistics. Comput Linguist 34(4):555–596. doi:10.1162/coli.07-034-R2
Article Google Scholar
Bach N, Huang F, Al-Onaizan Y (2011) Goodness: a method for measuring machine translation confidence. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, Association for Computational Linguistics, Portland, Oregon, USA, pp 211–219
Bartko J (1966) The intraclass correlation coefficient as a measure of reliability. Psychol Rep 19:3–11
Article Google Scholar
Blatz J, Fitzgerald E, Foster G, Gandrabur S, Goutte C, Kulesza A, Sanchis A, Ueffing N (2004) Confidence estimation for machine translation. In: Proceedings of coling, (2004) COLING. Geneva, Switzerland, pp 315–321
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article MATH Google Scholar
Brown PF, Cocke J, Pietra SD, Pietra VJD, Jelinek F, Lafferty JD, Mercer RL, Roossin PS (1990) A statistical approach to machine translation. Comput Linguist 16(2):79–85
Google Scholar
Burman P, Nolan D (1995) A general akaike-type criterion for model selection in robust regression. Biometrika 82(4):877–886
Article MathSciNet MATH Google Scholar
Burstein J, Kukich K, Wolff S, Lu C, Chodorow M, Braden-Harder L, Harris MD (1998) Automated scoring using a hybrid feature identification technique. In: Proceedings of the 36th annual meeting of the Association for Computational Linguistics and 17th international conference on computational linguistics, vol 1. Association for Computational Linguistics, Montreal, Quebec, Canada, pp 206–210. doi:10.3115/980845.980879
Callison-Burch C, Koehn P, Monz C, Post M, Soricut R, Specia L (2012) Findings of the 2012 workshop on statistical machine translation. In: Proceedings of the seventh workshop on statistical machine translation, Association for Computational Linguistics, Montréal, Canada, pp 10–51
Drucker H, Burges CJC, Kaufman L, Smola AJ, Vapnik V (1996) Support vector regression machines. In: Mozer M, Jordan MI, Petsche T (eds) NIPS. MIT Press, Cambridge, MA, pp 155–161
Efron B, Tibshirani R (1993) An introduction to the Bootstrap. Chapman and Hall/CRC monographs on statistics and applied probability series, Chapman & Hall, New York
Felice M, Specia L (2012) Linguistic features for quality estimation. In: Proceedings of the seventh workshop on statistical machine translation, Association for Computational Linguistics, Montréal, Canada, pp 96–103
Fleiss JL (1971) Measuring nominal scale agreement among many raters. Psychol Bull 76(5):378–382
Article Google Scholar
de Gispert A, Blackwood G, Iglesias G, Byrne W (2012) N-gram posterior probability confidence measures for statistical machine translation: an empirical study. Mach Transl 27(2): 85–114
Google Scholar
Gretton A, Smola A, Huang J, Schmittfull M, Borgwardt K, Schölkopf B (2009) Covariate shift and local learning by distribution matching. MIT Press, Cambridge, MA
Google Scholar
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
MATH Google Scholar
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explor 11:10–18
Article Google Scholar
Hastie T, Tibshirani R, Friedman JH (2003) The elements of statistical learning. Springer, New York
Google Scholar
Kanungo T, Orr D (2009) Predicting the readability of short web summaries. In: Proceedings of the second ACM international conference on web search and data mining, ACM, New York, NY, USA, pp 202–211
Koehn P (2010) Statistical machine translation. Cambridge University Press, Cambridge
MATH Google Scholar
Koehn P, Monz C (2006) Manual and automatic evaluation of machine translation between european languages. In: Proceedings on the workshop on statistical machine translation, Association for Computational Linguistics, New York, pp 102–121
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, Association for Computational Linguistics, Prague, Czech Republic, pp 177–180
Koponen M (2012) Comparing human perceptions of post-editing effort with post-editing operations. In: Proceedings of the seventh workshop on statistical machine translation, Association for Computational Linguistics, Montréal, Canada, pp 181–190
Le HS, Oparin I, Allauzen A, Gauvain JL, Yvon F (2011) Structured output layer neural network language model. In: Proceedings of IEEE international conference on acoustic, speech and signal processing, Prague, Czech Republic, pp 5524–5527
Le HS, Lavergne T, Allauzen A, Apidianaki M, Gong L, Max A, Sokolov A, Wisniewski G, Yvon F (2012) Limsi @ wmt12. In: Proceedings of the seventh workshop on statistical machine translation, Association for Computational Linguistics, Montréal, Canada, pp 330–337
Narendra PM, Fukunaga K (1977) A branch and bound algorithm for feature subset selection. IEEE Trans Comput 26(9):917–922
Article MATH Google Scholar
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Lear Res 12:2825–2830
Google Scholar
Popović M (2011) Morphemes and pos tags for n-gram based evaluation metrics. In: Proceedings of the sixth workshop on statistical machine translation, Association for Computational Linguistics, Edinburgh, Scotland, pp 104–107
Popovic M (2012) Morpheme- and pos-based ibm1 and language model scores for translation quality estimation. In: Proceedings of the seventh workshop on statistical machine translation, Association for Computational Linguistics, Montréal, Canada, pp 133–137
Popović M, Vilar D, Avramidis E, Burchardt A (2011) Evaluation without references: Ibm1 scores as evaluation metrics. In: Proceedings of the sixth workshop on statistical machine translation, Association for Computational Linguistics, Edinburgh, Scotland, pp 99–103
Quinlan RJ (1992) Learning with continuous classes. In: 5th Australian joint conference on artificial intelligence, pp 343–348
Quionero-Candela J, Sugiyama M, Schwaighofer A, Lawrence ND (2009) Dataset shift in machine learning. MIT Press, Cambridge
Google Scholar
Quirk C (2004) Training a sentence-level machine translation confidence metric. In: Proceedings of the 4th international conference on language resources and evaluation, pp 825–828
Rifkin RM, Lippert RA (2007) Notes on regularized least squares. Tech Rep MIT-CSAIL-TR-2007-025, MIT-CSAIL
Rubino R, Foster J, Wagner J, Roturier J, Samad Zadeh Kaljahi R, Hollowood F (2012) Dcu-symantec submission for the wmt 2012 quality estimation task. In: Proceedings of the seventh workshop on statistical machine translation, Association for Computational Linguistics, Montréal, Canada, pp 138–144
Schmid H (1995) Improvements in part-of-speech tagging with an application to german. In: Proceedings of the ACL SIGDAT-Workshop, ACL, Dublin, Ireland
Shrout P, Fleiss J (1979) Intraclass correlation: uses in assessing rater reliability. Psychol Bull 86:420–428
Article Google Scholar
Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: Proceedings of AMTA, pp 223–231
Somers H (2003) Computers and translation: a translator’s guide. John Benjamins Publishing Company, Amsterdam
Google Scholar
Soricut R, Echihabi A (2010) Trustrank: inducing trust in automatic translations via ranking. In: Proceedings of the 48th annual meeting of the Association for Computational Linguistics, Uppsala, Sweden, pp 612–621
Soricut R, Bach N, Wang Z (2012) The sdl language weaver systems in the wmt12 quality estimation shared task. In: Proceedings of the seventh workshop on statistical machine translation, Association for Computational Linguistics, Montréal, Canada, pp 145–151
Specia L (2011) Exploiting objective annotations for measuring translation post-editing effort. In: Proceedings of the 15th conference of EAMT, Leuven, Belgium, pp 73–80
Specia L, Raj D, Turchi M (2010) Machine translation evaluation versus quality estimation. Mach Transl 24(1):39–50
Article Google Scholar
Wang Y, Witten IH (1997) Induction of model trees for predicting continuous classes. In: 9th European conference on machine learning, Springer, pp 128–137
Xiong D, Zhang M, Li H (2010) Error detection for statistical machine translation using linguistic features. In: Proceedings of the 48th annual meeting of the Association for Computational Linguistics, Uppsala, Sweden, pp 604–611
Zhuang Y, Wisniewski G, Yvon F (2012) Non-linear models for confidence estimation. In: Proceedings of the seventh workshop on statistical machine translation, Association for Computational Linguistics, Montréal, Canada, pp 157–162

Download references

Acknowledgments

The authors wish to thank LE Hai Son for helping us with the SOUL language model. This work was partially funded by the French National Research Agency under project ANR-CONTINT-TRACE.

Author information

Authors and Affiliations

LIMSI—Université Paris Sud, Orsay, France
Guillaume Wisniewski & François Yvon
LIMSI—CNRS, Orsay, France
Anil Kumar Singh

Authors

Guillaume Wisniewski
View author publications
You can also search for this author in PubMed Google Scholar
Anil Kumar Singh
View author publications
You can also search for this author in PubMed Google Scholar
François Yvon
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guillaume Wisniewski.

Appendices

Appendix I: Detailed features list

Table 8 gives the complete list of features that we have been working with and from which we have drawn our different sets for experiments. All the language models were trained on WMT’12 monolingual data. For POS tag based features, the POS tags were obtained with the TreeTagger.

Table 8 Complete list of features that we started with (the All set)

Full size table

Note that there are a few redundant features in the above list. For example, the sentence length features are present as part of the baseline features category as well as separately (Table 9).

Table 9 List of features used in the 4 sets used in our experiments (as indexed in Table 8)

Full size table

Appendix II: Mapping of POS Tags

See Table 10.

Table 10 Mapping of POS tags to syntactic categories as used for poscount features

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wisniewski, G., Singh, A.K. & Yvon, F. Quality estimation for machine translation: some lessons learned. Machine Translation 27, 213–238 (2013). https://doi.org/10.1007/s10590-013-9141-9

Download citation

Received: 08 October 2012
Accepted: 07 May 2013
Published: 30 August 2013
Issue Date: December 2013
DOI: https://doi.org/10.1007/s10590-013-9141-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Quality estimation for machine translation: some lessons learned

Abstract

Access this article

Similar content being viewed by others

Machine Translation Quality Estimation: Applications and Future Perspectives

Quality Expectations of Machine Translation

Machine Translation Evaluation: Manual Versus Automatic—A Comparative Study

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix I: Detailed features list

Appendix II: Mapping of POS Tags

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Quality estimation for machine translation: some lessons learned

Abstract

Access this article

Similar content being viewed by others

Machine Translation Quality Estimation: Applications and Future Perspectives

Quality Expectations of Machine Translation

Machine Translation Evaluation: Manual Versus Automatic—A Comparative Study

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix I: Detailed features list

Appendix II: Mapping of POS Tags

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation