Skip to main content
Log in

Sentence-level ranking with quality estimation

  • Published:
Machine Translation

Abstract

Starting from human annotations, we provide a strategy based on machine learning that performs preference ranking on alternative machine translations of the same source, at sentence level. Rankings are decomposed into pairwise comparisons so that they can be learned by binary classifiers, using black-box features derived from linguistic analysis. In order to recompose from the pairwise decisions of the classifier, they are weighed with their classification probabilities, increasing the correlation coefficient by 80 %. We also demonstrate several configurations of successful automatic ranking models. The best configurations achieve a correlation with human judgments measured by Kendall’s tau at 0.27. Although the method does not use reference translations, this correlation is comparable to the one achieved by state-of-the-art reference-aware automatic evaluation metrics such as smoothed BLEU, METEOR and Levenshtein distance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. Decomposing again the previously recomposed ranking, instead of using the initially decomposed pairs (Sect. 3.2), allows tau to compare the success of the recomposition methods (Sects. 3.3.1, 3.3.2).

  2. \(\tau _{\mu }\) is the tau calculation that appears in WMT results.

  3. In all of the experiments we exclude the crowdsourced sentences of 2010.

  4. Classification accuracy on Table 2 is calculated with cross-validation on the training set.

  5. Nevertheless, annotator disagreement is a factor that could increase the data noise.

  6. Reducing the multiple ranking spans into one ranking has been lately an issue of discussion, as recent criticism advocates solving that as a tournament (Lopez 2012). At the moment we still follow the standard way it was done by WMT until the year 2012.

  7. The processed data sets can be found at http://www.dfki.de/~elav01/download/mtj12.

  8. http://www.acrolinx.com (proprietary).

  9. We tried to come as close as possible to the original features sets when not all features were technically available.

References

  • Avramidis E (2011) DFKI system combination with sentence ranking at ML4HMT-2011. In: Proceedings of the international workshop on using linguistic information for hybrid machine translation and of the shared task on applying machine learning techniques to optimising the division of labour in hybrid machine translation, Barcelona, Spain, pp 99–103

  • Avramidis E, Popovic M, Vilar D, Burchardt A, Popović M (2011) Evaluate with confidence estimation: machine ranking of translation outputs using grammatical features. In: Proceedings of the sixth workshop on statistical machine translation, Edinburgh, UK, pp 65–70

  • Blatz J, Fitzgerald E, Foster G, Gandrabur S, Goutte C, Kulesza A, Sanchis A, Ueffing N (2004) Confidence estimation for machine translation. In: Proceedings of the 20th international conference on Computational Linguistics, Stroudsburg, PA, USA

  • Callison-Burch C, Fordyce C, Koehn P, Monz C, Schroeder J (2007) (Meta-) Evaluation of machine translation. In: Proceedings of the second workshop on statistical machine translation, Prague, Czech Republic, pp 136–158

  • Callison-Burch C, Fordyce C, Koehn P, Monz C, Schroeder J (2008) Further meta-evaluation of machine translation. In: Proceedings of the third workshop on statistical machine translation, Columbus, Ohio, pp 70–106

  • Callison-Burch C, Koehn P, Monz C, Schroeder J (2009) Findings of the 2009 workshop on statistical machine translation. In: Proceedings of the fourth workshop on statistical machine translation, Athens, Greece, pp 1–28

  • Callison-Burch C, Koehn P, Monz C, Peterson K, Przybocki M, Zaidan O (2010) Findings of the 2010 joint workshop on statistical machine translation and metrics for machine translation. In: Proceedings of the joint fifth workshop on statistical machine translation and metricsMATR, Uppsala, Sweden, pp 17–53

  • Callison-Burch C, Koehn P, Monz C, Zaidan O (2011) Findings of the 2011 workshop on statistical machine translation. In: Proceedings of the sixth workshop on statistical machine translation, Edinburgh, UK, pp 22–64

  • Callison-Burch C, Koehn P, Monz C, Post M, Soricut R, Specia L (2012) Findings of the 2012 workshop on statistical machine translation. In: Proceedings of the seventh workshop on statistical machine translation, Montréal, Canada, pp 10–51

  • Cameron A (1998) Regression analysis of count data. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  • Cleveland WS (1979) Robust locally weighted regression and smoothing scatterplots. J Am Stat Assoc 74(368):829–836

    Article  MathSciNet  MATH  Google Scholar 

  • Coomans D, Massart D (1982) Alternative k-nearest neighbour rules in supervised pattern recognition. Anal Chimica Acta 138:15–27

    Article  Google Scholar 

  • Demšar J, Zupan B, Leban G, Curk T (2004) Orange: from experimental machine learning to interactive data mining. In: Principles of data mining and knowledge discovery, pp 537–539

  • Duh K (2008) Ranking vs. regression in machine translation evaluation. In: Proceedings of the third workshop on statistical machine translation, Columbus, Ohio, pp 191–194

  • Federmann C, Avramidis E, Ruiz MCj, van Genabith J, Melero M, Pecina P (2012) The ML4HMT workshop on optimising the division of labour in hybrid machine translation. In: Proceedings of the 8th ELRA conference on language resources and evaluation, Istanbul, Turkey

  • Goodstadt L (2010) Ruffus: a lightweight Python library for computational pipelines. Bioinformatics 26(21):2778–2779

    Article  Google Scholar 

  • He Y, Ma Y, van Genabith J, Way A (2010) Bridging SMT and TM with translation recommendation. In: Proceedings of the 48th annual meeting of the Association for Computational Linguistics, Uppsala, Sweden, pp 622–630

  • Herbrich R, Graepel T, Obermayer K (1999) Support vector learning for ordinal regression. In: International conference on artificial neural networks, pp 97–102

  • Hopkins M, May J (2011) Tuning as ranking. In: Proceedings of the conference on empirical methods in natural language processing, Edinburgh, UK, pp 1352–1362

  • Hosmer D (1989) Applied logistic regression, 8th edn. Wiley, New York

  • Hüllermeier E, Fürnkranz J, Cheng W, Brinker K (2008) Label ranking by learning pairwise preferences. Artif Intell 172(16–17):1897–1916

    Article  MATH  Google Scholar 

  • Kendall MG (1938) A new measure of rank correlation. Biometrika 30(1–2):81–93

    MathSciNet  MATH  Google Scholar 

  • Khedr AM (2008) Learning k-nearest neighbors classifier from distributed data. Comput Inform 27(3):355–376

    MathSciNet  MATH  Google Scholar 

  • Knight WR (1966) A computer method for calculating Kendalls tau with ungrouped data. J Am Stat Assoc 61(314):436–439

    Article  MATH  Google Scholar 

  • Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Conference proceedings: the tenth machine translation summit, AAMT, AAMT, Phuket, Thailand, pp 79–86

  • Lavie A, Agarwal A (2007) METEOR: an automatic metric for MT evaluation with high levels of correlation with human judgments. In: Proceedings of the second workshop on statistical machine translation, Prague, Czech Republic, pp 228–231

  • Levenshtein V (1966) Binary Codes Capable of Correcting Deletions and Insertions and Reversals. Sov Phys Doklady 10(8):707–710

    MathSciNet  Google Scholar 

  • Lopez A (2012) Putting human assessments of machine translation systems in order. In: Proceedings of the seventh workshop on statistical machine translation, Montréal, Canada, pp 1–9

  • Miller A (2002) Subset selection in regression, 2nd edn. Chapman & Hall, London

    Book  MATH  Google Scholar 

  • Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, pp 311–318

  • Parton K, Tetreault J, Madnani N, Chodorow M (2011) E-rating machine translation. In: Proceedings of the sixth workshop on statistical machine translation, Edinburgh, UK, pp 108–115

  • Petrov S, Klein D (2007) Improved inference for unlexicalized parsing. In: Proceedings of the conference of the North American chapter of the Association for Computational Linguistics, Rochester, NY, pp 404–411

  • Petrov S, Barrett L, Thibaux R, Klein D (2006) Learning accurate, compact, and interpretable tree annotation. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the Association for Computational Linguistics, Sydney, Australia, pp 433–440

  • Raybaud S, Lavecchia C, David L, Kamel S (2009a) Word-and sentence-level confidence measures for machine translation. In: 13th Annual meeting of the European Association for Machine Translation, European Association of Machine Translation, Barcelona, Spain

  • Raybaud S, Lavecchia C, Langlois D, Kamel S (2009b) New confidence measures for statistical machine translation. In: Proceedings of the international conference on agents, pp 394–401

  • Rosti AV, Ayan NF, Xiang B, Matsoukas S, Schwartz R, Dorr BJ (2007) Combining outputs from multiple machine translation systems. In: Proceedings of the North American chapter of the Association for Computational Linguistics Human Language Technologies, Rochester, NY, pp 228–235

  • Sánchez-Martínez F (2011) Choosing the best machine translation system to translate a sentence by using only source-language information. In: Proceedings of the 15th annual conference of the European Association for Machine Translation, Leuve, Belgium, pp 97–104

  • Siegel M (2011) Autorenunterstützung für die Maschinelle Übersetzung. In: Multilingual resources and multilingual applications: proceedings of the conference of the German Society for computational linguistics and language technology (GSCL), Hamburg

  • Soricut R, Narsale S (2012) Combining quality prediction and system selection for improved automatic translation output. In: Proceedings of the seventh workshop on statistical machine translation, Montréal, Canada, pp 163–170

  • Soricut R, Wang Z, Bach N (2012) The SDL language weaver systems in the WMT12 quality estimation shared task. In: Proceedings of the seventh workshop on statistical machine translation, Montréal, Canada, pp 145–151

  • Specia L, Turchi M, Cancedda N, Dymetman M, Cristianini N (2009) Estimating the sentence-level quality of machine translation systems. In: 13th annual meeting of the European Association for Machine Translation, Barcelona, Spain., pp 28–35

  • Specia L, Raj D, Turchi M (2010) Machine translation evaluation versus quality estimation. Mach Transl 24(1):39–50

    Article  Google Scholar 

  • Specia L, Felice M (2012) Linguistic features for quality estimation. In: Proceedings of the seventh workshop on statistical machine translation, Montréal, Canada, pp 96–103

  • Stolcke A (2002) SRILM—an extensible language modeling toolkit. In: Proceedings of the seventh international conference on spoken language processing, pp 901–904

  • Ueffing N, Ney H (2005) Word-level confidence estimation for machine translation using phrase-based translation models. Comput Linguist, pp 763–770

  • Vilar D, Avramidis E, Popović M, Hunsicker S (2011) DFKI’s SC and MT submissions to IWSLT, (2011) In: Proceedings of the international workshop on spoken language translation 2011. San Francisco, CA, USA, pp 98–105

  • Wagner J, Foster J (2009) The effect of correcting grammatical errors on parse probabilities. In: Proceedings of the 11th international conference on parsing technologies, Stroudsburg, PA, USA, pp 176–179

  • Ye Y, Zhou M, Lin CY (2007) Sentence level machine translation evaluation as a ranking problem: one step aside from BLEU. In: Proceedings of the second workshop on statistical machine translation, Association for Computational Linguistics, Prague, Czech Republic, pp 240–247

Download references

Acknowledgments

This work has been developed within the TaraXŰ project, financed by TSB Technologiestiftung Berlin—Zukunftsfonds Berlin, co-financed by the European Union—European fund for regional development. Many thanks to Prof. Hans Uszkoreit for the supervision, Dr. Aljoscha Burchardt, Dr. Maja Popovič and Dr. David Vilar for their useful feedback, to Prof. Melanie Siegel for her support concerning the language checking tool and to Lukas Poustka for his technical help on feature acquisition.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Eleftherios Avramidis.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Avramidis, E. Sentence-level ranking with quality estimation. Machine Translation 27, 239–256 (2013). https://doi.org/10.1007/s10590-013-9144-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10590-013-9144-6

Keywords

Navigation