Sentence-level ranking with quality estimation

Avramidis, Eleftherios

doi:10.1007/s10590-013-9144-6

Sentence-level ranking with quality estimation

Published: 30 August 2013

Volume 27, pages 239–256, (2013)
Cite this article

Machine Translation

Eleftherios Avramidis¹

415 Accesses
2 Citations
Explore all metrics

Abstract

Starting from human annotations, we provide a strategy based on machine learning that performs preference ranking on alternative machine translations of the same source, at sentence level. Rankings are decomposed into pairwise comparisons so that they can be learned by binary classifiers, using black-box features derived from linguistic analysis. In order to recompose from the pairwise decisions of the classifier, they are weighed with their classification probabilities, increasing the correlation coefficient by 80 %. We also demonstrate several configurations of successful automatic ranking models. The best configurations achieve a correlation with human judgments measured by Kendall’s tau at 0.27. Although the method does not use reference translations, this correlation is comparable to the one achieved by state-of-the-art reference-aware automatic evaluation metrics such as smoothed BLEU, METEOR and Levenshtein distance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Natural language processing: state of the art, current trends and challenges

Article 14 July 2022

Diksha Khurana, Aditya Koli, … Sukhdev Singh

Natural Language Processing

The English Sublexical Toolkit: Methods for indexing sound–spelling consistency

Article Open access 09 April 2024

Robert W. Wiley, Sartaj Singh, … Jeremy J. Purcell

Notes

Decomposing again the previously recomposed ranking, instead of using the initially decomposed pairs (Sect. 3.2), allows tau to compare the success of the recomposition methods (Sects. 3.3.1, 3.3.2).
\(\tau _{\mu }\) is the tau calculation that appears in WMT results.
In all of the experiments we exclude the crowdsourced sentences of 2010.
Classification accuracy on Table 2 is calculated with cross-validation on the training set.
Nevertheless, annotator disagreement is a factor that could increase the data noise.
Reducing the multiple ranking spans into one ranking has been lately an issue of discussion, as recent criticism advocates solving that as a tournament (Lopez 2012). At the moment we still follow the standard way it was done by WMT until the year 2012.
The processed data sets can be found at http://www.dfki.de/~elav01/download/mtj12.
http://www.acrolinx.com (proprietary).
We tried to come as close as possible to the original features sets when not all features were technically available.

References

Avramidis E (2011) DFKI system combination with sentence ranking at ML4HMT-2011. In: Proceedings of the international workshop on using linguistic information for hybrid machine translation and of the shared task on applying machine learning techniques to optimising the division of labour in hybrid machine translation, Barcelona, Spain, pp 99–103
Avramidis E, Popovic M, Vilar D, Burchardt A, Popović M (2011) Evaluate with confidence estimation: machine ranking of translation outputs using grammatical features. In: Proceedings of the sixth workshop on statistical machine translation, Edinburgh, UK, pp 65–70
Blatz J, Fitzgerald E, Foster G, Gandrabur S, Goutte C, Kulesza A, Sanchis A, Ueffing N (2004) Confidence estimation for machine translation. In: Proceedings of the 20th international conference on Computational Linguistics, Stroudsburg, PA, USA
Callison-Burch C, Fordyce C, Koehn P, Monz C, Schroeder J (2007) (Meta-) Evaluation of machine translation. In: Proceedings of the second workshop on statistical machine translation, Prague, Czech Republic, pp 136–158
Callison-Burch C, Fordyce C, Koehn P, Monz C, Schroeder J (2008) Further meta-evaluation of machine translation. In: Proceedings of the third workshop on statistical machine translation, Columbus, Ohio, pp 70–106
Callison-Burch C, Koehn P, Monz C, Schroeder J (2009) Findings of the 2009 workshop on statistical machine translation. In: Proceedings of the fourth workshop on statistical machine translation, Athens, Greece, pp 1–28
Callison-Burch C, Koehn P, Monz C, Peterson K, Przybocki M, Zaidan O (2010) Findings of the 2010 joint workshop on statistical machine translation and metrics for machine translation. In: Proceedings of the joint fifth workshop on statistical machine translation and metricsMATR, Uppsala, Sweden, pp 17–53
Callison-Burch C, Koehn P, Monz C, Zaidan O (2011) Findings of the 2011 workshop on statistical machine translation. In: Proceedings of the sixth workshop on statistical machine translation, Edinburgh, UK, pp 22–64
Callison-Burch C, Koehn P, Monz C, Post M, Soricut R, Specia L (2012) Findings of the 2012 workshop on statistical machine translation. In: Proceedings of the seventh workshop on statistical machine translation, Montréal, Canada, pp 10–51
Cameron A (1998) Regression analysis of count data. Cambridge University Press, Cambridge
Book MATH Google Scholar
Cleveland WS (1979) Robust locally weighted regression and smoothing scatterplots. J Am Stat Assoc 74(368):829–836
Article MathSciNet MATH Google Scholar
Coomans D, Massart D (1982) Alternative k-nearest neighbour rules in supervised pattern recognition. Anal Chimica Acta 138:15–27
Article Google Scholar
Demšar J, Zupan B, Leban G, Curk T (2004) Orange: from experimental machine learning to interactive data mining. In: Principles of data mining and knowledge discovery, pp 537–539
Duh K (2008) Ranking vs. regression in machine translation evaluation. In: Proceedings of the third workshop on statistical machine translation, Columbus, Ohio, pp 191–194
Federmann C, Avramidis E, Ruiz MCj, van Genabith J, Melero M, Pecina P (2012) The ML4HMT workshop on optimising the division of labour in hybrid machine translation. In: Proceedings of the 8th ELRA conference on language resources and evaluation, Istanbul, Turkey
Goodstadt L (2010) Ruffus: a lightweight Python library for computational pipelines. Bioinformatics 26(21):2778–2779
Article Google Scholar
He Y, Ma Y, van Genabith J, Way A (2010) Bridging SMT and TM with translation recommendation. In: Proceedings of the 48th annual meeting of the Association for Computational Linguistics, Uppsala, Sweden, pp 622–630
Herbrich R, Graepel T, Obermayer K (1999) Support vector learning for ordinal regression. In: International conference on artificial neural networks, pp 97–102
Hopkins M, May J (2011) Tuning as ranking. In: Proceedings of the conference on empirical methods in natural language processing, Edinburgh, UK, pp 1352–1362
Hosmer D (1989) Applied logistic regression, 8th edn. Wiley, New York
Hüllermeier E, Fürnkranz J, Cheng W, Brinker K (2008) Label ranking by learning pairwise preferences. Artif Intell 172(16–17):1897–1916
Article MATH Google Scholar
Kendall MG (1938) A new measure of rank correlation. Biometrika 30(1–2):81–93
MathSciNet MATH Google Scholar
Khedr AM (2008) Learning k-nearest neighbors classifier from distributed data. Comput Inform 27(3):355–376
MathSciNet MATH Google Scholar
Knight WR (1966) A computer method for calculating Kendalls tau with ungrouped data. J Am Stat Assoc 61(314):436–439
Article MATH Google Scholar
Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Conference proceedings: the tenth machine translation summit, AAMT, AAMT, Phuket, Thailand, pp 79–86
Lavie A, Agarwal A (2007) METEOR: an automatic metric for MT evaluation with high levels of correlation with human judgments. In: Proceedings of the second workshop on statistical machine translation, Prague, Czech Republic, pp 228–231
Levenshtein V (1966) Binary Codes Capable of Correcting Deletions and Insertions and Reversals. Sov Phys Doklady 10(8):707–710
MathSciNet Google Scholar
Lopez A (2012) Putting human assessments of machine translation systems in order. In: Proceedings of the seventh workshop on statistical machine translation, Montréal, Canada, pp 1–9
Miller A (2002) Subset selection in regression, 2nd edn. Chapman & Hall, London
Book MATH Google Scholar
Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, pp 311–318
Parton K, Tetreault J, Madnani N, Chodorow M (2011) E-rating machine translation. In: Proceedings of the sixth workshop on statistical machine translation, Edinburgh, UK, pp 108–115
Petrov S, Klein D (2007) Improved inference for unlexicalized parsing. In: Proceedings of the conference of the North American chapter of the Association for Computational Linguistics, Rochester, NY, pp 404–411
Petrov S, Barrett L, Thibaux R, Klein D (2006) Learning accurate, compact, and interpretable tree annotation. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the Association for Computational Linguistics, Sydney, Australia, pp 433–440
Raybaud S, Lavecchia C, David L, Kamel S (2009a) Word-and sentence-level confidence measures for machine translation. In: 13th Annual meeting of the European Association for Machine Translation, European Association of Machine Translation, Barcelona, Spain
Raybaud S, Lavecchia C, Langlois D, Kamel S (2009b) New confidence measures for statistical machine translation. In: Proceedings of the international conference on agents, pp 394–401
Rosti AV, Ayan NF, Xiang B, Matsoukas S, Schwartz R, Dorr BJ (2007) Combining outputs from multiple machine translation systems. In: Proceedings of the North American chapter of the Association for Computational Linguistics Human Language Technologies, Rochester, NY, pp 228–235
Sánchez-Martínez F (2011) Choosing the best machine translation system to translate a sentence by using only source-language information. In: Proceedings of the 15th annual conference of the European Association for Machine Translation, Leuve, Belgium, pp 97–104
Siegel M (2011) Autorenunterstützung für die Maschinelle Übersetzung. In: Multilingual resources and multilingual applications: proceedings of the conference of the German Society for computational linguistics and language technology (GSCL), Hamburg
Soricut R, Narsale S (2012) Combining quality prediction and system selection for improved automatic translation output. In: Proceedings of the seventh workshop on statistical machine translation, Montréal, Canada, pp 163–170
Soricut R, Wang Z, Bach N (2012) The SDL language weaver systems in the WMT12 quality estimation shared task. In: Proceedings of the seventh workshop on statistical machine translation, Montréal, Canada, pp 145–151
Specia L, Turchi M, Cancedda N, Dymetman M, Cristianini N (2009) Estimating the sentence-level quality of machine translation systems. In: 13th annual meeting of the European Association for Machine Translation, Barcelona, Spain., pp 28–35
Specia L, Raj D, Turchi M (2010) Machine translation evaluation versus quality estimation. Mach Transl 24(1):39–50
Article Google Scholar
Specia L, Felice M (2012) Linguistic features for quality estimation. In: Proceedings of the seventh workshop on statistical machine translation, Montréal, Canada, pp 96–103
Stolcke A (2002) SRILM—an extensible language modeling toolkit. In: Proceedings of the seventh international conference on spoken language processing, pp 901–904
Ueffing N, Ney H (2005) Word-level confidence estimation for machine translation using phrase-based translation models. Comput Linguist, pp 763–770
Vilar D, Avramidis E, Popović M, Hunsicker S (2011) DFKI’s SC and MT submissions to IWSLT, (2011) In: Proceedings of the international workshop on spoken language translation 2011. San Francisco, CA, USA, pp 98–105
Wagner J, Foster J (2009) The effect of correcting grammatical errors on parse probabilities. In: Proceedings of the 11th international conference on parsing technologies, Stroudsburg, PA, USA, pp 176–179
Ye Y, Zhou M, Lin CY (2007) Sentence level machine translation evaluation as a ranking problem: one step aside from BLEU. In: Proceedings of the second workshop on statistical machine translation, Association for Computational Linguistics, Prague, Czech Republic, pp 240–247

Download references

Acknowledgments

This work has been developed within the TaraXŰ project, financed by TSB Technologiestiftung Berlin—Zukunftsfonds Berlin, co-financed by the European Union—European fund for regional development. Many thanks to Prof. Hans Uszkoreit for the supervision, Dr. Aljoscha Burchardt, Dr. Maja Popovič and Dr. David Vilar for their useful feedback, to Prof. Melanie Siegel for her support concerning the language checking tool and to Lukas Poustka for his technical help on feature acquisition.

Author information

Authors and Affiliations

Language Technology Lab, German Research Center for Artificial Intelligence (DFKI GmbH), Alt Moabit 91c, Berlin, Germany
Eleftherios Avramidis

Authors

Eleftherios Avramidis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Eleftherios Avramidis.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Avramidis, E. Sentence-level ranking with quality estimation. Machine Translation 27, 239–256 (2013). https://doi.org/10.1007/s10590-013-9144-6

Download citation

Received: 06 October 2012
Accepted: 17 May 2013
Published: 30 August 2013
Issue Date: December 2013
DOI: https://doi.org/10.1007/s10590-013-9144-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Sentence-level ranking with quality estimation

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Natural Language Processing

The English Sublexical Toolkit: Methods for indexing sound–spelling consistency

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Sentence-level ranking with quality estimation

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Natural Language Processing

The English Sublexical Toolkit: Methods for indexing sound–spelling consistency

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation