Abstract
Significant breakthroughs in machine translation (MT) only seem possible if human translators are taken into the loop. While automatic evaluation and scoring mechanisms such as BLEU have enabled the fast development of systems, it is not clear how systems can meet real-world (quality) requirements in industrial translation scenarios today. The taraXŰ project has paved the way for wide usage of multiple MT outputs through various feedback loops in system development. The project has integrated human translators into the development process thus collecting feedback for possible improvements. This paper describes results from detailed human evaluation. Performance of different types of translation systems has been compared and analysed via ranking, error analysis and post-editing.
Similar content being viewed by others
Notes
For reasons of required anonymisation.
Note that this must be seen as an experiment. This was done in order to simulate the use of TMs, although it does not mirror the exact use of them in the translation industry.
More publications can be found online: http://taraxu.dfki.de/publications
References
Alonso, J. A., & Thurmair, G. (2003). The comprendium translator system. In: Proceedings of the Ninth Machine Translation Summit.
Avramidis, E., Popović, M., Vilar, D., & Burchardt, A. (2011). Evaluate with confidence estimation: Machine ranking of translation outputs using grammatical features. Proceedings of the Sixth Workshop on Statistical Machine Translation. Edinburgh, Scotland: Association for Computational Linguistics, pp. 65–70.
Burchardt, A., Tscherwinka, C., & Avramidis, E. (2013). Machine translation at work, studies in computational intelligence (241–261) (Vol. 458). Berlin: Springer.
Callison-Burch, C., Koehn, P., Monz, C., Peterson, K., Przybocki, M., & Zaidan, O. (2010). Findings of the 2010 joint workshop on statistical machine translation and metrics for machine translation. In: Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR. Association for Computational Linguistics, Uppsala, Sweden, pp. 17–53, revised August 2010.
Eisele, A., & Chen, Y. (2010). MultiUN: A Multilingual Corpus from United Nation Documents. Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010), Malta: La Valletta, pp. 2868–2872.
Farzindar, A., & Lapalme, G. (2009). Machine translation of legal information and its evaluation. Proceedings of the 22nd Canadian Conference on Artificial Intelligence (Canadian AI 09), BC: Kelowna, pp. 64–73.
Federmann, C. (2010). Appraise: An open-source toolkit for manual phrase-based evaluation of translations. In: Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC 2010), La Valletta, Malta.
He, Y., Ma, Y., Roturier, J., Way, A., & van Genabith, J. (2010). Improving the post-editing experience using translation recommendation: A user study. In: Proceedings of the Ninth Conference of the Association for Machine Translation in the Americas (AMTA 2010), Denver, Colorado.
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., et al. (2007). Moses: Open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions. Association for Computational Linguistics, Stroudsburg, PA, USA, ACL ’07, pp. 177–180.
Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2001). Bleu: A method for automatic evaluation of machine translation. IBM Research Report RC22176(W0109–022), IBM.
Popović, M. (2011). Hjerson: An open source tool for automatic error classification of machine translation output. The Prague Bulletin of Mathematical Linguistics, 96, 59–68.
Specia, L., & Farzindar, A. (2010). Estimating machine translation post-editing effort with HTER. In: Proceedings of AMTA-2010 Workshop Bringing MT to the User. MT Research and the Translation Industry, Denver, Colorado.
Tiedemann, J. (2009). News from OPUS—A collection of multilingual parallel corpora with tools and interfaces. In: N. Nicolov, K. Bontcheva, G. Angelova & R. Mitkov (Eds.), Advances in natural language processing (pp. 237–248.), vol V, chap V, Borovets, Bulgaria.
Vilar, D., Xu, J., D’Haro, L. F., & Ney, H. (2006). Error analysis of machine translation output. International Conference on Language Resources and Evaluation, Italy: Genoa, pp. 697–702.
Vilar, D., Stein, D., Huck, M., & Ney, H. (2010). Jane: Open source hierarchical translation, extended with reordering and lexicon models. ACL 2010 Joint Fifth Workshop on Statistical Machine Translation and Metrics MATR, Sweden: Uppsala, pp. 262–270.
Acknowledgments
This work has been developed within the taraXŰ Project financed by TSB Technologiestiftung Berlin—Zukunftsfonds Berlin, co-financed by the European Union—European fund for regional development. Thanks to our colleague Christian Federmann for helping with the Appraise system.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Popović, M., Avramidis, E., Burchardt, A. et al. Involving language professionals in the evaluation of machine translation. Lang Resources & Evaluation 48, 541–559 (2014). https://doi.org/10.1007/s10579-014-9286-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-014-9286-z