Abstract
The Cunei machine translation platform is an open-source system for data-driven machine translation. Our platform is a synthesis of the traditional example-based MT (EBMT) and statistical MT (SMT) paradigms. What makes Cunei unique is that it measures the relevance of each translation instance with a distance function. This distance function, represented as a log-linear model, operates over one translation instance at a time and enables us to score the translation instance relative to the specified input and/or the current target hypothesis. We describe how our system, Cunei, scores features individually for each translation instance and how it efficiently performs parameter tuning over the entire feature space. We also compare Cunei with three other open-source MT systems (Moses, CMU-EBMT, and Marclator). In our experiments involving Korean–English and Czech–English translation Cunei clearly outperforms the traditional EBMT and SMT systems.
Similar content being viewed by others
References
Bojar O., Žabokrtský Z (2009) CzEng 0.9: large parallel treebank with rich annotation. Prague Bull Math Linguist 92: 7–16
Brown RD (1996) Example-based machine translation in the Pangloss system. In: Proceedings of the 16th International Conference on Computational Linguistics, Copenhagen, Denmark, pp 169–174
Brown RD (2004) A modified Burrows-Wheeler transform for highly scalable example-based translation. In: Frederking RE, Taylor K (eds) Machine translation: from real users to research, 6th conference of the Association for Machine Translation in the Americas. Washington, DC, pp 27–36
Callison-Burch C, Bannard C, Schroeder J (2005) Scaling phrase-based statistical machine translation to larger corpora and longer phrases. In: Proceedings of the 43rd annual meeting of the Association for Computational Linguistics, Ann Arbor, USA, pp 255–262
Chiang D, Marton Y, Resnik P (2008) Online large-margin training of syntactic and structural translation features. In: 2008 conference on Empirical Methods in Natural Language Processing, Honolulu, USA, pp 224–233
Doddington G (2002) Automatic evaluation of machine translation quality using n-gram cooccurrence statistics. In: Proceedings of the human language technology conference, San Diego, CA, pp 128–132
Green T (1979) The necessity of syntax markers: two experiments with artificial languages. J Verbal Learn Behav 18: 481–496
Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Machine Translation Summit X: Proceedings, Phuket, Thailand, pp 79–86
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the Association for Computational linguistics, Prague, Czech Republic, pp 177–180
Liang P, Taskar B, Klein D (2006) Alignment by agreement. In: Proceedings of the human language technology conference of the North American Chapter of the Association for Computational Linguistics, New York City, USA, pp 104–111
Lopez A (2008) Tera-scale translation models via pattern matching. In: Proceedings of the 22nd International Conference on Computational Linguistics, Manchester, UK, pp 505–512
Manber U, Myers G (1990) Suffix arrays: a new method for on-line string searches. In: SODA ’90: proceedings of the first annual ACM-SIAM symposium on discrete algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, pp 319–327
Och FJ (2003) Minimum error rate training in statistical machine translation. In: Proceedings of the 41st annual meeting of the Association for Computational Linguistics, Sapporo, Japan, pp 160–167
Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29(1): 19–51
Och FJ, Gildea D, Khudanpur S, Sarkar A, Yamada K, Fraser A, Kumar S, Shen L, Smith D, Eng K, Jain V, Jin Z, Radev D (2004) Final report of Johns Hopkins 2003 summer workshop on syntax for statistical machine translation. Tech. Rep., Center for Language and Speech Processing, Johns Hopkins University, Baltimore, MD
Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, Philadelphia, USA, pp 311–318
Shen L, Xu J, Zhang B, Matsoukas S, Weischedel R (2009) Effective use of linguistic and contextual information for statistical machine translation. In: 2009 conference on Empirical Methods in Natural Language Processing, Suntec, Singapore, pp 72–80
Smith DA, Eisner J (2006) Minimum risk annealing for training log-linear models. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th annual meeting of the Association for Computational Linguistics, Sydney, Australia, pp 787–794
Stolcke A (2002) SRILM - an extensible language modeling toolkit. In: 7th international conference on spoken language processing, Denver, USA, pp 901–904
Stroppa N, Way A (2006) MaTrEx: DCU machine translation system for IWSLT 2006. In: Proceedings of the International Workshop on Spoken Language Translation, Kyoto, Japan, pp 31–36
Vogel S (2005) PESA: Phrase pair extraction as sentence splitting. In: Machine Translation Summit X: Proceedings, Phuket, Thailand, pp 251–258
Yamamoto M, Church KW (2001) Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus. Comput. Linguist. 27(1):1–30. doi:10.1162/089120101300346787
Zhang Y, Vogel S (2005) An efficient phrase-to-phrase alignment model for arbitrarily long phrase and large corpora. In: Proceedings of the tenth annual conference of the European Association for Machine Translation, Budapest, Hungary, pp 294–301
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Phillips, A.B. Cunei: open-source machine translation with relevance-based models of each translation instance. Machine Translation 25, 161–177 (2011). https://doi.org/10.1007/s10590-011-9109-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10590-011-9109-6