Abstract
In this article we will describe the design and implementation of Jane, an efficient hierarchical phrase-based (HPB) toolkit developed at RWTH Aachen University. The system has been used by RWTH at several international evaluation campaigns, including the WMT and NIST evaluations, and is now freely available for non-commercial application. We will go through the main features of Jane, which include, among others, support for different search strategies, different language model formats, support for syntax-based enhancements to the HPB machine translation paradigm, string-to-dependency translation, extended lexicon models, different methods for minimum-error-rate training and distributed operation on a computer cluster. Special attention has been paid to the efficiency of the decoder, clean code and quality assurance through unit and regression testing. Results on current machine translation tasks are reported, which show that the system is able to obtain state-of-the-art performance.
Similar content being viewed by others
References
Birch A, Blunsom P, Osborne M (2009) A quantitative analysis of reordering phenomena. In: Proceedings of the fourth workshop on statistical machine translation, Athens, pp 197–205
Blunsom P, Cohn T, Osborne M (2008) A discriminative latent variable model for statistical machine translation. In: ACL-08: HLT, 46th annual meeting of the association for computational linguistics: human language technologies, proceedings of the conference, Columbus, pp 200–208
Brown PF, Della Pietra SA, Della Pietra VJ, Mercer RL (1993) The mathematics of statistical machine translation: parameter estimation. Comput Linguist 19(2): 263–311
Byrd RH, Lu P, Nocedal J, Zhu C (1995) A limited memory algorithm for bound constrained optimization. SIAM J Sci Comput 16(5): 1190–1208
Card OS (1986) Speaker for the dead. Tor Books, New York
Chappelier JC, Rajman M (1998) A generalized CYK algorithm for parsing stochastic CFG. In: Proceedings of the first workshop on tabulation in parsing and deduction, Paris, pp 133–137
Chiang D (2005) A hierarchical phrase-based model for statistical machine translation. In: ACL-05: 43rd annual meeting of the association for computational linguistics, Ann Arbor, pp 263–270
Chiang D, Knight K, Wang W (2009) 11,001 new features for statistical machine translation. In: Proceedings of human language technologies: the 2009 annual conference of the North American chapter of the association for computational linguistics, Boulder, pp 218–226
Cmejrek M, Zhou B, Xiang B (2009) Enriching SCFG rules directly from efficient bilingual chart parsing. In: Proceedings of the international workshop on spoken language translation, Tokyo, pp 136–143
Hasan S, Ney H (2009) Comparison of extended lexicon models in search and rescoring for SMT. In: Joint conference of the 47th annual meeting of the association for computational linguistics and the 4th international joint conference on natural language processing of the asian federation of natural language processing, Boulder, pp 17–20
Hasan S, Ganitkevitch J, Ney H, Andrés-Ferrer J (2008) Triplet lexicon models for statistical machine translation. In: EMNLP 2008: Proceedings of the 2008 conference on empirical methods in natural language processing, Honolulu, pp 372–381
Heger C, Wuebker J, Huck M, Leusch G, Mansour S, Stein D, Ney H (2010) The RWTH Aachen machine translation system for WMT 2010. In: Proceedings of the joint 5th workshop on statistical machine translation and metricsMATR, Uppsala, pp 93–97
Huang L, Chiang D (2007) Forest rescoring: Faster decoding with integrated language models. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning, Prague, pp 144–151
Huck M, Ratajczak M, Lehnen P, Ney H (2010) A comparison of various types of extended lexicon models for statistical machine translation. In: AMTA 2010: proceedings of the ninth conference of the association for machine translation in the Americas, Denver
Johnson H, Martin J, Foster G, Kuhn R (2007) Improving translation quality by discarding most of the phrasetable. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning, Prague, pp 967–975
Koehn P (2010) Statistical machine translation. Cambridge University Press, Cambridge
Koehn P, Haddow B, Williams P, Hoang H (2010) More linguistic annotation for statistical machine translation. In: ACL 2010: Joint 5th workshop on statistical machine translation and metricsMATR, proceedings of the workshop, Uppsala, pp 121–126
Leusch G, Ney H (2009) Edit distances with block movements and error rate confidence estimates. Mach Transl 23: 129–140
Li Z, Callison-Burch C, Dyer C, Khudanpur S, Schwartz L, Thornton W, Weese J, Zaidan O (2009) Joshua: an open source toolkit for parsing-based machine translation. In: EACL 2009: 4th workshop on statistical machine translation, proceedings of the workshop, Athens, pp 135–139
Mauser A, Hasan S, Ney H (2009) Extending statistical machine translation with discriminative and trigger-based lexicon models. In: EMNLP 2009: proceedings of the 2009 conference on empirical methods in natural language processing, Singapore, pp 210–218
Och FJ (2003) Minimum error rate training for statistical machine translation. In: 41st annual meeting of the association for computational linguistics, proceedings of the conference, Sapporo, pp 160–167
Schwartz L (2010) Reproducible results in parsing-based machine translation: the JHU shared task submission. In: ACL 2010: joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, proceedings of the workshop, Uppsala, pp 177–182
Shen L, Xu J, Weischedel R (2008) A new string-to-dependency machine translation algorithm with a target dependency language model. In: ACL-08: HLT, 46th annual meeting of the association for computational linguistics: human language technologies, proceedings of the conference, Columbus, pp 577–585
Stolcke A (2002) SRILM—an extensible language modeling toolkit. In: Proceedings of the 7th international conference on spoken language processing, vol 3. Denver, pp 901–904
Talbot D, Osborne M (2007) Smoothed Bloom filter language models: tera-scale LMs on the cheap. In: EMNLP-CoNLL 2007: proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning, Prague, pp 468–476
Venugopal A, Zollmann A, Smith N, Vogel S (2009) Preference grammars: Softening syntactic constraints to improve statistical machine translation. In: Human language technologies: the 2009 annual conference of the North American chapter of the association for computational linguistics, proceedings of the conference, Boulder, pp 236–244
Vilar D, Ney H (2009) On LM heuristics for the cube growing algorithm. In: EAMT-2009: proceedings of the 13th annual conference of the European association for machine translation, Barcelona, pp 242–249
Vilar D, Ney H (2011) Cardinality pruning and language model heuristics for hierarchical phrase-based translation. Mach Transl 1–38. doi:10.1007/s10590-011-9119-4
Vilar D, Stein D, Ney H (2008) Analysing soft syntax features and heuristics for hierarchical phrase based machine translation. In: IWSLT 2008: proceedings of the international workshop on spoken language translation, Waikiki, pp 190–197
Vilar D, Stein D, Huck M, Ney H (2010a) Jane: open source hierarchical translation, extended with reordering and lexicon models. In: ACL 2010: joint 5th workshop on statistical machine translation and metricsMATR, proceedings of the workshop, Uppsala, pp 262–270
Vilar D, Stein D, Peitz S, Ney H (2010b) If I only had a parser: Poor man’s syntax for hierarchical machine translation. In: Proceedings of the 7th international workshop on spoken language translation, Paris, pp 345–352
Wuebker J, Mauser A, Ney H (2010) Training phrase translation models with leaving-one-out. In: ACL 2010, 48th annual meeting of the association for computational linguistics, proceedings of the conference, Uppsala, pp 475–484
Zens R, Ney H (2007) Efficient phrase-table representation for machine translation with applications to online MT and speech translation. In: Human language technologies 2007: the conference of the North American chapter of the association for computational linguistics, proceedings of the main conference, Rochester, pp 492–499
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Vilar, D., Stein, D., Huck, M. et al. Jane: an advanced freely available hierarchical machine translation toolkit. Machine Translation 26, 197–216 (2012). https://doi.org/10.1007/s10590-011-9120-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10590-011-9120-y