Abstract
This article presents several techniques for integrating information from a rule-based machine translation (RBMT) system into a statistical machine translation (SMT) framework. These techniques are grouped into three parts that correspond to the type of information integrated: the morphological, lexical, and system levels. The first part presents techniques that use information from a rule-based morphological tagger to do morpheme splitting of the Arabic source text. We also compare with the results of using a statistical morphological tagger. In the second part, we present two ways of using Arabic diacritics to improve SMT results, both based on binary decision trees. The third part presents a system combination method that combines the outputs of the RBMT and the SMT systems, leveraging the strength of each. This article shows how language specific information obtained through a deterministic rule-based process can be used to improve SMT, which is mostly language-independent.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Badr I, Zbib R, Glass J (2008) Segmentation for English-to-Arabic statistical machine translation. In: Proceedings of ACL-08: HLT, short papers, Columbus, OH, June, pp 153–156
Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of ACL 2005 workshop on intrinsic and extrinsic evaluation measures for MT and/or summarization, Ann Arbor, MI
Brunning J, de Gispert A, Byrne W (2009) Context-dependent alignment models for statistical machine translation. In: NAACL ’09: proceedings of the 2009 human language technology conference of the North American Chapter of the Association for Computational Linguistics, Boulder, Colorado, pp 110–118
Buckwalter T (2004) Buckwalter arabic morphological analyzer version 2.0, Linguistic Data Consortium
Chen Y, Eisele A (2010) Hierarchical hybrid translation between english and german. In: Proceedings of the 14th annual conference of the European Association for Machine Translation, St. Raphael, France
Devlin J (2009) Lexical features for statistical machine translation. Master’s Thesis, University of Maryland, December
Diab M, Ghoneim M, Habash N (2007) Arabic diacritization in the context of statistical machine translation. In: MT Summit XI, Copenhagen, Denmark, pp 143–149
Habash N, Rambow O (2005) Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In: Proceedings of the 43th annual meeting of the Association for Computational Linguistics (ACL), Ann Arbor, MI
Habash N, Sadat F (2006) Arabic preprocessing schemes for statistical machine translation. In: Proceedings of the 2006 human language technology conference of the North American Chapter of the Association for Computational Linguistics, New York, NY
Hart PE, Nilsson NJ, Raphael B (1968) A formal basis for the heuristic determination of minimum cost paths. IEEE Trans Syst Sci Cybern SSC4, 4
Koehn P (2004) Statistical significance tests for machine translation evaluation. In: EMNLP04, Barcelona, Spain
Koehn P, Och FJ, Marcu D (2003) Statistical phrase-based translation. In: Proceedings of the 2003 human language technology conference of the North American Chapter of the Association for Computational Linguistics, Edmonton, Canada, pp 48–54
Lee Y-S (2004) Morphological analysis for statistical machine translation. In: HLT-NAACL ’04: proceedings of HLT-NAACL 2004, Boston, Massachusetts
Lee YS, Papineni K, Roukos S (2003) Language model based arabic word segmentation. In: Proceedings of the 41st annual meeting of the Association for Computational Linguistics (ACL), Sapporo, Japan
Li Z, Callison-Burch C, Dyer C, Ganitkevitch J, Khudanpur S, Schwartz L, Thornton W, Weese J, Zaidan O (2009) Joshua: an open source toolkit for parsing-based machine translation. In: Proceedings of the fourth workshop on statistical machine translation. StatMT ’09, Athens, Greece, pp 135–139
Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29(1): 19–51
Och FJ, Gildea D, Khudanpur S, Sarkar A, Yamada K, Fraser A, Kumar S, Shen L, Smith D, Eng K, Jain V, Jin Z, Radev DR (2004) A smorgasbord of features for statistical machine translation. In: HLT-NAACL, Boston, MA, pp 161–168
Odell J (1995) The use of context in large vocabulary speech recognition. Ph.D. Thesis, Cambridge University Engineering Department
Olive, J, Caitlin, C, McCary, J (eds) (2011) Handbook of natural language processing and machine translation: DARPA global autonomous language exploitation. Springer, New York
Papineni K, Roukos S, Ward T, Zhu W (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics (ACL), Philadelphia, PA
Rosti AI, Matsoukas S, Schwartz R (2007) Improved word-level system combination for machine translation. In: Proceedings of the 45th annual meeting of the Association for Computational Linguistics (ACL), Prague, Czech Republic
Rosti AI, Zhang B, Matsoukas S, Schwartz R (2008) Incremental hypothesis alignment for building confusion networks with applicatoin to machine translation system combination. In: Proceedings of the third workshop on statistical machine translation, Columbus, OH
Rosti AI, Zhang B, Matsoukas S, Schwartz R (2010) BBN system description for WMT10 system combination task. In: ACL 2010 joint fifth workshop on statistical machine translation and metrics MATR, Uppsala, Sweden
Sadat F, Habash N (2006) Combination of Arabic preprocessing schemes for statistical machine translation. In: Proceedings of COLING ’04: The 21st international conference on computational linguistics, Geneva, Switzerland
Shen L, Xu J, Weischedel R (2008) A new string-to-dependency machine translation algorithm with a target dependency language model. In: Proceedings of the 46th annual meeting of the Association for Computational Linguistics (ACL), Columbus, OH, pp 577–585
Simard M, Goutte C, Isabelle P (2007a) Statistical phrase-based post-editing. In: Proceedings of the 2007 human language technology conference of the North American Chapter of the Association for Computational Linguistics, Rochester, NY
Simard M, Ueffing N, Isabelle P, Kuhn P (2007b) Rule-based translation with statistical phrase-based post-editing. In: Proceedings of the second workshop on statistical machine translation, Prague, Czech Republic
Snover M, Dorr B, Schwartz R, Makhoul J, Micciulla L (2006) A study of translation error rate with targeted human annotation. In: Proceedings of the 7th conference of the Association for Machine Translation in the Americas (AMTA 2006), Cambridge, MA, pp 223–231
Thurmair G (2009) Comparing different architectures of hybrid machine translation systems. In: MT Summit XII: proceedings of the twelfth Machine Translation Summit, Ottawa, ON, Canada
Zbib R, Matsoukas S, Schwartz R, Makhoul J (2010) Decision trees for lexical smoothing in statistical machine translation. In: ACL 2010 joint fifth workshop on statistical machine translation and metrics MATR, Uppsala, Sweden
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zbib, R., Kayser, M., Matsoukas, S. et al. Methods for integrating rule-based and statistical systems for Arabic to English machine translation. Machine Translation 26, 67–83 (2012). https://doi.org/10.1007/s10590-011-9106-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10590-011-9106-9