We present a phrase-based statistical machine translation approach which uses linguistic analysis in the preprocessing phase. The linguistic analysis includes morphological transformation and syntactic transformation. Since the word-order problem is solved using syntactic transformation, there is no reordering in the decoding phase. For morphological transformation, we use hand-crafted transformational rules. For syntactic transformation, we propose a transformational model based on a probabilistic context-free grammar. This model is trained using a bilingual corpus and a broad-coverage parser of the source language. This approach is applicable to language pairs in which the target language is poor in resources. We considered translation from English to Vietnamese and from English to French. Our experiments showed significant BLEU-score improvements in comparison with Pharaoh, a state-of-the-art phrase-based SMT system.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Al-Onaizan Y, Curin J, Jahr M, Knight K, Lafferty J, Melamed D, Och F-J, Purdy D, Smith NA, Yarowsky D (1999) Statistical machine translation. Final Report, JHU Summer Workshop 1999, Johns Hopkins University, Baltimore, MD
Bikel DM (2004). Intricacies of Collins’ parsing model. Comput Ling 30: 479–511
Brown PF, Della Pietra SA, Della Pietra VJ and Mercer RL (1993). The mathematics of statistical machine translation. Comput Ling 22: 39–69
Charniak E (2000) A maximum-entropy-inspired parser. In: 1st Meeting of the North American Chapter of the Association for Computational Linguistics, Seattle, Washington, pp 132–139
Charniak E, Knight K, Yamada K (2003) Syntax-based language models for statistical machine translation. In: Summit MT IX: Proceedings of the Ninth Machine Translation Summit, New Orleans, USA, pp 40–46
Chiang D (2005) A hierarchical phrase-based model for statistical machine translation. In: 43rd Annual Meeting of the Association for Computational Linguistics, Ann Arbor, Michigan, pp 263–270
Collins M (1999) Head-driven statistical models for natural language parsing. PhD thesis, University of Pennsylvania, Philadelphia, PA
Collins M, Koehn P, Kučerová I (2005) Clause restructuring for statistical machine translation. In: 43rd Annual Meeting of the Association for Computational Linguistics, Ann Arbor, Michigan, pp 531–540
Ding Y, Palmer M (2005) Machine translation using probabilistic synchronous dependency insertion grammars. In: 43rd Annual Meeting of the Association for Computational Linguistics, Ann Arbor, Michigan, pp 541–548
Fox H (2002) Phrasal cohesion and statistical machine translation. In: Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing, (EMNLP-02), Philadelphia PA, pp 304–311
Goldwater S, McClosky D (2005) Improving statistical MT through morphological analysis. In: Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Vancouver BC, pp 676–683
Johnson M (2002) A simple pattern-matching algorithm for recovering empty nodes and their antecedents. In: 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, pp 136–143
Klein D, Manning CD (2003) Accurate unlexicalized parsing. In: 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan, pp 423–430
Knight K and Graehl J (2005). An overview of probabilistic tree transducers for natural language processing. In: Gelbukh, AF (eds) Computational linguistics and intelligent text processing, 6th international conference CICLing 2005, Mexico City, Mexico, pp 1–24. Springer, Berlin, Germany
Koehn P (2004). Pharaoh: A beam search decoder for phrase-based statistical machine translation models. In: Frederking, RE and Taylor, KB (eds) Machine translation: From real users to research, 6th Conference of the Association for Machine Translation of the Americas, AMTA 2004, Washington DC, pp 115–124. Springer, Berlin
Koehn P, Och FJ, Marcu D (2003) Statistical phrase-based translation. In: HLT-NAACL: Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, Edmonton, Alberta, Canada, pp 127–133
Lee Y-S (2004) Morphological analysis for statistical machine translation. In: HLT-NAACL 2004 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, Boston, Massachusetts, Companion Volume: Short Papers, Student Research Workshop, Demonstrations, Tutorials Abstracts pp 57–60
Lehmann EL (1986). Testing statistical hypotheses. Springer, Berlin, Germany
Marcu D, Wong W (2002) A phrase-based, joint probability model for statistical machine translation. In: Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing, (EMNLP-02), Philadelphia PA, pp 133–139
Marcus MP, Santorini B and Marcinkiewicz MA (1993). Building a large annotated corpus of English: the Penn TreeBank. Comput Ling 19: 313–330
Melamed ID (2004) Statistical machine translation by parsing. In: 42nd Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain, pp 653–660
Nguyen TP, Nguyen VV, Le AC (2003) Vietnamese word segmentation using Hidden Markov Model. In: Proceedings of International Workshop for Computer, Information, and Communication Technologies in Korea and Vietnam, Hanoi, Vietnam, pp 13–17
Nguyen TP, Shimazu A (2006) Improving phrase-based SMT with morpho-syntactic analysis and transformation. In: AMTA 2006, Proceedings of the 7th Conference of the Association for Machine Translation of the Americas: Visions for the future of machine translation, Cambridge, Massachusetts, pp 138–147
Niessen S and Ney H (2004). Statistical machine translation with scarce resources using morpho-syntactic information. Comput Ling 30: 181–204
Och FJ, Gildea D, Khudanpur S, Sarkar A, Yamada K, Fraser A, Kumar S, Shen L, Smith D, Eng K, Jain V, Jin Z, Radev D (2004) A smorgasbord of features for statistical machine translation. In: HLT-NAACL 2004 Human Language Technology Conference of the North American chapter of the Association for Computational Linguistics, Boston, Massachusetts, pp 161–168
Och FJ, Ney H (2000) Improved statistical alignment models. In: 38th Annual meeting of the association for computational linguistics, Hong Kong, China, pp 440–447
Och FJ and Ney H (2004). The alignment template approach to statistical machine translation. Comput Ling 30: 417–449
Papineni KA, Roukos S, Ward T, Zhu WJ (2001) Bleu: a method for automatic evaluation of machine translation. Technical report RC22176 (W0109-022), IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, NY
Pham NH, Nguyen LM, Le AC, Nguyen PT, Nguyen VV (2003) LVT: An English–Vietnamese machine translation system. In: First National Conference on Fundamental and Applied Research in Information Technology FAIR-03, Hanoi, Vietnam, pp 173–180
Quirk C, Menezes A, Cherry C (2005) Dependency treelet translation: Syntactically informed phrasal SMT. In: 43rd Annual Meeting of the Association for Computational Linguistics, Ann Arbor, Michigan, pp 271–279
Shen L, Sarkar A, Och FJ (2004) Discriminative reranking for machine translation. In: HLT-NAACL 2004 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, Boston, Massachusetts, pp 177–184
Stolcke A (2002) SRILM–An extensible language modeling toolkit. In: International Conference on Spoken Language Processing, Denver, Colorado, pp 901–904
Xia F, McCord M (2004) Improving a statistical MT system with automatically learned rewrite patterns. In: 20th International Conference on Computational Linguistics, Geneva, Switzerland, pp 508–514
Yamada K, Knight K (2001) A syntax-based statistical translation model. In: Association for Computational Linguistics 39th Annual Meeting and 10th Conference of the European Chapter, Toulouse, France, pp 523–529
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Nguyen, T.P., Shimazu, A. Improving phrase-based statistical machine translation with morphosyntactic transformation. Machine Translation 20, 147–166 (2006). https://doi.org/10.1007/s10590-007-9022-1
Issue Date:
DOI: https://doi.org/10.1007/s10590-007-9022-1