Skip to main content
Log in

Improving phrase-based statistical machine translation with morphosyntactic transformation

  • Original Paper
  • Published:
Machine Translation

Abstract

We present a phrase-based statistical machine translation approach which uses linguistic analysis in the preprocessing phase. The linguistic analysis includes morphological transformation and syntactic transformation. Since the word-order problem is solved using syntactic transformation, there is no reordering in the decoding phase. For morphological transformation, we use hand-crafted transformational rules. For syntactic transformation, we propose a transformational model based on a probabilistic context-free grammar. This model is trained using a bilingual corpus and a broad-coverage parser of the source language. This approach is applicable to language pairs in which the target language is poor in resources. We considered translation from English to Vietnamese and from English to French. Our experiments showed significant BLEU-score improvements in comparison with Pharaoh, a state-of-the-art phrase-based SMT system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Al-Onaizan Y, Curin J, Jahr M, Knight K, Lafferty J, Melamed D, Och F-J, Purdy D, Smith NA, Yarowsky D (1999) Statistical machine translation. Final Report, JHU Summer Workshop 1999, Johns Hopkins University, Baltimore, MD

  • Bikel DM (2004). Intricacies of Collins’ parsing model. Comput Ling 30: 479–511

    Article  Google Scholar 

  • Brown PF, Della Pietra SA, Della Pietra VJ and Mercer RL (1993). The mathematics of statistical machine translation. Comput Ling 22: 39–69

    Google Scholar 

  • Charniak E (2000) A maximum-entropy-inspired parser. In: 1st Meeting of the North American Chapter of the Association for Computational Linguistics, Seattle, Washington, pp 132–139

  • Charniak E, Knight K, Yamada K (2003) Syntax-based language models for statistical machine translation. In: Summit MT IX: Proceedings of the Ninth Machine Translation Summit, New Orleans, USA, pp 40–46

  • Chiang D (2005) A hierarchical phrase-based model for statistical machine translation. In: 43rd Annual Meeting of the Association for Computational Linguistics, Ann Arbor, Michigan, pp 263–270

  • Collins M (1999) Head-driven statistical models for natural language parsing. PhD thesis, University of Pennsylvania, Philadelphia, PA

  • Collins M, Koehn P, Kučerová I (2005) Clause restructuring for statistical machine translation. In: 43rd Annual Meeting of the Association for Computational Linguistics, Ann Arbor, Michigan, pp 531–540

  • Ding Y, Palmer M (2005) Machine translation using probabilistic synchronous dependency insertion grammars. In: 43rd Annual Meeting of the Association for Computational Linguistics, Ann Arbor, Michigan, pp 541–548

  • Fox H (2002) Phrasal cohesion and statistical machine translation. In: Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing, (EMNLP-02), Philadelphia PA, pp 304–311

  • Goldwater S, McClosky D (2005) Improving statistical MT through morphological analysis. In: Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Vancouver BC, pp 676–683

  • Johnson M (2002) A simple pattern-matching algorithm for recovering empty nodes and their antecedents. In: 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, pp 136–143

  • Klein D, Manning CD (2003) Accurate unlexicalized parsing. In: 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan, pp 423–430

  • Knight K and Graehl J (2005). An overview of probabilistic tree transducers for natural language processing. In: Gelbukh, AF (eds) Computational linguistics and intelligent text processing, 6th international conference CICLing 2005, Mexico City, Mexico, pp 1–24. Springer, Berlin, Germany

    Google Scholar 

  • Koehn P (2004). Pharaoh: A beam search decoder for phrase-based statistical machine translation models. In: Frederking, RE and Taylor, KB (eds) Machine translation: From real users to research, 6th Conference of the Association for Machine Translation of the Americas, AMTA 2004, Washington DC, pp 115–124. Springer, Berlin

    Google Scholar 

  • Koehn P, Och FJ, Marcu D (2003) Statistical phrase-based translation. In: HLT-NAACL: Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, Edmonton, Alberta, Canada, pp 127–133

  • Lee Y-S (2004) Morphological analysis for statistical machine translation. In: HLT-NAACL 2004 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, Boston, Massachusetts, Companion Volume: Short Papers, Student Research Workshop, Demonstrations, Tutorials Abstracts pp 57–60

  • Lehmann EL (1986). Testing statistical hypotheses. Springer, Berlin, Germany

    Google Scholar 

  • Marcu D, Wong W (2002) A phrase-based, joint probability model for statistical machine translation. In: Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing, (EMNLP-02), Philadelphia PA, pp 133–139

  • Marcus MP, Santorini B and Marcinkiewicz MA (1993). Building a large annotated corpus of English: the Penn TreeBank. Comput Ling 19: 313–330

    Google Scholar 

  • Melamed ID (2004) Statistical machine translation by parsing. In: 42nd Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain, pp 653–660

  • Nguyen TP, Nguyen VV, Le AC (2003) Vietnamese word segmentation using Hidden Markov Model. In: Proceedings of International Workshop for Computer, Information, and Communication Technologies in Korea and Vietnam, Hanoi, Vietnam, pp 13–17

  • Nguyen TP, Shimazu A (2006) Improving phrase-based SMT with morpho-syntactic analysis and transformation. In: AMTA 2006, Proceedings of the 7th Conference of the Association for Machine Translation of the Americas: Visions for the future of machine translation, Cambridge, Massachusetts, pp 138–147

  • Niessen S and Ney H (2004). Statistical machine translation with scarce resources using morpho-syntactic information. Comput Ling 30: 181–204

    Article  Google Scholar 

  • Och FJ, Gildea D, Khudanpur S, Sarkar A, Yamada K, Fraser A, Kumar S, Shen L, Smith D, Eng K, Jain V, Jin Z, Radev D (2004) A smorgasbord of features for statistical machine translation. In: HLT-NAACL 2004 Human Language Technology Conference of the North American chapter of the Association for Computational Linguistics, Boston, Massachusetts, pp 161–168

  • Och FJ, Ney H (2000) Improved statistical alignment models. In: 38th Annual meeting of the association for computational linguistics, Hong Kong, China, pp 440–447

  • Och FJ and Ney H (2004). The alignment template approach to statistical machine translation. Comput Ling 30: 417–449

    Article  Google Scholar 

  • Papineni KA, Roukos S, Ward T, Zhu WJ (2001) Bleu: a method for automatic evaluation of machine translation. Technical report RC22176 (W0109-022), IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, NY

  • Pham NH, Nguyen LM, Le AC, Nguyen PT, Nguyen VV (2003) LVT: An English–Vietnamese machine translation system. In: First National Conference on Fundamental and Applied Research in Information Technology FAIR-03, Hanoi, Vietnam, pp 173–180

  • Quirk C, Menezes A, Cherry C (2005) Dependency treelet translation: Syntactically informed phrasal SMT. In: 43rd Annual Meeting of the Association for Computational Linguistics, Ann Arbor, Michigan, pp 271–279

  • Shen L, Sarkar A, Och FJ (2004) Discriminative reranking for machine translation. In: HLT-NAACL 2004 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, Boston, Massachusetts, pp 177–184

  • Stolcke A (2002) SRILM–An extensible language modeling toolkit. In: International Conference on Spoken Language Processing, Denver, Colorado, pp 901–904

  • Xia F, McCord M (2004) Improving a statistical MT system with automatically learned rewrite patterns. In: 20th International Conference on Computational Linguistics, Geneva, Switzerland, pp 508–514

  • Yamada K, Knight K (2001) A syntax-based statistical translation model. In: Association for Computational Linguistics 39th Annual Meeting and 10th Conference of the European Chapter, Toulouse, France, pp 523–529

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thai Phuong Nguyen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nguyen, T.P., Shimazu, A. Improving phrase-based statistical machine translation with morphosyntactic transformation. Machine Translation 20, 147–166 (2006). https://doi.org/10.1007/s10590-007-9022-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10590-007-9022-1

Keywords

Navigation