Skip to main content
Log in

Automatically generated parallel treebanks and their exploitability in machine translation

  • Published:
Machine Translation

Abstract

Given much recent discussion and the shift in focus of the field, it is becoming apparent that the incorporation of syntax is the way forward for improvements to the current state-of-the-art in machine translation (MT). Parallel treebanks are a relatively recent innovation and appear to be ideal candidates for MT training material. However, until recently there has been no other means to build them than by hand. In this paper, we describe how we make use of new tools to automatically build a large parallel treebank and extract a set of linguistically-motivated phrase pairs from it. We show that adding these phrase pairs to the translation model of a baseline phrase-based statistical MT (PB-SMT) system leads to significant improvements in translation quality. Following this, we describe experiments in which we exploit the information encoded in the parallel treebank in other areas of the PB-SMT framework, while investigating the conditions under which the incorporation of parallel treebank data performs optimally. Finally, we discuss the possibility of exploiting automatically-generated parallel treebanks further in syntax-aware paradigms of MT.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Ahrenberg L (2007) LinES: an English–Swedish parallel treebank. In: Proceedings of the 16th Nordic conference of computational linguistics (NOLADIA’07). Tartu, Estonia, pp 270–274

  • Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of workshop on intrinsic and extrinsic evaluation measures for MT and/or summarization at the 43th annual meeting of the association of computational linguistics (ACL-05). Ann Arbor, MI

  • Bikel D (2002) Design of a multi-lingual, parallel-processing statistical parsing engine. In: Human language technology conference (HLT). San Diego, CA

  • Carpuat M, Wu D (2007) How phrase sense disambiguation outperforms word sense disambiguation for statistical machine translation. In: Proceedings of the 11th international conference on theoretical and methodological issues in machine translation (TMI-07). Skövde, Sweden, pp 43–52

  • Chiang D (2005) A hierarchical phrase-based model for statistical machine translation. In: 43rd annual meeting of the association for computational linguistics (ACL’05). Ann Arbor, MI, pp 263–270

  • Chiang D (2007) Hierarchical phrase-based translation. Comput Linguist 33(2): 201–228

    Article  Google Scholar 

  • Chrupała G, van Genabith J (2006) Using machine-learning to assign function labels to parser output for Spanish. In: 44th annual meeting of the association for computational linguistics (ACL’06). Sydney, Australia, pp 136–143

  • Civit M, Martí MA (2004) Building Cast3LB: a Spanish treebank. Res Lang Comput 2(4): 549–574

    Article  Google Scholar 

  • Čmejrek M, Cuřín J, Havelka J, Hajič J, Kuboň V (2004) Prague Czech-English dependency treebank. Syntactically annotated resources for machine translation. In: Proceedings of the 4th international conference on language resources and evaluation. Lisbon, Portugal, pp 1597–1600

  • Cyrus L, Feddes H, Schumacher F (2003) FuSe—a multi-layered parallel treebank. In: Proceedings of the second workshop on treebanks and linguistic theories (TLT’03). Växjö, Sweden, pp 213–216

  • Doddington G (2002) Automatic evaluation of machine translation quality using N-gram co-occurrence statistics. In: Human language technology: notebook proceedings. San Diego, CA, pp 128–132

  • Eck M, Vogel S, Waibel A (2005) Low cost portability for statistical machine translation based on n-gram coverage. In: Machine translation summit X. Phuket, Thailand, pp 227–234

  • Galley M, Graehl J, Knight K, Marcu D, DeNeefe S, Wang W, Thayer I (2006) Scalable inference and training of context-rich syntactic translation models. In: Proceedings of the 44th annual meeting of the association for computational linguistics. Sydney, Australia, pp 961–968

  • Groves D (2007) Hybrid data-driven models of machine translation. Ph.D. thesis, Dublin City University, Dublin, Ireland

  • Gustafson-Čapková S, Samuelsson Y, Volk M (2007) SMULTRON—the Stockholm MULtilingual parallel TReebank. www.ling.su.se/dali/research/smultron/index

  • Han C, Han N-R, Ko E-S, Palmer M (2002) Development and evaluation of a Korean treebank and its application to NLP. In: Proceedings of the 3rd international conference on language resources and evaluation (LREC’02). Canary Islands, Spain, pp 1635–1642

  • Hanneman G, Lavie A (2009) Decoding with syntactic and non-syntactic phrases in a syntax-based machine translation system. In: Proceedings of the third workshop on syntax and structure in statistical translation at the 2009 meeting of the North-American chapter of the association for computational linguistics (NAACL-HLT-2009). Boulder, CO, June 2009

  • Hansen-Schirra S, Neumann S, Vela M (2006) Multi-dimensional annotation and alignment in an English-German translation corpus. In: Proceedings of the workshop on multi-dimensional markup in natural language processing (NLPXML-2006) at EACL. Trento, Italy, pp 35–42

  • Hassan H, Sima’an K, Way A (2007) Supertagged phrase-based statistical machine translation. In: 45th annual meeting of the association for computational linguistics (ACL’07). Prague, Czech Republic, pp 288–295

  • Hearne M (2005) Data-oriented models of parsing and translation. Ph.D. thesis, Dublin City University, Dublin, Ireland

  • Hearne M, Tinsley J, Zhechev V, Way A (2007) Capturing translational divergences with a statistical tree-to-tree aligner. In: Proceedings of the 11th international conference on theoretical and methodological issues in machine translation (TMI-07). Skövde, Sweden, pp 83–94

  • Hearne M, Ozdowska S, Tinsley J (2008) Comparing constituency and dependency representations for SMT phrase-extraction. In: 15ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN’08). Avignon, France

  • Johnson H, Martin J, Foster G, Kuhn R (2007) Improving translation quality by discarding most of the phrasetable. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL 2007). Prague, Czech Republic, pp 967–975

  • Koehn P (2004) Statistical significance tests for machine translation evaluation. In: Proceedings of the 2004 conference on empirical methods in natural language processing. Barcelona, Spain, pp 388–395

  • Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Machine translation summit X. Phuket, Thailand, pp 79–86

  • Koehn P, Hoang H (2007) Factored translation models. In: Proceedings of the joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL). Prague, Czech Republic, pp 868–876

  • Koehn P, Och FJ, Marcu D (2003) Statistical phrase-based translation. In: Proceedings of the 2003 conference of the North American chapter of the association for computational linguistics on human language technology (NAACL’03). Edmonton, Canada, pp 48–54

  • Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: 45th annual meeting of the association for computational linguistics (ACL), demonstration session. Prague, Czech Republic, pp 177–180

  • Lavie A (2008) Stat-XFER: a general search-based syntax-driven framework for machine translation. In: Proceedings of the 9th international conference on intelligent text processing and computational linguistics (CICLing-08)—invited paper. Haifa, Israel, pp 362–375

  • Lavie A, Parlikar A, Ambati V (2008) Syntax-driven learning of sub-sentential translation equivalents and translation rules from parsed parallel corpora. In: Proceedings of the second workshop on syntax and structure in statistical translation (SSST-2). Columbus, OH

  • Lu Y, Huang J, Liu Q (2007) Improving statistical machine translation performance by training data selection and optimization. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL 2007). Prague, Czech Republic, pp 343–350

  • Marton Y, Resnik P (2008) Soft syntactic constraints for hierarchical phrased-based translation. In: Proceedings of the 46th annual meeting of the association for computational linguistics (ACL’08). Columbus, OH, pp 1003–1011

  • Papineni K, Roukos S, Ward T, Zhu W-J (2002) BLEU: a method for automatic evaluation of machine translation. In: 40th annual meeting of the association for computational linguistics (ACL-02). Philadelphia, PA, pp 311–318

  • Petrov S, Klein D (2007) Improved inference for unlexicalized parsing. In: Human language technologies 2007: the conference of the North American chapter of the association for computational linguistics. Rochester, NY, pp 404–411

  • Samuelsson Y, Volk M (2007) Alignment tools for parallel treebanks. In: Proceedings of the biennial GLDV conference. Tübingen, Germany

  • Stolcke A (2002) SRILM—an extensible language modeling toolkit. In: Proceedings of the international conference spoken language processing. Denver, CO

  • Stroppa N, van den Bosch A, Way A (2007) Exploiting source similarity for SMT using context-informed features. In: Proceedings of the 11th international conference on theoretical and methodological issues in machine translation (TMI-07). Skövde, Sweden, pp 231–240

  • Tinsley J, Hearne M, Way A (2007a) Exploiting parallel treebanks to improve phrase-based statistical machine translation. In: Proceedings of the sixth international workshop on treebanks and linguistic theories (TLT-07). Bergen, Norway, pp 175–187

  • Tinsley J, Zhechev V, Hearne M, Way A (2007b) Robust language-pair independent sub-tree alignment. In: Machine translation summit XI. Copenhagen, Denmark, pp 467–474

  • Vilar D, Stein D, Ney H (2008) Analysing soft syntax features and heuristics for hierarchical phrase based machine translation. International workshop on spoken language translation

  • Volk M, Samuelsson Y (2004) Bootstrapping parallel treebanks. In: Proceedings of the 7th conference of the workshop on linguistically interpreted corpora (LINC). Geneva, Switzerland, pp 71–77

  • Yamada K, Knight K (2001) A syntax-based statistical translation model. In: Proceedings of the 39th annual meeting of the association for computational linguistics (ACL’01). Toulouse, France, pp 523–530

  • Zhechev V, Way A (2008) Automatic generation of parallel treebanks. In: Proceedings of the 22nd international conference on computational linguistics (CoLing’08). Manchester, UK, pp 1105–1112

  • Zollmann A, Venugopal A, Och F, Ponte J (2008) A systematic comparison of phrase-based, hierarchical and syntax-augmented statistical MT. In: Proceedings of the 22nd international conference on computational linguistics (CoLing’08). Manchester, England, pp 1145–1152

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to John Tinsley.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tinsley, J., Way, A. Automatically generated parallel treebanks and their exploitability in machine translation. Machine Translation 23, 1–22 (2009). https://doi.org/10.1007/s10590-009-9068-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10590-009-9068-3

Keywords

Navigation