Abstract
Given much recent discussion and the shift in focus of the field, it is becoming apparent that the incorporation of syntax is the way forward for improvements to the current state-of-the-art in machine translation (MT). Parallel treebanks are a relatively recent innovation and appear to be ideal candidates for MT training material. However, until recently there has been no other means to build them than by hand. In this paper, we describe how we make use of new tools to automatically build a large parallel treebank and extract a set of linguistically-motivated phrase pairs from it. We show that adding these phrase pairs to the translation model of a baseline phrase-based statistical MT (PB-SMT) system leads to significant improvements in translation quality. Following this, we describe experiments in which we exploit the information encoded in the parallel treebank in other areas of the PB-SMT framework, while investigating the conditions under which the incorporation of parallel treebank data performs optimally. Finally, we discuss the possibility of exploiting automatically-generated parallel treebanks further in syntax-aware paradigms of MT.
Similar content being viewed by others
References
Ahrenberg L (2007) LinES: an English–Swedish parallel treebank. In: Proceedings of the 16th Nordic conference of computational linguistics (NOLADIA’07). Tartu, Estonia, pp 270–274
Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of workshop on intrinsic and extrinsic evaluation measures for MT and/or summarization at the 43th annual meeting of the association of computational linguistics (ACL-05). Ann Arbor, MI
Bikel D (2002) Design of a multi-lingual, parallel-processing statistical parsing engine. In: Human language technology conference (HLT). San Diego, CA
Carpuat M, Wu D (2007) How phrase sense disambiguation outperforms word sense disambiguation for statistical machine translation. In: Proceedings of the 11th international conference on theoretical and methodological issues in machine translation (TMI-07). Skövde, Sweden, pp 43–52
Chiang D (2005) A hierarchical phrase-based model for statistical machine translation. In: 43rd annual meeting of the association for computational linguistics (ACL’05). Ann Arbor, MI, pp 263–270
Chiang D (2007) Hierarchical phrase-based translation. Comput Linguist 33(2): 201–228
Chrupała G, van Genabith J (2006) Using machine-learning to assign function labels to parser output for Spanish. In: 44th annual meeting of the association for computational linguistics (ACL’06). Sydney, Australia, pp 136–143
Civit M, Martí MA (2004) Building Cast3LB: a Spanish treebank. Res Lang Comput 2(4): 549–574
Čmejrek M, Cuřín J, Havelka J, Hajič J, Kuboň V (2004) Prague Czech-English dependency treebank. Syntactically annotated resources for machine translation. In: Proceedings of the 4th international conference on language resources and evaluation. Lisbon, Portugal, pp 1597–1600
Cyrus L, Feddes H, Schumacher F (2003) FuSe—a multi-layered parallel treebank. In: Proceedings of the second workshop on treebanks and linguistic theories (TLT’03). Växjö, Sweden, pp 213–216
Doddington G (2002) Automatic evaluation of machine translation quality using N-gram co-occurrence statistics. In: Human language technology: notebook proceedings. San Diego, CA, pp 128–132
Eck M, Vogel S, Waibel A (2005) Low cost portability for statistical machine translation based on n-gram coverage. In: Machine translation summit X. Phuket, Thailand, pp 227–234
Galley M, Graehl J, Knight K, Marcu D, DeNeefe S, Wang W, Thayer I (2006) Scalable inference and training of context-rich syntactic translation models. In: Proceedings of the 44th annual meeting of the association for computational linguistics. Sydney, Australia, pp 961–968
Groves D (2007) Hybrid data-driven models of machine translation. Ph.D. thesis, Dublin City University, Dublin, Ireland
Gustafson-Čapková S, Samuelsson Y, Volk M (2007) SMULTRON—the Stockholm MULtilingual parallel TReebank. www.ling.su.se/dali/research/smultron/index
Han C, Han N-R, Ko E-S, Palmer M (2002) Development and evaluation of a Korean treebank and its application to NLP. In: Proceedings of the 3rd international conference on language resources and evaluation (LREC’02). Canary Islands, Spain, pp 1635–1642
Hanneman G, Lavie A (2009) Decoding with syntactic and non-syntactic phrases in a syntax-based machine translation system. In: Proceedings of the third workshop on syntax and structure in statistical translation at the 2009 meeting of the North-American chapter of the association for computational linguistics (NAACL-HLT-2009). Boulder, CO, June 2009
Hansen-Schirra S, Neumann S, Vela M (2006) Multi-dimensional annotation and alignment in an English-German translation corpus. In: Proceedings of the workshop on multi-dimensional markup in natural language processing (NLPXML-2006) at EACL. Trento, Italy, pp 35–42
Hassan H, Sima’an K, Way A (2007) Supertagged phrase-based statistical machine translation. In: 45th annual meeting of the association for computational linguistics (ACL’07). Prague, Czech Republic, pp 288–295
Hearne M (2005) Data-oriented models of parsing and translation. Ph.D. thesis, Dublin City University, Dublin, Ireland
Hearne M, Tinsley J, Zhechev V, Way A (2007) Capturing translational divergences with a statistical tree-to-tree aligner. In: Proceedings of the 11th international conference on theoretical and methodological issues in machine translation (TMI-07). Skövde, Sweden, pp 83–94
Hearne M, Ozdowska S, Tinsley J (2008) Comparing constituency and dependency representations for SMT phrase-extraction. In: 15ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN’08). Avignon, France
Johnson H, Martin J, Foster G, Kuhn R (2007) Improving translation quality by discarding most of the phrasetable. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL 2007). Prague, Czech Republic, pp 967–975
Koehn P (2004) Statistical significance tests for machine translation evaluation. In: Proceedings of the 2004 conference on empirical methods in natural language processing. Barcelona, Spain, pp 388–395
Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Machine translation summit X. Phuket, Thailand, pp 79–86
Koehn P, Hoang H (2007) Factored translation models. In: Proceedings of the joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL). Prague, Czech Republic, pp 868–876
Koehn P, Och FJ, Marcu D (2003) Statistical phrase-based translation. In: Proceedings of the 2003 conference of the North American chapter of the association for computational linguistics on human language technology (NAACL’03). Edmonton, Canada, pp 48–54
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: 45th annual meeting of the association for computational linguistics (ACL), demonstration session. Prague, Czech Republic, pp 177–180
Lavie A (2008) Stat-XFER: a general search-based syntax-driven framework for machine translation. In: Proceedings of the 9th international conference on intelligent text processing and computational linguistics (CICLing-08)—invited paper. Haifa, Israel, pp 362–375
Lavie A, Parlikar A, Ambati V (2008) Syntax-driven learning of sub-sentential translation equivalents and translation rules from parsed parallel corpora. In: Proceedings of the second workshop on syntax and structure in statistical translation (SSST-2). Columbus, OH
Lu Y, Huang J, Liu Q (2007) Improving statistical machine translation performance by training data selection and optimization. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL 2007). Prague, Czech Republic, pp 343–350
Marton Y, Resnik P (2008) Soft syntactic constraints for hierarchical phrased-based translation. In: Proceedings of the 46th annual meeting of the association for computational linguistics (ACL’08). Columbus, OH, pp 1003–1011
Papineni K, Roukos S, Ward T, Zhu W-J (2002) BLEU: a method for automatic evaluation of machine translation. In: 40th annual meeting of the association for computational linguistics (ACL-02). Philadelphia, PA, pp 311–318
Petrov S, Klein D (2007) Improved inference for unlexicalized parsing. In: Human language technologies 2007: the conference of the North American chapter of the association for computational linguistics. Rochester, NY, pp 404–411
Samuelsson Y, Volk M (2007) Alignment tools for parallel treebanks. In: Proceedings of the biennial GLDV conference. Tübingen, Germany
Stolcke A (2002) SRILM—an extensible language modeling toolkit. In: Proceedings of the international conference spoken language processing. Denver, CO
Stroppa N, van den Bosch A, Way A (2007) Exploiting source similarity for SMT using context-informed features. In: Proceedings of the 11th international conference on theoretical and methodological issues in machine translation (TMI-07). Skövde, Sweden, pp 231–240
Tinsley J, Hearne M, Way A (2007a) Exploiting parallel treebanks to improve phrase-based statistical machine translation. In: Proceedings of the sixth international workshop on treebanks and linguistic theories (TLT-07). Bergen, Norway, pp 175–187
Tinsley J, Zhechev V, Hearne M, Way A (2007b) Robust language-pair independent sub-tree alignment. In: Machine translation summit XI. Copenhagen, Denmark, pp 467–474
Vilar D, Stein D, Ney H (2008) Analysing soft syntax features and heuristics for hierarchical phrase based machine translation. International workshop on spoken language translation
Volk M, Samuelsson Y (2004) Bootstrapping parallel treebanks. In: Proceedings of the 7th conference of the workshop on linguistically interpreted corpora (LINC). Geneva, Switzerland, pp 71–77
Yamada K, Knight K (2001) A syntax-based statistical translation model. In: Proceedings of the 39th annual meeting of the association for computational linguistics (ACL’01). Toulouse, France, pp 523–530
Zhechev V, Way A (2008) Automatic generation of parallel treebanks. In: Proceedings of the 22nd international conference on computational linguistics (CoLing’08). Manchester, UK, pp 1105–1112
Zollmann A, Venugopal A, Och F, Ponte J (2008) A systematic comparison of phrase-based, hierarchical and syntax-augmented statistical MT. In: Proceedings of the 22nd international conference on computational linguistics (CoLing’08). Manchester, England, pp 1145–1152
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Tinsley, J., Way, A. Automatically generated parallel treebanks and their exploitability in machine translation. Machine Translation 23, 1–22 (2009). https://doi.org/10.1007/s10590-009-9068-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10590-009-9068-3