Hostname: page-component-745bb68f8f-v2bm5 Total loading time: 0 Render date: 2025-01-18T17:28:33.576Z Has data issue: false hasContentIssue false

Extracting parallel phrases from comparable data for machine translation

Published online by Cambridge University Press:  15 June 2016

SANJIKA HEWAVITHARANA
Affiliation:
Raytheon BBN Technologies, Cambridge, MA 02138, USA email: shewavit@bbn.com
STEPHAN VOGEL
Affiliation:
Qatar Computing Research Institute, Doha, Qatar email: svogel@qf.org.qa

Abstract

Mining parallel data from comparable corpora is a promising approach for overcoming the data sparseness in statistical machine translation and other natural language processing applications. In this paper, we address the task of detecting parallel phrase pairs embedded in comparable sentence pairs. We present a novel phrase alignment approach that is designed to only align parallel sections bypassing non-parallel sections of the sentence. We compare the proposed approach with two other alignment methods: (1) the standard phrase extraction algorithm, which relies on the Viterbi path of the word alignment, (2) a binary classifier to detect parallel phrase pairs when presented with a large collection of phrase pair candidates. We evaluate the accuracy of these approaches using a manually aligned data set, and show that the proposed approach outperforms the other two approaches. Finally, we demonstrate the effectiveness of the extracted phrase pairs by using them in Arabic–English and Urdu–English translation systems, which resulted in improvements upto 1.2 Bleu over the baseline. The main contributions of this paper are two-fold: (1) novel phrase alignment algorithms to extract parallel phrase pairs from comparable sentences, (2) evaluating the utility of the extracted phrases by using them directly in the MT decoder.

Type
Articles
Copyright
Copyright © Cambridge University Press 2016 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

Part of this work was conducted when the authors were affiliated to the Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA.

References

Banerjee, S. and Lavie, A. 2005. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, Michigan, USA, June, pp. 65–72.Google Scholar
Bourdaillet, J., Huet, S., Langlais, P. and Lapalme, G. 2010. TransSearch: from a bilingual concordancer to a translation finder. Machine Translation 24 (3–4): 241–71, December.CrossRefGoogle Scholar
Brown, P. F., Della Pietra, S. A., Della Pietra, V. J., and Mercer, R. L. 1993. The mathematics of statistical machine translation: parameter estimation. Computational Linguistics 19 (2): 263311.Google Scholar
Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., and Singer, Y. 2006. Online passive-agressive algorithms. Journal of Machine Learning Research 7 (March): 551–85.Google Scholar
Fung, P. and Cheung, P. 2004. Mining very non-parallel corpora: parallel sentence and lexicon extraction via bootstrapping and EM. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, pp. 57–63.Google Scholar
Fung, P. and Yee, L. Y. 1998. An IR approach for translating new words from nonparallel, comparable texts. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, Montreal, Canada, pp. 414–20.Google Scholar
Gupta, M., Hewavitharana, S. and Vogel, S. 2011. Extending a probabilistic phrase alignment approach for SMT. In Proceedings of the International Workshop on Spoken Language Translation (IWSLT), San Francisco, CA, December.Google Scholar
Gupta, R., Pal, S. and Bandyopadhyay, S. 2013. Improving MT system using extracted parallel fragments of text from comparable corpora. In Proceedings of the 6th Workshop on Building and Using Comparable Corpora, Sofia, Bulgaria, August.Google Scholar
Hewavitharana, S. and Vogel, S. 2011. Extracting parallel phrases from comparable data. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, Portland, Oregon, pp. 61–8.Google Scholar
Hewavitharana, S. and Vogel, S. 2013. Extracting parallel phrases from comparable data. In Sharoff, S., Reinhard, R., Zweigenbaum, P., and Fung, P. (eds.), Building and Using Comparable Corpora. Berlin Heidelberg: Springer, pp. 191204.CrossRefGoogle Scholar
Kikui, G., Sumita, E., Takezawa, T. and Yamamoto, S. 2003. Creating corpora for speech-to-speech translation. In Proceedings of EUROSPEECH, Geneva, pp. 381–84.Google Scholar
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., and Herbst, E. 2007. Moses: open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Prague, Czech Republic, June.Google Scholar
Kumano, T., Tanaka, H. and Tokunaga, T. 2007. Extracting phrasal alignments from comparable corpora by using joint probability SMT model. In Proceedings of the International Conference on Theoretical and Methodological Issues in Machine Translation, Skvde, Sweden, September.Google Scholar
Munteanu, D. S. and Marcu, D. 2005. Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics 31 (4): 477504.CrossRefGoogle Scholar
Munteanu, D. S. and Marcu, D. 2006. Extracting parallel sub-sentential fragments from non-parallel corpora. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, pp. 81–8.Google Scholar
Och, F. J. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan, pp. 160–67.Google Scholar
Papineni, K., Roukos, S., Ward, T. and Zhu, W.-J. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, July, pp. 311–18.Google Scholar
Quirk, C., Udupa, R. U. and Menezes, A. 2007. Generative models of noisy translations with applications to parallel fragment extraction. In Proceedings of the Machine Translation Summit XI, Copenhagen, Denmark, pp. 377–84.Google Scholar
Rapp, R. 1995. Identifying word translations in non-parallel texts. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, Cambridge, Massachusetts, pp. 320–22.Google Scholar
Rapp, R. 1999. Automatic identification of word translations from unrelated English and German corpora. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, College Park, Maryland, USA, pp. 519–26.Google Scholar
Resnik, P. and Smith, N. 2003. The web as a parallel corpus. Computational Linguistics 29 (3): 349–80.CrossRefGoogle Scholar
Snover, M., Dorr, B., Schwartz, R., Micciulla, L., and Makhoul, J. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of Association for Machine Translation in the Americas, Cambridge, MA.Google Scholar
Tillmann, C. and Hewavitharana, S. 2011. An efficient unified alignment algorithm for bilingual data. In Proceedings of Interspeech 2011, Florence, Italy, August.CrossRefGoogle Scholar
Tillmann, C. and Hewavitharana, S. 2013. A unified alignment algorithm for bilingual data. Natural Language Engineering 19 (01): 3360, Januray.CrossRefGoogle Scholar
Tillmann, C. and Xu, J.-M. 2009. A simple sentence-level extraction algorithm for comparable data. In Companion Vol. of NAACL HLT 09, Boulder, CA, June.CrossRefGoogle Scholar
Utiyama, M. and Isahara, H. 2003. Reliable measures for aligning Japanese-English news articles and sentences. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan, pp. 72–9.Google Scholar
Vogel, S. 2003. SMT decoder dissected: word reordering. In Proceedings of IEEE International Conference on Natural Language Processing and Knowledge Engineering, Beijing, China, October, pp. 561–66.Google Scholar
Vogel, S. 2005. PESA: phrase pair extraction as sentence splitting. In Proceedings of the Machine Translation Summit X, Phuket, Thailand, September.Google Scholar
Zhao, B. and Vogel, S. 2002a. Adaptive parallel sentence mining from web bilingual news collection. In Proceedings of the IEEE International Conference on Data Mining, Maebashi City, Japan, pp. 745–48.Google Scholar
Zhao, B. and Vogel, S. 2002b. Full-text story alignment models for Chinese-English bilingual news corpora. In Proceedings of the ICSLP '02, Denver, CO, September.CrossRefGoogle Scholar