Abstract
Statistical Machine Translation (SMT) delivers a convenient format for representing how translation process is modeled. The translations of words or phrases are generally computed based on their occurrence in some bilingual training corpus. However, SMT still suffers for out of vocabulary (OOV) words and less frequent words especially when only limited training data are available or training and test data are in different domains. In this paper, we propose a convenient way to handle OOV and rare words using paraphrasing technique. Initially we extract paraphrases from bilingual training corpus with the help of comparable corpora. The extracted paraphrases are analyzed by conditionally checking the association of their monolingual distribution. Bilingual aligned paraphrases are incorporated as additional training data into the PB-SMT system. Integration of paraphrases into PB-SMT system results in significant improvement.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Iordanskaja, L., Kittredge, R., Polguere, A.: Lexical Selection and Paraphrase in a Meaning-Text Generation Model. In: Paris, C.L., et al. (eds.) Natural Language Generation in Artificial Intelligence and Computational Linguistic, pp. 293–312. Kluwer Academic Publishers, Dordrecht (1991)
Callison-Burch, C., Koehn, P., Osborne, M.: Improved Statistical Machine Translation Using Paraphrases. In: The Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, HLT-NAACL, pp. 17–24 (2006)
Denoual, E., Lepage, Y.: BLEU in characters: towards automatic MT evaluation in languages without word delimiters. In: The Second International Joint Conference on Natural Language Processing, pp. 81–86 (2005)
Kauchak, D., Barzilay, R.: Paraphrasing for automatic evaluation. In: The Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (2006)
Heilman, M., Smith, N.A.: Tree edit models for recognizing textual entailments, paraphrases, and answers to questions. In: HLT 2010 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 1011–1019 (2010)
Gupta, R., Pal, S., Bandyopadhyay, S.: Improving MT System Using Extracted Parallel Fragments of Text from Comparable Corpora. In: 6th Workshop of Building and Using Comparable Corpora (BUCC). ACL, Sofia (2013)
Bannard, C., Callison-Burch, C.: Paraphrasing with Bilingual Parallel Corpora. In: ACL (2005)
Pal, S., Naskar, S.K., Pecina, P., Bandyopadhyay, S., Way, A.: Handling Named Entities and Compound Verbs in Phrase-Based Statistical Machine Translation. In: COLING 2010 Workshop on Multiword Expressions: from Theory to Applications (MWE 2010), Beijing, China, pp. 45–53 (2010)
Shiqi, Z., Haifeng, W., Ting, L., Sheng, L.: Extracting Paraphrase Patterns from Bilingual Parallel Corpora. Natural Language Engineering 15(4), 503–526 (2009)
Chan, T.P., Callison-Burch, C., Durme, B.V.: Reranking Bilingually Extracted Paraphrases Using Monolingual Distributional Similarity. In: GEometrical Models of Natural Language Semantics, GEMS (2011)
Aziz, W., Specia, L.: Multilingual WSD-like Constraints for Paraphrase Extraction. In: The Seventeenth Conference on Computational Natural Language Learning (CoNLL), Sofia, Bulgaria, pp. 202–211 (2013)
Barzilay, R., McKeown, K.R.: Extracting paraphrases from a parallel corpus. In: 39th Annual Meeting on Association for Computational Linguistics, pp. 50–57 (2001)
Xu, W., Ritter, A., Grishman, R.: Gathering and Generating Paraphrases from Twitter with Application to Normalization. In: ACL 2013 Workshop on Building and Using Comparable Corpora (2013)
Wang, R., Callison-Burch, C.: Paraphrase Fragment Extraction from Monolingual Compa-rable Corpora. In: Fourth Workshop on Building and Using Comparable Corpora, BUCC (2011)
Kuhn, R., Chen, C., Foster, G., Stratford, E.: Phrase Clustering for Smoothing TM Prob-abilities – or, How to Extract Paraphrases from Phrase Tables. In: COLING, Beijing, China (2010)
Fujita, A., Carpuat, M.: FUN-NRC: Paraphrase-augmented Phrase-based SMT Systems for NTCIR-10 PatentMT. In: The 10th NTCIR Conference, Tokyo, Japan, June 18-21 (2013)
Madnani, N., Ayan, N.F., Resnik, P., Dorr, B.J.: Using Paraphrases for Parameter Tuning in Statistical Machine Translation. In: The Second Workshop on Statistical Machine Translation, StatMT (2007)
Marton, Y., Callison-Burch, C., Resnik, P.: Improved Statistical Machine Translation Using Monolingually-Derived Paraphrases. In: The 2009 Conference on Empirical Methods in Natural Language Processing, EMNLP (2009)
Mehay, D.N., White, M.: Shallow and Deep Paraphrasing for Improved Machine Translation Parameter Optimization. In: The AMTA 2012 Workshop on Monolingual Machine Translation, MONOMT (2012)
Razmara, M., Siahbani, M., Haffari, G., Sarkar, A.: Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation. In: ACL (2013)
Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. Journal on Computational Linguistics Archive 16(1), 22–29 (1990)
Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Journal on Computational Linguistics - Special Issue on Using Large Corpora: I Archive 19(1), 61–74 (1993)
Phan, X.H.: Crfchunker: Crfenglish phrase chunker. In: PACLIC (2006)
Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: HLT-NAACL, pp. 127–133 (2003)
Och, F.J.: Minimum Error Rate Training in Statistical Machine Translation. In: ACL (2003)
Stolcke, A.: SRILM - An Extensible Language Modeling Toolkit. In: International Conferance on Spoken Language Processing, vol. 2, pp. 901–904. Denver (2002)
Kneser, R., Ney, H.: Improved backing-off for M-gram language modeling. In: International Conference on Acoustics, Speech, and Signal Processing, ICASSP (1995)
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: Open Source Toolkit for Statistical Machine Translation. In: ACL (2007)
Doddington, G.: Automatic evaluation of machine translation quality using n-gram cooccurrence statistics. In: Human Language Technology Conference, HLT, San Diego, CA, pp. 128–132 (2002)
Papineni, K., Roukos, S., Ward, T., Zhu., W.J.: BLEU: a method for automatic evaluation of machine translation. In: ACL (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Pal, S., Lohar, P., Naskar, S.K. (2014). Role of Paraphrases in PB-SMT. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2014. Lecture Notes in Computer Science, vol 8404. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54903-8_21
Download citation
DOI: https://doi.org/10.1007/978-3-642-54903-8_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-54902-1
Online ISBN: 978-3-642-54903-8
eBook Packages: Computer ScienceComputer Science (R0)