Role of Paraphrases in PB-SMT

Pal, Santanu; Lohar, Pintu; Naskar, Sudip Kumar

doi:10.1007/978-3-642-54903-8_21

Role of Paraphrases in PB-SMT

Santanu Pal¹⁷,
Pintu Lohar¹⁸ &
Sudip Kumar Naskar¹⁸

Conference paper

1692 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8404))

Abstract

Statistical Machine Translation (SMT) delivers a convenient format for representing how translation process is modeled. The translations of words or phrases are generally computed based on their occurrence in some bilingual training corpus. However, SMT still suffers for out of vocabulary (OOV) words and less frequent words especially when only limited training data are available or training and test data are in different domains. In this paper, we propose a convenient way to handle OOV and rare words using paraphrasing technique. Initially we extract paraphrases from bilingual training corpus with the help of comparable corpora. The extracted paraphrases are analyzed by conditionally checking the association of their monolingual distribution. Bilingual aligned paraphrases are incorporated as additional training data into the PB-SMT system. Integration of paraphrases into PB-SMT system results in significant improvement.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Iordanskaja, L., Kittredge, R., Polguere, A.: Lexical Selection and Paraphrase in a Meaning-Text Generation Model. In: Paris, C.L., et al. (eds.) Natural Language Generation in Artificial Intelligence and Computational Linguistic, pp. 293–312. Kluwer Academic Publishers, Dordrecht (1991)
Chapter Google Scholar
Callison-Burch, C., Koehn, P., Osborne, M.: Improved Statistical Machine Translation Using Paraphrases. In: The Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, HLT-NAACL, pp. 17–24 (2006)
Google Scholar
Denoual, E., Lepage, Y.: BLEU in characters: towards automatic MT evaluation in languages without word delimiters. In: The Second International Joint Conference on Natural Language Processing, pp. 81–86 (2005)
Google Scholar
Kauchak, D., Barzilay, R.: Paraphrasing for automatic evaluation. In: The Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (2006)
Google Scholar
Heilman, M., Smith, N.A.: Tree edit models for recognizing textual entailments, paraphrases, and answers to questions. In: HLT 2010 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 1011–1019 (2010)
Google Scholar
Gupta, R., Pal, S., Bandyopadhyay, S.: Improving MT System Using Extracted Parallel Fragments of Text from Comparable Corpora. In: 6th Workshop of Building and Using Comparable Corpora (BUCC). ACL, Sofia (2013)
Google Scholar
Bannard, C., Callison-Burch, C.: Paraphrasing with Bilingual Parallel Corpora. In: ACL (2005)
Google Scholar
Pal, S., Naskar, S.K., Pecina, P., Bandyopadhyay, S., Way, A.: Handling Named Entities and Compound Verbs in Phrase-Based Statistical Machine Translation. In: COLING 2010 Workshop on Multiword Expressions: from Theory to Applications (MWE 2010), Beijing, China, pp. 45–53 (2010)
Google Scholar
Shiqi, Z., Haifeng, W., Ting, L., Sheng, L.: Extracting Paraphrase Patterns from Bilingual Parallel Corpora. Natural Language Engineering 15(4), 503–526 (2009)
Article Google Scholar
Chan, T.P., Callison-Burch, C., Durme, B.V.: Reranking Bilingually Extracted Paraphrases Using Monolingual Distributional Similarity. In: GEometrical Models of Natural Language Semantics, GEMS (2011)
Google Scholar
Aziz, W., Specia, L.: Multilingual WSD-like Constraints for Paraphrase Extraction. In: The Seventeenth Conference on Computational Natural Language Learning (CoNLL), Sofia, Bulgaria, pp. 202–211 (2013)
Google Scholar
Barzilay, R., McKeown, K.R.: Extracting paraphrases from a parallel corpus. In: 39th Annual Meeting on Association for Computational Linguistics, pp. 50–57 (2001)
Google Scholar
Xu, W., Ritter, A., Grishman, R.: Gathering and Generating Paraphrases from Twitter with Application to Normalization. In: ACL 2013 Workshop on Building and Using Comparable Corpora (2013)
Google Scholar
Wang, R., Callison-Burch, C.: Paraphrase Fragment Extraction from Monolingual Compa-rable Corpora. In: Fourth Workshop on Building and Using Comparable Corpora, BUCC (2011)
Google Scholar
Kuhn, R., Chen, C., Foster, G., Stratford, E.: Phrase Clustering for Smoothing TM Prob-abilities – or, How to Extract Paraphrases from Phrase Tables. In: COLING, Beijing, China (2010)
Google Scholar
Fujita, A., Carpuat, M.: FUN-NRC: Paraphrase-augmented Phrase-based SMT Systems for NTCIR-10 PatentMT. In: The 10th NTCIR Conference, Tokyo, Japan, June 18-21 (2013)
Google Scholar
Madnani, N., Ayan, N.F., Resnik, P., Dorr, B.J.: Using Paraphrases for Parameter Tuning in Statistical Machine Translation. In: The Second Workshop on Statistical Machine Translation, StatMT (2007)
Google Scholar
Marton, Y., Callison-Burch, C., Resnik, P.: Improved Statistical Machine Translation Using Monolingually-Derived Paraphrases. In: The 2009 Conference on Empirical Methods in Natural Language Processing, EMNLP (2009)
Google Scholar
Mehay, D.N., White, M.: Shallow and Deep Paraphrasing for Improved Machine Translation Parameter Optimization. In: The AMTA 2012 Workshop on Monolingual Machine Translation, MONOMT (2012)
Google Scholar
Razmara, M., Siahbani, M., Haffari, G., Sarkar, A.: Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation. In: ACL (2013)
Google Scholar
Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. Journal on Computational Linguistics Archive 16(1), 22–29 (1990)
Google Scholar
Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Journal on Computational Linguistics - Special Issue on Using Large Corpora: I Archive 19(1), 61–74 (1993)
Google Scholar
Phan, X.H.: Crfchunker: Crfenglish phrase chunker. In: PACLIC (2006)
Google Scholar
Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: HLT-NAACL, pp. 127–133 (2003)
Google Scholar
Och, F.J.: Minimum Error Rate Training in Statistical Machine Translation. In: ACL (2003)
Google Scholar
Stolcke, A.: SRILM - An Extensible Language Modeling Toolkit. In: International Conferance on Spoken Language Processing, vol. 2, pp. 901–904. Denver (2002)
Google Scholar
Kneser, R., Ney, H.: Improved backing-off for M-gram language modeling. In: International Conference on Acoustics, Speech, and Signal Processing, ICASSP (1995)
Google Scholar
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: Open Source Toolkit for Statistical Machine Translation. In: ACL (2007)
Google Scholar
Doddington, G.: Automatic evaluation of machine translation quality using n-gram cooccurrence statistics. In: Human Language Technology Conference, HLT, San Diego, CA, pp. 128–132 (2002)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu., W.J.: BLEU: a method for automatic evaluation of machine translation. In: ACL (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Universität Des Saarlandes, Saarbrücken, Germany
Santanu Pal
Department of Computer Science & Engineering, Jadavpur University, Kolkata, India
Pintu Lohar & Sudip Kumar Naskar

Authors

Santanu Pal
View author publications
You can also search for this author in PubMed Google Scholar
Pintu Lohar
View author publications
You can also search for this author in PubMed Google Scholar
Sudip Kumar Naskar
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Computing Research, National Polytechnic Institute, Av. Juan Dios Bátiz, Col. Nueva Industrial Vallejo, 07738, Mexico D.F, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pal, S., Lohar, P., Naskar, S.K. (2014). Role of Paraphrases in PB-SMT. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2014. Lecture Notes in Computer Science, vol 8404. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54903-8_21

Download citation

DOI: https://doi.org/10.1007/978-3-642-54903-8_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-54902-1
Online ISBN: 978-3-642-54903-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics