Skip to main content
Log in

Extracting lexical and phrasal paraphrases: a review of the literature

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

Recent advances in natural language processing have increased the popularity of paraphrase extraction. Most of the attention, however, has been focused on the extraction methods only without taking the resource factor into the consideration. Unknowingly, there is a strong relationship between them and the resource factor also plays an equally important role in paraphrase extraction. In addition, almost all of the previous studies have been focused on corpus-based methods that extract paraphrases from corpora based solely on syntactic similarity. Despite the popularity of corpus-based methods, a considerable amount of research has consistently shown that these methods are vulnerable to several types of erroneous paraphrases. For these reasons, it is necessary to evaluate whether the trend is moving in a positive direction. This paper reviews the major research on paraphrase extraction methods in detail. It begins by exploring the definition of paraphrase from different perspectives to provide a better understanding of the concept of paraphrase extraction. It then studies the characteristics and potential uses of different types of paraphrase resources. After that, it divides paraphrase extraction methods into four main categories: heuristic-based, knowledge-based, corpus-based and hybrid-based and summarizes their strengths and weaknesses. This paper concludes with some potential open research issues for future directions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Abbreviations

BPC:

Bilingual parallel corpora

CC:

Comparable corpora

CCG:

Complex syntactic constraints

FC:

Free corpora

FSA:

Finite state automata

HTML:

Hypertext markup language

IE:

Information extraction

IR:

Information retrieval

LP:

Lexical paraphrases

LR:

Lexical resources

MPC:

Monolingual parallel corpora

NE:

Named entity

NLP:

Natural language processing

PMI:

Point-wise mutual information

POS:

Part of speech

PP:

Phrasal paraphrases

Q&A:

Question and answering

SUBJ-OBJ:

Is the subject of and is the object of

T1:

Type 1

T2:

Type 2

T3:

Type 3

T4:

Type 4

T5:

Type 5

TF-IDF:

Term frequency & inverse document frequency

WWW:

World wide web

References

  • Azmi Murad MA, Martin TP (2004) Using fuzzy sets in contextual word similarity. Intell Data Eng Automa Learn (IDEAL), LNCS 3177: 517–522

    Google Scholar 

  • Bannard C, Callison-Burch C (2005) Paraphrasing with bilingual parallel corpora. In: Proceedings of the 43rd annual meeting of the Association for Computational Linguistics, pp 597–604

  • Barzilay R, McKeown KR (2001) Extracting paraphrases from a parallel corpus. In: Proceedings of the 39th annual meeting on Association for Computational Linguistics, pp 50–57

  • Bernhard D, Gurevych I (2008) Answering learners’ questions by retrieving question paraphrases from social Q&A sites. In: Proceedings of the 3rd workshop on innovative use of NLP for building educational applications, pp 44–52

  • Bhagat R, Ravichandran D (2008) Large scale acquisition of paraphrases for learning surface patterns. In: Proceedings of ACL–HLT, pp 674–682

  • Bhagat R, Hovy E, Patwardhan S (2009) Acquiring paraphrases from text corpora. In: Proceedings of the fifth international conference on knowledge capture (K-CAP), pp 161–168

  • Boczkowski P (1999) Understanding the development of online newspapers. New Media Soc 1(1): 101–126

    Article  Google Scholar 

  • Brun C, Hagege C (2003) Normalization and paraphrasing using symbolic methods. In: Proceedings of the second international workshop on paraphrasing, pp 41–48

  • Callison-Burch C (2008) Syntactic constraints on paraphrases extracted from parallel corpora. In: Proceedings of EMNLP, pp 196–205

  • Chakraborty RC (2010) Natural language processing. AI Course Lecture 41. http://www.myreaders.info/10_Natural_Language_Processing.pdf

  • Chapman S (2006) Thinking about language: theories of English. Palgrave Macmillian, Basingstoke.

  • Childs L, Acott-Smith A, Curtis K (1998) Grammar part I: parts of speech. Academic studies: English. http://www.nald.ca/library/learning/academic/english/grammar/speech/module5.pdf

  • Connor M, Roth D (2007) Context sensitive paraphrasing with a global unsupervised classifier. In: Proceedings of the 18th European conference on machine learning (ECML), pp 104–115

  • Coyne B, Rambow O (2009) LexPar: a freely available English paraphrase lexicon automatically extracted from FrameNet. In: Proceedings of IEEE international conference on semantic computing (ICSC), pp 53–58

  • Debusmann R (2000) An introduction to dependency grammar. Hausarbeit fur das Hauptseminar Dependenzgrammatik SoSe 99, pp 1–16

  • Deleger L, Zweigenbaum P (2008) Paraphrase acquisition from comparable medical corpora of specialized and lay texts. In: Proceedings of AMIA annual symposium proceedings archives, pp 146–150

  • Deleger L, Zweigenbaum P (2009) Extracting lay paraphrases of specialized expressions from monolingual comparable medical corpora. In: Proceedings of the second workshop on building and using comparable corpora (ACL–IJCNLP), pp 2–10

  • Deleger L, Zweigenbaum P (2010) Identifying paraphrases between technical and lay corpora. In: Proceedings of the seventh international conference on language resources and evaluation (LREC), pp 3537–3541

  • DiMarco C, Hirst G, Stede M (1993) The semantic and stylistic differentiation of synonyms and near-synonyms. In: Proceedings of AAAI spring symposium on building lexicons for machine translation, pp 114–121

  • Edmonds P, Hirst G (2002) Near-synonymy and lexical choice. Comput Linguist 28(2): 105–144

    Article  Google Scholar 

  • Elhadad N, Sutaria K (2007) Mining a lexicon of technical terms and lay equivalents. In: Proceedings of workshop on BioNLP: biological, translational and clinical language processing, pp 49–56

  • Fellbaum C, Palmer M, Dang HT, Delfs L, Wolf S (2001) Manual and automatic semantic annotation with WordNet. In: Proceedings of NAACL workshop: WordNet and other lexical resources, pp 3–10

  • Filip H introduction to natural language semantics (2002). http://www.ccl.pku.edu.cn/doubtfire/Semantics/Chapter_1_What_is_meaning.pdf

  • Fitzgerald M (1995) Interest in electronic delivery continues to grow, survey shows. Editor & Publisher, pp 30–31

  • Friedman C, Hripcsak G (1999) Natural language processing and its future in medicine. Acad Med 74(8): 890–895

    Article  Google Scholar 

  • Gennari SP, MacDonald MC, Postle BR, Seidenbergb MS (2007) Context-dependent interpretation of words: evidence for interactive neural processes. Neuroimage 35: 1278–1286

    Article  Google Scholar 

  • Glickman O, Dagan I (2003) Identifying lexical paraphrases from a single corpus: a case study for verbs. In: Proceedings of recent advantages in natural language processing (RANLP), pp 166–173

  • Grigonyte G, Cordeiro J, Dias G, Moraliyski R, Brazdil P (2010) Paraphrase alignment for synonym evidence discovery. In: Proceedings of the 23rd international conference on computational linguistics (COLING), pp 403–411

  • Harris Z (1970). Distributional structure. Structural and transformational linguistics, pp 775–794

  • Hasegawa T, Sekine S, Grishman R (2005) Unsupervised paraphrase acquisition via relation discovery. Technical Report 05-012, Proteus Project, Computer Department, New York University

  • Hashimoto C, Torisawa K, Saeger SD, Kazama J, Kurohashi S (2011) Extracting paraphrases from definition sentences on the web. In: Proceedings of the 49th annual meeting of the Association for computational linguistics (ACL), pp 1087–1097

  • Hearst MA (1992) Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th international conference on computational linguistics (COLING), pp 539–545

  • Herrera J, Penas A, Verdejo F (2007) Paraphrase extraction from validated question answering corpora in Spanish. In: Proceedings of the working notes of the XXIII conference of the Spanish Association for Natural Language Processing (SEPLN), pp 37–44

  • Ho CF, Azmi Murad MA, Doraisamy S, Abdul Kadir R (2011) Comparing two corpus-based methods for extracting paraphrases to dictionary-based method. Int J Semant Comput (IJSC) 5(2): 133–178

    Article  Google Scholar 

  • Hwang YS, Kim YK, Park SK (2008) Paraphrasing depending on bilingual context toward generalization of translation knowledge. In: Proceedings of the third international joint conference on natural language processing, pp 327–334

  • Ibrahim A, Katz B, Lin J (2003) Extracting structural paraphrases from aligned monolingual corpora. In: Proceedings of ACL, pp 10–17

  • Jacquemin C, Klavans JL, Tzoukermann E (1997) Expansion of multi-word terms for indexing and retrieval using morphology and syntax. In: Proceeding of the 35th annual meeting of the Association for Computational Linguistics and eighth conference of the European chapter of the Association for Computational Linguistics, pp 24–31

  • Jusoh S, Masoud AM, Alfawareh HM (2011) Automated text summarization: sentence refinement approach. Commun Comput Inf Sci Digit Inf Process Commun 189(8): 207–218

    Article  Google Scholar 

  • Kaji N, Kurohashi S (2006) Lexical choice via topic adaptation for paraphrasing written language to spoken language. Inf Retr Technol LNCS 4182: 673–679

    Article  Google Scholar 

  • Keshtkar F, Inkpen D (2010) A corpus-based method for extracting paraphrases of emotion terms. In: Proceedings of the NAACL–HLT workshop on computational approaches to analysis and generation of emotion in text, pp 35–44

  • Kozareva Z, Vazquez S, Montoyo A (2007) The usefulness of conceptual representation for the identification of semantic variability expressions. In: Proceedings of the 8th international conference on computational linguistics and intelligent text processing (CICLing), pp 325–336

  • Kozlowski R, McCoy KF, Vijay-Shanker K (2003) Generation of single-sentence paraphrases from predicate/argument structure using lexico-grammatical resources. In: Proceedings of the second international workshop on paraphrasing, pp 1–8

  • Lakoff G, Johnson M (1980) Metaphors we live by. University of Chicago Press, Chicago, pp xiii, 241

  • Li YH, Bandar ZA, McLean D (2003) An approach for measuring semantic similarity between words using multiple information sources. IEEE Trans Knowl Data Eng 15: 871–882

    Article  Google Scholar 

  • Li W, Liu T, Zhang Y, Li S, He W (2005) Automated generalization of phrasal paraphrases from the web. In: Proceedings of IWP, pp 49–56

  • Li YH, McLean D, Bandar ZA, Shea JO, Crockett K (2006) Sentence similarity based on semantic nets and corpus statistics. IEEE Trans Knowl Data Eng 18(8): 1138–1150

    Article  Google Scholar 

  • Lin D, Pantel P (2001) DIRT—discovery of inference rules from text. In: Proceedings of ACM SIGKDD, pp 323–328

  • Meyer E (1998). An unexpectedly wider web for the world’s newspapers. American Journalism Review. http://www.newslink.org/emcol10.html

  • Murata M, Kanamaru T, Isahara H (2005) Automatic synonym acquisition based on matching of definition sentences in multiple dictionaries. In: Proceedings of the sixth international conference on computational linguistics and intelligent text processing (CICLing), pp 293–304

  • Nakagawa H, Masuda H (2005) Extracting paraphrases of Japanese action word of sentence ending part from web and mobile news articles. In: Proceedings of AIRS, pp 94–105

  • Nes FV, Abma T, Jonsson H, Deeg D (2010) Language differences in qualitative research: is meaning lost in translation?. Eur J Ageing 7(4): 313–316

    Article  Google Scholar 

  • Otis L (2008) Going with your gut: some thoughts on language and the body. Perspect 372: 798–799

    Google Scholar 

  • Pang B, Knight K, Marcu D (2003) Syntax-based alignment of multiple translations: extracting paraphrases and generating new sentences. In: Proceedings of HLT–NAACL, pp 102–109

  • Pasca M, Dienes P (2005) Aligning needles in a haystack: paraphrase acquisition across the web. In: Proceedings of the international joint conference on Natural Language Processing (IJCNLP), LNAI, vol 3651, pp 119–130

  • Peng FY, Tham NI, Hao X (1999) Trends in online newspapers: a look at the US web. Newsp Res J 20(2): 52–63

    Google Scholar 

  • Poibeau T, Dutoit D (2009) Automatic extraction of paraphrastic phrases from small-size corpora. Linguisticae Investigationes 32(1): 77–98

    Article  Google Scholar 

  • Polkinghorne D (2005) Language and meaning: data collection in qualitative research. J Couns Psychol 52(2): 137–145

    Article  Google Scholar 

  • Ringlstetter C, Schulz KU, Mihov S (2006) Orthographic errors in web pages: toward cleaner web corpora. J Comput Linguist 32(3): 295–340

    Article  Google Scholar 

  • Runett B (2002) Reaching out: newspaper sites add audience; improve stature as net marketplace. Newspaper Association of America, Arlington

    Google Scholar 

  • Schonbach K, Waal ED, Lauf E (2005) Online and print newspapers: their impact on the extent of the perceived public agenda. Eur J Commun 20: 245–258

    Article  Google Scholar 

  • Sekine S (2005) Automatic paraphrase discovery based on context and keywords between NE pairs. In: Proceedings of IWP

  • Sekine S (2006) On–demand information extraction. In: Proceedings of the COLING/ACL on main conference poster sessions, pp 731–738

  • Shimohata M, Sumita E (2002) Automatic paraphrasing based on parallel corpus for normalization. In: Proceedings of LREC, pp 453–457

  • Shinyama Y, Sekine S (2003) Paraphrase acquisition for information extraction. In: Proceedings of IWP, pp 65–71

  • Shinyama Y, Sekine S, Sudo K (2002) Automatic paraphrase acquisition from news articles. In: Proceedings of HLTR, pp 313–318

  • Tomuro N (2003) Interrogative reformulation patterns and acquisition of question paraphrases. In: Proceedings of IWP, pp 33–40

  • Wang XY, Lo D, Jiang J, Zhang L, Mei H (2009) Extracting paraphrases of technical terms from noisy parallel software corpora. In: Proceedings of ACL–IJCNLP, pp 197–200

  • Wu H, Zhou M (2003) Optimizing synonym extraction using monolingual and bilingual resources. In: Proceedings of the second international workshop on paraphrasing (IWP), pp 72–79

  • Yamamoto K (2002) Acquisition of lexical paraphrases from texts. In: Proceedings of the second international workshop on computational terminology (COMPUTERM), pp 22–28

  • Yoshida M, Nakagawa H, Terada A (2008) Gram-free synonym extraction via suffix arrays. AIRS, LNCS 4993: 276–285

    Google Scholar 

  • Zhao SQ, Liu T, Yuan XC, Li S, Zhang Y (2007). Automatic acquisition of context-specific lexical paraphrases. In: Proceedings of IJCAI, pp 1789–1794

  • Zhao SQ, Wang HF, Liu T, Li S (2008) Pivot approach for extracting paraphrase patterns from bilingual corpora. In: Proceedings of ACL–HLT, pp 780–788

  • Zhao SQ, Lan X, Liu T, Li S (2009a) Application-driven statistical paraphrase generation. In: Proceedings of the 47th annual meeting of the ACL and the 4th IJCNLP of the AFNLP, pp 834–842

  • Zhao SQ, Wang HF, Liu T, Li S (2009b) Extracting paraphrase patterns from bilingual parallel corpora. Nat Lang Eng 15(4): 503–526

    Article  Google Scholar 

  • Zhao SQ, Wang HF, Liu T (2010) Paraphrasing with search engine query logs. In: Proceedings of the 23rd international conference on computational linguistics (COLING), pp 1317–1325

  • Zhou L, Lin C, Munteanu DS, Hovy E (2006) ParaEval: using paraphrases to evaluate summaries automatically. In: Proceedings of the human language technology conference of the North American chapter of the ACL, pp 447–454

  • Zukerman I, Raskutti B, Wen Y (2002) Experiments in query paraphrasing for information retrieval. Adv Artif Intell, LNCS 2557:24–35

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to ChukFong Ho.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ho, C., Azmi Murad, M.A., Doraisamy, S. et al. Extracting lexical and phrasal paraphrases: a review of the literature. Artif Intell Rev 42, 851–894 (2014). https://doi.org/10.1007/s10462-012-9357-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10462-012-9357-8

Keywords

Navigation