Abstract
Recent advances in natural language processing have increased the popularity of paraphrase extraction. Most of the attention, however, has been focused on the extraction methods only without taking the resource factor into the consideration. Unknowingly, there is a strong relationship between them and the resource factor also plays an equally important role in paraphrase extraction. In addition, almost all of the previous studies have been focused on corpus-based methods that extract paraphrases from corpora based solely on syntactic similarity. Despite the popularity of corpus-based methods, a considerable amount of research has consistently shown that these methods are vulnerable to several types of erroneous paraphrases. For these reasons, it is necessary to evaluate whether the trend is moving in a positive direction. This paper reviews the major research on paraphrase extraction methods in detail. It begins by exploring the definition of paraphrase from different perspectives to provide a better understanding of the concept of paraphrase extraction. It then studies the characteristics and potential uses of different types of paraphrase resources. After that, it divides paraphrase extraction methods into four main categories: heuristic-based, knowledge-based, corpus-based and hybrid-based and summarizes their strengths and weaknesses. This paper concludes with some potential open research issues for future directions.
Similar content being viewed by others
Abbreviations
- BPC:
-
Bilingual parallel corpora
- CC:
-
Comparable corpora
- CCG:
-
Complex syntactic constraints
- FC:
-
Free corpora
- FSA:
-
Finite state automata
- HTML:
-
Hypertext markup language
- IE:
-
Information extraction
- IR:
-
Information retrieval
- LP:
-
Lexical paraphrases
- LR:
-
Lexical resources
- MPC:
-
Monolingual parallel corpora
- NE:
-
Named entity
- NLP:
-
Natural language processing
- PMI:
-
Point-wise mutual information
- POS:
-
Part of speech
- PP:
-
Phrasal paraphrases
- Q&A:
-
Question and answering
- SUBJ-OBJ:
-
Is the subject of and is the object of
- T1:
-
Type 1
- T2:
-
Type 2
- T3:
-
Type 3
- T4:
-
Type 4
- T5:
-
Type 5
- TF-IDF:
-
Term frequency & inverse document frequency
- WWW:
-
World wide web
References
Azmi Murad MA, Martin TP (2004) Using fuzzy sets in contextual word similarity. Intell Data Eng Automa Learn (IDEAL), LNCS 3177: 517–522
Bannard C, Callison-Burch C (2005) Paraphrasing with bilingual parallel corpora. In: Proceedings of the 43rd annual meeting of the Association for Computational Linguistics, pp 597–604
Barzilay R, McKeown KR (2001) Extracting paraphrases from a parallel corpus. In: Proceedings of the 39th annual meeting on Association for Computational Linguistics, pp 50–57
Bernhard D, Gurevych I (2008) Answering learners’ questions by retrieving question paraphrases from social Q&A sites. In: Proceedings of the 3rd workshop on innovative use of NLP for building educational applications, pp 44–52
Bhagat R, Ravichandran D (2008) Large scale acquisition of paraphrases for learning surface patterns. In: Proceedings of ACL–HLT, pp 674–682
Bhagat R, Hovy E, Patwardhan S (2009) Acquiring paraphrases from text corpora. In: Proceedings of the fifth international conference on knowledge capture (K-CAP), pp 161–168
Boczkowski P (1999) Understanding the development of online newspapers. New Media Soc 1(1): 101–126
Brun C, Hagege C (2003) Normalization and paraphrasing using symbolic methods. In: Proceedings of the second international workshop on paraphrasing, pp 41–48
Callison-Burch C (2008) Syntactic constraints on paraphrases extracted from parallel corpora. In: Proceedings of EMNLP, pp 196–205
Chakraborty RC (2010) Natural language processing. AI Course Lecture 41. http://www.myreaders.info/10_Natural_Language_Processing.pdf
Chapman S (2006) Thinking about language: theories of English. Palgrave Macmillian, Basingstoke.
Childs L, Acott-Smith A, Curtis K (1998) Grammar part I: parts of speech. Academic studies: English. http://www.nald.ca/library/learning/academic/english/grammar/speech/module5.pdf
Connor M, Roth D (2007) Context sensitive paraphrasing with a global unsupervised classifier. In: Proceedings of the 18th European conference on machine learning (ECML), pp 104–115
Coyne B, Rambow O (2009) LexPar: a freely available English paraphrase lexicon automatically extracted from FrameNet. In: Proceedings of IEEE international conference on semantic computing (ICSC), pp 53–58
Debusmann R (2000) An introduction to dependency grammar. Hausarbeit fur das Hauptseminar Dependenzgrammatik SoSe 99, pp 1–16
Deleger L, Zweigenbaum P (2008) Paraphrase acquisition from comparable medical corpora of specialized and lay texts. In: Proceedings of AMIA annual symposium proceedings archives, pp 146–150
Deleger L, Zweigenbaum P (2009) Extracting lay paraphrases of specialized expressions from monolingual comparable medical corpora. In: Proceedings of the second workshop on building and using comparable corpora (ACL–IJCNLP), pp 2–10
Deleger L, Zweigenbaum P (2010) Identifying paraphrases between technical and lay corpora. In: Proceedings of the seventh international conference on language resources and evaluation (LREC), pp 3537–3541
DiMarco C, Hirst G, Stede M (1993) The semantic and stylistic differentiation of synonyms and near-synonyms. In: Proceedings of AAAI spring symposium on building lexicons for machine translation, pp 114–121
Edmonds P, Hirst G (2002) Near-synonymy and lexical choice. Comput Linguist 28(2): 105–144
Elhadad N, Sutaria K (2007) Mining a lexicon of technical terms and lay equivalents. In: Proceedings of workshop on BioNLP: biological, translational and clinical language processing, pp 49–56
Fellbaum C, Palmer M, Dang HT, Delfs L, Wolf S (2001) Manual and automatic semantic annotation with WordNet. In: Proceedings of NAACL workshop: WordNet and other lexical resources, pp 3–10
Filip H introduction to natural language semantics (2002). http://www.ccl.pku.edu.cn/doubtfire/Semantics/Chapter_1_What_is_meaning.pdf
Fitzgerald M (1995) Interest in electronic delivery continues to grow, survey shows. Editor & Publisher, pp 30–31
Friedman C, Hripcsak G (1999) Natural language processing and its future in medicine. Acad Med 74(8): 890–895
Gennari SP, MacDonald MC, Postle BR, Seidenbergb MS (2007) Context-dependent interpretation of words: evidence for interactive neural processes. Neuroimage 35: 1278–1286
Glickman O, Dagan I (2003) Identifying lexical paraphrases from a single corpus: a case study for verbs. In: Proceedings of recent advantages in natural language processing (RANLP), pp 166–173
Grigonyte G, Cordeiro J, Dias G, Moraliyski R, Brazdil P (2010) Paraphrase alignment for synonym evidence discovery. In: Proceedings of the 23rd international conference on computational linguistics (COLING), pp 403–411
Harris Z (1970). Distributional structure. Structural and transformational linguistics, pp 775–794
Hasegawa T, Sekine S, Grishman R (2005) Unsupervised paraphrase acquisition via relation discovery. Technical Report 05-012, Proteus Project, Computer Department, New York University
Hashimoto C, Torisawa K, Saeger SD, Kazama J, Kurohashi S (2011) Extracting paraphrases from definition sentences on the web. In: Proceedings of the 49th annual meeting of the Association for computational linguistics (ACL), pp 1087–1097
Hearst MA (1992) Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th international conference on computational linguistics (COLING), pp 539–545
Herrera J, Penas A, Verdejo F (2007) Paraphrase extraction from validated question answering corpora in Spanish. In: Proceedings of the working notes of the XXIII conference of the Spanish Association for Natural Language Processing (SEPLN), pp 37–44
Ho CF, Azmi Murad MA, Doraisamy S, Abdul Kadir R (2011) Comparing two corpus-based methods for extracting paraphrases to dictionary-based method. Int J Semant Comput (IJSC) 5(2): 133–178
Hwang YS, Kim YK, Park SK (2008) Paraphrasing depending on bilingual context toward generalization of translation knowledge. In: Proceedings of the third international joint conference on natural language processing, pp 327–334
Ibrahim A, Katz B, Lin J (2003) Extracting structural paraphrases from aligned monolingual corpora. In: Proceedings of ACL, pp 10–17
Jacquemin C, Klavans JL, Tzoukermann E (1997) Expansion of multi-word terms for indexing and retrieval using morphology and syntax. In: Proceeding of the 35th annual meeting of the Association for Computational Linguistics and eighth conference of the European chapter of the Association for Computational Linguistics, pp 24–31
Jusoh S, Masoud AM, Alfawareh HM (2011) Automated text summarization: sentence refinement approach. Commun Comput Inf Sci Digit Inf Process Commun 189(8): 207–218
Kaji N, Kurohashi S (2006) Lexical choice via topic adaptation for paraphrasing written language to spoken language. Inf Retr Technol LNCS 4182: 673–679
Keshtkar F, Inkpen D (2010) A corpus-based method for extracting paraphrases of emotion terms. In: Proceedings of the NAACL–HLT workshop on computational approaches to analysis and generation of emotion in text, pp 35–44
Kozareva Z, Vazquez S, Montoyo A (2007) The usefulness of conceptual representation for the identification of semantic variability expressions. In: Proceedings of the 8th international conference on computational linguistics and intelligent text processing (CICLing), pp 325–336
Kozlowski R, McCoy KF, Vijay-Shanker K (2003) Generation of single-sentence paraphrases from predicate/argument structure using lexico-grammatical resources. In: Proceedings of the second international workshop on paraphrasing, pp 1–8
Lakoff G, Johnson M (1980) Metaphors we live by. University of Chicago Press, Chicago, pp xiii, 241
Li YH, Bandar ZA, McLean D (2003) An approach for measuring semantic similarity between words using multiple information sources. IEEE Trans Knowl Data Eng 15: 871–882
Li W, Liu T, Zhang Y, Li S, He W (2005) Automated generalization of phrasal paraphrases from the web. In: Proceedings of IWP, pp 49–56
Li YH, McLean D, Bandar ZA, Shea JO, Crockett K (2006) Sentence similarity based on semantic nets and corpus statistics. IEEE Trans Knowl Data Eng 18(8): 1138–1150
Lin D, Pantel P (2001) DIRT—discovery of inference rules from text. In: Proceedings of ACM SIGKDD, pp 323–328
Meyer E (1998). An unexpectedly wider web for the world’s newspapers. American Journalism Review. http://www.newslink.org/emcol10.html
Murata M, Kanamaru T, Isahara H (2005) Automatic synonym acquisition based on matching of definition sentences in multiple dictionaries. In: Proceedings of the sixth international conference on computational linguistics and intelligent text processing (CICLing), pp 293–304
Nakagawa H, Masuda H (2005) Extracting paraphrases of Japanese action word of sentence ending part from web and mobile news articles. In: Proceedings of AIRS, pp 94–105
Nes FV, Abma T, Jonsson H, Deeg D (2010) Language differences in qualitative research: is meaning lost in translation?. Eur J Ageing 7(4): 313–316
Otis L (2008) Going with your gut: some thoughts on language and the body. Perspect 372: 798–799
Pang B, Knight K, Marcu D (2003) Syntax-based alignment of multiple translations: extracting paraphrases and generating new sentences. In: Proceedings of HLT–NAACL, pp 102–109
Pasca M, Dienes P (2005) Aligning needles in a haystack: paraphrase acquisition across the web. In: Proceedings of the international joint conference on Natural Language Processing (IJCNLP), LNAI, vol 3651, pp 119–130
Peng FY, Tham NI, Hao X (1999) Trends in online newspapers: a look at the US web. Newsp Res J 20(2): 52–63
Poibeau T, Dutoit D (2009) Automatic extraction of paraphrastic phrases from small-size corpora. Linguisticae Investigationes 32(1): 77–98
Polkinghorne D (2005) Language and meaning: data collection in qualitative research. J Couns Psychol 52(2): 137–145
Ringlstetter C, Schulz KU, Mihov S (2006) Orthographic errors in web pages: toward cleaner web corpora. J Comput Linguist 32(3): 295–340
Runett B (2002) Reaching out: newspaper sites add audience; improve stature as net marketplace. Newspaper Association of America, Arlington
Schonbach K, Waal ED, Lauf E (2005) Online and print newspapers: their impact on the extent of the perceived public agenda. Eur J Commun 20: 245–258
Sekine S (2005) Automatic paraphrase discovery based on context and keywords between NE pairs. In: Proceedings of IWP
Sekine S (2006) On–demand information extraction. In: Proceedings of the COLING/ACL on main conference poster sessions, pp 731–738
Shimohata M, Sumita E (2002) Automatic paraphrasing based on parallel corpus for normalization. In: Proceedings of LREC, pp 453–457
Shinyama Y, Sekine S (2003) Paraphrase acquisition for information extraction. In: Proceedings of IWP, pp 65–71
Shinyama Y, Sekine S, Sudo K (2002) Automatic paraphrase acquisition from news articles. In: Proceedings of HLTR, pp 313–318
Tomuro N (2003) Interrogative reformulation patterns and acquisition of question paraphrases. In: Proceedings of IWP, pp 33–40
Wang XY, Lo D, Jiang J, Zhang L, Mei H (2009) Extracting paraphrases of technical terms from noisy parallel software corpora. In: Proceedings of ACL–IJCNLP, pp 197–200
Wu H, Zhou M (2003) Optimizing synonym extraction using monolingual and bilingual resources. In: Proceedings of the second international workshop on paraphrasing (IWP), pp 72–79
Yamamoto K (2002) Acquisition of lexical paraphrases from texts. In: Proceedings of the second international workshop on computational terminology (COMPUTERM), pp 22–28
Yoshida M, Nakagawa H, Terada A (2008) Gram-free synonym extraction via suffix arrays. AIRS, LNCS 4993: 276–285
Zhao SQ, Liu T, Yuan XC, Li S, Zhang Y (2007). Automatic acquisition of context-specific lexical paraphrases. In: Proceedings of IJCAI, pp 1789–1794
Zhao SQ, Wang HF, Liu T, Li S (2008) Pivot approach for extracting paraphrase patterns from bilingual corpora. In: Proceedings of ACL–HLT, pp 780–788
Zhao SQ, Lan X, Liu T, Li S (2009a) Application-driven statistical paraphrase generation. In: Proceedings of the 47th annual meeting of the ACL and the 4th IJCNLP of the AFNLP, pp 834–842
Zhao SQ, Wang HF, Liu T, Li S (2009b) Extracting paraphrase patterns from bilingual parallel corpora. Nat Lang Eng 15(4): 503–526
Zhao SQ, Wang HF, Liu T (2010) Paraphrasing with search engine query logs. In: Proceedings of the 23rd international conference on computational linguistics (COLING), pp 1317–1325
Zhou L, Lin C, Munteanu DS, Hovy E (2006) ParaEval: using paraphrases to evaluate summaries automatically. In: Proceedings of the human language technology conference of the North American chapter of the ACL, pp 447–454
Zukerman I, Raskutti B, Wen Y (2002) Experiments in query paraphrasing for information retrieval. Adv Artif Intell, LNCS 2557:24–35
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ho, C., Azmi Murad, M.A., Doraisamy, S. et al. Extracting lexical and phrasal paraphrases: a review of the literature. Artif Intell Rev 42, 851–894 (2014). https://doi.org/10.1007/s10462-012-9357-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10462-012-9357-8