Skip to main content

Paraphrase Detection in Monolingual Specialized/Lay Comparable Corpora

  • Chapter
  • First Online:
Building and Using Comparable Corpora

Abstract

Paraphrases are a key feature in many natural language processing applications, and their extraction and generation are important tasks to tackle. Given two comparable corpora in the same language and the same domain, but displaying two different discourse types (lay and specialized), specific paraphrases can be spotted which provide a dimension along which these discourse types can be contrasted. Detecting such paraphrases in comparable corpora is the goal of the present work. Generally, paraphrases are identified by means of lexical and/or structural patterns. In this chapter, we present two methods to extract paraphrases across lay and specialized French monolingual comparable corpora. The first method uses lexical patterns designed according to intuition and linguistic studies, while the second is empirical, based on n-gram matching. The two methods appear to be complementary: the n-gram method confirms the initial lexical patterns and identifies other patterns. Besides, differences in the direction of application of paraphrase patterns highlight differences between specialized and lay discourse.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.sor-cancer.fr/

  2. 2.

    http://www.cismef.org/

  3. 3.

    http://www.hon.ch/

  4. 4.

    http://www.has-sante.fr/

  5. 5.

    http://www.inpes.sante.fr/

  6. 6.

    http://www.doctissimo.fr/

  7. 7.

    http://www.tabac-info-service.fr/

  8. 8.

    http://www.stop-tabac.ch/

  9. 9.

    http://www.diabete.qc.ca/

  10. 10.

    http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger

  11. 11.

    http://www.univ-nancy2.fr/pers/namer/Telecharger_Flemm.htm

  12. 12.

    http://search.cpan.org/~snowhare/Lingua-Stem-0.83

  13. 13.

    The first two authors of this chapter.

  14. 14.

    Examples of occurrences of each pattern are provided in Table 10, Appendix.

  15. 15.

    Examples of occurrences of each pattern are provided in Table 11, Appendix.

References

  1. Banerjee, S., Pedersen, T.: The design, implementation, and use of the n-gram statistics package. In: Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics, pp. 370–381, Mexico City (2003)

    Google Scholar 

  2. Bannard, C., Callison-Burch, C.: Paraphrasing with bilingual parallel corpora. In: Proceedings of the 43rd Annual Meeting on Association for, Computational Linguistics, pp. 597–604 (2005)

    Google Scholar 

  3. Barzilay, R.: Information fusion for multidocument summarization: paraphrasing and generation. PhD thesis, Columbia University (2003)

    Google Scholar 

  4. Barzilay, R., Lee, L.: Learning to paraphrase: an unsupervised approach using multiple-sequence alignment. In: HLT-NAACL, pp. 16–23, Edmonton, Canada (2003)

    Google Scholar 

  5. Barzilay, R., McKeown, K.: Extracting paraphrases from a parallel corpus. In: ACL/EACL, pp. 50–57 (2001)

    Google Scholar 

  6. Chiao, Y.C., Zweigenbaum, P.: Looking for candidate translational equivalents in specialized, comparable corpora. In: Proceedings of the 19th COLING, pp. 1208–1212, Taipei, Taiwan (2002)

    Google Scholar 

  7. Daille, B.: Identification des adjectifs relationnels en corpus. In: TALN 1999, pp. 105–114 (1999)

    Google Scholar 

  8. Deléger, L., Zweigenbaum, P.: Extracting lay paraphrases of specialized expressions from monolingual comparable medical corpora. In: Fung, P., Zweigenbaum, P., Rapp, R. (eds.) Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: From Parallel to Nonparallel Corpora, pp. 2–10 (2009)

    Google Scholar 

  9. Elhadad, N., Sutaria, K.: Mining a lexicon of technical terms and lay equivalents. In: ACL BioNLP Workshop, pp. 49–56, Prague, Czech Republic (2007)

    Google Scholar 

  10. Fang, Z.: Scientific literacy: a systemic functional linguistics perspective. Sci. Edu. 89(2), 335–347 (2005)

    Article  Google Scholar 

  11. Fradin, B.: On the semantics of denominal adjectives. In: Sixth Mediterranean Morphology Meeting, Ithaca, Greece (2008)

    Google Scholar 

  12. Fung, P.: A pattern matching method for finding noun and proper noun translations from noisy parallel corpora. In: Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pp. 236–233, Boston, Massachusetts (1995)

    Google Scholar 

  13. Hathout, N., Namer, F., Dal, G.: An experimental constructional database: the MorTAL project. In: Boucher, P. (ed.) Many Morphologies, pp. 178–209. Cascadilla, Somerville (2002)

    Google Scholar 

  14. Ibrahim, A., Katz, B., Lin, J.: Extracting structural paraphrases from aligned monolingual corpora. In: Proceedings of the 2nd International Workshop on Paraphrasing, Association for Computational Linguistics, pp. 57–64, Sapporo, Japan (2003)

    Google Scholar 

  15. Jacquemin, C.: Syntagmatic and paradigmatic representations of term variation. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, pp. 341–348, College Park, Maryland (1999)

    Google Scholar 

  16. L’Homme, M.: Adjectifs dérivés sémantiques (ADS) dans la structuration des terminologies. In: Terminologie, Ontologie et Représentation des Connaissances, Université Jean-Moulin Lyon-3 (2004)

    Google Scholar 

  17. Lindberg, D.A.B., Humphreys, B.L., McCray, A.T.: The unified medical language system. Methods Inf. Med. 32(2), 81–91 (1993)

    Google Scholar 

  18. Max, A.: Local rephrasing suggestions for supporting the work of writers. In: Proceedings of GoTAL, Gothenburg, Sweden (2008)

    Google Scholar 

  19. Max, A., Wisniewski, G.: Mining naturally-occurring corrections and paraphrases from Wikipedia’s revision history. In: Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M., Tapias, D. (eds.) Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC’10). European Language Resources Association (ELRA), Valletta, Malta (2010)

    Google Scholar 

  20. McEnery, A.M., Xiao, R.Z.: Parallel and comparable corpora: What are they up to? In: Incorporating Corpora: Translation and the Linguist—Translating Europe. Multilingual Matters, Clevedon (2007)

    Google Scholar 

  21. Morin, E., Daille, B., Takeuchi, K., Kageura, K.: Bilingual terminology mining: using brain, not brawn comparable corpora. In: Proceedings of ACL, Prague, Czech Republic (2007)

    Google Scholar 

  22. Namer, F.: Morphologie, Lexique et Traitement Automatique des Langues: l’Analyseur DériF. Lavoisier, Paris (2009)

    Google Scholar 

  23. Pang, B., Knight, K., Marcu, D.: Syntax-based alignment of multiple translations: extracting paraphrases and generating new sentences. In: Proceedings of HLT-NAACL 2003, pp. 102–109, Edmonton, Canada (2003)

    Google Scholar 

  24. Pasca, M., Dienes, P.: Aligning needles in a haystack: paraphrase acquisition across the web. In: Proceedings of IJCNLP, pp. 119–130 (2005)

    Google Scholar 

  25. Rapp, R.: Identifying word translations in non-parallel texts. In: Proceedings of the 33rd Annual Meeting of the Association for, Computational Linguistics, pp. 320–322 (1995)

    Google Scholar 

  26. Shinyama, Y., Sekine, S.: Paraphrase acquisition for information extraction. In: Proceedings of the 2nd International Workshop on Paraphrasing (IWP), pp. 65–71, Sapporo, Japan (2003)

    Google Scholar 

  27. Wolff, S.: Automatic coding of medical vocabulary (Chap. 7). In: Sager, N., Friedman, C., Lyman, M.S. (eds.) Medical Language Processing: Computer Management of Narrative Data, pp. 145–162. Addison-Wesley, New York (1986)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Louise Deléger .

Editor information

Editors and Affiliations

Appendix

Appendix

This appendix contains example patterns: Table 10 provides examples of the patterns presented in Table 4, while Table 11 shows examples of the patterns presented in Table 7.

Table 10 Examples for the top 20 most frequent patterns presented in Table 4
Table 11 Examples for the unidirectional patterns displayed in Table 7

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Deléger, L., Cartoni, B., Zweigenbaum, P. (2013). Paraphrase Detection in Monolingual Specialized/Lay Comparable Corpora. In: Sharoff, S., Rapp, R., Zweigenbaum, P., Fung, P. (eds) Building and Using Comparable Corpora. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20128-8_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-20128-8_12

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-20127-1

  • Online ISBN: 978-3-642-20128-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics