Skip to main content

Distributional Context Generalisation and Normalisation as a Mean to Reduce Data Sparsity: Evaluation of Medical Corpora

  • Conference paper
  • 1986 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8686))

Abstract

Distributional analysis relies on the recurrence of information in the contexts of words to associate. But the vector space models implementing the approach suffer from data sparsity and from a high dimensional context matrix. If reducing data sparsity is an important aspect with general corpora, it is also a major issue with specialised corpora that are of much smaller size and with much lower context frequencies. We tackle this problem on specialised texts and propose a method to increase the matrix density by normalising and generalising distributional contexts with synonymy and hypernymy relations acquired from corpora. Experiments on a French biomedical corpus show that context generalisation and normalisation improve the results when combined with the use of relations acquired with lexico-syntactic patterns.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aubin, S., Hamon, T.: Improving term extraction with terminological resources. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds.) FinTAL 2006. LNCS (LNAI), vol. 4139, pp. 380–387. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  2. Baroni, M., Bernardini, S., Ferraresi, A., Zanchetta, E.: The wacky wide web: a collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation 43(3), 209–226 (2009)

    Article  Google Scholar 

  3. Broda, B., Piasecki, M., Szpakowicz, S.: Rank-based transformation in measuring semantic relatedness. In: Gao, Y., Japkowicz, N. (eds.) Canadian AI 2009. LNCS (LNAI), vol. 5549, pp. 187–190. Springer, Heidelberg (2009)

    Google Scholar 

  4. Buckley, C., Voorhees, E.: Retrieval system evaluation. In: TREC: Experiment and Evaluation in Information Retrieval, ch. 3 (2005)

    Google Scholar 

  5. Curran, J.R.: From distributional to semantic similarity. Ph.D. thesis, Institute for Communicating and Collaborative Systems, University of Edinburgh (2004)

    Google Scholar 

  6. Ferret, O.: Sélection non supervisée de relations sémantiques pour améliorer un thésaurus distributionnel. In: Actes de TALN 2013, pp. 48–61 (2013)

    Google Scholar 

  7. Grabar, N., Zweigenbaum, P.: Lexically-based terminology structuring. Terminology 10, 23–54 (2003)

    Article  Google Scholar 

  8. Grefenstette, G.: Corpus-derived first, second and third-order word affinities. In: Sixth Euralex International Congress, pp. 279–290 (1994)

    Google Scholar 

  9. Hamon, T., Nazarenko, A., Poibeau, T., Aubin, S., Derivière, J.: A robust linguistic platform for efficient and domain specific web content analysis. In: RIAO (2007)

    Google Scholar 

  10. Hamon, T., Nazarenko, A., Gros, C.: A step towards the detection of semantic variants of terms in technical documents. In: COLING-ACL 1998, pp. 498–504 (1998)

    Google Scholar 

  11. Jacquemin, C.: Spotting and discovering terms through natural language processing. The MIT Press (2001)

    Google Scholar 

  12. Kanerva, P., Kristofersson, J., Holst, A.: Random indexing of text samples for latent semantic analysis. In: Conf. of the Cognitive Science Society, vol. 1036 (2000)

    Google Scholar 

  13. Karlgren, J., Sahlgren, M.: From words to understanding. In: Proceedings of the ACL 2001, pp. 294–308 (2001)

    Google Scholar 

  14. Landauer, T., Dumais, S.: A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review 104(2), 211 (1997)

    Article  Google Scholar 

  15. Morin, E., Jacquemin, C.: Automatic Acquisition and Expansion of Hypernym Links. Computers and the Humanities 38(4), 363–396 (2004)

    Article  Google Scholar 

  16. Padó, S., Lapata, M.: Dependency-based construction of semantic space models. Computational Linguistics 33(2), 161–199 (2007)

    Article  MATH  Google Scholar 

  17. Polajnar, T., Clark, S.: Improving distributional semantic vectors through context selection and normalisation. In: Proceedings of EACL 2014 (to appear, 2014)

    Google Scholar 

  18. Sahlgren, M.: The Word-Space Model: Using Distributional Analysis to Represent Syntagmatic and Paradigmatic Relations between Words in High-Dimensional Vector Spaces. Ph.D. thesis, Stockholm University, Stockholm, Sweden (2006)

    Google Scholar 

  19. Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: New Methods in Language Processing, pp. 44–49 (1994)

    Google Scholar 

  20. Tsatsaronis, G., Panagiotopoulou, V.: A generalized vector space model for text retrieval based on semantic relatedness. In: EACL 2009, pp. 70–78 (2009)

    Google Scholar 

  21. Turney, P.D., Pantel, P.: From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research 37, 141–188 (2010)

    MATH  MathSciNet  Google Scholar 

  22. Vozalis, E., Margaritis, K.G.: Analysis of recommender systems’ algorithms. In: Proceedings of HERCMA (2003)

    Google Scholar 

  23. Weeds, J., Weir, D.: Co-occurrence retrieval: A flexible framework for lexical distributional similarity. Computational Linguistics 31(4), 439–475 (2005)

    Article  MATH  Google Scholar 

  24. Zweigenbaum, P.: Menelas: an access system for medical records using natural language. Computer Methods and Programs in Biomedicine 45 (1994)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Périnet, A., Hamon, T. (2014). Distributional Context Generalisation and Normalisation as a Mean to Reduce Data Sparsity: Evaluation of Medical Corpora. In: Przepiórkowski, A., Ogrodniczuk, M. (eds) Advances in Natural Language Processing. NLP 2014. Lecture Notes in Computer Science(), vol 8686. Springer, Cham. https://doi.org/10.1007/978-3-319-10888-9_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-10888-9_13

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-10887-2

  • Online ISBN: 978-3-319-10888-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics