Distributional Context Generalisation and Normalisation as a Mean to Reduce Data Sparsity: Evaluation of Medical Corpora

Périnet, Amandine; Hamon, Thierry

doi:10.1007/978-3-319-10888-9_13

Distributional Context Generalisation and Normalisation as a Mean to Reduce Data Sparsity: Evaluation of Medical Corpora

Amandine Périnet^20,21 &
Thierry Hamon^22,23

Conference paper

1986 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8686))

Abstract

Distributional analysis relies on the recurrence of information in the contexts of words to associate. But the vector space models implementing the approach suffer from data sparsity and from a high dimensional context matrix. If reducing data sparsity is an important aspect with general corpora, it is also a major issue with specialised corpora that are of much smaller size and with much lower context frequencies. We tackle this problem on specialised texts and propose a method to increase the matrix density by normalising and generalising distributional contexts with synonymy and hypernymy relations acquired from corpora. Experiments on a French biomedical corpus show that context generalisation and normalisation improve the results when combined with the use of relations acquired with lexico-syntactic patterns.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aubin, S., Hamon, T.: Improving term extraction with terminological resources. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds.) FinTAL 2006. LNCS (LNAI), vol. 4139, pp. 380–387. Springer, Heidelberg (2006)
Chapter Google Scholar
Baroni, M., Bernardini, S., Ferraresi, A., Zanchetta, E.: The wacky wide web: a collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation 43(3), 209–226 (2009)
Article Google Scholar
Broda, B., Piasecki, M., Szpakowicz, S.: Rank-based transformation in measuring semantic relatedness. In: Gao, Y., Japkowicz, N. (eds.) Canadian AI 2009. LNCS (LNAI), vol. 5549, pp. 187–190. Springer, Heidelberg (2009)
Google Scholar
Buckley, C., Voorhees, E.: Retrieval system evaluation. In: TREC: Experiment and Evaluation in Information Retrieval, ch. 3 (2005)
Google Scholar
Curran, J.R.: From distributional to semantic similarity. Ph.D. thesis, Institute for Communicating and Collaborative Systems, University of Edinburgh (2004)
Google Scholar
Ferret, O.: Sélection non supervisée de relations sémantiques pour améliorer un thésaurus distributionnel. In: Actes de TALN 2013, pp. 48–61 (2013)
Google Scholar
Grabar, N., Zweigenbaum, P.: Lexically-based terminology structuring. Terminology 10, 23–54 (2003)
Article Google Scholar
Grefenstette, G.: Corpus-derived first, second and third-order word affinities. In: Sixth Euralex International Congress, pp. 279–290 (1994)
Google Scholar
Hamon, T., Nazarenko, A., Poibeau, T., Aubin, S., Derivière, J.: A robust linguistic platform for efficient and domain specific web content analysis. In: RIAO (2007)
Google Scholar
Hamon, T., Nazarenko, A., Gros, C.: A step towards the detection of semantic variants of terms in technical documents. In: COLING-ACL 1998, pp. 498–504 (1998)
Google Scholar
Jacquemin, C.: Spotting and discovering terms through natural language processing. The MIT Press (2001)
Google Scholar
Kanerva, P., Kristofersson, J., Holst, A.: Random indexing of text samples for latent semantic analysis. In: Conf. of the Cognitive Science Society, vol. 1036 (2000)
Google Scholar
Karlgren, J., Sahlgren, M.: From words to understanding. In: Proceedings of the ACL 2001, pp. 294–308 (2001)
Google Scholar
Landauer, T., Dumais, S.: A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review 104(2), 211 (1997)
Article Google Scholar
Morin, E., Jacquemin, C.: Automatic Acquisition and Expansion of Hypernym Links. Computers and the Humanities 38(4), 363–396 (2004)
Article Google Scholar
Padó, S., Lapata, M.: Dependency-based construction of semantic space models. Computational Linguistics 33(2), 161–199 (2007)
Article MATH Google Scholar
Polajnar, T., Clark, S.: Improving distributional semantic vectors through context selection and normalisation. In: Proceedings of EACL 2014 (to appear, 2014)
Google Scholar
Sahlgren, M.: The Word-Space Model: Using Distributional Analysis to Represent Syntagmatic and Paradigmatic Relations between Words in High-Dimensional Vector Spaces. Ph.D. thesis, Stockholm University, Stockholm, Sweden (2006)
Google Scholar
Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: New Methods in Language Processing, pp. 44–49 (1994)
Google Scholar
Tsatsaronis, G., Panagiotopoulou, V.: A generalized vector space model for text retrieval based on semantic relatedness. In: EACL 2009, pp. 70–78 (2009)
Google Scholar
Turney, P.D., Pantel, P.: From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research 37, 141–188 (2010)
MATH MathSciNet Google Scholar
Vozalis, E., Margaritis, K.G.: Analysis of recommender systems’ algorithms. In: Proceedings of HERCMA (2003)
Google Scholar
Weeds, J., Weir, D.: Co-occurrence retrieval: A flexible framework for lexical distributional similarity. Computational Linguistics 31(4), 439–475 (2005)
Article MATH Google Scholar
Zweigenbaum, P.: Menelas: an access system for medical records using natural language. Computer Methods and Programs in Biomedicine 45 (1994)
Google Scholar

Download references

Author information

Authors and Affiliations

INSERM, U1142, LIMICS, Paris, France
Amandine Périnet
UPMC Univ Paris 06, Univ Paris 13, Sorbonne Paris Cité, Villetaneuse, France
Amandine Périnet
LIMSI-CNRS, Orsay, France
Thierry Hamon
Université Paris 13, Sorbonne Paris Cité, Villetaneuse, France
Thierry Hamon

Authors

Amandine Périnet
View author publications
You can also search for this author in PubMed Google Scholar
Thierry Hamon
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computer Science, Polish Academy of Sciences, ul. Jana Kazimierza 5, 01-248, Warsaw, Poland
Adam Przepiórkowski & Maciej Ogrodniczuk &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Périnet, A., Hamon, T. (2014). Distributional Context Generalisation and Normalisation as a Mean to Reduce Data Sparsity: Evaluation of Medical Corpora. In: Przepiórkowski, A., Ogrodniczuk, M. (eds) Advances in Natural Language Processing. NLP 2014. Lecture Notes in Computer Science(), vol 8686. Springer, Cham. https://doi.org/10.1007/978-3-319-10888-9_13

Download citation

DOI: https://doi.org/10.1007/978-3-319-10888-9_13
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10887-2
Online ISBN: 978-3-319-10888-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics