Abstract
Distributional analysis relies on the recurrence of information in the contexts of words to associate. But the vector space models implementing the approach suffer from data sparsity and from a high dimensional context matrix. If reducing data sparsity is an important aspect with general corpora, it is also a major issue with specialised corpora that are of much smaller size and with much lower context frequencies. We tackle this problem on specialised texts and propose a method to increase the matrix density by normalising and generalising distributional contexts with synonymy and hypernymy relations acquired from corpora. Experiments on a French biomedical corpus show that context generalisation and normalisation improve the results when combined with the use of relations acquired with lexico-syntactic patterns.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Aubin, S., Hamon, T.: Improving term extraction with terminological resources. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds.) FinTAL 2006. LNCS (LNAI), vol. 4139, pp. 380–387. Springer, Heidelberg (2006)
Baroni, M., Bernardini, S., Ferraresi, A., Zanchetta, E.: The wacky wide web: a collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation 43(3), 209–226 (2009)
Broda, B., Piasecki, M., Szpakowicz, S.: Rank-based transformation in measuring semantic relatedness. In: Gao, Y., Japkowicz, N. (eds.) Canadian AI 2009. LNCS (LNAI), vol. 5549, pp. 187–190. Springer, Heidelberg (2009)
Buckley, C., Voorhees, E.: Retrieval system evaluation. In: TREC: Experiment and Evaluation in Information Retrieval, ch. 3 (2005)
Curran, J.R.: From distributional to semantic similarity. Ph.D. thesis, Institute for Communicating and Collaborative Systems, University of Edinburgh (2004)
Ferret, O.: Sélection non supervisée de relations sémantiques pour améliorer un thésaurus distributionnel. In: Actes de TALN 2013, pp. 48–61 (2013)
Grabar, N., Zweigenbaum, P.: Lexically-based terminology structuring. Terminology 10, 23–54 (2003)
Grefenstette, G.: Corpus-derived first, second and third-order word affinities. In: Sixth Euralex International Congress, pp. 279–290 (1994)
Hamon, T., Nazarenko, A., Poibeau, T., Aubin, S., Derivière, J.: A robust linguistic platform for efficient and domain specific web content analysis. In: RIAO (2007)
Hamon, T., Nazarenko, A., Gros, C.: A step towards the detection of semantic variants of terms in technical documents. In: COLING-ACL 1998, pp. 498–504 (1998)
Jacquemin, C.: Spotting and discovering terms through natural language processing. The MIT Press (2001)
Kanerva, P., Kristofersson, J., Holst, A.: Random indexing of text samples for latent semantic analysis. In: Conf. of the Cognitive Science Society, vol. 1036 (2000)
Karlgren, J., Sahlgren, M.: From words to understanding. In: Proceedings of the ACL 2001, pp. 294–308 (2001)
Landauer, T., Dumais, S.: A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review 104(2), 211 (1997)
Morin, E., Jacquemin, C.: Automatic Acquisition and Expansion of Hypernym Links. Computers and the Humanities 38(4), 363–396 (2004)
Padó, S., Lapata, M.: Dependency-based construction of semantic space models. Computational Linguistics 33(2), 161–199 (2007)
Polajnar, T., Clark, S.: Improving distributional semantic vectors through context selection and normalisation. In: Proceedings of EACL 2014 (to appear, 2014)
Sahlgren, M.: The Word-Space Model: Using Distributional Analysis to Represent Syntagmatic and Paradigmatic Relations between Words in High-Dimensional Vector Spaces. Ph.D. thesis, Stockholm University, Stockholm, Sweden (2006)
Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: New Methods in Language Processing, pp. 44–49 (1994)
Tsatsaronis, G., Panagiotopoulou, V.: A generalized vector space model for text retrieval based on semantic relatedness. In: EACL 2009, pp. 70–78 (2009)
Turney, P.D., Pantel, P.: From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research 37, 141–188 (2010)
Vozalis, E., Margaritis, K.G.: Analysis of recommender systems’ algorithms. In: Proceedings of HERCMA (2003)
Weeds, J., Weir, D.: Co-occurrence retrieval: A flexible framework for lexical distributional similarity. Computational Linguistics 31(4), 439–475 (2005)
Zweigenbaum, P.: Menelas: an access system for medical records using natural language. Computer Methods and Programs in Biomedicine 45 (1994)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Périnet, A., Hamon, T. (2014). Distributional Context Generalisation and Normalisation as a Mean to Reduce Data Sparsity: Evaluation of Medical Corpora. In: Przepiórkowski, A., Ogrodniczuk, M. (eds) Advances in Natural Language Processing. NLP 2014. Lecture Notes in Computer Science(), vol 8686. Springer, Cham. https://doi.org/10.1007/978-3-319-10888-9_13
Download citation
DOI: https://doi.org/10.1007/978-3-319-10888-9_13
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10887-2
Online ISBN: 978-3-319-10888-9
eBook Packages: Computer ScienceComputer Science (R0)