Skip to main content
Log in

Text mining neuroscience journal articles to populate neuroscience databases

  • Original Article
  • Published:
Neuroinformatics Aims and scope Submit manuscript

Abstract

We have developed a program NeuroText to populate the neuroscience databases in SenseLab (http://senselab.med.yale.edu/senselab) by mining the natural language text of neuroscience articles. NeuroText uses a two-step approach to identify relevant articles. The first step (pre-processing), aimed at 100% sensitivity, identifies abstracts containing database keywords. In the second step, potentially relveant abstracts identified in the first step are processed for specificity dictated by database architecture, and neuroscience, lexical and semantic contexts. NeuroText results were presented to the experts for validation using a dynamically generated interface that also allows expert-validated articles to be automatically deposited into the databases. Of the test set of 912 articles, 735 were rejected at the pre-processing step. For the remaining articles, the accuracy of predicting database-relevant articles was 85%. Twenty-two articles were erroneously identified. NeuroText deferred decisions on 29 articles to the expert. A comparison of NeuroText results versus the experts’ analyses revealed that the program failed to correctly identify articles’ relevance due to concepts that did not yet exist in the knowledgebase or due to vaguely presented information in the abstracts. NeuroText uses two “evolution” techniques (supervised and unsupervised) that play an important role in the continual improvement of the retrieval results. Software that uses the NeuroText approach can facilitate the creation of curated, special-interest, bibliography databases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Agresti A. (1990) Categorical Data Analysis, Wiley, New York, pp. 59–66.

    Google Scholar 

  • Aronson A. (2001) Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc. Am. Med. Inform. Assn. Symp. Washington DC, pp. 17–21.

  • Baeza-Yates R. and Ribeiro-Neto B. (1999) Modern Information Retrieval, Addison-Wesley, New York, pp. 99–114; 191–224.

    Google Scholar 

  • Barde Y. A., Edgar D. and Thoenen H. (1982) Purification of a new neurotrophic factor from mammalian brain. EMBO. 1, 549–553.

    CAS  Google Scholar 

  • Cantrell A. R., Smith R. D., Goldin A. L., Scheuer T., and Catterall W. A. (1997) Dopaminergic Modulation of Sodium Current in Hippocampal Neurons via cAMP-Dependent Phosphorylation of Specific Sites in the Sodium Channel a Subunit. J. Neurosci. 17, 7330–7338.

    PubMed  CAS  Google Scholar 

  • Capogna M., McKinney R. A., O’Connor V., Gähwiler B. H., and Thompson S. M. (1997) Ca2+ or Sr2+ Partially Rescues Synaptic Transmission in Hippocampal Cultures Treated with Botulinum Toxin A and C, But Not Tetanus Toxin. J. Neurosci. 17, 7190–7202.

    PubMed  CAS  Google Scholar 

  • Chen W. R. and Shepherd G. M. (1997) Membrane and synaptic properties of mitral cells in slices of rat olfactory bulb. Brain Res. 745, 189–196.

    Article  PubMed  CAS  Google Scholar 

  • Chiu W. L. A. K., Sze C. N., Ip L. N., Chan S. K. and Au-Yeung S. C. F. (2001) NTDB: Thermodynamic Database for Nucleic Acids. Nucl. Acids Res. 29, 230–233.

    Article  PubMed  CAS  Google Scholar 

  • Cicchetti D. V. and Feinstein A. R. (1990) High aggreement but low kappa: II. Resolving the paradoxes. J. Clin. Epidemiol. 43, 551–558.

    Article  PubMed  CAS  Google Scholar 

  • Claiborne B. J., Amaral D. G., and Cowan W. M. (1986) A light and electron microscopy study analysis of the mossy fibers of the rat dentate gyrus. J. Comp. Neurol. 246, 435–458.

    Article  PubMed  CAS  Google Scholar 

  • Crasto C. J., Marenco L., Miller P. L., and Shepherd G. M. (2002) Olfactory receptor database: a metadata driven automated population from sources of gene and protein sequences. Nucl. Acids Res. 30, 354–360.

    Article  PubMed  CAS  Google Scholar 

  • Friedman C., Alderson P. O., Austin J. H., Cimino J. J., and Johnson S. B. (1994) A general natural language text processor for clinical radiology. J Am Med. Inform. Assn. 1, 161–174.

    CAS  Google Scholar 

  • Friedman C., Jra P., Yu H., Krauthammer M., and Rzhetsky A. (2001) GENIES: a natural-language processing system for extraction of molecular pathways from journal articles. Bioinformatics. 17, S74-S84.

    PubMed  Google Scholar 

  • Hersh W. R., Crabtree M. K., Hickman D. H., et al. (2002) Factors Associated with Success in Searching MEDLINE and Applying Evidence to Answer Clinical Questions. J Am Med Inform Assn. 9, 283–293.

    Article  Google Scholar 

  • Iliopoulos I., Enright A. J., and Ouzounis C. (2001) TextQuest: Document Clustering of MEDLINE Abstracts for Concept Discovery in Molecular Biology, Pacif. Symp. Biocomp. 6, 374–383.

    Google Scholar 

  • Justeson J. S. and Katz S. (1995) Technical terminology: some linguistic properties and an algorithm for identification in text. Nat. Lang. Eng. 1, 9–27.

    Article  Google Scholar 

  • Karp P. D., Riley M., Paley S. M., Pellegrini-Toole A., and Krumenacker M. (1999) EcoCyc: Encyclopedia of Escherichia coli genes and metabolism. Nucl. Acids Res. 27, 55–58.

    Article  PubMed  CAS  Google Scholar 

  • Kim W., Aronson A. R., and Wilbur W. J. (2001) Automatic MeSH term assignment and quality assessment Proc. Am. Med. Inform. Assn. Symb., Washington DC., pp. 310–323.

  • Korfhage R. R. (1997) Information Storage and Retrieval, John Wiley and Sons, New York, pp. 105–139, 191–215, 219–231.

    Google Scholar 

  • Krauthammer M., Rzhetsky A., Morozov P., and Friedman C. (2000) Using BLAST for identifying gene and protein names in journal articles. Gene. 259, 245–252.

    Article  PubMed  CAS  Google Scholar 

  • Lagus K. (2000) Text mining with the WEBSOM. Acta. Polytech. Scand. Math Comput. 110, 1–54.

    Google Scholar 

  • Marenco L., Nadkarni P. M., Skoufos E., Shepherd G. M., and Miller P. L. (1999) Neuronal database integration: the SenseLab EAV data model. Proc. Am. Med. Inform. Assn. Symp. Washington DC, 102–106.

  • Migliore M., Morse T. M., Davison A. P., Marenco L., Shepherd G. M., and Hines M. L. (2003) ModelDB: Making Models Publicly Accessible to Support Computational Neuroscience. Neuroinformatics. 1, 135–140.

    Article  PubMed  Google Scholar 

  • Mori K., Nowycky M. C., and Shepherd G. M. (1981) Electrophysiological analysis of mitral cells in the isolated turtle olfactory bulb. J. Physiol. (Lond.). 314, 281–294.

    CAS  Google Scholar 

  • Mutalik P. G., Deshpande A., and Nadkarni P. (1999) Use of General-purpose Negation Detection to Augment Concept Indexing of Medical Documents. J. Am. Med. Inform. Assoc. 8, 598–609.

    Google Scholar 

  • Nadkarni P. M., Marenco L., Chen R., Skoufos E., Shepherd G. M., and Miller P. L. (1999) Organization of Heterogeneous Scientific Data Using the EAV/CR Representation. J. Am. Med. Inform. Assn. 6, 478–493.

    CAS  Google Scholar 

  • Pinker S. (1994) The Language Instinct, Harper-Collins, London, pp. 177–178.

    Google Scholar 

  • Prager J. M. (1999) Linguini: Language Indentification for Multilingual Documents. Proc. 32nd Hawaii Int. Sys. 1–11.

  • Qian J., Colmers W. F., and Saggau P. (1997) Inhibition of Synaptic Transmission by Neuropeptide Y in Rat Hippocampal Area CA1: Modulation of Presynaptic Ca2+ Entry. J Neurosci. 17, 8169–8177.

    PubMed  CAS  Google Scholar 

  • Raghavan V. V., Jung G. S., and Bolling P. (1989) A critical investigation of recall and precision as measures of retrieval system performance. ACM. Tr. Inform. Sys. 7, 205–229.

    Article  Google Scholar 

  • Schomburg I., Chang A., and Schomburg D. (2002) BRENDA, enzyme data and metabolic information. Nucleic Acids Res. 30, 47–49.

    Article  PubMed  CAS  Google Scholar 

  • Shepherd G. M., Mirsky J. S., Healy M. D., et al. (1998) The Human Brain Project: neuroinformatics tools for integrating, searching and modeling multidisciplinary neuroscience data. Trends Neurosci. 21, 460–468.

    Article  PubMed  CAS  Google Scholar 

  • Spitzer R. and Fleiss J. (1982) A design-independent method for measuring the reliability of psychiatric diagnosis. J. Psychiat. Res. 17, 335–342.

    Google Scholar 

  • Sun Q.-Q. and Dale N. (1998) Differential inhibition of N and P/Q Ca2+ currents by 5HT1A and 5HT1D receptors in spinal neurons of Xenopus larvae. J. Physiol. 510, 103–120.

    Article  PubMed  CAS  Google Scholar 

  • Tague-Sutcliffe J. (1992) Measuring the informativeness of a retrieval process. Proc. 15th Ann. Intern. ACM SIGIR Conf. Res. Dev. Inform. Retrieval. Denmark. pp. 23–36.

  • Toth Z., Hollrigel G. S., Gorcs T., and Soltesz, I. (1997) Instantaneous Perturbation of Dentate Interneuronal Networks by a Pressure Wave-Transient Delivered to the Neocortex. J. Neurosci. 17, 8106–8117.

    PubMed  CAS  Google Scholar 

  • Weeber M., Mork J. and Aronson A. R. (2001) Developing a test collection for biomedical word sense disambiguation. Proc. Am. Med. Inform. Assn. Symp. Washington DC, 746–750.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chiquito J. Crasto.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Crasto, C.J., Marenco, L.N., Migliore, M. et al. Text mining neuroscience journal articles to populate neuroscience databases. Neuroinform 1, 215–237 (2003). https://doi.org/10.1385/NI:1:3:215

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1385/NI:1:3:215

Index Entries

Navigation