Skip to main content

Learning to Find Relevant Biological Articles without Negative Training Examples

  • Conference paper
AI 2008: Advances in Artificial Intelligence (AI 2008)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5360))

Included in the following conference series:

Abstract

Classifiers are traditionally learned using sets of positive and negative training examples. However, often a classifier is required, but for training only an incomplete set of positive examples and a set of unlabeled examples are available. This is the situation, for example, with the Transport Classification Database (TCDB, www.tcdb.org), a repository of information about proteins involved in transmembrane transport. This paper presents and evaluates a method for learning to rank the likely relevance to TCDB of newly published scientific articles, using the articles currently referenced in TCDB as positive training examples. The new method has succeeded in identifying 964 new articles relevant to TCDB in fewer than six months, which is a major practical success. From a general data mining perspective, the contributions of this paper are (i) evaluating two novel approaches that solve the positive-only problem effectively, (ii) applying support vector machines in a state-of-the-art way for recognizing and ranking relevance, and (iii) deploying a system to update a widely-used, real-world biomedical database. Supplementary information including all data sets are publicly available at www.cs.ucsd.edu/users/knoto/pub/ajcai08.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Das, S., Saier Jr., M.H., Elkan, C.: Finding transport proteins in a general protein database. In: Proceedings of the Eleventh European Conference on Principles and Practice of Knowledge Discovery in Databases, pp. 54–66 (2007)

    Google Scholar 

  2. Denis, F., Gilleron, R., Letouzey, F.: Learning from positive and unlabeled examples. Theoretical Computer Science 348(1), 70–83 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  3. Dobrokhotov, P.B., Goutte, C., Veuthey, A.L., Gaussier, E.: Combining NLP and probabilistic categorisation for document and term selection for Swiss-Prot. In: Proceedings of the Eleventh International Conference on Intelligent Systems for Molecular Biology, pp. 91–94 (2003)

    Google Scholar 

  4. Dobrokhotov, P.B., Goutte, C., Veuthey, A.L., Gaussier, E.: Assisting medical annotation in Swiss-Prot using statistical classifiers. International Journal of Medical Informatics 74(2-4), 317–324 (2005)

    Article  Google Scholar 

  5. Elkan, C., Noto, K.: Learning classifiers from only positive and unlabeled data. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2008), pp. 213–220 (2008)

    Google Scholar 

  6. Galperin, M.Y.: The Molecular Biology Database Collection: 2008 update. Nucleic Acids Research, 36(Database issue):D2 (2008)

    Google Scholar 

  7. Han, B., Obradovic, Z., Hu, Z., Wu, C.H., Vucetic, S.: Substring selection for biomedical document classification. Bioinformatics 22(17), 2136–2142 (2006)

    Article  Google Scholar 

  8. Joachims, T.: Making large-scale support vector machine learning practical. In: Smola, A., Schölkopf, B., Burges, C. (eds.) Advances in Kernel Methods: Support Vector Machines. MIT Press, Cambridge (1998)

    Google Scholar 

  9. Joachims, T.: A support vector method for multivariate performance measures. In: ACM International Conference Proceeding Series, vol. 119, pp. 377–384 (2005)

    Google Scholar 

  10. Saier Jr., M.H.: A functional-phylogenetic classification system for transmembrane solute transporters. Microbiology and Molecular Biology Reviews 64(2), 354–411 (2000)

    Article  MathSciNet  Google Scholar 

  11. Saier Jr., M.H., Tran, C.V., Barabote, R.D.: TCDB: The transporter classification database for membrane transport protein analyses and information. Nucleic Acids Research 34, D181–D186 (2006)

    Article  Google Scholar 

  12. Liu, B., Dai, Y., Li, X., Lee, W.S., Yu, P.S.: Building text classifiers using positive and unlabeled examples. In: Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM 2003), pp. 179–188 (2003)

    Google Scholar 

  13. McCallum, A.K.: MALLET: A machine learning for language toolkit (2002), http://mallet.cs.umass.edu

  14. Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)

    Article  Google Scholar 

  15. Steinberg, D., Cardell, N.S.: Estimating logistic regression models when the dependent variable has no variance. Communications in Statistics - Theory and Methods 21(2), 423–450 (1992)

    Article  MATH  Google Scholar 

  16. Wang, C., Ding, C., Meraz, R.F., Holbrook, S.R.: PSoL: A positive sample only learning algorithm for finding non-coding RNA genes. Bioinformatics 22(21), 2590–2596 (2006)

    Article  Google Scholar 

  17. Wang, P., Morgan, A.A., Zhang, Q., Sette, A., Peters, B.: Automating document classification for the immune epitope database. BMC Bioinformatics 8(269) (2007)

    Google Scholar 

  18. Ward, G., Hastie, T., Barry, S., Elith, J., Leathwick, J.R.: Presence-only data and the em algorithm. Biometrics (2008)

    Google Scholar 

  19. Wilbur, W.J.: Boosting naïve Bayesian learning on a large subset of MEDLINE. In: Proc. AMIA Symp. (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Noto, K., Saier, M.H., Elkan, C. (2008). Learning to Find Relevant Biological Articles without Negative Training Examples. In: Wobcke, W., Zhang, M. (eds) AI 2008: Advances in Artificial Intelligence. AI 2008. Lecture Notes in Computer Science(), vol 5360. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-89378-3_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-89378-3_20

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-89377-6

  • Online ISBN: 978-3-540-89378-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics