Learning to Find Relevant Biological Articles without Negative Training Examples

Noto, Keith; Saier, Milton H.; Elkan, Charles

doi:10.1007/978-3-540-89378-3_20

Keith Noto³,
Milton H. Saier Jr.³ &
Charles Elkan³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5360))

Included in the following conference series:

Australasian Joint Conference on Artificial Intelligence

Abstract

Classifiers are traditionally learned using sets of positive and negative training examples. However, often a classifier is required, but for training only an incomplete set of positive examples and a set of unlabeled examples are available. This is the situation, for example, with the Transport Classification Database (TCDB, www.tcdb.org), a repository of information about proteins involved in transmembrane transport. This paper presents and evaluates a method for learning to rank the likely relevance to TCDB of newly published scientific articles, using the articles currently referenced in TCDB as positive training examples. The new method has succeeded in identifying 964 new articles relevant to TCDB in fewer than six months, which is a major practical success. From a general data mining perspective, the contributions of this paper are (i) evaluating two novel approaches that solve the positive-only problem effectively, (ii) applying support vector machines in a state-of-the-art way for recognizing and ranking relevance, and (iii) deploying a system to update a widely-used, real-world biomedical database. Supplementary information including all data sets are publicly available at www.cs.ucsd.edu/users/knoto/pub/ajcai08.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Das, S., Saier Jr., M.H., Elkan, C.: Finding transport proteins in a general protein database. In: Proceedings of the Eleventh European Conference on Principles and Practice of Knowledge Discovery in Databases, pp. 54–66 (2007)
Google Scholar
Denis, F., Gilleron, R., Letouzey, F.: Learning from positive and unlabeled examples. Theoretical Computer Science 348(1), 70–83 (2005)
Article MathSciNet MATH Google Scholar
Dobrokhotov, P.B., Goutte, C., Veuthey, A.L., Gaussier, E.: Combining NLP and probabilistic categorisation for document and term selection for Swiss-Prot. In: Proceedings of the Eleventh International Conference on Intelligent Systems for Molecular Biology, pp. 91–94 (2003)
Google Scholar
Dobrokhotov, P.B., Goutte, C., Veuthey, A.L., Gaussier, E.: Assisting medical annotation in Swiss-Prot using statistical classifiers. International Journal of Medical Informatics 74(2-4), 317–324 (2005)
Article Google Scholar
Elkan, C., Noto, K.: Learning classifiers from only positive and unlabeled data. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2008), pp. 213–220 (2008)
Google Scholar
Galperin, M.Y.: The Molecular Biology Database Collection: 2008 update. Nucleic Acids Research, 36(Database issue):D2 (2008)
Google Scholar
Han, B., Obradovic, Z., Hu, Z., Wu, C.H., Vucetic, S.: Substring selection for biomedical document classification. Bioinformatics 22(17), 2136–2142 (2006)
Article Google Scholar
Joachims, T.: Making large-scale support vector machine learning practical. In: Smola, A., Schölkopf, B., Burges, C. (eds.) Advances in Kernel Methods: Support Vector Machines. MIT Press, Cambridge (1998)
Google Scholar
Joachims, T.: A support vector method for multivariate performance measures. In: ACM International Conference Proceeding Series, vol. 119, pp. 377–384 (2005)
Google Scholar
Saier Jr., M.H.: A functional-phylogenetic classification system for transmembrane solute transporters. Microbiology and Molecular Biology Reviews 64(2), 354–411 (2000)
Article MathSciNet Google Scholar
Saier Jr., M.H., Tran, C.V., Barabote, R.D.: TCDB: The transporter classification database for membrane transport protein analyses and information. Nucleic Acids Research 34, D181–D186 (2006)
Article Google Scholar
Liu, B., Dai, Y., Li, X., Lee, W.S., Yu, P.S.: Building text classifiers using positive and unlabeled examples. In: Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM 2003), pp. 179–188 (2003)
Google Scholar
McCallum, A.K.: MALLET: A machine learning for language toolkit (2002), http://mallet.cs.umass.edu
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Article Google Scholar
Steinberg, D., Cardell, N.S.: Estimating logistic regression models when the dependent variable has no variance. Communications in Statistics - Theory and Methods 21(2), 423–450 (1992)
Article MATH Google Scholar
Wang, C., Ding, C., Meraz, R.F., Holbrook, S.R.: PSoL: A positive sample only learning algorithm for finding non-coding RNA genes. Bioinformatics 22(21), 2590–2596 (2006)
Article Google Scholar
Wang, P., Morgan, A.A., Zhang, Q., Sette, A., Peters, B.: Automating document classification for the immune epitope database. BMC Bioinformatics 8(269) (2007)
Google Scholar
Ward, G., Hastie, T., Barry, S., Elith, J., Leathwick, J.R.: Presence-only data and the em algorithm. Biometrics (2008)
Google Scholar
Wilbur, W.J.: Boosting naïve Bayesian learning on a large subset of MEDLINE. In: Proc. AMIA Symp. (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

University of California, La Jolla, CA 92093, USA
Keith Noto, Milton H. Saier Jr. & Charles Elkan

Authors

Keith Noto
View author publications
You can also search for this author in PubMed Google Scholar
Milton H. Saier Jr.
View author publications
You can also search for this author in PubMed Google Scholar
Charles Elkan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Wales, School of Computer Science and Engineering,, University of New South, NSW 2052, Sydney, Australia
Wayne Wobcke
School of Mathematics, Statistics and Computer Science, Victoria University of Wellington, P.O. Box 600, 6140, Wellington, New Zealand
Mengjie Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Noto, K., Saier, M.H., Elkan, C. (2008). Learning to Find Relevant Biological Articles without Negative Training Examples. In: Wobcke, W., Zhang, M. (eds) AI 2008: Advances in Artificial Intelligence. AI 2008. Lecture Notes in Computer Science(), vol 5360. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-89378-3_20

Download citation

DOI: https://doi.org/10.1007/978-3-540-89378-3_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-89377-6
Online ISBN: 978-3-540-89378-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics