Abstract
Classifiers are traditionally learned using sets of positive and negative training examples. However, often a classifier is required, but for training only an incomplete set of positive examples and a set of unlabeled examples are available. This is the situation, for example, with the Transport Classification Database (TCDB, www.tcdb.org), a repository of information about proteins involved in transmembrane transport. This paper presents and evaluates a method for learning to rank the likely relevance to TCDB of newly published scientific articles, using the articles currently referenced in TCDB as positive training examples. The new method has succeeded in identifying 964 new articles relevant to TCDB in fewer than six months, which is a major practical success. From a general data mining perspective, the contributions of this paper are (i) evaluating two novel approaches that solve the positive-only problem effectively, (ii) applying support vector machines in a state-of-the-art way for recognizing and ranking relevance, and (iii) deploying a system to update a widely-used, real-world biomedical database. Supplementary information including all data sets are publicly available at www.cs.ucsd.edu/users/knoto/pub/ajcai08.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Das, S., Saier Jr., M.H., Elkan, C.: Finding transport proteins in a general protein database. In: Proceedings of the Eleventh European Conference on Principles and Practice of Knowledge Discovery in Databases, pp. 54–66 (2007)
Denis, F., Gilleron, R., Letouzey, F.: Learning from positive and unlabeled examples. Theoretical Computer Science 348(1), 70–83 (2005)
Dobrokhotov, P.B., Goutte, C., Veuthey, A.L., Gaussier, E.: Combining NLP and probabilistic categorisation for document and term selection for Swiss-Prot. In: Proceedings of the Eleventh International Conference on Intelligent Systems for Molecular Biology, pp. 91–94 (2003)
Dobrokhotov, P.B., Goutte, C., Veuthey, A.L., Gaussier, E.: Assisting medical annotation in Swiss-Prot using statistical classifiers. International Journal of Medical Informatics 74(2-4), 317–324 (2005)
Elkan, C., Noto, K.: Learning classifiers from only positive and unlabeled data. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2008), pp. 213–220 (2008)
Galperin, M.Y.: The Molecular Biology Database Collection: 2008 update. Nucleic Acids Research, 36(Database issue):D2 (2008)
Han, B., Obradovic, Z., Hu, Z., Wu, C.H., Vucetic, S.: Substring selection for biomedical document classification. Bioinformatics 22(17), 2136–2142 (2006)
Joachims, T.: Making large-scale support vector machine learning practical. In: Smola, A., Schölkopf, B., Burges, C. (eds.) Advances in Kernel Methods: Support Vector Machines. MIT Press, Cambridge (1998)
Joachims, T.: A support vector method for multivariate performance measures. In: ACM International Conference Proceeding Series, vol. 119, pp. 377–384 (2005)
Saier Jr., M.H.: A functional-phylogenetic classification system for transmembrane solute transporters. Microbiology and Molecular Biology Reviews 64(2), 354–411 (2000)
Saier Jr., M.H., Tran, C.V., Barabote, R.D.: TCDB: The transporter classification database for membrane transport protein analyses and information. Nucleic Acids Research 34, D181–D186 (2006)
Liu, B., Dai, Y., Li, X., Lee, W.S., Yu, P.S.: Building text classifiers using positive and unlabeled examples. In: Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM 2003), pp. 179–188 (2003)
McCallum, A.K.: MALLET: A machine learning for language toolkit (2002), http://mallet.cs.umass.edu
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Steinberg, D., Cardell, N.S.: Estimating logistic regression models when the dependent variable has no variance. Communications in Statistics - Theory and Methods 21(2), 423–450 (1992)
Wang, C., Ding, C., Meraz, R.F., Holbrook, S.R.: PSoL: A positive sample only learning algorithm for finding non-coding RNA genes. Bioinformatics 22(21), 2590–2596 (2006)
Wang, P., Morgan, A.A., Zhang, Q., Sette, A., Peters, B.: Automating document classification for the immune epitope database. BMC Bioinformatics 8(269) (2007)
Ward, G., Hastie, T., Barry, S., Elith, J., Leathwick, J.R.: Presence-only data and the em algorithm. Biometrics (2008)
Wilbur, W.J.: Boosting naïve Bayesian learning on a large subset of MEDLINE. In: Proc. AMIA Symp. (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Noto, K., Saier, M.H., Elkan, C. (2008). Learning to Find Relevant Biological Articles without Negative Training Examples. In: Wobcke, W., Zhang, M. (eds) AI 2008: Advances in Artificial Intelligence. AI 2008. Lecture Notes in Computer Science(), vol 5360. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-89378-3_20
Download citation
DOI: https://doi.org/10.1007/978-3-540-89378-3_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-89377-6
Online ISBN: 978-3-540-89378-3
eBook Packages: Computer ScienceComputer Science (R0)