Abstract
Leo Breiman’s Random Forest ensemble learning procedure is applied to the problem of Quantitative Structure-Activity Relationship (QSAR) modeling for pharmaceutical molecules. This entails using a quantitative description of a compound’s molecular structure to predict that compound’s biological activity as measured in an in vitro assay. Without any parameter tuning, the performance of Random Forest with default settings on six publicly available data sets is already as good or better than that of three other prominent QSAR methods: Decision Tree, Partial Least Squares, and Support Vector Machine. In addition to reliable prediction accuracy, Random Forest provides variable importance measures which can be used in a variable reduction wrapper algorithm. Comparisons of various such wrappers and between Random Forest and Bagging are presented.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Ambroise, C., McLachlan, G.J.: Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl. Acad. Sci. USA 99, 6562–6566 (2002)
Bakken, G.A., Jurs, P.C.: Classification of multidrug-resistance reversal agents using structure-based descriptors and linear discriminant analysis. J. Med. Chem. 43, 4534–4541 (2000)
Breiman, L.: Arcing classifiers. Ann. Stat. 26, 801–849 (1998)
Breiman, L.: Random forests. Machine Learning 45, 5–32 (2001)
Doniger, S., Hofmann, T., Yeh, J.: Predicting CNS permeability of drug molecules: comparison of neural network and support vector machine algorithms. J. Comput. Biol. 9, 849–864 (2002)
Ekins, S., et al.: Progress in predicting human ADME parameters in silico. J. Pharmac. Toxic. Meth. 44, 251–272 (2000)
Friedman, J.H., Popescu, B.E.: Importance sampled learning ensembles, http://www-stat.stanford.edu/~jhf/ftp/isle.pdf
Gilligan, P.J., et al.: Novel piperidine σ receptor ligands as potential antipsychotic drugs. J. Med. Chem. 35, 4344–4361 (1992)
Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Machine Learning 46, 389–422 (2002)
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York (2001)
Hawkins, D.M., Basak, S.C., Shi, X.: QSAR with few compounds and many features. J. Chem. Inf. Comput. Sci. 41, 663–670 (2001)
Kauffman, G.W., Jurs, P.C.: QSAR and k-nearest neighbor classification analysis of selective cyclooxygenase-2 inhibitors using topologically-based numerical descriptors. J. Chem. Inf. Comput. Sci. 41, 1553–1560 (2001)
Liaw, A., Wiener, M.: Classification and regression by randomForest. R News 2/3, 18–22 (2002)
Penzotti, J.E., Lamb, M.L., Evensen, E., Grootenhuis, P.D.J.: A computational ensemble pharmacophore model for identifying substrates of p-glycoprotein. J. Med. Chem. 45, 1737–1740 (2002)
Reunanen, J.: Overfitting in making comparisons between variable selection methods. J. Machine Learning Res. 3, 1371–1382 (2003)
Svetnik, V., Liaw, A., Tong, C., Culberson, J.C., Sheridan, R.P., Feuston, B.P.: QSAR modeling using Random Forest, an ensemble learning tool for regression and classification. J. Chem. Inf. Comput. Sci. 43, 1947–1958 (2003)
Tong, W., Hong, H., Fang, H., Xie, Q., Perkins, R.: Decision forest: combining the predictions of multiple independent decision tree models. J. Chem. Inf. Comput. Sci. 43, 525–531 (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Svetnik, V., Liaw, A., Tong, C., Wang, T. (2004). Application of Breiman’s Random Forest to Modeling Structure-Activity Relationships of Pharmaceutical Molecules. In: Roli, F., Kittler, J., Windeatt, T. (eds) Multiple Classifier Systems. MCS 2004. Lecture Notes in Computer Science, vol 3077. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-25966-4_33
Download citation
DOI: https://doi.org/10.1007/978-3-540-25966-4_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22144-9
Online ISBN: 978-3-540-25966-4
eBook Packages: Springer Book Archive