Abstract
Analytic formulae are used to estimate the error for two virtual screening metrics, enrichment factor and area under the ROC curve. These analytic error estimates are then compared to bootstrapping error estimates, and shown to have excellent agreement with respect to area under the ROC curve and good agreement with respect to enrichment factor. The major advantage of the analytic formulae is that they are trivial to calculate and depend only on the number of actives and inactives and the measured value of the metric, information commonly reported in papers. In contrast to this, the bootstrapping method requires the individual compound scores. Methods for converting the error, which is calculated as a variance, into more familiar error bars are also discussed.
Similar content being viewed by others
References
McGann M (2011) J Chem Inf Model 51(3):578–596
Hanley JA, McNeil BJ (1983) Radiology 148(3):839–843
Hanley JA, McNeil BJ (1982) Radiology 143(1):29–36
Triballeau N, Acher F, Brabet I, Pin JP, Bertrand HO (2005) J Med Chem 48(7):2534–2547
Henderson AR (2005) Clin Chim Acta 359(1–2):1–26
Nicholls A (2008) J Comput Aided Mol Des 22(3):239–255
Jain A, Nicholls A (2008) J Comput Aided Mol Des 22(3):133–139
Jain AN (2007) J Comput Aided Mol Des 21(5):281–306
Nicholls A (2014) J Comput Aided Mol Des 28(9):887–918
OMEGA OpenEye Scientific Software, 3600 Cerrillos Rd., Suite 1107 Santa Fe, NM 87507
FRED OpenEye Scientific Software, 3600 Cerrillos Rd., Suite 1107 Santa Fe, NM 87507
Author information
Authors and Affiliations
Corresponding authors
Additional information
Mark McGann and Istvan Enyedy have contributed equally to this work.
Appendix: Calculating 95 % confidence assuming a binomial distribution
Appendix: Calculating 95 % confidence assuming a binomial distribution
The details of calculating a CI95 using a binomial distribution bear some explanation. A binomial distribution is a discreet distribution with a range [0, N] with values at each integer value (n) in the range. Each value, f (n; N, p), is the probability of getting exactly n successes in N trials. Formally the definition is
where p is the probability of success, in our case the AUC, and
What we lack in the equations above is the number of trials, N, which we can compute by recognizing that the variance of a binomial distribution is \(\sigma^{2} = p\left( {1 - p} \right)/N\) and that we have variance from the Hanley formula (\(\sigma_{AUC}^{2}\)) shown above. Thus we can solve for N as follows
Now, recalling that p is simply the measured AUC, we can construct the binomial distribution. This distribution is discrete rather than continuous but becomes approximately continuous when N is large and in practice we have found that creating a continuous distribution by interpolating the value between points is effective.
Once the appropriate binomial distribution is constructed, we construct a cumulative distribution curve for the binomial and read the values at 2.5 and 97.5 % to obtain the 95 % confidence interval.
The above calculations are described for AUC, but the same method can be applied to EF by recognizing that EF (fI) * fI is also a probability and using this value in place of AUC. The resulting 95 % confidence interval is then multiplied by fI to convert the result from the probability units [0,1] to the EF units.
Rights and permissions
About this article
Cite this article
McGann, M., Nicholls, A. & Enyedy, I. The statistics of virtual screening and lead optimization. J Comput Aided Mol Des 29, 923–936 (2015). https://doi.org/10.1007/s10822-015-9861-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10822-015-9861-4