Skip to main content
Log in

The statistics of virtual screening and lead optimization

  • SPECIAL SERIES: STATISTICS IN MOLECULAR MODELING
  • Guest Editor: Anthony Nicholls
  • Published:
Journal of Computer-Aided Molecular Design Aims and scope Submit manuscript

Abstract

Analytic formulae are used to estimate the error for two virtual screening metrics, enrichment factor and area under the ROC curve. These analytic error estimates are then compared to bootstrapping error estimates, and shown to have excellent agreement with respect to area under the ROC curve and good agreement with respect to enrichment factor. The major advantage of the analytic formulae is that they are trivial to calculate and depend only on the number of actives and inactives and the measured value of the metric, information commonly reported in papers. In contrast to this, the bootstrapping method requires the individual compound scores. Methods for converting the error, which is calculated as a variance, into more familiar error bars are also discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

References

  1. McGann M (2011) J Chem Inf Model 51(3):578–596

    Article  CAS  Google Scholar 

  2. Hanley JA, McNeil BJ (1983) Radiology 148(3):839–843

    Article  CAS  Google Scholar 

  3. Hanley JA, McNeil BJ (1982) Radiology 143(1):29–36

    Article  CAS  Google Scholar 

  4. Triballeau N, Acher F, Brabet I, Pin JP, Bertrand HO (2005) J Med Chem 48(7):2534–2547

    Article  CAS  Google Scholar 

  5. Henderson AR (2005) Clin Chim Acta 359(1–2):1–26

    Article  CAS  Google Scholar 

  6. Nicholls A (2008) J Comput Aided Mol Des 22(3):239–255

    Article  CAS  Google Scholar 

  7. Jain A, Nicholls A (2008) J Comput Aided Mol Des 22(3):133–139

    Article  CAS  Google Scholar 

  8. Jain AN (2007) J Comput Aided Mol Des 21(5):281–306

    Article  CAS  Google Scholar 

  9. Nicholls A (2014) J Comput Aided Mol Des 28(9):887–918

    Article  CAS  Google Scholar 

  10. OMEGA OpenEye Scientific Software, 3600 Cerrillos Rd., Suite 1107 Santa Fe, NM 87507

  11. FRED OpenEye Scientific Software, 3600 Cerrillos Rd., Suite 1107 Santa Fe, NM 87507

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Mark McGann or Istvan Enyedy.

Additional information

Mark McGann and Istvan Enyedy have contributed equally to this work.

Appendix: Calculating 95 % confidence assuming a binomial distribution

Appendix: Calculating 95 % confidence assuming a binomial distribution

The details of calculating a CI95 using a binomial distribution bear some explanation. A binomial distribution is a discreet distribution with a range [0, N] with values at each integer value (n) in the range. Each value, f (n; N, p), is the probability of getting exactly n successes in N trials. Formally the definition is

$$f\left( {n;N,p} \right) = \left( {\begin{array}{*{20}c} N \\ n \\ \end{array} } \right)p^{n} \left( {1 - p} \right)^{N - n}$$

where p is the probability of success, in our case the AUC, and

$$\left( {\begin{array}{*{20}c} N \\ n \\ \end{array} } \right) = \frac{N!}{{n!\left( {N - n} \right)!}}$$

What we lack in the equations above is the number of trials, N, which we can compute by recognizing that the variance of a binomial distribution is \(\sigma^{2} = p\left( {1 - p} \right)/N\) and that we have variance from the Hanley formula (\(\sigma_{AUC}^{2}\)) shown above. Thus we can solve for N as follows

$$N = \frac{{p\left( {1 - p} \right)}}{{\sigma_{AUC}^{2} }}$$

Now, recalling that p is simply the measured AUC, we can construct the binomial distribution. This distribution is discrete rather than continuous but becomes approximately continuous when N is large and in practice we have found that creating a continuous distribution by interpolating the value between points is effective.

Once the appropriate binomial distribution is constructed, we construct a cumulative distribution curve for the binomial and read the values at 2.5 and 97.5 % to obtain the 95 % confidence interval.

The above calculations are described for AUC, but the same method can be applied to EF by recognizing that EF (fI) * fI is also a probability and using this value in place of AUC. The resulting 95 % confidence interval is then multiplied by fI to convert the result from the probability units [0,1] to the EF units.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

McGann, M., Nicholls, A. & Enyedy, I. The statistics of virtual screening and lead optimization. J Comput Aided Mol Des 29, 923–936 (2015). https://doi.org/10.1007/s10822-015-9861-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10822-015-9861-4

Keywords

Navigation