The statistics of virtual screening and lead optimization

McGann, Mark; Nicholls, Anthony; Enyedy, Istvan

doi:10.1007/s10822-015-9861-4

The statistics of virtual screening and lead optimization

SPECIAL SERIES: STATISTICS IN MOLECULAR MODELING
Guest Editor: Anthony Nicholls
Published: 19 October 2015

Volume 29, pages 923–936, (2015)
Cite this article

Journal of Computer-Aided Molecular Design Aims and scope Submit manuscript

Mark McGann¹,
Anthony Nicholls¹ &
Istvan Enyedy²

858 Accesses
13 Citations
3 Altmetric
Explore all metrics

Abstract

Analytic formulae are used to estimate the error for two virtual screening metrics, enrichment factor and area under the ROC curve. These analytic error estimates are then compared to bootstrapping error estimates, and shown to have excellent agreement with respect to area under the ROC curve and good agreement with respect to enrichment factor. The major advantage of the analytic formulae is that they are trivial to calculate and depend only on the number of actives and inactives and the measured value of the metric, information commonly reported in papers. In contrast to this, the bootstrapping method requires the individual compound scores. Methods for converting the error, which is calculated as a variance, into more familiar error bars are also discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The power metric: a new statistically robust enrichment-type metric for virtual screening applications with early recovery capability

Article Open access 02 February 2017

Statistical Methods for Drug Discovery

In Silico Screening of Compound Libraries Using a Consensus of Orthogonal Methodologies

References

McGann M (2011) J Chem Inf Model 51(3):578–596
Article CAS Google Scholar
Hanley JA, McNeil BJ (1983) Radiology 148(3):839–843
Article CAS Google Scholar
Hanley JA, McNeil BJ (1982) Radiology 143(1):29–36
Article CAS Google Scholar
Triballeau N, Acher F, Brabet I, Pin JP, Bertrand HO (2005) J Med Chem 48(7):2534–2547
Article CAS Google Scholar
Henderson AR (2005) Clin Chim Acta 359(1–2):1–26
Article CAS Google Scholar
Nicholls A (2008) J Comput Aided Mol Des 22(3):239–255
Article CAS Google Scholar
Jain A, Nicholls A (2008) J Comput Aided Mol Des 22(3):133–139
Article CAS Google Scholar
Jain AN (2007) J Comput Aided Mol Des 21(5):281–306
Article CAS Google Scholar
Nicholls A (2014) J Comput Aided Mol Des 28(9):887–918
Article CAS Google Scholar
OMEGA OpenEye Scientific Software, 3600 Cerrillos Rd., Suite 1107 Santa Fe, NM 87507
FRED OpenEye Scientific Software, 3600 Cerrillos Rd., Suite 1107 Santa Fe, NM 87507

Download references

Author information

Authors and Affiliations

OpenEye Scientific, Santa Fe, NM, USA
Mark McGann & Anthony Nicholls
Biogen, 115 Broadway, Cambridge, MA, USA
Istvan Enyedy

Authors

Mark McGann
View author publications
You can also search for this author in PubMed Google Scholar
Anthony Nicholls
View author publications
You can also search for this author in PubMed Google Scholar
Istvan Enyedy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Mark McGann or Istvan Enyedy.

Additional information

Mark McGann and Istvan Enyedy have contributed equally to this work.

Appendix: Calculating 95 % confidence assuming a binomial distribution

The details of calculating a CI95 using a binomial distribution bear some explanation. A binomial distribution is a discreet distribution with a range [0, N] with values at each integer value (n) in the range. Each value, f (n; N, p), is the probability of getting exactly n successes in N trials. Formally the definition is

$$f\left( {n;N,p} \right) = \left( {\begin{array}{*{20}c} N \\ n \\ \end{array} } \right)p^{n} \left( {1 - p} \right)^{N - n}$$

where p is the probability of success, in our case the AUC, and

$$\left( {\begin{array}{*{20}c} N \\ n \\ \end{array} } \right) = \frac{N!}{{n!\left( {N - n} \right)!}}$$

What we lack in the equations above is the number of trials, N, which we can compute by recognizing that the variance of a binomial distribution is $\sigma^{2} = p\left( {1 - p} \right)/N$ and that we have variance from the Hanley formula ($\sigma_{AUC}^{2}$) shown above. Thus we can solve for N as follows

$$N = \frac{{p\left( {1 - p} \right)}}{{\sigma_{AUC}^{2} }}$$

Now, recalling that p is simply the measured AUC, we can construct the binomial distribution. This distribution is discrete rather than continuous but becomes approximately continuous when N is large and in practice we have found that creating a continuous distribution by interpolating the value between points is effective.

Once the appropriate binomial distribution is constructed, we construct a cumulative distribution curve for the binomial and read the values at 2.5 and 97.5 % to obtain the 95 % confidence interval.

The above calculations are described for AUC, but the same method can be applied to EF by recognizing that EF (f_I) * f_I is also a probability and using this value in place of AUC. The resulting 95 % confidence interval is then multiplied by f_I to convert the result from the probability units [0,1] to the EF units.

Rights and permissions

Reprints and permissions

About this article

Cite this article

McGann, M., Nicholls, A. & Enyedy, I. The statistics of virtual screening and lead optimization. J Comput Aided Mol Des 29, 923–936 (2015). https://doi.org/10.1007/s10822-015-9861-4

Download citation

Received: 30 June 2015
Accepted: 13 July 2015
Published: 19 October 2015
Issue Date: October 2015
DOI: https://doi.org/10.1007/s10822-015-9861-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The statistics of virtual screening and lead optimization

Abstract

Access this article

Similar content being viewed by others

The power metric: a new statistically robust enrichment-type metric for virtual screening applications with early recovery capability

Statistical Methods for Drug Discovery

In Silico Screening of Compound Libraries Using a Consensus of Orthogonal Methodologies

References

Author information

Authors and Affiliations

Corresponding authors

Additional information

Appendix: Calculating 95 % confidence assuming a binomial distribution

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The statistics of virtual screening and lead optimization

Abstract

Access this article

Similar content being viewed by others

The power metric: a new statistically robust enrichment-type metric for virtual screening applications with early recovery capability

Statistical Methods for Drug Discovery

In Silico Screening of Compound Libraries Using a Consensus of Orthogonal Methodologies

References

Author information

Authors and Affiliations

Corresponding authors

Additional information

Appendix: Calculating 95 % confidence assuming a binomial distribution

Appendix: Calculating 95 % confidence assuming a binomial distribution

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation