Comparing classification models—a practical tutorial

Patrick Walters, W.

doi:10.1007/s10822-021-00417-2

Comparing classification models—a practical tutorial

Published: 22 September 2021

Volume 36, pages 381–389, (2022)
Cite this article

Journal of Computer-Aided Molecular Design Aims and scope Submit manuscript

W. Patrick Walters ORCID: orcid.org/0000-0003-2860-7958¹

1687 Accesses
5 Citations
Explore all metrics

Abstract

While machine learning models have become a mainstay in Cheminformatics, the field has yet to agree on standards for model evaluation and comparison. In many cases, authors compare methods by performing multiple folds of cross-validation and reporting the mean value for an evaluation metric such as the area under the receiver operating characteristic. These comparisons of mean values often lack statistical rigor and can lead to inaccurate conclusions. In the interest of encouraging best practices, this tutorial provides an example of how multiple methods can be compared in a statistically rigorous fashion.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

chemmodlab: a cheminformatics modeling laboratory R package for fitting and assessing machine learning models

Article Open access 28 November 2018

Using p-values for the comparison of classifiers: pitfalls and alternatives

Article 11 April 2022

Introduction to Machine Learning

Data availability

All data is available in a GitHub repository.

Code availability

All code is available in a GitHub repository.

References

Walters WP, Barzilay R (2021) Critical assessment of AI in drug discovery. Expert Opin Drug Discov. https://doi.org/10.1080/17460441.2021.1915982
Article PubMed Google Scholar
Elton DC, Boukouvalas Z, Fuge MD, Chung PW (2019) Deep learning for molecular design—a review of the state of the art. Mol Syst Des Eng 4:828–849
Article CAS Google Scholar
Bender A, Cortés-Ciriano I (2021) Artificial intelligence in drug discovery: what is realistic, what are illusions? Part 1: ways to make an impact, and why we are not there yet. Drug Discov Today 26:511–524
Article CAS Google Scholar
Bender A, Cortes-Ciriano I (2021) Artificial intelligence in drug discovery: what is realistic, what are illusions? Part 2: a discussion of chemical and biological data. Drug Discov Today. https://doi.org/10.1016/j.drudis.2020.11.037
Article PubMed PubMed Central Google Scholar
Vamathevan J, Clark D, Czodrowski P et al (2019) Applications of machine learning in drug discovery and development. Nat Rev Drug Discov. https://doi.org/10.1038/s41573-019-0024-5
Article PubMed PubMed Central Google Scholar
Nicholls A (2011) What do we know?: simple statistical techniques that help. Methods Mol Biol 672:531–581
Article CAS Google Scholar
Jain AN, Nicholls A (2008) Recommendations for evaluation of computational methods. J Comput Aided Mol Des 22:133–139
Article CAS Google Scholar
Nicholls A (2014) Confidence limits, error bars and method comparison in molecular modeling. Part 1: the calculation of confidence intervals. J Comput Aided Mol Des 28:887–918
Article CAS Google Scholar
Nicholls A (2016) Confidence limits, error bars and method comparison in molecular modeling. Part 2: comparing methods. J Comput Aided Mol Des 30:103–126
Article CAS Google Scholar
Jamieson C, Moir EM, Rankovic Z, Wishart G (2008) Strategy and tactics for hERG optimizations. Antitargets. Wiley, Hoboken, pp 423–455
Chapter Google Scholar
Gaulton A, Bellis LJ, Bento AP et al (2011) ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res 40:D1100–D1107
Article Google Scholar
Bento AP, Gaulton A, Hersey A et al (2013) The ChEMBL bioactivity database: an update. Nucleic Acids Res 42:D1083–D1090
Article Google Scholar
jcamd_model_comparison. Available at https://github.com/PatWalters/jcamd_model_comparison
Czodrowski P (2013) hERG me out. J Chem Inf Model 53:2240–2251
Article CAS Google Scholar
McKinney W (2017) Python for data analysis: data wrangling with pandas, NumPy, and IPython. O’Reilly Media, Incorporated, Sebastopol
Google Scholar
Esposito C, Landrum GA, Schneider N et al (2021) GHOST: adjusting the decision threshold to handle imbalanced data in machine learning. J Chem Inf Model. https://doi.org/10.1021/acs.jcim.1c00160
Article PubMed Google Scholar
Cáceres EL, Mew NC, Keiser MJ (2020) Adding stochastic negative examples into machine learning improves molecular bioactivity prediction. J Chem Inf Model. https://doi.org/10.1021/acs.jcim.0c00565
Article PubMed Google Scholar
Lopez-Del Rio A, Picart-Armada S, Perera-Lluna A (2021) Balancing data on deep learning-based proteochemometric activity classification. J Chem Inf Model 61:1657–1669
Article CAS Google Scholar
Svetnik V, Liaw A, Tong C et al (2003) Random forest: a classification and regression tool for compound classification and QSAR modeling. J Chem Inf Comput Sci 43:1947–1958
Article CAS Google Scholar
Sheridan RP, Liaw A, Tudor M (2021) Light gradient boosting machine as a regression method for quantitative structure-activity relationships. arXiv [q-bio.BM]
RDKit: open-source cheminformatics software. Available at https://github.com/rdkit/rdkit. Accessed 28 Feb 2021
Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inf Model 50:742–754
Article CAS Google Scholar
Truchon J-F, Bayly CI (2007) Evaluating virtual screening methods: good and bad metrics for the “early recognition” problem. J Chem Inf Model 47:488–508
Article CAS Google Scholar
Nicholls A (2008) What do we know and when do we know it? J Comput Aided Mol Des 22:239–255
Article CAS Google Scholar
Dietterich TG (1998) Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput 10:1895–1923
Article CAS Google Scholar
Mlxtend. Available at http://rasbt.github.io/mlxtend/

Download references

Author information

Authors and Affiliations

Relay Therapeutics, 399 Binney St, Cambridge, MA, 02141, USA
W. Patrick Walters

Authors

W. Patrick Walters
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to W. Patrick Walters.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

For the special issue in honor of Gerry Maggiora.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Patrick Walters, W. Comparing classification models—a practical tutorial. J Comput Aided Mol Des 36, 381–389 (2022). https://doi.org/10.1007/s10822-021-00417-2

Download citation

Received: 27 June 2021
Accepted: 18 August 2021
Published: 22 September 2021
Issue Date: May 2022
DOI: https://doi.org/10.1007/s10822-021-00417-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Comparing classification models—a practical tutorial

Abstract

Access this article

Similar content being viewed by others

chemmodlab: a cheminformatics modeling laboratory R package for fitting and assessing machine learning models

Using p-values for the comparison of classifiers: pitfalls and alternatives

Introduction to Machine Learning

Data availability

Code availability

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Comparing classification models—a practical tutorial

Abstract

Access this article

Similar content being viewed by others

chemmodlab: a cheminformatics modeling laboratory R package for fitting and assessing machine learning models

Using p-values for the comparison of classifiers: pitfalls and alternatives

Introduction to Machine Learning

Data availability

Code availability

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation