Abstract
While machine learning models have become a mainstay in Cheminformatics, the field has yet to agree on standards for model evaluation and comparison. In many cases, authors compare methods by performing multiple folds of cross-validation and reporting the mean value for an evaluation metric such as the area under the receiver operating characteristic. These comparisons of mean values often lack statistical rigor and can lead to inaccurate conclusions. In the interest of encouraging best practices, this tutorial provides an example of how multiple methods can be compared in a statistically rigorous fashion.
Similar content being viewed by others
Data availability
All data is available in a GitHub repository.
Code availability
All code is available in a GitHub repository.
References
Walters WP, Barzilay R (2021) Critical assessment of AI in drug discovery. Expert Opin Drug Discov. https://doi.org/10.1080/17460441.2021.1915982
Elton DC, Boukouvalas Z, Fuge MD, Chung PW (2019) Deep learning for molecular design—a review of the state of the art. Mol Syst Des Eng 4:828–849
Bender A, Cortés-Ciriano I (2021) Artificial intelligence in drug discovery: what is realistic, what are illusions? Part 1: ways to make an impact, and why we are not there yet. Drug Discov Today 26:511–524
Bender A, Cortes-Ciriano I (2021) Artificial intelligence in drug discovery: what is realistic, what are illusions? Part 2: a discussion of chemical and biological data. Drug Discov Today. https://doi.org/10.1016/j.drudis.2020.11.037
Vamathevan J, Clark D, Czodrowski P et al (2019) Applications of machine learning in drug discovery and development. Nat Rev Drug Discov. https://doi.org/10.1038/s41573-019-0024-5
Nicholls A (2011) What do we know?: simple statistical techniques that help. Methods Mol Biol 672:531–581
Jain AN, Nicholls A (2008) Recommendations for evaluation of computational methods. J Comput Aided Mol Des 22:133–139
Nicholls A (2014) Confidence limits, error bars and method comparison in molecular modeling. Part 1: the calculation of confidence intervals. J Comput Aided Mol Des 28:887–918
Nicholls A (2016) Confidence limits, error bars and method comparison in molecular modeling. Part 2: comparing methods. J Comput Aided Mol Des 30:103–126
Jamieson C, Moir EM, Rankovic Z, Wishart G (2008) Strategy and tactics for hERG optimizations. Antitargets. Wiley, Hoboken, pp 423–455
Gaulton A, Bellis LJ, Bento AP et al (2011) ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res 40:D1100–D1107
Bento AP, Gaulton A, Hersey A et al (2013) The ChEMBL bioactivity database: an update. Nucleic Acids Res 42:D1083–D1090
jcamd_model_comparison. Available at https://github.com/PatWalters/jcamd_model_comparison
Czodrowski P (2013) hERG me out. J Chem Inf Model 53:2240–2251
McKinney W (2017) Python for data analysis: data wrangling with pandas, NumPy, and IPython. O’Reilly Media, Incorporated, Sebastopol
Esposito C, Landrum GA, Schneider N et al (2021) GHOST: adjusting the decision threshold to handle imbalanced data in machine learning. J Chem Inf Model. https://doi.org/10.1021/acs.jcim.1c00160
Cáceres EL, Mew NC, Keiser MJ (2020) Adding stochastic negative examples into machine learning improves molecular bioactivity prediction. J Chem Inf Model. https://doi.org/10.1021/acs.jcim.0c00565
Lopez-Del Rio A, Picart-Armada S, Perera-Lluna A (2021) Balancing data on deep learning-based proteochemometric activity classification. J Chem Inf Model 61:1657–1669
Svetnik V, Liaw A, Tong C et al (2003) Random forest: a classification and regression tool for compound classification and QSAR modeling. J Chem Inf Comput Sci 43:1947–1958
Sheridan RP, Liaw A, Tudor M (2021) Light gradient boosting machine as a regression method for quantitative structure-activity relationships. arXiv [q-bio.BM]
RDKit: open-source cheminformatics software. Available at https://github.com/rdkit/rdkit. Accessed 28 Feb 2021
Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inf Model 50:742–754
Truchon J-F, Bayly CI (2007) Evaluating virtual screening methods: good and bad metrics for the “early recognition” problem. J Chem Inf Model 47:488–508
Nicholls A (2008) What do we know and when do we know it? J Comput Aided Mol Des 22:239–255
Dietterich TG (1998) Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput 10:1895–1923
Mlxtend. Available at http://rasbt.github.io/mlxtend/
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
For the special issue in honor of Gerry Maggiora.
Rights and permissions
About this article
Cite this article
Patrick Walters, W. Comparing classification models—a practical tutorial. J Comput Aided Mol Des 36, 381–389 (2022). https://doi.org/10.1007/s10822-021-00417-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10822-021-00417-2