Skip to main content
Log in

A high quality, industrial data set for binding affinity prediction: performance comparison in different early drug discovery scenarios

  • Published:
Journal of Computer-Aided Molecular Design Aims and scope Submit manuscript

Abstract

We release a new, high quality data set of 1162 PDE10A inhibitors with experimentally determined binding affinities together with 77 PDE10A X-ray co-crystal structures from a Roche legacy project. This data set is used to compare the performance of different 2D- and 3D-machine learning (ML) as well as empirical scoring functions for predicting binding affinities with high throughput. We simulate use cases that are relevant in the lead optimization phase of early drug discovery. ML methods perform well at interpolation, but poorly in extrapolation scenarios—which are most relevant to a real-world application. Moreover, we find that investing into the docking workflow for binding pose generation using multi-template docking is rewarded with an improved scoring performance. A combination of 2D-ML and 3D scoring using a modified piecewise linear potential shows best overall performance, combining information on the protein environment with learning from existing SAR data.

Graphical abstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data availability

Code for the docking workflow is available at https://github.com/ccdc-opensource/science-paper-rf-machine-learned-scoring-2022.git.

Code availability

Code to calculate RF values is available at https://github.com/jasonccole/relative-interaction-frequencies. Docking poses are available at https://doi.org/10.6084/m9.figshare.20055014.

Abbreviations

ACNN:

Atomic convolutional neural network

LO:

Lead optimization

MCS:

Maximum common substructure

PDE10A:

Phosphodiesterase 10A

RF:

Ratio of frequencies

SAR:

Structure-activity relationship

tSNE:

T-distributed stochastic neighbor embedding

References

  1. Kuhn B, Fuchs JE, Reutlinger M et al (2011) Rationalizing tight ligand binding through cooperative interaction networks. J Chem Inf Model 51:3180–3198. https://doi.org/10.1021/ci200319e

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Chan L, Morris GM, Hutchison GR (2021) Understanding conformational entropy in small molecules. J Chem Theory Comput 17:2099–2106. https://doi.org/10.1021/acs.jctc.0c01213

    Article  CAS  PubMed  Google Scholar 

  3. Tosstorff A, Cole JC, Taylor R et al (2020) Identification of noncompetitive protein–ligand interactions for structural optimization. J Chem Inf Model 60:6595–6611. https://doi.org/10.1021/acs.jcim.0c00858

    Article  CAS  PubMed  Google Scholar 

  4. Bash PA, Singh UC, Langridge R, Kollman PA (1987) Free energy calculations by computer simulation. Science 236:564–568. https://doi.org/10.1126/science.3576184

    Article  CAS  PubMed  Google Scholar 

  5. Hochuli J, Helbling A, Skaist T et al (2018) Visualizing convolutional neural network protein-ligand scoring. J Mol Graph Model 84:96–108. https://doi.org/10.1016/j.jmgm.2018.06.005

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Brown BP, Mendenhall J, Geanes AR, Meiler J (2021) General purpose structure-based drug discovery neural network score functions with human-interpretable pharmacophore maps. J Chem Inf Model 61:603–620. https://doi.org/10.1021/acs.jcim.0c01001

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Gomes J, Ramsundar B, Feinberg EN, Pande VS (2017) Atomic convolutional networks for predicting protein-ligand binding affinity. Arxiv. https://doi.org/10.48550/arXiv.1703.10603

  8. Wang R, Fang X, Lu Y, Wang S (2004) The PDBbind database: collection of binding affinities for protein−ligand complexes with known three-dimensional structures. J Med Chem 47:2977–2980. https://doi.org/10.1021/jm030580l

    Article  CAS  PubMed  Google Scholar 

  9. Liu T, Lin Y, Wen X et al (2007) BindingDB: a web-accessible database of experimentally determined protein–ligand binding affinities. Nucleic Acids Res 35:D198–D201. https://doi.org/10.1093/nar/gkl999

    Article  CAS  PubMed  Google Scholar 

  10. Affinity Data PDB code 2W9H. http://www.bindingdb.org/jsp/dbsearch/PrimarySearch_pdbids.jsp?pdbids_submit=Search&pdbids=2W9H. Accessed 26 Jan 2022

  11. Mysinger MM, Carchia M, John JI, Shoichet BK (2012) Directory of Useful Decoys, Enhanced (DUD-E): better ligands and decoys for better benchmarking. J Med Chem 55:6582–6594. https://doi.org/10.1021/jm300687e

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Rohrer SG, Baumann K (2009) Maximum Unbiased Validation (MUV) data sets for virtual screening based on pubchem bioactivity data. J Chem Inf Model 49:169–184. https://doi.org/10.1021/ci8002649

    Article  CAS  PubMed  Google Scholar 

  13. Tran-Nguyen V-K, Jacquemard C, Rognan D (2020) LIT-PCBA: an unbiased data set for machine learning and virtual screening. J Chem Inf Model 60:4263–4273. https://doi.org/10.1021/acs.jcim.0c00155

    Article  CAS  PubMed  Google Scholar 

  14. Chappie TA, Helal CJ, Hou X (2012) Current landscape of phosphodiesterase 10A (PDE10A) inhibition. J Med Chem 55:7299–7331. https://doi.org/10.1021/jm3004976

    Article  CAS  PubMed  Google Scholar 

  15. Jones G, Willett P, Glen RC et al (1997) Development and validation of a genetic algorithm for flexible docking. J Mol Biol 267:727–748. https://doi.org/10.1006/jmbi.1996.0897

    Article  CAS  PubMed  Google Scholar 

  16. Korb O, Stützle T, Exner TE (2009) Empirical scoring functions for advanced protein−ligand docking with PLANTS. J Chem Inf Model 49:84–96. https://doi.org/10.1021/ci800298z

    Article  CAS  PubMed  Google Scholar 

  17. Tosstorff A, Cole JC, Bartelt R, Kuhn B (2021) Augmenting structure-based design with experimental protein-ligand interaction data: molecular recognition, interactive visualization, and rescoring. Chem Med Chem 16:3428–3438. https://doi.org/10.1002/cmdc.202100387

    Article  CAS  PubMed  Google Scholar 

  18. Feinberg EN, Sur D, Wu Z et al (2018) PotentialNet for molecular property prediction. ACS Central Sci 4:1520–1530. https://doi.org/10.1021/acscentsci.8b00507

    Article  CAS  Google Scholar 

  19. Xiong Z, Wang D, Liu X et al (2020) Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. J Med Chem 63:8749–8760. https://doi.org/10.1021/acs.jmedchem.9b00959

    Article  CAS  PubMed  Google Scholar 

  20. Stumpfe D, Hu H, Bajorath J (2019) Evolving concept of activity cliffs. ACS Omega 4:14360–14368. https://doi.org/10.1021/acsomega.9b02221

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Thomas M, Smith RT, O’Boyle NM et al (2021) Comparison of structure- and ligand-based scoring functions for deep generative models: a GPCR case study. J Cheminformatics 13:39. https://doi.org/10.1186/s13321-021-00516-0

    Article  CAS  Google Scholar 

  22. Wang L, Chambers J, Abel R (2019) Biomolecular simulations, methods and protocols. Methods Mol Biol 2022:201–232. https://doi.org/10.1007/978-1-4939-9608-7_9

    Article  CAS  PubMed  Google Scholar 

  23. Yung-Chi C, Prusoff WH (1973) Relationship between the inhibition constant (KI) and the concentration of inhibitor which causes 50 per cent inhibition (IC ) of an enzymatic reaction. Biochem Pharmacol 22:3099–3108. https://doi.org/10.1016/0006-2952(73)90196-2

    Article  Google Scholar 

  24. Proasis. Desert Scientific Software, Sydney, Australia

  25. Landrum G RDKit. https://doi.org/10.5281/zenodo.5589557

  26. Morgan HL (1965) The generation of a unique machine description for chemical structures-a technique developed at chemical abstracts service. J Chem Doc 5:107–113. https://doi.org/10.1021/c160017a018

    Article  CAS  Google Scholar 

  27. Pedregosa F, Varoquaux G, Gramfort A et al (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830

    Google Scholar 

  28. Groom CR, Bruno IJ, Lightfoot MP, Ward SC (2016) The Cambridge Structural database. Acta Crystallogr Sect B Struct Sci Cryst Eng Mater 72:171–179. https://doi.org/10.1107/s2052520616003954

    Article  CAS  Google Scholar 

  29. Hawkins PCD, Skillman AG, Warren GL et al (2010) Conformer generation with OMEGA: algorithm and validation using high quality structures from the protein databank and cambridge structural database. J Chem Inf Model 50:572–584. https://doi.org/10.1021/ci100031x

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Cruciani G, Milletti F, Storchi L et al (2009) In silico pKa prediction and ADME profiling. Chem Biodivers 6:1812–1821. https://doi.org/10.1002/cbdv.200900153

    Article  CAS  PubMed  Google Scholar 

  31. Virtanen P, Gommers R, Oliphant TE et al (2020) SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods 17:261–272. https://doi.org/10.1038/s41592-019-0686-2

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. O’Boyle NM, Brewerton SC, Taylor R (2008) Using buriedness to improve discrimination between actives and inactives in docking. J Chem Inf Model 48:1269–1278. https://doi.org/10.1021/ci8000452

    Article  CAS  PubMed  Google Scholar 

  33. Eldridge MD, Murray CW, Auton TR et al (1997) Empirical scoring functions: I. The development of a fast empirical scoring function to estimate the binding affinity of ligands in receptor complexes. J Comput Aid Mol Des 11:425–445. https://doi.org/10.1023/a:1007996124545

    Article  CAS  Google Scholar 

  34. Li M, Zhou J, Hu J et al (2021) DGL-lifesci: an open-source toolkit for deep learning on graphs in life science. ACS Omega 6:27233–27238. https://doi.org/10.1021/acsomega.1c04017

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Cleves AE, Johnson SR, Jain AN (2021) Synergy and complementarity between focused machine learning and physics-based simulation in affinity prediction. J Chem Inf Model 61:5948–5966. https://doi.org/10.1021/acs.jcim.1c01382

    Article  CAS  PubMed  Google Scholar 

  36. Hunter JD (2007) Matplotlib: a 2D graphics environment. Comput Sci Eng 9:90–95. https://doi.org/10.1109/mcse.2007.55

    Article  Google Scholar 

  37. Waskom ML (2021) Seaborn: statistical data visualization. J Open Source Softw 6:3021. https://doi.org/10.21105/joss.03021

    Article  Google Scholar 

  38. Schrödinger L The PyMOL Molecular Graphics System, Version~1.8

  39. Kabsch W (2010) XDS. Acta Crystallogr Sect D Biol Crystallogr 66:125–132. https://doi.org/10.1107/s0907444909047337

    Article  CAS  Google Scholar 

  40. McCoy AJ, Grosse-Kunstleve RW, Adams PD et al (2007) Phaser crystallographic software. J Appl Crystallogr 40:658–674. https://doi.org/10.1107/s0021889807021206

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Emsley P, Lohkamp B, Scott WG, Cowtan K (2010) Features and development of coot. Acta Crystallogr Sect D Biol Crystallogr 66:486–501. https://doi.org/10.1107/s0907444910007493

    Article  CAS  Google Scholar 

  42. Murshudov GN, Skubák P, Lebedev AA et al (2011) REFMAC5 for the refinement of macromolecular crystal structures. Acta Crystallogr Sect D Biol Crystallogr 67:355–367. https://doi.org/10.1107/s0907444911001314

    Article  CAS  Google Scholar 

Download references

Acknowledgements

We thank Robin Taylor for valuable feedback and review of the manuscript. Fabian Dey is acknowledged for providing code to align molecules. We also thank Jörg Benz and Catherine Joseph for PDE10A crystallization. Finally, the data set would not exist without the great synthetic efforts of Konrad Bleicher, Luca Gobbi, Katrin Groebke-Zbinden, Matthias Koerner, Christian Lerner, Jens-Uwe Peters, Rosa Maria Rodriguez-Sarmiento, and associated lab members.

Funding

No funding was received for conducting this study.

Author information

Authors and Affiliations

Authors

Contributions

AT and BK: designed the study and wrote the manuscript. JC, CK, MiR: provided input to the design of the study and reviewed the manuscript. MaR: determined all crystal structures of the study and reviewed the manuscript. AF: synthesized compounds and supervised the chemistry program. HS and AN: generated and analysed binding data. AT, BK, JC: wrote code used for docking, scoring and data analysis.

Corresponding author

Correspondence to Andreas Tosstorff.

Ethics declarations

Conflict of interest

JC works for the Cambridge Crystallographic Data Centre which develops the GOLD docking software.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 1540 KB)

Supplementary file2 (CSV 188 KB)

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tosstorff, A., Rudolph, M.G., Cole, J.C. et al. A high quality, industrial data set for binding affinity prediction: performance comparison in different early drug discovery scenarios. J Comput Aided Mol Des 36, 753–765 (2022). https://doi.org/10.1007/s10822-022-00478-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10822-022-00478-x

Keywords

Navigation