Skip to main content
Log in

Ligand-based approaches to activity prediction for the early stage of structure–activity–relationship progression

  • Published:
Journal of Computer-Aided Molecular Design Aims and scope Submit manuscript

Abstract

The retrospective evaluation of virtual screening approaches and activity prediction models are important for methodological development. However, for fair comparison, evaluation data sets must be carefully prepared. In this research, we compiled structure–activity–relationship matrix-based data sets for 15 biological targets along with many diverse inactive compounds, assuming the early stage of structure–activity–relationship progression. To use a large number of diverse inactive compounds and a limited number of active compounds, similarity profiles (SPs) are proposed as a set of molecular descriptors. Using these highly imbalanced data sets, we evaluated various approaches including SPs, under-sampling, support vector machine (SVM), and message passing neural networks. We found that for the under-sampling approaches, cluster-based sampling is better than random sampling. For virtual screening, SPs with inactive reference compounds and the under-sampling SVM also perform well. For classification, SPs with many inactive references performed as well as the under-sampling SVM trained on a balanced data set. Although the performance of SPs and the under-sampling SVM were comparable, SPs with many inactive references were preferable for selecting structurally distinct compounds from the active training compounds.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data availability

All data sets used in this study are available in an open-access deposition on the ZENODO platform [33].

References

  1. Stumpfe D, Bajorath J (2020) Current trends, overlooked issues, and unmet challenges in virtual screening. J Chem Inf Model 60:4112–4115. https://doi.org/10.1021/acs.jcim.9b01101

    Article  CAS  PubMed  Google Scholar 

  2. Škuta C, Cortés-Ciriano I, Dehaen W et al (2020) QSAR-derived affinity fingerprints (part 1): fingerprint construction and modeling performance for similarity searching, bioactivity classification and scaffold hopping. J Cheminform 12:1–16. https://doi.org/10.1186/s13321-020-00443-6

    Article  Google Scholar 

  3. Wassermann AM, Heikamp K, Bajorath J (2011) Potency-directed similarity searching using support vector machines. Chem Biol Drug Des 77:30–38. https://doi.org/10.1111/j.1747-0285.2010.01059.x

    Article  CAS  PubMed  Google Scholar 

  4. Jing Y, Bian Y, Hu Z et al (2018) Correction to: deep learning for drug design: an artificial intelligence paradigm for drug discovery in the big data era. AAPS J 20:1–1. https://doi.org/10.1208/s12248-018-0243-4

    Article  CAS  Google Scholar 

  5. Sakai M, Nagayasu K, Shibui N et al (2021) Prediction of pharmacological activities from chemical structures with graph convolutional neural networks. Sci Rep 11:525. https://doi.org/10.1038/s41598-020-80113-7

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Li X, Fourches D (2020) Inductive transfer learning for molecular activity prediction: next-Gen QSAR Models with MolPMoFiT. J Cheminform 12:1–15. https://doi.org/10.1186/s13321-020-00430-x

    Article  CAS  Google Scholar 

  7. Tsou LK, Yeh SH, Ueng SH et al (2020) Comparative study between deep learning and QSAR classifications for TNBC inhibitors and novel GPCR agonist discovery. Sci Rep 10:1–11. https://doi.org/10.1038/s41598-020-73681-1

    Article  CAS  Google Scholar 

  8. Yonchev D, Vogt M, Bajorath J (2020) From SAR diagnostics to compound design: development chronology of the compound optimization monitor (COMO) method. Mol Inform 39:2000046. https://doi.org/10.1002/minf.202000046

    Article  CAS  PubMed Central  Google Scholar 

  9. Kunimoto R, Miyao T, Bajorath J (2018) Computational method for estimating progression saturation of analog series. RSC Adv 8:5484–5492. https://doi.org/10.1039/c7ra13748f

    Article  CAS  Google Scholar 

  10. Lipinski CA (2010) Overview of hit to lead: the medicinal chemist’s role from HTS retest to lead optimization hand off. In: Hayward MM (ed) Lead-seeking approaches. Springer, New York, pp 1–24

    Google Scholar 

  11. Hawkins PCD, Skillman AG, Nicholls A (2007) Comparison of shape-matching and docking as virtual screening tools. J Med Chem 50:74–82. https://doi.org/10.1021/jm0603365

    Article  CAS  PubMed  Google Scholar 

  12. Sato T, Yuki H, Takaya D et al (2012) Application of support vector machine to three-dimensional shape-based virtual screening using comprehensive three-dimensional molecular shape overlay with known inhibitors. J Chem Inf Model 52:1015–1026. https://doi.org/10.1021/ci200562p

    Article  CAS  PubMed  Google Scholar 

  13. Sato A, Miyao T, Jasial S, Funatsu K (2021) Comparing predictive ability of QSAR/QSPR models using 2D and 3D molecular representations. J Comput Aided Mol Des 35:179–193. https://doi.org/10.1007/s10822-020-00361-7

    Article  CAS  PubMed  Google Scholar 

  14. Wassermann AM, Haebel P, Weskamp N, Bajorath J (2012) SAR matrices: automated extraction of information-rich SAR tables from large compound data sets. J Chem Inf Model 52:1769–1776. https://doi.org/10.1021/ci300206e

    Article  CAS  PubMed  Google Scholar 

  15. Mendez D, Gaulton A, Bento AP et al (2019) ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res 47:D930–D940. https://doi.org/10.1093/nar/gky1075

    Article  CAS  PubMed  Google Scholar 

  16. Kim S, Chen J, Cheng T et al (2021) PubChem in 2021: new data content and improved web interfaces. Nucleic Acids Res 49:D1388–D1395. https://doi.org/10.1093/nar/gkaa971

    Article  CAS  PubMed  Google Scholar 

  17. Kenny PW, Sadowski J (2005) Structure modification in chemical databases. In: Oprea TI (ed) Chemoinformatics in drug discovery. Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim, pp 271–285

    Chapter  Google Scholar 

  18. MolProp TK, version 2.5.4; OpenEye Scientific Software Inc, Santa Fe

  19. Wawer M, Bajorath J (2011) Local structural changes, global data views: graphical substructure- activity relationship trailing. J Med Chem 54:2944–2951. https://doi.org/10.1021/jm200026b

    Article  CAS  PubMed  Google Scholar 

  20. Matsumoto K, Miyao T, Funatsu K (2021) Ranking-oriented quantitative structure-activity relationship modeling combined with assay-wise data integration. ACS Omega 6:11964–11973. https://doi.org/10.1021/acsomega.1c00463

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inf Model 50:742–754. https://doi.org/10.1021/ci100050t

    Article  CAS  PubMed  Google Scholar 

  22. Jones E, Oliphant T, Peterson P (2021) SciPy: Open source scientific tools for python. https://www.scipy.org. Accessed 31 Oct 2021

  23. Vapnik VN (2000) The nature of statistical learning theory. Springer-Verlag, New York

    Book  Google Scholar 

  24. Gilmer J, Schoenholz SS, Riley PF, et al (2017) Neural message passing for quantum chemistry. In: 34th International Conference on Machine Learning. PMLR, pp 2053–2070

  25. Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory. ACM, pp 144–152

  26. Ralaivola L, Swamidass SJ, Saigo H, Baldi P (2005) Graph kernels for chemical informatics. Neural Netw 18:1093–1110. https://doi.org/10.1016/j.neunet.2005.07.009

    Article  PubMed  Google Scholar 

  27. Pedregosa F, Varoquaux G, Gramfort A et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830

    Google Scholar 

  28. OEChem TK, version 3.0.0; OpenEye Scientific Software Inc, Santa Fe

  29. Tang B, Kramer ST, Fang M et al (2020) A self-attention based message passing neural network for predicting molecular lipophilicity and aqueous solubility. J Cheminform 12:1–9. https://doi.org/10.1186/s13321-020-0414-z

    Article  Google Scholar 

  30. Paszke A, Gross S, Chintala S, et al. (2017) Automatic differentiation in pytorch. In: 31st Conference on Neural Information Processing Systems

  31. Akiba T, Sano S, Yanase T, et al. (2019) Optuna: a next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, pp 2623–2631

  32. Park HS, Jun CH (2009) A simple and fast algorithm for K-medoids clustering. Expert Syst Appl 36:3336–3341. https://doi.org/10.1016/j.eswa.2008.01.039

    Article  Google Scholar 

  33. Maeda I, Sato A, Tamura S, Miyao T, Compound activity data sets for 15 biological targets compiled from the ChEMBL and PubChem databases. https://doi.org/10.5281/zenodo.5748597

Download references

Acknowledgements

We thank Dr. Ryo Kunimoto at Daiichi Sankyo Company, Limited, for helping us with data set preparation. This work was financially supported by the Grant-in-Aid for Transformative Research Areas (A) 21A204 Digitalization-driven Transformative Organic Synthesis (Digi-TOS) from the Ministry of Education, Culture, Sports, Science & Technology, Japan.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tomoyuki Miyao.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 5623 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Maeda, I., Sato, A., Tamura, S. et al. Ligand-based approaches to activity prediction for the early stage of structure–activity–relationship progression. J Comput Aided Mol Des 36, 237–252 (2022). https://doi.org/10.1007/s10822-022-00449-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10822-022-00449-2

Keywords

Navigation