Abstract
The retrospective evaluation of virtual screening approaches and activity prediction models are important for methodological development. However, for fair comparison, evaluation data sets must be carefully prepared. In this research, we compiled structure–activity–relationship matrix-based data sets for 15 biological targets along with many diverse inactive compounds, assuming the early stage of structure–activity–relationship progression. To use a large number of diverse inactive compounds and a limited number of active compounds, similarity profiles (SPs) are proposed as a set of molecular descriptors. Using these highly imbalanced data sets, we evaluated various approaches including SPs, under-sampling, support vector machine (SVM), and message passing neural networks. We found that for the under-sampling approaches, cluster-based sampling is better than random sampling. For virtual screening, SPs with inactive reference compounds and the under-sampling SVM also perform well. For classification, SPs with many inactive references performed as well as the under-sampling SVM trained on a balanced data set. Although the performance of SPs and the under-sampling SVM were comparable, SPs with many inactive references were preferable for selecting structurally distinct compounds from the active training compounds.
Similar content being viewed by others
Data availability
All data sets used in this study are available in an open-access deposition on the ZENODO platform [33].
References
Stumpfe D, Bajorath J (2020) Current trends, overlooked issues, and unmet challenges in virtual screening. J Chem Inf Model 60:4112–4115. https://doi.org/10.1021/acs.jcim.9b01101
Škuta C, Cortés-Ciriano I, Dehaen W et al (2020) QSAR-derived affinity fingerprints (part 1): fingerprint construction and modeling performance for similarity searching, bioactivity classification and scaffold hopping. J Cheminform 12:1–16. https://doi.org/10.1186/s13321-020-00443-6
Wassermann AM, Heikamp K, Bajorath J (2011) Potency-directed similarity searching using support vector machines. Chem Biol Drug Des 77:30–38. https://doi.org/10.1111/j.1747-0285.2010.01059.x
Jing Y, Bian Y, Hu Z et al (2018) Correction to: deep learning for drug design: an artificial intelligence paradigm for drug discovery in the big data era. AAPS J 20:1–1. https://doi.org/10.1208/s12248-018-0243-4
Sakai M, Nagayasu K, Shibui N et al (2021) Prediction of pharmacological activities from chemical structures with graph convolutional neural networks. Sci Rep 11:525. https://doi.org/10.1038/s41598-020-80113-7
Li X, Fourches D (2020) Inductive transfer learning for molecular activity prediction: next-Gen QSAR Models with MolPMoFiT. J Cheminform 12:1–15. https://doi.org/10.1186/s13321-020-00430-x
Tsou LK, Yeh SH, Ueng SH et al (2020) Comparative study between deep learning and QSAR classifications for TNBC inhibitors and novel GPCR agonist discovery. Sci Rep 10:1–11. https://doi.org/10.1038/s41598-020-73681-1
Yonchev D, Vogt M, Bajorath J (2020) From SAR diagnostics to compound design: development chronology of the compound optimization monitor (COMO) method. Mol Inform 39:2000046. https://doi.org/10.1002/minf.202000046
Kunimoto R, Miyao T, Bajorath J (2018) Computational method for estimating progression saturation of analog series. RSC Adv 8:5484–5492. https://doi.org/10.1039/c7ra13748f
Lipinski CA (2010) Overview of hit to lead: the medicinal chemist’s role from HTS retest to lead optimization hand off. In: Hayward MM (ed) Lead-seeking approaches. Springer, New York, pp 1–24
Hawkins PCD, Skillman AG, Nicholls A (2007) Comparison of shape-matching and docking as virtual screening tools. J Med Chem 50:74–82. https://doi.org/10.1021/jm0603365
Sato T, Yuki H, Takaya D et al (2012) Application of support vector machine to three-dimensional shape-based virtual screening using comprehensive three-dimensional molecular shape overlay with known inhibitors. J Chem Inf Model 52:1015–1026. https://doi.org/10.1021/ci200562p
Sato A, Miyao T, Jasial S, Funatsu K (2021) Comparing predictive ability of QSAR/QSPR models using 2D and 3D molecular representations. J Comput Aided Mol Des 35:179–193. https://doi.org/10.1007/s10822-020-00361-7
Wassermann AM, Haebel P, Weskamp N, Bajorath J (2012) SAR matrices: automated extraction of information-rich SAR tables from large compound data sets. J Chem Inf Model 52:1769–1776. https://doi.org/10.1021/ci300206e
Mendez D, Gaulton A, Bento AP et al (2019) ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res 47:D930–D940. https://doi.org/10.1093/nar/gky1075
Kim S, Chen J, Cheng T et al (2021) PubChem in 2021: new data content and improved web interfaces. Nucleic Acids Res 49:D1388–D1395. https://doi.org/10.1093/nar/gkaa971
Kenny PW, Sadowski J (2005) Structure modification in chemical databases. In: Oprea TI (ed) Chemoinformatics in drug discovery. Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim, pp 271–285
MolProp TK, version 2.5.4; OpenEye Scientific Software Inc, Santa Fe
Wawer M, Bajorath J (2011) Local structural changes, global data views: graphical substructure- activity relationship trailing. J Med Chem 54:2944–2951. https://doi.org/10.1021/jm200026b
Matsumoto K, Miyao T, Funatsu K (2021) Ranking-oriented quantitative structure-activity relationship modeling combined with assay-wise data integration. ACS Omega 6:11964–11973. https://doi.org/10.1021/acsomega.1c00463
Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inf Model 50:742–754. https://doi.org/10.1021/ci100050t
Jones E, Oliphant T, Peterson P (2021) SciPy: Open source scientific tools for python. https://www.scipy.org. Accessed 31 Oct 2021
Vapnik VN (2000) The nature of statistical learning theory. Springer-Verlag, New York
Gilmer J, Schoenholz SS, Riley PF, et al (2017) Neural message passing for quantum chemistry. In: 34th International Conference on Machine Learning. PMLR, pp 2053–2070
Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory. ACM, pp 144–152
Ralaivola L, Swamidass SJ, Saigo H, Baldi P (2005) Graph kernels for chemical informatics. Neural Netw 18:1093–1110. https://doi.org/10.1016/j.neunet.2005.07.009
Pedregosa F, Varoquaux G, Gramfort A et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830
OEChem TK, version 3.0.0; OpenEye Scientific Software Inc, Santa Fe
Tang B, Kramer ST, Fang M et al (2020) A self-attention based message passing neural network for predicting molecular lipophilicity and aqueous solubility. J Cheminform 12:1–9. https://doi.org/10.1186/s13321-020-0414-z
Paszke A, Gross S, Chintala S, et al. (2017) Automatic differentiation in pytorch. In: 31st Conference on Neural Information Processing Systems
Akiba T, Sano S, Yanase T, et al. (2019) Optuna: a next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, pp 2623–2631
Park HS, Jun CH (2009) A simple and fast algorithm for K-medoids clustering. Expert Syst Appl 36:3336–3341. https://doi.org/10.1016/j.eswa.2008.01.039
Maeda I, Sato A, Tamura S, Miyao T, Compound activity data sets for 15 biological targets compiled from the ChEMBL and PubChem databases. https://doi.org/10.5281/zenodo.5748597
Acknowledgements
We thank Dr. Ryo Kunimoto at Daiichi Sankyo Company, Limited, for helping us with data set preparation. This work was financially supported by the Grant-in-Aid for Transformative Research Areas (A) 21A204 Digitalization-driven Transformative Organic Synthesis (Digi-TOS) from the Ministry of Education, Culture, Sports, Science & Technology, Japan.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Maeda, I., Sato, A., Tamura, S. et al. Ligand-based approaches to activity prediction for the early stage of structure–activity–relationship progression. J Comput Aided Mol Des 36, 237–252 (2022). https://doi.org/10.1007/s10822-022-00449-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10822-022-00449-2