Abstract
Effective representation of a molecule is required to develop useful quantitative structure–property relationships (QSPR) for accurate prediction of chemical properties. The octanol–water partition coefficient logP, a measure of lipophilicity, is an important property for pharmacological and toxicological endpoints used in the pharmaceutical and regulatory spheres. We compare physicochemical descriptors, structural keys, and circular fingerprints in their ability to effectively represent a chemical space and characterise molecular features to correlate with lipophilicity. Exploratory landscape continuity analyses revealed that whole-molecule physicochemical descriptors could map together compounds that were similar in both molecular features and logP, indicating higher potential for use in logP QSPRs compared to the substructural approach of structural keys and circular fingerprints. Indeed, logP QSPR models parameterised by physicochemical descriptors consistently performed with the lowest error. Our best performing model was a stochastic gradient descent-optimised multilinear regression with 1438 descriptors, returning an internal benchmark RMSE of 1.03 log units. This corroborates the well-established notion that lipophilicity is an additive, whole-molecule property. We externally tested the model by participating in the 2019 SAMPL6 logP Prediction Challenge and blindly predicting for 11 protein kinase inhibitor fragment-like molecules. Our model returned an RMSE of 0.49 log units, placing eighth overall and third in the empirical methods category (submission ID ‘hdpuj’). Permutation feature importance analyses revealed that physicochemical descriptors could characterise predictive molecular features highly relevant to the kinase inhibitor fragment-like molecules.





Similar content being viewed by others
References
Fujita T, Iwasa J, Hansch C (1964) A new substituent constant, π, derived from partition coefficients. J Am Chem Soc 86(23):5175–5180
Iwasa J, Fujita T, Hansch C (1965) Substituent constants for aliphatic functions obtained from partition coefficients. J Med Chem 8(2):150–153
Wang R, Fu Y, Lai L (1997) A new atom-additive method for calculating partition coefficients. J Chem Inf Comput Sci 37(3):615–621
Moriguchi I et al (1992) Simple method of calculating octanol/water partition coefficient. Chem Pharm Bull 40(1):127–130
Lo Y-C et al (2018) Machine learning in chemoinformatics and drug discovery. Drug Discov Today 23(8):1538–1546
Mitchell JBO (2014) Machine learning methods in chemoinformatics. WIREs Comput Mol Sci 4(5):468–481
Polanski J, Gasteiger J (2017) Computer representation of chemical compounds. In: Leszczynski J et al (eds) Handbook of computational chemistry. Springer International Publishing, Cham, pp 1997–2039
Hall LH, Mohney B, Kier LB (1991) The electrotopological state: an atom index for QSAR. Quant Struct Act Relat 10(1):43–51
Kier LB, Hall LH (1990) An electrotopological-state index for atoms in molecules. Pharm Res 7(8):801–807
Hall LH, Kier LB (1995) Electrotopological state indices for atom types: a novel combination of electronic, topological, and valence state information. J Chem Inf Comput Sci 35(6):1039–1045
Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inf Model 50(5):742–754
Wang J-B et al (2015) In silico evaluation of logD7,4 and comparison with other prediction methods. J Chemom 29(7):389–398
Wang R, Gao Y, Lai L (2000) Calculating partition coefficient by atom-additive method. Perspect Drug Discov Des 19(1):47–66
Chen H-F (2009) In silico log P prediction for a large data set with support vector machines, radial basis neural networks and multiple linear regression. Chem Biol Drug Des 74(2):142–147
Lowe EW et al (2011) Comparative analysis of machine learning techniques for the prediction of logP. In: 2011 IEEE symposium on computational intelligence in bioinformatics and computational biology (CIBCB), IEEE, Paris
Zang Q et al (2017) In silico prediction of physicochemical properties of environmental chemicals using molecular fingerprints and machine learning. J Chem Inf Model. 57(1):36–49
Yap CW (2011) PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints. J Comput Chem 32(7):1466–1474
Todeschini, R, V Consonni (2009) Molecular descriptors for chemoinformatics: volume I: alphabetical listing/volume II: appendices, references, vol 41. Wiley, Weinheim
Pedregosa F et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830
Peltason L (2007) J Bajorath, SAR index: quantifying the nature of structure–activity relationships. J Med Chem 50(23):5571–5578
Guha R, Van Drie JH (2008) Structure–activity landscape index: identifying and quantifying activity cliffs. J Chem Inf Model 48(3):646–658
Dice LR (1945) Measures of the amount of ecologic association between species. Ecology 26(3):297–302
Bajusz D (2015) A Rácz, K Héberger, Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations? J Cheminform 7(1):20
Cheng T et al (2007) Computation of octanol−water partition coefficients by guiding an additive model with knowledge. J Chem Inf Model 47(6):2140–2148
Mansouri K et al (2018) OPERA models for predicting physicochemical properties and environmental fate endpoints. J Cheminform 10(1):10
Martel S et al (2013) Large, chemically diverse dataset of logP measurements for benchmarking studies. Eur J Pharm Sci 48(1–2):21–29
Daina A (2014) O Michielin, V Zoete, iLOGP: a simple, robust, and efficient description of n-octanol/water partition coefficient for drug design using the GB/SA approach. J Chem Inf Model 54(12):3284–3301
Fraaije JGEM et al (2016) Coarse-grained models for automated fragmentation and parametrization of molecular databases. J Chem Inf Model 56(12):2361–2377
Gedeck P (2017) S Skolnik, S Rodde, Developing collaborative QSAR models without sharing structures. J Chem Inf Model 57(8):1847–1858
Plante J (2018) S Werner, JPlogP: an improved logP predictor trained using predicted data. J Cheminform 10(1):61
Işık M et al (2019) Octanol-water partition coefficient measurements for the SAMPL6 Blind Prediction Challenge. J Comput Aided Mol Des. https://doi.org/10.1007/s10822-019-00271-3
Peltason L (2010) P Iyer, J Bajorath, Rationalizing three-dimensional activity landscapes and the influence of molecular representations on landscape topology and the formation of activity cliffs. J Chem Inf Model 50(6):1021–1033
Mannhold R, van de Waterbeemd H (2001) Substructure and whole molecule approaches for calculating log P J Comput Aided Mol Des 15(4), 337–354.
Zakharov AV et al (2019) Novel consensus architecture to improve performance of large-scale multitask deep learning QSAR models. J Chem Inf Model 59(11):4613–4624
Moriwaki H et al (2018) Mordred: a molecular descriptor calculator. J Cheminform 10(1):4
Cherkasov A et al (2014) QSAR modeling: where have you been? Where are you going to? J Med Chem 57(12):4977–5010
Wu Z et al (2018) MoleculeNet: a benchmark for molecular machine learning. Chem Sci 9(2):513–530
Tiño P et al (2004) Nonlinear prediction of quantitative structure−activity relationships. J Chem Inf Comput Sci 44(5):1647–1653
Olson RS, Moore JH (2019) TPOT: a tree-based pipeline optimization tool for automating machine learning. In: Hutter F, Kotthoff L, Vanschoren J (eds) Automated machine learning: methods, systems, challenges. Springer, Cham, pp 151–160
Acknowledgements
We thank the National Institutes of Health (Grant No. R01-GM124270) for their support in funding the SAMPL6 Challenges and associated experimental work.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Lui, R., Guan, D. & Matthews, S. A comparison of molecular representations for lipophilicity quantitative structure–property relationships with results from the SAMPL6 logP Prediction Challenge. J Comput Aided Mol Des 34, 523–534 (2020). https://doi.org/10.1007/s10822-020-00279-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10822-020-00279-0