Summary
We examined “descriptor collision” for several chemical fingerprint systems (MDL 320, Daylight, SMDL), and for a 2D-based descriptor set. For large databases (ChemNavigator and WOMBAT), the smallest collision rate remains around 5%. We systematically increase the “descriptor collision” rate (here termed “descriptor confusion”), in order to design a set of “descriptors to mask chemical structures”, DMCS. If effective, a DMCS system would not allow third parties to determine the original chemical structures used to derive the DMCS set (i.e., reverse engineering). Using SMDL keys, the “confusion” rate is increased to 45.6% by eliminating those keys that have a low frequency of occurrence in WOMBAT structures. We applied an automated PLS engine, WB-PLS [Olah et al., J. Comput. Aided Mol. Des., 18 (2004) 437], to 1277 series of structures from 948 targets in WOMBAT, in order to validate the biological relevance of the SMDL descriptors as a potential DMCS set. The “reduced set” of SMDL descriptors has a small loss of modeling power (around 20%) compared to the initial descriptor set, while the collision rate is significantly increased. These results indicate that the development of an effective DMCS is possible. If well documented, DMCS systems would encourage private sector data release (e.g., related to water solubility) and directly benefit public sector science.
Similar content being viewed by others
Abbreviations
- CMR:
-
calculated molecular refractivity
- ClogP:
-
program produced by BioByte Corp., Claremont, CA
- Daylight/DY:
-
Daylight Chemical Information Systems
- DMCS:
-
descriptors to mask chemical structures
- DMSO:
-
Dimethylsulfoxide
- DPISMR:
-
the NIH Small Molecule Repository as organized by DPI
- LogP:
-
the logarithm of the octanol-water partition coefficient
- LogSw :
-
the logarithm of the (molar) aqueous solubility
- MACCS:
-
Molecular ACCess System, an MDL product
- MDL:
-
Molecular Design Limited
- MLI:
-
Molecular Libraries and Imaging initiative
- NIH:
-
National Institutes of Health
- PLS:
-
Partial Least Squares/Projection Latent Structures
- QSAR:
-
quantitative structure–activity relationships
- SMDL:
-
Sunset Molecular Discovery, LLC
- SMILES:
-
Simplified Molecular Input Line Entry Specification
- WOMBAT/WB:
-
WOrld of Molecular BioAcTivity database.
References
Austin, C.P., Brady, L.S., Insel, T.R. and Collins, F.S., Science, 306 (2004) 1138. Last access on 21.10.05
The PubChem database is available online at the National Center for Biotechnology Information, http://pubchem.ncbi.nlm.nih.gov/ Last access on 21.10.05
Hahn, M.M. and Green, R., Curr. Opin. Chem. Biol., 3 (1999) 379.
Filimonov, D. and Poroikov, V., J. Comput. Aided Mol. Des., 19 (2005) in press
Weber, L., Curr. Opin. Chem. Biol., 2 (1998) 381.
The iResearch Library™ is available from ChemNavigator, Inc., http://chemnavigator.com/cnc/products/IRL.asp Last access on 21.10.05
The Crossfire Beilstein database is available from Elsevier MDL, http://www.mdl.com/products/knowledge/crossfire_beilstein/index.jsp Last access on 21.10.05
Tetko, I.V., Abagyan, R. and Oprea, T.I., J. Comput. Aided Mol. Des., 19 (2005) in press
Faulon, J.L., Brown, W.M. and Martin, S., J. Comput. Aided Mol. Des., 19 (2005) in press
Olah, M., Mracec, M., Ostopovici, L., Rad, R., Bora, A., Hadaruga, N., Olah, I., Banda, M., Simon, S., Mracec, M. and Oprea, T.I., In Oprea, T.I. (Ed), Chemoinformatics in Drug Discovery, Wiley-VCH, New York, 2005, pp. 223–239
WOMBAT is available from Sunset Molecular Discovery LLC, http://www.sunsetmolecular.com/ Last access on 21.10.05
Weininger, D., J. Chem. Inf. Comput. Sci., 28 (1988) 31.
Leo, A. and Weininger, D., CMR3. Daylight Chemical Information Systems, Santa Fe, New Mexico, http://www.daylight.com/, 1995
Leo, A., Chem. Rev., 93 (1993) 1281.
Leo, A. and Weininger, D., CLOGP 4.0. Daylight Chemical Information Systems, Santa Fe, New Mexico, http://www.daylight.com/, 2001
Ran, Y., Jain, N. and Yalkowsky, S.H., J. Chem. Inf. Comput. Sci., 41 (2001) 1208.
Livingstone, D.J., Ford, M.G., Huuskonen, J.J. and Salt, D.W., J. Comput. Aided Mol. Des., 15 (2001) 741.
Tetko, I.V., Tanchuk, V.Y. and Villa, A.E., J. Chem. Inf. Comput. Sci., 41 (2001) 1407.
Glen, R.C., J. Comput. Aided Mol. Des., 8 (1994) 457.
Gasteiger, J. and Marsili, M., Tetrahedron, 36 (1980) 3219.
Oprea, T.I., J. Comput. Aided Mol. Des., 14 (2000) 251.
Balaban, A.T., SAR QSAR Environ. Res., 8 (1998) 1.
Kier, L.B. and Hall, L.H. Molecular Connectivity in Structure-Activity Analysis. John Wiley, New York, 1986.
Basak, S.C., Balaban, A.T., Grunwald, G.D. and Gute, B.D., J. Chem. Inf. Comput. Sci., 40 (2000) 891.
Durant, J.L., Leland, B.A., Henry, D.R. and Nourse, J.G., J. Chem. Inf. Comput. Sci., 42 (2002) 1273.
MacCuish, J. and MacCuish, N., Measures software, Mesa Analytics and Computing LLC, Santa Fe, New Mexico, http://www.mesaac.com/ Last access on 21.10.05
Daylight fingerprints are available from Daylight Chemical Information Systems, http://www.daylight.com/ Last access on 21.10.05
Olah, M., Bologa, C. and Oprea, T.I., J. Comput. Aided Mol. Des., 18 (2004) 437.
Schneider, G., Neidhart, W., Giller, T. and Schmidt, G., Angew. Chem. Int. Ed., 38 (1999) 2894.
The SMARTS toolkit and SMARTS are available from Daylight Chemical Information Systems, Santa Fe, New Mexico, http://www.daylight.com/dayhtml/doc/theory.smarts.html; online SMARTS tutorial: http://www.daylight.com/dayhtml/doc/theory/smarts.html, 2005
SMACK and OEChem are available from OpenEye Scientific Software, Santa Fe, New Mexico, http://www.eyesopen.com/products/applications/smack.html, 2005
Wold, S., Johansson, E. and Cocchi, M., In Kubinyi, H., (Ed), 3D QSAR in Drug Design: Theory, Methods and Applications, ESCOM, Leiden, 1993, pp. 523–550
Kappler, M.A., Allu, T.K., Bologa, C. and Oprea, T.I., J. Chem. Inf. Model, 45 (2005) in preparation
Acknowledgments
We thank Jeremy (JJ) Yang from OpenEye Scientific Software (Santa Fe, NM) for advice on descriptor collision. This work was supported by New Mexico Tobacco Settlement Funds for Biocomputing (TKA, MO) and by the New Mexico Molecular Library Screening Center, NIH 1U54 MH074425-01 (CB, TIO).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Bologa, C., Allu, T.K., Olah, M. et al. Descriptor collision and confusion: Toward the design of descriptors to mask chemical structures. J Comput Aided Mol Des 19, 625–635 (2005). https://doi.org/10.1007/s10822-005-9020-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10822-005-9020-4