Abstract
Binary fingerprints are binary vectors used to represent chemical molecules by recording the presence or absence of particular substructures, such as labeled paths in the 2D graph of bonds. Complete fingerprints are often reduced to a compressed format–of typical dimension n = 512 or n = 1024–by using a simple congruence operation. The statistical properties of complete or compressed fingerprints representations are important since fingerprints are used to rapidly search large databases and to develop statistical machine learning methods in chemoinformatics. Here we present an empirical and mathematical analysis of the distribution of complete and compressed fingerprints. In particular, we derive formulas that provide good approximation for the expected number of bits set to one in a compressed fingerprint, given its uncompressed version, and vice versa.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Altschul, S., Madden, T., Shaffer, A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.: Gapped Blast and PSI-Blast: a new generation of protein database search programs. Nucl. Acids Res. 25, 3389–3402 (1997)
Bollobas, B.: Random Graphs. Academic Press, London (1985)
Chen, J., Swamidass, S.J., Dou, Y., Bruand, J., Baldi, P.: ChemDB: a public database of small molecules and related chemoinformatics resources (2005) (Submitted)
Fligner, M.A., Verducci, J.S., Blower, P.E.: A Modification of the Jaccard/Tanimoto Similarity Index for Diverse Selection of Chemical Compounds Using Binary Strings. Technometrics 44(2), 110–119 (2002)
Flower, D.R.: On the properties of bit string-based measures of chemical similarity. J. of Chemical Information and Computer Science 38, 378–386 (1998)
Irwin, J.J., Shoichet, B.K.: ZINC–a free database of commercially available compounds for virtual screening. Journal of Chemical Information and Computer Sciences 45, 177–182 (2005)
Ralaivola, L., Swamidass, S.J., Saigo, H., Baldi, P.: Graph kernels for chemical informatics. Neural Networks (2005); Special issue on Neural Networks and Kernel Methods for Structured Domains (In press)
Rouvray, D.: Definition and role of similarity concepts in the chemical and physical sciences. Journal of Chemical Information and Computer Sciences 32(6), 580–586 (1992)
Swamidass, S.J., Chen, J., Bruand, J., Phung, P., Ralaivola, L., Baldi, P.: Kernels for small molecules and the prediction of mutagenicity, toxicity, and anti-cancer activity. Bioinformatics 21(suppl. 1), i359–368 (2005); Proceedings of the 2005 ISMB Conference
Tversky, A.: Features of similarity. Psychological Review 84(4), 327–352 (1977)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Swamidass, S.J., Baldi, P. (2006). Statistical Distribution of Chemical Fingerprints. In: Bloch, I., Petrosino, A., Tettamanzi, A.G.B. (eds) Fuzzy Logic and Applications. WILF 2005. Lecture Notes in Computer Science(), vol 3849. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11676935_2
Download citation
DOI: https://doi.org/10.1007/11676935_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-32529-1
Online ISBN: 978-3-540-32530-7
eBook Packages: Computer ScienceComputer Science (R0)