Abstract
As the use of High-Throughput Screening (HTS) systems becomes more routine in the drug discovery process, there is an increasing need for fast and reliable analysis of the massive amounts of resulting biological data. At the forefront of the methods used for analyzing HTS data is cluster analysis. It is used in this context to find natural groups in the data, thereby revealing families of compounds that exhibit increased activity towards a specific biological target. Scientists in this area have traditionally used a number of clustering algorithms, distance (similarity) measures, and compound representation methods. We first discuss the nature of chemical and biological data and how it adversely impacts the current analysis methodology. We emphasize the inability of widely used methods to discover the chemical families in a pharmaceutical dataset and point out specific problems occurring when one attempts to apply these common clustering and other statistical methods on chemical data. We then introduce a new, data-mining algorithm that employs a newly proposed clustering method and expert knowledge to accommodate user requests and produce chemically sensible results. This new, chemically aware algorithm employs molecular structure to find true chemical structural families of compounds in pharmaceutical data, while at the same time accommodates the multi-domain nature of chemical compounds.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
MacCuish J.D., Nicolaou C.A. and MacCuish N.J.: “Ties in Proximity and Clustering Compounds”, J. Chem. Inf. Comput. Sci., Vol.41, No.1, pp.134–146, 2001.
Nicolaou C.A.: “Automated Lead Discovery and Development in HTS Da-tasets”, JALA, Vol.6, No.2, pp.60–63, 2001.
Nicolaou C.A., MacCuish J.D. and Tamura S.Y.: “A New Multi-domain Clustering Algorithm for Lead Discovery that Exploits Ties in Proximities”, Proceedings 13th European Symposium on Quantitative Structure-Activity Relationships, September, 2000.
Xie X.L. and Beni G.: “A Validity Measure for Fuzzy Clustering”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.13, No.8, pp.841–847, 2001.
Engels M.F., Thielemans T., Verbinnen D., Tollenacre J. and Verbeeck R.: “CerBeruS: a System Supporting the Sequential Screening Process”, J. Chem. Inf. Comput. Sci., Vol.40, No.2. pp.241–245. 2000.
Willett P., Winterman V. and Bawden D.: “Implementation of Non-hierarchic Cluster Analysis Methods in Chemical Information Systems: Selection of Compounds for Biological Testing and Clustering of Substructure Search Output”, J. Chem. Inf. Comput. Sci., Vol.26, pp.109–118, 1986.
Brown R.D. and Martin Y.C.: “Use of Structure-activity Data to Compare Structure-based Clustering Methods and Descriptors for Use in Compound Selection”, J. Chem. Inf. Comput. Sci., Vol.36, pp.572–584, 1996.
Wild D.J. and Blankley C.J.: “Comparison of 2d Fingerprint Types and Hierarchy Level Selection Methods for Structural Grouping Using Wards Clustering”, J. Chem. Inf. Comput. Sci., Vol.40, pp.155–162, 2000.
Godden J., Xue L. and Bajorath J.: “Combinatorial Preferences Affect Molecular Similarity/diversity Calculations Using Binary Fingerprints and Tanimoto Coefficients”, J. Chem. Inf. Comput. Sci., Vol.40, pp.163–166, 2000.
Flower D.R.: “On the Properties of Bit String-based Measures of Chemical Similarity”, J. Chem. Inf. Comput. Sci., Vol.38, pp.379–386, 1998.
Bertrand P.: “Structural Properties of Pyramidal Clustering”, DIMACS Series in Discrete Mathematics and Theoretical Computer Science, Vol.19, pp.35–53, 1995.
Barnard J.M. and Downs G.M.: “Clustering of Chemical Structures on the Basis of Two-dimensional Similarity Measures”, J. Chem. Inf. Comput. Sci., Vol.32, No.6, pp.644–649, 1992.
MacCuish J.D. and Nicolaou C.A.: “Method and System for Artificial Intelligence Directed Lead Discovery Through Multi-Domain Agglomerative Clustering. Application for a United States Patent”, MBHB Case No. 99,832. Assignee: Bioreason Inc.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Nicolaou, C.A. (2003). Identification of Lead Compounds in Pharmaceutical Data Using Data Mining Techniques. In: Manolopoulos, Y., Evripidou, S., Kakas, A.C. (eds) Advances in Informatics. PCI 2001. Lecture Notes in Computer Science, vol 2563. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-38076-0_9
Download citation
DOI: https://doi.org/10.1007/3-540-38076-0_9
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-07544-8
Online ISBN: 978-3-540-38076-4
eBook Packages: Springer Book Archive