Abstract
Distance functions are a fundamental ingredient of classification and clustering procedures, and this holds true also in the particular case of microarray data. In the general data mining and classification literature, functions such as Euclidean distance or Pearson correlation have gained their status of de facto standards thanks to a considerable amount of experimental validation. For microarray data, the issue of which distance function “works best” has been investigated, but no final conclusion has been reached. The aim of this paper is to shed further light on that issue. Indeed, we present an experimental study, involving several distances, assessing (a) their intrinsic separation ability and (b) their predictive power when used in conjunction with clustering algorithms. The experiments have been carried out on six benchmark microarray datasets, where the “gold solution” is known for each of them. We have used both Hierarchical and K-means clustering algorithms and external validation criteria as evaluation tools. From the methodological point of view, the main result of this study is a ranking of those measures in terms of their intrinsic and clustering abilities, highlighting also the correlations between the two. Pragmatically, based on the outcomes of the experiments, one receives the indication that Minkowski, cosine and Pearson correlation distances seems to be the best choice when dealing with microarray data analysis.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Stanford microarray database, http://genome-www5.stanford.edu/
D’haeseleer, P.: How does gene expression cluster work? Nature Biothecnology 23, 1499–1501 (2006)
Speed, T.P.: Statistical analysis of gene expression microarray data. Chapman & Hall/CRC (2003)
Handl, J., Knowles, J., Kell, D.B.: Computational cluster validation in post-genomic data analysis. Bioinformatics 21(15), 3201–3212 (2005)
Giancarlo, R., Scaturro, D., Utro, F.: Computational cluster validation for microarray data analysis: experimental assessment of Clest, Consensus Clustering, Figure of Merit, Gap Statistics and Model Explorer. BMC Bioinformatics 9, 462 (2008)
Shamir, R., Sharan, R.: Algorithmic approaches to clustering gene expression data. In: Jiang, T., Smith, T., Xu, Y., Zhang, M.Q. (eds.) Current Topics in Computational Biology, pp. 120–161. MIT Press, Cambridge (2003)
Priness, I., Maimon, O., Ben-Gal, I.: Evaluation of gene-expression clustering via mutual information distance measure. BMC Bioinformatics 8(111), 1–12 (2007)
Deza, E., Deza, M.: Dictionary of distances. Elsevier, Amsterdam (2006)
Costa, I., de Carvalho, F., de Souto, M.: Comparative analysis of clustering methods for gene expression time course data. Genetics and Molecular Biology 27(4), 623–631 (2004)
Gibbons, F., Roth, F.: Judging the quality of gene expression-based clustering methods using gene annotation. Genome Research (12), 1574–1581 (2002)
Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley-Interscience, New York City (1991)
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer, Heidelberg (2003)
Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman & Hall/CRC Monographs on Statistics & Applied Probability. Chapman and Hall/CRC, Boca Raton (1986)
Dudoit, S., Fridlyand, J.: A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology 3 (2002)
Di Gesú, V., Giancarlo, R., Lo Bosco, G., Raimondi, A., Scaturro, D.: Genclust: A genetic algorithm for clustering gene expression data. BMC Bioinformatics 6, 289 (2005)
Wen, X., Fuhrman, S., Michaels, G.S., Carr, G.S., Smith, D.B., Barker, J.L., Somogyi, R.: Large scale temporal gene expression mapping of central nervous system development. Proc. of The National Academy of Science USA 95, 334–339 (1998)
Yeung, K.Y., Haynor, D.R., Ruzzo, W.L.: Validating clustering for gene expression data. Bioinformatics 17, 309–318 (2001)
Alizadeh, A., Eisen, M., Davis, R., Ma, C., Lossos, I., Rosenwald, A., Boldrick, J., Sabet, H., Tran, T., Yu, X., Powell, J., Yang, L., Marti, G., Moore, T., Hudson, J.J., Lu, L., Lewis, D., Tibshirani, R., Sherlock, G., Chan, W., Greiner, T., Weisenburger, D., Armitage, J., Warnke, R., Levy, R., Wilson, W., Grever, M., Byrd, J., Botstein, D., Brown, P., Staudt, L.: Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature 403, 503–511 (2000)
NCI 60 Cancer Microarray Project, http://genome-www.stanford.edu/NCI60
Spellman, P.T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K., Eisen, M.B., Brown, P.O., Botstein, D., Futcher, B.: Comprehensive identification of cell cycle regulated genes of the yeast Saccharomyces Cerevisiae by microarray hybridization. Mol. Biol. Cell 9, 3273–3297 (1998)
Hartuv, E., Schmitt, A., Lange, J., Meier-Ewert, S., Lehrach, H., Shamir, R.: An algorithm for clustering of cDNAs for gene expression analysis using short oligonucleotide fingerprints. Genomics 66, 249–256 (2000)
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Computing Surveys 31(3), 264–323 (1999)
Chen, J.Y., Lonardi, S. (eds.): Biological Data Mining. Statistical Indices for Computational and Data Driven Class Discovery in Microarray Data, pp. 295–335. CRC Press, Boca Raton (2009)
Xu, Y., Olman, V., Xu, D.: Clustering gene expression data using a graph-theoretic approach: An application of minimum spanning tree. Bioinformatics 18(4), 526–535 (2002)
Metz, C.E.: Basic principles of ROC analysis. Seminars in Nuclear Medicine 8(4), 283–298 (1978)
Hanley, J.A., McNeil, B.J.: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1), 29–36 (1982)
Yeung, K.Y.: Cluster Analysis of Gene Expression Data. PhD thesis, University of Washington (2001)
Daub, C., Steuer, R., Selbig, J., Kloska, S.: Estimating mutual information using b-spline functions - an improved similarity measure for analysing gene expression data. BMC Bioinformatics 5(1), 118 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Giancarlo, R., Lo Bosco, G., Pinello, L. (2010). Distance Functions, Clustering Algorithms and Microarray Data Analysis. In: Blum, C., Battiti, R. (eds) Learning and Intelligent Optimization. LION 2010. Lecture Notes in Computer Science, vol 6073. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13800-3_10
Download citation
DOI: https://doi.org/10.1007/978-3-642-13800-3_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13799-0
Online ISBN: 978-3-642-13800-3
eBook Packages: Computer ScienceComputer Science (R0)