Abstract
Clustering is one of the most well known activities in scientific investigation and the object of research in many disciplines, ranging from Statistics to Computer Science. Following Handl et al., it can be summarized as a three step process: (a) choice of a distance function; (b) choice of a clustering algorithm; (c) choice of a validation method. Although such a purist approach to clustering is hardly seen in many areas of science, genomic data require that level of attention, if inferences made from cluster analysis have to be of some relevance to biomedical research. Unfortunately, the high dimensionality of the data and their noisy nature makes cluster analysis of genomic data particularly difficult. This paper highlights new findings that seem to address a few relevant problems in each of the three mentioned steps, both in regard to the intrinsic predictive power of methods and algorithms and their time performance. Inclusion of this latter aspect into the evaluation process is quite novel, since it is hardly considered in genomic data analysis.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Broad institute, http://www.broadinstitute.org/cgi-bin/cancer/publications/pub_paper.cgi?mode=view&paper_id=89
NCI 60 Cancer Microarray Project, http://genome-www.stanford.edu/NCI60
Stanford microarray database, http://genome-www5.stanford.edu/
Alizadeh, A.A., Eisen, M.B., Davis, R.E., Ma, C., Lossos, I.S., Rosenwald, A., Boldrick, J.C., Sabet, H., Tran, T., Yu, X., Powell, J.I., Yang, L., Marti, G.E., Moore, T., Hudson, J.J., Lu, L., Lewis, D.B., Tibshirani, R., Sherlock, G., Chan, W.C., Greiner, T.C., Weisenburger, D.D., Armitage, J.O., Warnke, R., Levy, R., Wilson, W., Grever, M.R., Byrd, J.C., Botstein, D., Brown, P.O., Staudt, L.M.: Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature 403, 503–511 (2000)
Ben-Hur, A., Elisseeff, A., Guyon, I.: A stability based method for discovering structure in clustering data. In: Seventh Pacific Symposium on Biocomputing, pp. 6–17. ISCB (2002)
Borodin, A., Ostrovsky, R., Rabani, Y.: Subquadratic approximation algorithms for clustering problems in high dimensional space. Machine Learning 56, 153–167 (2004)
Brunet, J.-P., Tamayo, P., Golub, T.R., Mesirov, J.P.: Metagenes and molecular pattern discovery using matrix factorization. Proc. of the National Academy of Sciences of the United States of America 101, 4164–4169 (2004)
Devarajan, K.: Nonnegative Matrix Factorization: An Analytical and Interpretive Tool in Computational Biology. PLoS Comput. Biol. 4, e1000029 (2008)
Deza, E., Deza, M.: Dictionary of distances. Elsevier, Amsterdam (2006)
D’haeseleer, P.: How does gene expression cluster work? Nature Biotechnology 23, 1499–1501 (2006)
Di Gesú, V., Giancarlo, R., Lo Bosco, G., Raimondi, A., Scaturro, D.: Genclust: A genetic algorithm for clustering gene expression data. BMC Bioinformatics 6, 289 (2005)
Dudoit, S., Fridlyand, J.: A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology 3 (2002)
Fisher, D., Hoffman, P.: The Adjusted Rand Statistic: A SAS macro. Psychometrika 53, 417–423 (1988)
Frahling, G., Sohler, C.: A fast K-means implementation using coresets. In: Proceedings of the Twenty-Second Annual Symposium on Computational Geometry, pp. 135–143. ACM, New York (2006)
Freyhult, E., Landfors, M., Önskog, J., Hvidsten, T.R., Rydén, P.: Challenges in microarray class discovery: a comprehensive examination of normalization, gene selection and clustering. BMC Bioinformatics 11, 503 (2010)
Giancarlo, R., Lo Bosco, G., Pinello, L.: Distance Functions, Clustering Algorithms and Microarray Data Analysis. In: Blum, C., Battiti, R. (eds.) LION 4. LNCS, vol. 6073, pp. 125–138. Springer, Heidelberg (2010)
Giancarlo, R., Scaturro, D., Utro, F.: Computational cluster validation for microarray data analysis: experimental assessment of Clest, Consensus Clustering, Figure of Merit, Gap Statistics and Model Explorer. BMC Bioinformatics 9, 462 (2008)
Giancarlo, R., Scaturro, D., Utro, F.: Statistical Indices for Computational and Data Driven Class Discovery in Microarray Data. In: Biological Data Mining, pp. 295–335. CRC Press, Boca Raton (2009)
Giancarlo, R., Utro, F.: Speeding up the Consensus Clustering methodology for microarray data analysis. Algorithms for Molecular Biology 6, 1 (2011)
Handl, J., Knowles, J., Kell, D.B.: Computational cluster validation in post-genomic data analysis. Bioinformatics 21, 3201–3212 (2005)
Hartuv, E., Schmitt, A., Lange, J., Meier-Ewert, S., Lehrach, H., Shamir, R.: An algorithm for clustering of cDNAs for gene expression analysis using short oligonucleotide fingerprints. Genomics 66, 249–256 (2000)
Hanley, J.A., McNeil, B.J.: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143, 29–36 (1982)
Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: a Review. ACM Computing Surveys 31, 264–323 (1999)
Jain, A., Dubes, R.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs (1988)
Klie, S., Nikoloski, Z., Selbig, J.: Biological cluster evaluation for gene function prediction. Journal of Computational Biology 17, 1–18 (2010)
Kraus, J., Kestler, H.: A highly efficient multi-core algorithm for clustering extremely large datasets. BMC Bioinformatics 11, 169 (2010)
Lee, D.D., Seung, H.S.: Learning the parts of objects by Non-negative Matrix Factorization. Nature 401, 788–791 (1999)
Mehta, T., Tanik, M., Allison, D.B.: Towards sound epistemological foundations of statistical methods for high-dimensional biology. Nature Genetics 36, 943–947 (2004)
Monti, S., Tamayo, P., Mesirov, J., Golub, T.: Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data. Machine Learning 52, 91–118 (2003)
Priness, I., Maimon, O., Ben-Gal, I.: Evaluation of gene-expression clustering via mutual information distance measure. BMC Bioinformatics 8, 1–12 (2007)
Seal, S., Comarina, S., Aluru, S.: An optimal hierarchical clustering algorithm for gene expression data. Information Processing Letters 93, 143–147 (2004)
Speed, T.P.: Statistical analysis of gene expression microarray data. Chapman & Hall/CRC (2003)
Su, A.I., Cooke, M.P., Ching, K.A., Hakak, Y., Walker, J.R., Wiltshire, T., Orth, A.P., Vega, R.G., Sapinoso, L.M., Moqrich, A., Patapoutian, A., Hampton, G.M., Schultz, P.G., Hogenesch, J.B.: Large-scale analysis of the human and mouse transcriptomes. Proceedings of the National Academy of Sciences of the United States of America 99, 4465–4470 (2002)
Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a dataset via the gap statistics. Journal Royal Statistical Society B 2, 411–423 (2001)
Utro, F.: Algorithms for internal validation clustering measures in the Post Genomic Era, Doctoral Dissertation, University of Palermo (2011), http://arxiv.org/abs/1102.2915v1
Xu, Y., Olman, V., Xu, D.: Clustering gene expression data using a graph-theoretic approach: An application of minimum spanning tree. Bioinformatics 18, 526–535 (2002)
Yeung, K.Y.: Cluster Analysis of Gene Expression Data. Ph.D. thesis, University of Washington (2001)
Yeung, K.Y., Haynor, D.R., Ruzzo, W.L.: Validating clustering for gene expression data. Bioinformatics 17, 309–318 (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Giancarlo, R., Bosco, G.L., Pinello, L., Utro, F. (2011). The Three Steps of Clustering in the Post-Genomic Era: A Synopsis. In: Rizzo, R., Lisboa, P.J.G. (eds) Computational Intelligence Methods for Bioinformatics and Biostatistics. CIBB 2010. Lecture Notes in Computer Science(), vol 6685. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21946-7_2
Download citation
DOI: https://doi.org/10.1007/978-3-642-21946-7_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-21945-0
Online ISBN: 978-3-642-21946-7
eBook Packages: Computer ScienceComputer Science (R0)