The Three Steps of Clustering in the Post-Genomic Era: A Synopsis

Giancarlo, R.; Bosco, G. Lo; Pinello, L.; Utro, F.

doi:10.1007/978-3-642-21946-7_2

R. Giancarlo²¹,
G. Lo Bosco²¹,
L. Pinello²¹ &
…
F. Utro²²

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 6685))

Included in the following conference series:

International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics

849 Accesses
9 Citations

Abstract

Clustering is one of the most well known activities in scientific investigation and the object of research in many disciplines, ranging from Statistics to Computer Science. Following Handl et al., it can be summarized as a three step process: (a) choice of a distance function; (b) choice of a clustering algorithm; (c) choice of a validation method. Although such a purist approach to clustering is hardly seen in many areas of science, genomic data require that level of attention, if inferences made from cluster analysis have to be of some relevance to biomedical research. Unfortunately, the high dimensionality of the data and their noisy nature makes cluster analysis of genomic data particularly difficult. This paper highlights new findings that seem to address a few relevant problems in each of the three mentioned steps, both in regard to the intrinsic predictive power of methods and algorithms and their time performance. Inclusion of this latter aspect into the evaluation process is quite novel, since it is hardly considered in genomic data analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Broad institute, http://www.broadinstitute.org/cgi-bin/cancer/publications/pub_paper.cgi?mode=view&paper_id=89
NCI 60 Cancer Microarray Project, http://genome-www.stanford.edu/NCI60
Stanford microarray database, http://genome-www5.stanford.edu/
Alizadeh, A.A., Eisen, M.B., Davis, R.E., Ma, C., Lossos, I.S., Rosenwald, A., Boldrick, J.C., Sabet, H., Tran, T., Yu, X., Powell, J.I., Yang, L., Marti, G.E., Moore, T., Hudson, J.J., Lu, L., Lewis, D.B., Tibshirani, R., Sherlock, G., Chan, W.C., Greiner, T.C., Weisenburger, D.D., Armitage, J.O., Warnke, R., Levy, R., Wilson, W., Grever, M.R., Byrd, J.C., Botstein, D., Brown, P.O., Staudt, L.M.: Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature 403, 503–511 (2000)
Article Google Scholar
Ben-Hur, A., Elisseeff, A., Guyon, I.: A stability based method for discovering structure in clustering data. In: Seventh Pacific Symposium on Biocomputing, pp. 6–17. ISCB (2002)
Google Scholar
Borodin, A., Ostrovsky, R., Rabani, Y.: Subquadratic approximation algorithms for clustering problems in high dimensional space. Machine Learning 56, 153–167 (2004)
Article MATH Google Scholar
Brunet, J.-P., Tamayo, P., Golub, T.R., Mesirov, J.P.: Metagenes and molecular pattern discovery using matrix factorization. Proc. of the National Academy of Sciences of the United States of America 101, 4164–4169 (2004)
Article Google Scholar
Devarajan, K.: Nonnegative Matrix Factorization: An Analytical and Interpretive Tool in Computational Biology. PLoS Comput. Biol. 4, e1000029 (2008)
Article Google Scholar
Deza, E., Deza, M.: Dictionary of distances. Elsevier, Amsterdam (2006)
MATH Google Scholar
D’haeseleer, P.: How does gene expression cluster work? Nature Biotechnology 23, 1499–1501 (2006)
Article Google Scholar
Di Gesú, V., Giancarlo, R., Lo Bosco, G., Raimondi, A., Scaturro, D.: Genclust: A genetic algorithm for clustering gene expression data. BMC Bioinformatics 6, 289 (2005)
Article Google Scholar
Dudoit, S., Fridlyand, J.: A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology 3 (2002)
Google Scholar
Fisher, D., Hoffman, P.: The Adjusted Rand Statistic: A SAS macro. Psychometrika 53, 417–423 (1988)
Article MATH Google Scholar
Frahling, G., Sohler, C.: A fast K-means implementation using coresets. In: Proceedings of the Twenty-Second Annual Symposium on Computational Geometry, pp. 135–143. ACM, New York (2006)
Chapter Google Scholar
Freyhult, E., Landfors, M., Önskog, J., Hvidsten, T.R., Rydén, P.: Challenges in microarray class discovery: a comprehensive examination of normalization, gene selection and clustering. BMC Bioinformatics 11, 503 (2010)
Article Google Scholar
Giancarlo, R., Lo Bosco, G., Pinello, L.: Distance Functions, Clustering Algorithms and Microarray Data Analysis. In: Blum, C., Battiti, R. (eds.) LION 4. LNCS, vol. 6073, pp. 125–138. Springer, Heidelberg (2010)
Chapter Google Scholar
Giancarlo, R., Scaturro, D., Utro, F.: Computational cluster validation for microarray data analysis: experimental assessment of Clest, Consensus Clustering, Figure of Merit, Gap Statistics and Model Explorer. BMC Bioinformatics 9, 462 (2008)
Article Google Scholar
Giancarlo, R., Scaturro, D., Utro, F.: Statistical Indices for Computational and Data Driven Class Discovery in Microarray Data. In: Biological Data Mining, pp. 295–335. CRC Press, Boca Raton (2009)
Google Scholar
Giancarlo, R., Utro, F.: Speeding up the Consensus Clustering methodology for microarray data analysis. Algorithms for Molecular Biology 6, 1 (2011)
Article Google Scholar
Handl, J., Knowles, J., Kell, D.B.: Computational cluster validation in post-genomic data analysis. Bioinformatics 21, 3201–3212 (2005)
Article Google Scholar
Hartuv, E., Schmitt, A., Lange, J., Meier-Ewert, S., Lehrach, H., Shamir, R.: An algorithm for clustering of cDNAs for gene expression analysis using short oligonucleotide fingerprints. Genomics 66, 249–256 (2000)
Article Google Scholar
Hanley, J.A., McNeil, B.J.: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143, 29–36 (1982)
Article Google Scholar
Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: a Review. ACM Computing Surveys 31, 264–323 (1999)
Article Google Scholar
Jain, A., Dubes, R.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs (1988)
MATH Google Scholar
Klie, S., Nikoloski, Z., Selbig, J.: Biological cluster evaluation for gene function prediction. Journal of Computational Biology 17, 1–18 (2010)
Article MathSciNet Google Scholar
Kraus, J., Kestler, H.: A highly efficient multi-core algorithm for clustering extremely large datasets. BMC Bioinformatics 11, 169 (2010)
Article Google Scholar
Lee, D.D., Seung, H.S.: Learning the parts of objects by Non-negative Matrix Factorization. Nature 401, 788–791 (1999)
Article MATH Google Scholar
Mehta, T., Tanik, M., Allison, D.B.: Towards sound epistemological foundations of statistical methods for high-dimensional biology. Nature Genetics 36, 943–947 (2004)
Article Google Scholar
Monti, S., Tamayo, P., Mesirov, J., Golub, T.: Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data. Machine Learning 52, 91–118 (2003)
Article MATH Google Scholar
Priness, I., Maimon, O., Ben-Gal, I.: Evaluation of gene-expression clustering via mutual information distance measure. BMC Bioinformatics 8, 1–12 (2007)
Article Google Scholar
Seal, S., Comarina, S., Aluru, S.: An optimal hierarchical clustering algorithm for gene expression data. Information Processing Letters 93, 143–147 (2004)
Article MathSciNet MATH Google Scholar
Speed, T.P.: Statistical analysis of gene expression microarray data. Chapman & Hall/CRC (2003)
Google Scholar
Su, A.I., Cooke, M.P., Ching, K.A., Hakak, Y., Walker, J.R., Wiltshire, T., Orth, A.P., Vega, R.G., Sapinoso, L.M., Moqrich, A., Patapoutian, A., Hampton, G.M., Schultz, P.G., Hogenesch, J.B.: Large-scale analysis of the human and mouse transcriptomes. Proceedings of the National Academy of Sciences of the United States of America 99, 4465–4470 (2002)
Article Google Scholar
Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a dataset via the gap statistics. Journal Royal Statistical Society B 2, 411–423 (2001)
Article MATH Google Scholar
Utro, F.: Algorithms for internal validation clustering measures in the Post Genomic Era, Doctoral Dissertation, University of Palermo (2011), http://arxiv.org/abs/1102.2915v1
Xu, Y., Olman, V., Xu, D.: Clustering gene expression data using a graph-theoretic approach: An application of minimum spanning tree. Bioinformatics 18, 526–535 (2002)
Google Scholar
Yeung, K.Y.: Cluster Analysis of Gene Expression Data. Ph.D. thesis, University of Washington (2001)
Google Scholar
Yeung, K.Y., Haynor, D.R., Ruzzo, W.L.: Validating clustering for gene expression data. Bioinformatics 17, 309–318 (2001)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Matematica ed Informatica, Universitá di Palermo, Via Archirafi 34, 90123, Palermo, Italy
R. Giancarlo, G. Lo Bosco & L. Pinello
Computational Genomics Group, IBM T.J. Watson Research Center, Yorktown Heights, NY, USA
F. Utro

Authors

R. Giancarlo
View author publications
You can also search for this author in PubMed Google Scholar
G. Lo Bosco
View author publications
You can also search for this author in PubMed Google Scholar
L. Pinello
View author publications
You can also search for this author in PubMed Google Scholar
F. Utro
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

ICAR-CNR, Consiglio Nazionale delle Ricerche, 90128, Palermo, Italy
Riccardo Rizzo
School of Computing and Mathematical Sciences, Liverpool John Moores University, L3 3AF, Liverpool, UK
Paulo J. G. Lisboa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Giancarlo, R., Bosco, G.L., Pinello, L., Utro, F. (2011). The Three Steps of Clustering in the Post-Genomic Era: A Synopsis. In: Rizzo, R., Lisboa, P.J.G. (eds) Computational Intelligence Methods for Bioinformatics and Biostatistics. CIBB 2010. Lecture Notes in Computer Science(), vol 6685. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21946-7_2

Download citation

DOI: https://doi.org/10.1007/978-3-642-21946-7_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-21945-0
Online ISBN: 978-3-642-21946-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics