Skip to main content

The Three Steps of Clustering in the Post-Genomic Era: A Synopsis

  • Conference paper
Computational Intelligence Methods for Bioinformatics and Biostatistics (CIBB 2010)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 6685))

Abstract

Clustering is one of the most well known activities in scientific investigation and the object of research in many disciplines, ranging from Statistics to Computer Science. Following Handl et al., it can be summarized as a three step process: (a) choice of a distance function; (b) choice of a clustering algorithm; (c) choice of a validation method. Although such a purist approach to clustering is hardly seen in many areas of science, genomic data require that level of attention, if inferences made from cluster analysis have to be of some relevance to biomedical research. Unfortunately, the high dimensionality of the data and their noisy nature makes cluster analysis of genomic data particularly difficult. This paper highlights new findings that seem to address a few relevant problems in each of the three mentioned steps, both in regard to the intrinsic predictive power of methods and algorithms and their time performance. Inclusion of this latter aspect into the evaluation process is quite novel, since it is hardly considered in genomic data analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Broad institute, http://www.broadinstitute.org/cgi-bin/cancer/publications/pub_paper.cgi?mode=view&paper_id=89

  2. NCI 60 Cancer Microarray Project, http://genome-www.stanford.edu/NCI60

  3. Stanford microarray database, http://genome-www5.stanford.edu/

  4. Alizadeh, A.A., Eisen, M.B., Davis, R.E., Ma, C., Lossos, I.S., Rosenwald, A., Boldrick, J.C., Sabet, H., Tran, T., Yu, X., Powell, J.I., Yang, L., Marti, G.E., Moore, T., Hudson, J.J., Lu, L., Lewis, D.B., Tibshirani, R., Sherlock, G., Chan, W.C., Greiner, T.C., Weisenburger, D.D., Armitage, J.O., Warnke, R., Levy, R., Wilson, W., Grever, M.R., Byrd, J.C., Botstein, D., Brown, P.O., Staudt, L.M.: Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature 403, 503–511 (2000)

    Article  Google Scholar 

  5. Ben-Hur, A., Elisseeff, A., Guyon, I.: A stability based method for discovering structure in clustering data. In: Seventh Pacific Symposium on Biocomputing, pp. 6–17. ISCB (2002)

    Google Scholar 

  6. Borodin, A., Ostrovsky, R., Rabani, Y.: Subquadratic approximation algorithms for clustering problems in high dimensional space. Machine Learning 56, 153–167 (2004)

    Article  MATH  Google Scholar 

  7. Brunet, J.-P., Tamayo, P., Golub, T.R., Mesirov, J.P.: Metagenes and molecular pattern discovery using matrix factorization. Proc. of the National Academy of Sciences of the United States of America 101, 4164–4169 (2004)

    Article  Google Scholar 

  8. Devarajan, K.: Nonnegative Matrix Factorization: An Analytical and Interpretive Tool in Computational Biology. PLoS Comput. Biol. 4, e1000029 (2008)

    Article  Google Scholar 

  9. Deza, E., Deza, M.: Dictionary of distances. Elsevier, Amsterdam (2006)

    MATH  Google Scholar 

  10. D’haeseleer, P.: How does gene expression cluster work? Nature Biotechnology 23, 1499–1501 (2006)

    Article  Google Scholar 

  11. Di Gesú, V., Giancarlo, R., Lo Bosco, G., Raimondi, A., Scaturro, D.: Genclust: A genetic algorithm for clustering gene expression data. BMC Bioinformatics 6, 289 (2005)

    Article  Google Scholar 

  12. Dudoit, S., Fridlyand, J.: A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology 3 (2002)

    Google Scholar 

  13. Fisher, D., Hoffman, P.: The Adjusted Rand Statistic: A SAS macro. Psychometrika 53, 417–423 (1988)

    Article  MATH  Google Scholar 

  14. Frahling, G., Sohler, C.: A fast K-means implementation using coresets. In: Proceedings of the Twenty-Second Annual Symposium on Computational Geometry, pp. 135–143. ACM, New York (2006)

    Chapter  Google Scholar 

  15. Freyhult, E., Landfors, M., Önskog, J., Hvidsten, T.R., Rydén, P.: Challenges in microarray class discovery: a comprehensive examination of normalization, gene selection and clustering. BMC Bioinformatics 11, 503 (2010)

    Article  Google Scholar 

  16. Giancarlo, R., Lo Bosco, G., Pinello, L.: Distance Functions, Clustering Algorithms and Microarray Data Analysis. In: Blum, C., Battiti, R. (eds.) LION 4. LNCS, vol. 6073, pp. 125–138. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  17. Giancarlo, R., Scaturro, D., Utro, F.: Computational cluster validation for microarray data analysis: experimental assessment of Clest, Consensus Clustering, Figure of Merit, Gap Statistics and Model Explorer. BMC Bioinformatics 9, 462 (2008)

    Article  Google Scholar 

  18. Giancarlo, R., Scaturro, D., Utro, F.: Statistical Indices for Computational and Data Driven Class Discovery in Microarray Data. In: Biological Data Mining, pp. 295–335. CRC Press, Boca Raton (2009)

    Google Scholar 

  19. Giancarlo, R., Utro, F.: Speeding up the Consensus Clustering methodology for microarray data analysis. Algorithms for Molecular Biology 6, 1 (2011)

    Article  Google Scholar 

  20. Handl, J., Knowles, J., Kell, D.B.: Computational cluster validation in post-genomic data analysis. Bioinformatics 21, 3201–3212 (2005)

    Article  Google Scholar 

  21. Hartuv, E., Schmitt, A., Lange, J., Meier-Ewert, S., Lehrach, H., Shamir, R.: An algorithm for clustering of cDNAs for gene expression analysis using short oligonucleotide fingerprints. Genomics 66, 249–256 (2000)

    Article  Google Scholar 

  22. Hanley, J.A., McNeil, B.J.: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143, 29–36 (1982)

    Article  Google Scholar 

  23. Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: a Review. ACM Computing Surveys 31, 264–323 (1999)

    Article  Google Scholar 

  24. Jain, A., Dubes, R.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs (1988)

    MATH  Google Scholar 

  25. Klie, S., Nikoloski, Z., Selbig, J.: Biological cluster evaluation for gene function prediction. Journal of Computational Biology 17, 1–18 (2010)

    Article  MathSciNet  Google Scholar 

  26. Kraus, J., Kestler, H.: A highly efficient multi-core algorithm for clustering extremely large datasets. BMC Bioinformatics 11, 169 (2010)

    Article  Google Scholar 

  27. Lee, D.D., Seung, H.S.: Learning the parts of objects by Non-negative Matrix Factorization. Nature 401, 788–791 (1999)

    Article  MATH  Google Scholar 

  28. Mehta, T., Tanik, M., Allison, D.B.: Towards sound epistemological foundations of statistical methods for high-dimensional biology. Nature Genetics 36, 943–947 (2004)

    Article  Google Scholar 

  29. Monti, S., Tamayo, P., Mesirov, J., Golub, T.: Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data. Machine Learning 52, 91–118 (2003)

    Article  MATH  Google Scholar 

  30. Priness, I., Maimon, O., Ben-Gal, I.: Evaluation of gene-expression clustering via mutual information distance measure. BMC Bioinformatics 8, 1–12 (2007)

    Article  Google Scholar 

  31. Seal, S., Comarina, S., Aluru, S.: An optimal hierarchical clustering algorithm for gene expression data. Information Processing Letters 93, 143–147 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  32. Speed, T.P.: Statistical analysis of gene expression microarray data. Chapman & Hall/CRC (2003)

    Google Scholar 

  33. Su, A.I., Cooke, M.P., Ching, K.A., Hakak, Y., Walker, J.R., Wiltshire, T., Orth, A.P., Vega, R.G., Sapinoso, L.M., Moqrich, A., Patapoutian, A., Hampton, G.M., Schultz, P.G., Hogenesch, J.B.: Large-scale analysis of the human and mouse transcriptomes. Proceedings of the National Academy of Sciences of the United States of America 99, 4465–4470 (2002)

    Article  Google Scholar 

  34. Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a dataset via the gap statistics. Journal Royal Statistical Society B 2, 411–423 (2001)

    Article  MATH  Google Scholar 

  35. Utro, F.: Algorithms for internal validation clustering measures in the Post Genomic Era, Doctoral Dissertation, University of Palermo (2011), http://arxiv.org/abs/1102.2915v1

  36. Xu, Y., Olman, V., Xu, D.: Clustering gene expression data using a graph-theoretic approach: An application of minimum spanning tree. Bioinformatics 18, 526–535 (2002)

    Google Scholar 

  37. Yeung, K.Y.: Cluster Analysis of Gene Expression Data. Ph.D. thesis, University of Washington (2001)

    Google Scholar 

  38. Yeung, K.Y., Haynor, D.R., Ruzzo, W.L.: Validating clustering for gene expression data. Bioinformatics 17, 309–318 (2001)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Giancarlo, R., Bosco, G.L., Pinello, L., Utro, F. (2011). The Three Steps of Clustering in the Post-Genomic Era: A Synopsis. In: Rizzo, R., Lisboa, P.J.G. (eds) Computational Intelligence Methods for Bioinformatics and Biostatistics. CIBB 2010. Lecture Notes in Computer Science(), vol 6685. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21946-7_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-21946-7_2

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-21945-0

  • Online ISBN: 978-3-642-21946-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics