Abstract
In microarray data analysis, visualizations based on agglomerative clustering results are widely applied to help biomedical researchers in generating a mental model of their data. In order to support a selection of the to-be-applied algorithm and parameterizations, we propose a novel cluster index, the tree index (TI), to evaluate hierarchical cluster results regarding their visual appearance and their accordance to available background information. Visually appealing cluster trees are characterized by splits that separate those homogeneous clusters from the rest of the data, which have low inner cluster variance and share a medical class label. To evaluate clustering trees regarding this property, the TI computes the likeliness of every single split in the cluster tree. Computing TIs for different algorithms and parameterizations allows to identify the most appealing cluster tree among many possible tree visualizations obtained. Application is shown on simulated data as well as on two public available cancer data sets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Quackenbush, J.: Computational analysis of microarray data. Nat. Rev. Genet. 2(6), 418–427 (2001)
Ochs, M.F., Godwin, A.K.: Microarray in cancer: Research and applications. Biotechn. 34, 4–15 (2003)
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. John Wiley and Sons, Inc., New York (2001)
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer, Heidelberg (2001) Fondi di Ricerca Salvatore Ruggieri - Numero 555 d’inventario
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. 31(3), 264–323 (1999)
Hartigan, J.A.: Clustering Algorithms. Wiley, Chichester (1975)
Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D.: Cluster analysis and display of genome-wide expression patterns. PNAS 95, 14863–14868 (1998)
Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE PAMI 22(8), 888–905 (2000)
Kluger, Y., Basri, R., Chang, J., Gerstein, M.: Spectral biclustering of microarray data: Coclustering genes and conditions. Genome Res. 13(4), 703–716 (2003)
Xing, E., Karp, R.: CLIFF: Clustering of high–dimensional microarray data via iterative feature filtering using normalized cuts. Bioinformatics 17(suppl. 1), 306–315 (2001)
Handl, J., Knowles, J., Kell, D.B.: Computational cluster validation in post-genomic data analysis. Bioinformatics 21(15), 3201–3212 (2005)
Gat-Viks, I., Sharan, R., Shamir, R.: Scoring clustering solutions by their biological relevance. Bioinformatics 19(18), 2381–2389 (2003)
Halkidi, M., Batistakis, Y., Vazirgiannis, M.: On clustering validation techniques. Journal of Intelligent Information Systems 17(2-3), 107–145 (2001)
Goodman, L., Kruskal, W.: Measures of associations for cross-validations. J. Am. Stat. Assoc. 49, 732–764 (1954)
Calinski, R., Harabasz, J.: A dendrite method for cluster analysis. Comm. in Statistics 3, 1–27 (1974)
Dunn, J.: Well separated clusters and optimal fuzzy partitions. J. Cybernetics 4, 95–104 (1974)
Hubert, L., Schulz, J.: Quadratic assignment as a general data-analysis strategy. Br. J. Math. Stat. Psychol. 29, 190–241 (1976)
Davies, D., Bouldin, D.: A cluster separation measure. IEEE Trans. Pattern Recogn. Machine Intell. 1, 224–227 (1979)
Rousseeuw, P.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–56 (1987)
Shamir, R., Sharan, R.: Algorithmic approaches to clustering gene expression data. In: Jiang, T., Smith, T., Xu, Y., Zhang, M.Q. (eds.) Current Topics in Computational Biology. MIT Press, Cambridge (2001)
Maulik, U., Bandyopadhyay, S.: Performance evaluation of some clustering algorithms and validity indices. IEEE transactions PAMI 24(12), 1650–1654 (2002)
Chen, G., Jaradat, S.A., et al.: Evaluation and comparison of clustering algorithms in analyzing ES cell gene expression data. Statistica Sinica 12, 241–262 (2002)
Bolshakova, N., Azuaje, F., Cunningham, P.: An integrated tool for microarray data clustering and cluster validity assessment. Bioinformatics 21(4), 451–455 (2005)
Bolshakova, N., Azuaje, F.: Estimating the number of clusters in DNA microarray data. Methods Inf. Med. 45(2), 153–157 (2006)
Rand, W.: Objective criteria for the evaluation of clustering methods. J. of the American Statistical Association 66, 846–850 (1971)
Hubert, A.: Comparing partitions. J. of Classification 2, 193–198 (1985)
Thalamuthu, A., Mukhopadhyay, I., Zheng, X., Tseng, G.C.: Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics 22(19), 2405–2412 (2006)
Tavazoie, S., Hughes, J.D., Campbell, M.J., Cho, R.J., Church, G.M.: Systematic determination of genetic network architecture. Nat. Gen. 22, 281–285 (1999)
Toronen, P.: Selection of informative clusters from hierarchical cluster tree with gene classes. BMC Bioinformatics 5(1), (32) (2004)
Datta, S., Datta, S.: Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes. BMC Bioinf. 7(397) (2006)
Steuer, R., Selbig, P.H.J.: Validation and functional annotation of expression-based clusters based on gene ontology. BMC Bioinformatics 7(380) (2006)
Yeung, K., Haynor, D., Ruzzo, W.: Validating clustering for gene expression data. Bioinformatics 17(4), 309–318 (2001)
Johnson, N.L., Kotz, S., Balakrishnan, N.: Discrete multivariate distributions. Wiley, Chichester (1997)
van de Vijver, M.J., Yudong, D., van’t Veer, L., Hongyue, D., et al.: A gene-expression signature as a predictor of survival in breast cancer. The New Eng. J. Med. 347(25), 1999–2009 (2002)
van’t Veer, L.J., Dai, H., van de Vijver, M.J., He, Y.D., A.A.M.H., Mao, M., Peterse, H.L., van der Kooy, K., Marton, M.J., A.T.W.: Gene expression profiling predicts clinical outcome of breast cancer. Nature 415, 530–536 (2002)
Ramaswamy, S., Tamayo, P., Rifkin, R., Mukherjee, S., Yeang, C., Angelo, M., Ladd, C., Reich, M., Latulippe, E., Mesirov, J., Poggio, T., Gerald, W., Loda, M., Lander, E., Golub, T.: Multiclass cancer diagnosis using tumor gene expression signatures. PNAS 98(26), 15149–15154 (2001)
Ding, C.: Analysis of gene expression profiles: class discovery and leaf ordering. In: Proc. RECOMB 2002 (2002)
Mewes, H., Frishman, D., Guldener, U., Mannhaupt, G., Mayer, K., Mokrejs, M., Morgenstern, B., Munsterkotter, M., Rudd, S., Weil, B.: MIPS: a database for genomes and protein sequences. Nucleic Acid Res. 30, 31–34 (2002)
GO-Consortium: The Gene Ontology Consortium; Gene Ontology: tool for the unification of biology. Nat.Gene. 25, 25–29 (2000)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Martin, C., Nattkemper, T.W. (2008). A Tree Index to Support Clustering Based Exploratory Data Analysis. In: Elloumi, M., Küng, J., Linial, M., Murphy, R.F., Schneider, K., Toma, C. (eds) Bioinformatics Research and Development. BIRD 2008. Communications in Computer and Information Science, vol 13. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-70600-7_1
Download citation
DOI: https://doi.org/10.1007/978-3-540-70600-7_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-70598-7
Online ISBN: 978-3-540-70600-7
eBook Packages: Computer ScienceComputer Science (R0)