Abstract
Grouping objects into different categories is a basic means of cognition. In the fields of machine learning and statistics, this subject is addressed by cluster analysis. Yet, it is still controversially discussed how to assess the reliability and quality of clusterings. In particular, it is hard to determine the optimal number of clusters inherent in the underlying data. Running different cluster algorithms and cluster validation methods usually yields different optimal clusterings. In fact, several clusterings with different numbers of clusters are plausible in many situations, as different methods are specialized on diverse structural properties. To account for the possibility of multiple plausible clusterings, we employ a multi-objective approach for collecting cluster alternatives (MOCCA) from a combination of cluster algorithms and validation measures. In an application to artificial data as well as microarray data sets, we demonstrate that exploring a Pareto set of optimal partitions rather than a single solution can identify alternative solutions that are overlooked by conventional clustering strategies. Competitive solutions are hereby ranked following an impartial criterion, while the ultimate judgement is left to the investigator.
Similar content being viewed by others
References
Albatineh AN, Niewiadomska-Bugaj M, Mihalko D (2006) On similarity indices and correction for chance agreement. J Classif 23(2): 301–313
Ben-David S, von Luxburg U, Pál D (2006) A sober look at clustering stability. In: Carbonell JG, Siekmann J (eds) Conference on learning theory. Lecture notes in artificial intelligence, vol 4005. Springer, Berlin, pp 5–19
Ben-David S, Pál D, Simon HU (2007) Stability of k-means clustering. In: Bshouty NH, Gentile C (eds) Conference on learning theory. Lecture notes in artificial intelligence, vol 4539. Springer, Berlin, pp 20–34
Ben-Hur A, Elisseeff A, Guyon I (2002) A stability based method for discovering structure in clustered data. Pacific Symposium on Biocomputing, pp 6–17
Bertoni A, Valentini G (2005) Random projections for assessing gene expression cluster stability. In: Proceedings of the IEEE-international joint conference on neural networks (IJCNN), vol 1. IEEE Computer Society, pp 149–154
Brock G, Pihur V, Datta S, Datta S (2008) clvalid: an r package for cluster validation. J Stat Softw 25(4): 1–22
Conover WJ (1999) Practical nonparametric statistics, 3rd edn. Wiley, New York
Cottrell M, Hammer B, Hasenfuss A, Villmann T (2006) Batch and median neural gas. Neural Netw 19(6–7): 762–771
de Souto MCP, Costa IG, de Araujo DSA, Ludermir TB, Schliep A (2008) Clustering cancer gene expression data: a comparative study. BMC Bioinf 9(1): 497
Deb K (2004) Multi-objective optimization using evolutionary algorithms. Wiley, New York
Dimitriadou E (2009) cclust: convex clustering methods and clustering indexes http://CRAN.R-project.org/package=cclust. R package version 0.6-16
Dimitriadou E, Barth M, Windischberger C, Hornik K, Moser E (2004) A quantitative comparison of functional MRI cluster analysis. Artif Intell Med 32(1): 57–71
Dimitriadou E, Weingessel A, Hornik K (1999) Voting in clustering and finding the number of clusters. In: Bothe H, Oja E, Massad E, Haefke C (eds) Proceedings of the “International symposium on advances in intelligent data analysis (AIDA 99)” (“International ICSC congress on computational intelligence: methods and applications (CIMA 99)”, ICSC Academic Press, pp 291–296
Dolnicar S, Leisch F (2009) Evaluation of structure and reproducibility of cluster solutions using the bootstrap. Technical report 63, Department of Statistics, LMU Munich
Dos Santos EM, Sabourin R, Maupin P (2009) Overfitting cautious selection of classifier ensembles with genetic algorithms. Inf Fusion 10(2): 150–162
Faceli K, de Souto MCP (2006) Multi-objective clustering ensemble. In: Proceedings of the 6th international conference on hybrid intelligent systems, IEEE Computer Society, Los Alamitos, p 51
Fowlkes E, Mallows C (1983) A method for comparing two hierarchical clusterings. J Am Stat Assoc 78(383): 553–569
Fridlyand J, Dudoit S (2001) Applications of resampling methods to estimate the number of clusters and to improve the accuracy of a clustering method. Technical report 600, University of California, Berkeley
Golub T, Slonim D, Tamayo P, Huard C, Gassenbeek M, Coller H, Loh M, Downing J, Caliguri M, Bloomfield C, Lander E (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439): 531–537
Handl J, Knowles J, Kell D (2005) Computational cluster validation in post-genomic data analysis. Bioinformatics 21(15): 3201–3212
Hornik K, Leisch F (2005) Ensemble methods for cluster analysis. In: Taudes A (eds) Adaptive information systems and modelling in economics and management science. Springer, Berlin, pp 261–268
Hubert L, Arabie P (1985) Comparing partitions. J Math. Classif 2: 193–218
Jaccard P (1908) Nouvelles recherches sur la distribution florale. Bulletin de la Société Vaudoise des sciences naturelles 44: 223–270
Jain A, Dubes R (1988) Algorithms for clustering data. Prentice Hall, New Jersey
Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31(8): 651–666
Jain AK, Moreau JV (1987) Bootstrap technique in cluster analysis. Pattern Recognit 20(5): 547–568
Kerr MK, Churchill GA (2001) Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. Proceedings of the national academy of sciences 98(16):8961–8965
Kestler HA, Müller A, Buchholz M, Palm G, Gress TM (2003) Robustness evaluation of clusterings. In: Spang R, Béziat P, Vingron M (eds) Currents in computational molecular biology, (Abstract) pp 253–254
Kestler HA, Müller A, Schwenker F, Gress T, Mattfeldt T, Palm G (2001) Cluster analysis of comparative genomic hybridization data. Lecture notes NATO ASI: aritificial intelligence and heuristic methods for bioinformatics, (Abstract) pp S–40
Khan J, Wei J, Ringner M, Saal L, Ladanyi M, Westermann F, Berthold F, Schwab M, Antonesco C, Peterson C, Meltzer P (2001) Classification and diagnostic prediction of cancer using gene expression profiling and artificial neural networks. Nat Med 6(7): 673–679
Kraus JM, Kestler HA (2010) A highly effcient multi-core algorithm for clustering extremely large datasets. BMC Bioinf 11(1): 169
Lange T, Roth V, Braun ML, Buhmann JM (2004) Stability-based validation of clustering solutions. Neural Comput 16(6): 1299–1323
Law MHC, Topchy AP, Jain AK (2004) Multiobjective data clustering. In: Proceedings of the 2004 IEEE computer society conference on computer vision and pattern recognition, vol 2. IEEE Computer Society, Los Alamitos, pp 424–430
Leisch F, Hornik K (1999) Stabilization of k-means with bagged clustering. In: Proceedings of the 1999 joint statistical meetings
Levine E, Domany E (2001) Resampling method for unsupervised estimation of cluster validity. Neural Comput 13(11): 2573–2593
Maechler M, Rousseeuw P, Struyf A, Hubert M (2005) Cluster analysis basics and extensions. Unpublished
Nieweglowski L (2009) clv: cluster validation techniques. http://CRAN.R-project.org/package=clv. R package version 0.3-2
Pomeroy S, Tamayo P, Gaasenbeek M, Sturla L, Angelo M, McLaughlin M, Kim J, Goumnerova L, Black P, Lau C, Allen J, Zagzag D, Olson J, Curran T, Wetmore C, Biegel J, Poggio T, Mukherjee S, Rifkin R, Califano A, Stolovitzky G, Louis D, Mesirov J, Lander E, Golub T (2002) Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415(6870): 436–442
Radtke PW, Wong T, Sabourin R (2009) Solution over-fit control in evolutionary multiobjective optimization of pattern classification systems. Int J Pattern Recognit Artif Intell 23(6): 1107–1127
Rakhlin A, Caponnetto A (2007) Stability of k-means clustering. In: Schölkopf B, Platt JC, Hoffman T (eds) Advances in neural information processing systems 19. MIT Press, Cambridge, pp 1121–1128
Smolkin M, Ghosh D (2003) Cluster stability scores for microarray data in cancer studies. BMC Bioinf 4(36)
Strehl A, Ghosh J (2002) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3: 583–617
Ultsch A (2005) Clustering with som: U*c. In: Proceedings of the workshop on self-organizing maps. Paris, pp 75–82
Venables WN, Ripley BD (2002) Modern applied statistics with S, 4th edn. Springer, New York
Author information
Authors and Affiliations
Corresponding author
Electronic Supplementary Material
The Below is the Electronic Supplementary Material.
Rights and permissions
About this article
Cite this article
Kraus, J.M., Müssel, C., Palm, G. et al. Multi-objective selection for collecting cluster alternatives. Comput Stat 26, 341–353 (2011). https://doi.org/10.1007/s00180-011-0244-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-011-0244-6