Abstract
Evolutionary K-Means (EKM), which combines K-Means and genetic algorithm, solves K-Means’ initiation problem by selecting parameters automatically through the evolution of partitions. Currently, EKM algorithms usually choose silhouette index as cluster validity index, and they are effective in clustering well-separated clusters. However, their performance of clustering noisy data is often disappointing. On the other hand, clustering stability-based approaches are more robust to noise; yet, they should start intelligently to find some challenging clusters. It is necessary to join EKM with clustering stability-based analysis. In this paper, we present a novel EKM algorithm that uses clustering stability to evaluate partitions. We firstly introduce two weighted aggregated consensus matrices, positive aggregated consensus matrix (PA) and negative aggregated consensus matrix (NA), to store clustering tendency for each pair of instances. Specifically, PA stores the tendency of sharing the same label and NA stores that of having different labels. Based upon the matrices, clusters and partitions can be evaluated from the view of clustering stability. Then, we propose a clustering stability-based EKM algorithm CSEKM that evolves partitions and the aggregated matrices simultaneously. To evaluate the algorithm’s performance, we compare it with an EKM algorithm, two consensus clustering algorithms, a clustering stability-based algorithm and a multi-index-based clustering approach. Experimental results on a series of artificial datasets, two simulated datasets and eight UCI datasets suggest CSEKM is more robust to noise.
Similar content being viewed by others
References
Aggarwal CC, Reddy CK (2014) Data clustering: algorithms and applications. CRC Press, Boca Raton
Alves V, Campello RJGB, Hruschka ER (2006) Towards a fast evolutionary algorithm for clustering. In: Proceedings of IEEE congress on evolutionary computation (CEC 2006), pp 1776–1783
Arbelaitz O, Gurrutxaga I, Muguerza J, Perez JM, Perona I (2013) An extensive comparative study of cluster validity indices. Pattern Recogn 46:243–256
Arthur D, Vassilvitskii (2007) S K-means++: the advantages of careful seeding. In: Proceedings of the 18th annual ACM-SIAM symposium on discrete algorithms (SODA), pp 1027–1035
Bache K, Lichman M (2013) UCI machine learning repository. University of California, School of Information and Computer Science, Irvine, CA. http://archive.ics.uci.edu/ml
Bandyopadhyay S, Maulik U (2002) An evolutionary technique based on K-Means algorithm for optimal clustering in \(R^N\). Inf Sci 146:221–237
Ben-David S, von Luxburg U, Páal D (2006) A sober look at clustering stability. In: Proceedings of the 19th annual conference on learning theory (COLT 2006), pp 5–19
Bezdek JC, Boggavarapu S, Hall LO, Bensaid A (1994) Genetic algorithm guided clustering. In: Proceedings of the first IEEE conference on evolutionary computation, pp 34–39
Brunsch T, Roglin H (2013) A bad instance for k-means++. Theoret Comput Sci 505:19–26
Bubeck S, Meilă M, Luxburg U (2012) How the initialization affects the stability of the K-Means algorithm. ESAIM Prob Stat 16:436–452
Cano JR, Cordon O, Herrera F, Sanchez F (2002) A greedy randomized adaptive search procedure applied to the clustering problem as an initialization process using K-Means as a local search procedure, J Intell Fuzzy Syst 12:235–242
Charrad M, Ghazzali N, Boiteau V, Niknafs A (2014) NbClust: an R package for determining the relevant number of clusters in a data set. J Stat Softw 61(6):1–36
Chen S, Chao Y, Wang H, Fu H (2006) A prototypes-embedded genetic K-Means algorithm. In: Proceedings of the 18th international conference on pattern recognition (ICPR), pp 724–727
Chiu TY, Hsu TC, Wang JS (2010) AP-based consensus clustering for gene expression time series. In: Proceedings of the 20th international conference on pattern recognition (ICPR), pp 2512–2515
Chiui TY, Hsu TC, Yen CC, Wang JS (2015) Interpolation based consensus clustering for gene expression time series. BMC Bioinform 16:117
Craenendonck TV, Blockeel H (2015) Using internal validity measures to compare clustering algorithms. ICML 2015 AutoML Workshop, https://lirias.kuleuven.be/bitstream/123456789/504712/1/automl_camera.pdf
de Amorima RC (2015) Recovering the number of clusters in data sets with noise features using feature rescaling factors. Inf Sci 324:126–145
Erisoglu M, Calis N, Sakallioglu S (2011) A new algorithm for initial cluster centers in K-Means algorithm. Pattern Recogn Lett 32:1701–1705
Famili AF, Liu G, Liu Z (2004) Evaluation and optimization of clustering in gene expression data analysis. Bioinformatics 20(10):1535–1545
Fang Y, Wang J (2012) Selection of the number of clusters via the bootstrap method. Comput Stat Data Anal 56(3):468–477
Hall LO, Özyurt IB, Bezdek JC (1999) Clustering with a genetically optimized approach. IEEE Trans Evol Comput 3(2):103–112
Handl J, Knowles J (2007) An evolutionary approach to multiobjective clustering. IEEE Trans Evol Comput 11(1):56–76
He Z (2016) Evolutionary K-Means with pair-wise constraints. Soft Comput 20(1):287–301
Hennig C (2007) Cluster-wise assessment of cluster stability. Comput Stat Data Anal 52(1):258–271
Hruschka ER, Campello RJGB, de Castro LN (2006) Evolving clusters in gene-expression data. Inf Sci 176:1898–1927
Hruschka ER, Campello RJGB, Freitas AA, Carvalho ACPLF (2009) A survey of evolutionary algorithms for clustering. IEEE Trans Syst Man Cybern Part C Appl Rev 39(2):133–155
Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recogn Lett 31(8):651–666
Krishna K, Murty MN (1999) Genetic K-Means algorithm. IEEE Trans Syst Man Cybern B Cybern 29(3):433–439
Liu Y, Li Z, Xiong H, Gao X, Wu J (2010) Understanding of internal clustering validation measures. In: Proceedings on 10th IEEE international conference on data mining (ICDM 2010), pp 911–916
Moller U (2009) Resampling methods for unsupervised learning from sample data. In: Mellouk A, Chebira A (eds) Machine learning. InTech, Cape Town, SA, pp 289–304 http://cdn.intechweb.org/pdfs/6069.pdf
Monti S, Tamayo P, Mesirov J, Golub T (2003) Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 52:91118
Naldi MC, Campello RJGB, Hruschka ER, Carvalho ACPLF (2011) Efficiency issues of evolutionary K-Means. Appl Soft Comput 11:1938–1952
R Core Team (2015) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. https://www.R-project.org/
Rahman MA, Islam MZ, Bossomaier T, DenClust (2014) A density based seed selection approach for K-Means. In: Proceedings of 13th international conference on artificial intelligence and soft computing (ICSISC), Part II, Lecture notes in computer science, vol 8468, pp 784–795
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Schmidt TSB, Matias Rodrigues JF, von Mering C (2015) Limits to robustness and reproducibility in the demarcation of operational taxonomic units. Environ Microbiol 17(5):1689–1706
Senbabaoglu Y, Michailidis G, Li JZ (2014) Critical limitations of consensus clustering in class discovery. Sci Rep 4:6207
Shamir O, Tishby N (2010) Stability and model selection in K-Means clustering. Mach Learn 80(2–3):213–243
Vendramin L, Campello RJGB, Hruschka ER (2010) Relative clustering validity criteria: a comparative overview. Stat Anal Data Min 3(4):243–256
Vinh NX, Epps J (2009) A novel approach for automatic number of clusters detection in microarray data based on consensus clustering. In: Proceedings of the 9th international conference on bioinformatics and bioengineering (BIBE), pp 84–91
Vinh NX, Epps J, Bailey J (2009) Information theoretic measures for clusterings comparison: is a correction for chance necessary?. In: Proceedings of the 26th annual international conference on machine learning (ICML 2009), pp 1073–1080
von Luxburg U (2009) Clustering stability: an overview. Found Trends Mach Learn 2(3):235–274
Wang X, Qiu W, Zamar RH (2007) CLUES: a non-parametric clustering method based on local shrinking. Comput Stat Data Anal 52(1):286–298
Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678
Yu Z, Wong H, Wang H (2007) Graph based consensus clustering for class discovery from gene expression data. Bioinformatics 23(21):2888–2896
Acknowledgements
This study was funded by National Nature Science Foundation of China (Grant No. 60805042), and Fujian Natural Science Foundation (Grant No. 2018J01794).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Additional information
Communicated by V. Loia.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
He, Z., Yu, C. Clustering stability-based Evolutionary K-Means. Soft Comput 23, 305–321 (2019). https://doi.org/10.1007/s00500-018-3280-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-018-3280-0