Skip to main content

Advertisement

Log in

Clustering stability-based Evolutionary K-Means

  • Methodologies and Application
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Evolutionary K-Means (EKM), which combines K-Means and genetic algorithm, solves K-Means’ initiation problem by selecting parameters automatically through the evolution of partitions. Currently, EKM algorithms usually choose silhouette index as cluster validity index, and they are effective in clustering well-separated clusters. However, their performance of clustering noisy data is often disappointing. On the other hand, clustering stability-based approaches are more robust to noise; yet, they should start intelligently to find some challenging clusters. It is necessary to join EKM with clustering stability-based analysis. In this paper, we present a novel EKM algorithm that uses clustering stability to evaluate partitions. We firstly introduce two weighted aggregated consensus matrices, positive aggregated consensus matrix (PA) and negative aggregated consensus matrix (NA), to store clustering tendency for each pair of instances. Specifically, PA stores the tendency of sharing the same label and NA stores that of having different labels. Based upon the matrices, clusters and partitions can be evaluated from the view of clustering stability. Then, we propose a clustering stability-based EKM algorithm CSEKM that evolves partitions and the aggregated matrices simultaneously. To evaluate the algorithm’s performance, we compare it with an EKM algorithm, two consensus clustering algorithms, a clustering stability-based algorithm and a multi-index-based clustering approach. Experimental results on a series of artificial datasets, two simulated datasets and eight UCI datasets suggest CSEKM is more robust to noise.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  • Aggarwal CC, Reddy CK (2014) Data clustering: algorithms and applications. CRC Press, Boca Raton

    Book  MATH  Google Scholar 

  • Alves V, Campello RJGB, Hruschka ER (2006) Towards a fast evolutionary algorithm for clustering. In: Proceedings of IEEE congress on evolutionary computation (CEC 2006), pp 1776–1783

  • Arbelaitz O, Gurrutxaga I, Muguerza J, Perez JM, Perona I (2013) An extensive comparative study of cluster validity indices. Pattern Recogn 46:243–256

    Article  Google Scholar 

  • Arthur D, Vassilvitskii (2007) S K-means++: the advantages of careful seeding. In: Proceedings of the 18th annual ACM-SIAM symposium on discrete algorithms (SODA), pp 1027–1035

  • Bache K, Lichman M (2013) UCI machine learning repository. University of California, School of Information and Computer Science, Irvine, CA. http://archive.ics.uci.edu/ml

  • Bandyopadhyay S, Maulik U (2002) An evolutionary technique based on K-Means algorithm for optimal clustering in \(R^N\). Inf Sci 146:221–237

    Article  MATH  Google Scholar 

  • Ben-David S, von Luxburg U, Páal D (2006) A sober look at clustering stability. In: Proceedings of the 19th annual conference on learning theory (COLT 2006), pp 5–19

  • Bezdek JC, Boggavarapu S, Hall LO, Bensaid A (1994) Genetic algorithm guided clustering. In: Proceedings of the first IEEE conference on evolutionary computation, pp 34–39

  • Brunsch T, Roglin H (2013) A bad instance for k-means++. Theoret Comput Sci 505:19–26

    Article  MathSciNet  MATH  Google Scholar 

  • Bubeck S, Meilă M, Luxburg U (2012) How the initialization affects the stability of the K-Means algorithm. ESAIM Prob Stat 16:436–452

    Article  MathSciNet  MATH  Google Scholar 

  • Cano JR, Cordon O, Herrera F, Sanchez F (2002) A greedy randomized adaptive search procedure applied to the clustering problem as an initialization process using K-Means as a local search procedure, J Intell Fuzzy Syst 12:235–242

    MATH  Google Scholar 

  • Charrad M, Ghazzali N, Boiteau V, Niknafs A (2014) NbClust: an R package for determining the relevant number of clusters in a data set. J Stat Softw 61(6):1–36

    Article  Google Scholar 

  • Chen S, Chao Y, Wang H, Fu H (2006) A prototypes-embedded genetic K-Means algorithm. In: Proceedings of the 18th international conference on pattern recognition (ICPR), pp 724–727

  • Chiu TY, Hsu TC, Wang JS (2010) AP-based consensus clustering for gene expression time series. In: Proceedings of the 20th international conference on pattern recognition (ICPR), pp 2512–2515

  • Chiui TY, Hsu TC, Yen CC, Wang JS (2015) Interpolation based consensus clustering for gene expression time series. BMC Bioinform 16:117

    Article  Google Scholar 

  • Craenendonck TV, Blockeel H (2015) Using internal validity measures to compare clustering algorithms. ICML 2015 AutoML Workshop, https://lirias.kuleuven.be/bitstream/123456789/504712/1/automl_camera.pdf

  • de Amorima RC (2015) Recovering the number of clusters in data sets with noise features using feature rescaling factors. Inf Sci 324:126–145

    Article  MathSciNet  Google Scholar 

  • Erisoglu M, Calis N, Sakallioglu S (2011) A new algorithm for initial cluster centers in K-Means algorithm. Pattern Recogn Lett 32:1701–1705

    Article  Google Scholar 

  • Famili AF, Liu G, Liu Z (2004) Evaluation and optimization of clustering in gene expression data analysis. Bioinformatics 20(10):1535–1545

    Article  Google Scholar 

  • Fang Y, Wang J (2012) Selection of the number of clusters via the bootstrap method. Comput Stat Data Anal 56(3):468–477

    Article  MathSciNet  MATH  Google Scholar 

  • Hall LO, Özyurt IB, Bezdek JC (1999) Clustering with a genetically optimized approach. IEEE Trans Evol Comput 3(2):103–112

    Article  Google Scholar 

  • Handl J, Knowles J (2007) An evolutionary approach to multiobjective clustering. IEEE Trans Evol Comput 11(1):56–76

    Article  Google Scholar 

  • He Z (2016) Evolutionary K-Means with pair-wise constraints. Soft Comput 20(1):287–301

    Article  MathSciNet  Google Scholar 

  • Hennig C (2007) Cluster-wise assessment of cluster stability. Comput Stat Data Anal 52(1):258–271

    Article  MathSciNet  MATH  Google Scholar 

  • Hruschka ER, Campello RJGB, de Castro LN (2006) Evolving clusters in gene-expression data. Inf Sci 176:1898–1927

    Article  MathSciNet  Google Scholar 

  • Hruschka ER, Campello RJGB, Freitas AA, Carvalho ACPLF (2009) A survey of evolutionary algorithms for clustering. IEEE Trans Syst Man Cybern Part C Appl Rev 39(2):133–155

    Article  Google Scholar 

  • Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recogn Lett 31(8):651–666

    Article  Google Scholar 

  • Krishna K, Murty MN (1999) Genetic K-Means algorithm. IEEE Trans Syst Man Cybern B Cybern 29(3):433–439

    Article  Google Scholar 

  • Liu Y, Li Z, Xiong H, Gao X, Wu J (2010) Understanding of internal clustering validation measures. In: Proceedings on 10th IEEE international conference on data mining (ICDM 2010), pp 911–916

  • Moller U (2009) Resampling methods for unsupervised learning from sample data. In: Mellouk A, Chebira A (eds) Machine learning. InTech, Cape Town, SA, pp 289–304 http://cdn.intechweb.org/pdfs/6069.pdf

  • Monti S, Tamayo P, Mesirov J, Golub T (2003) Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 52:91118

    Article  MATH  Google Scholar 

  • Naldi MC, Campello RJGB, Hruschka ER, Carvalho ACPLF (2011) Efficiency issues of evolutionary K-Means. Appl Soft Comput 11:1938–1952

    Article  Google Scholar 

  • R Core Team (2015) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. https://www.R-project.org/

  • Rahman MA, Islam MZ, Bossomaier T, DenClust (2014) A density based seed selection approach for K-Means. In: Proceedings of 13th international conference on artificial intelligence and soft computing (ICSISC), Part II, Lecture notes in computer science, vol 8468, pp 784–795

  • Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65

    Article  MATH  Google Scholar 

  • Schmidt TSB, Matias Rodrigues JF, von Mering C (2015) Limits to robustness and reproducibility in the demarcation of operational taxonomic units. Environ Microbiol 17(5):1689–1706

    Article  Google Scholar 

  • Senbabaoglu Y, Michailidis G, Li JZ (2014) Critical limitations of consensus clustering in class discovery. Sci Rep 4:6207

    Article  Google Scholar 

  • Shamir O, Tishby N (2010) Stability and model selection in K-Means clustering. Mach Learn 80(2–3):213–243

    Article  MathSciNet  Google Scholar 

  • Vendramin L, Campello RJGB, Hruschka ER (2010) Relative clustering validity criteria: a comparative overview. Stat Anal Data Min 3(4):243–256

    MathSciNet  Google Scholar 

  • Vinh NX, Epps J (2009) A novel approach for automatic number of clusters detection in microarray data based on consensus clustering. In: Proceedings of the 9th international conference on bioinformatics and bioengineering (BIBE), pp 84–91

  • Vinh NX, Epps J, Bailey J (2009) Information theoretic measures for clusterings comparison: is a correction for chance necessary?. In: Proceedings of the 26th annual international conference on machine learning (ICML 2009), pp 1073–1080

  • von Luxburg U (2009) Clustering stability: an overview. Found Trends Mach Learn 2(3):235–274

    Article  MATH  Google Scholar 

  • Wang X, Qiu W, Zamar RH (2007) CLUES: a non-parametric clustering method based on local shrinking. Comput Stat Data Anal 52(1):286–298

    Article  MathSciNet  MATH  Google Scholar 

  • Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678

    Article  Google Scholar 

  • Yu Z, Wong H, Wang H (2007) Graph based consensus clustering for class discovery from gene expression data. Bioinformatics 23(21):2888–2896

    Article  Google Scholar 

Download references

Acknowledgements

This study was funded by National Nature Science Foundation of China (Grant No. 60805042), and Fujian Natural Science Foundation (Grant No. 2018J01794).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhenfeng He.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Communicated by V. Loia.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

He, Z., Yu, C. Clustering stability-based Evolutionary K-Means. Soft Comput 23, 305–321 (2019). https://doi.org/10.1007/s00500-018-3280-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-018-3280-0

Keywords

Navigation