Abstract
Cluster validity evaluation is a hot issue in clustering algorithm research. Aiming at determining the optimal number of clusters in cluster validity evaluation, this paper proposes a new cluster validity index Ratio of Deviation of Sum-of-squares and Euclid distance (RDSED), and designs a cluster validity evaluation method based on RDSED which is suitable to dynamically determine the near-optimal number of clusters. Firstly, based on the analysis of the relationships of the intra-class and inter-class, the concepts of sum-of-squares of within-cluster, sum-of-squares of between-cluster, total sum-of-squares, sum of intra-cluster distance and average distance between clusters are proposed, and then a cluster validity index RDSED based on these concepts is constructed. Secondly, a cluster validity evaluation method based on RDSED for dynamically determining the near-optimal number of clusters is designed. In this method, RDSED value is calculated from large to small in the range of clustering number and this index value is used to dynamically terminate the clustering validity verification process, and finally the near-optimal number of clusters and clustering partition results are obtained. Experiment results of artificial datasets and real datasets show that, compared with some classical clustering validity evaluation method, the proposed cluster validity evaluation method can obtain the near-optimal number of clusters that is closest to the real cluster number in most cases and can effectively evaluate clustering partition results.
Similar content being viewed by others
References
Bakshi S, Jagadev AK, Dehuri S, Wang G-N (2014a) Enhancing scalability and accuracy of recommendation systems using unsupervised learning and particle swarm optimization. Appl Soft Comput 15:21–29
Bakshi S, Jagadev AK, Dehuri S, Wang G-N (2014b) Enhancing scalability and accuracy of recommendation systems using unsupervised learning and particle swarm optimization. Appl Soft Comput 15:21–29
Cagnina L, Errecalde M, Ingaramo D, Rosso P (2014) An efficient particle swarm optimization approach to cluster short texts. Inf Sci 265:36–49
Calinski T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat 3:1–27
Campo DN, Stegmayer G, Milone DH (2016a) A new index for clustering validation with overlapped clusters. Expert Syst Appl 64:549–556
Campo DN, Stegmayer G, Milone DH (2016b) A new index for clustering validation with overlapped clusters. Expert Syst Appl 64:549–556
Davies DL, Bouldin DW (1979) A clustering separation measure. IEEE Trans Pattern Anal Mach Intell 1:224–227
Draszawka K, Szymański J (2011) External validation measures for nested clustering of text documents. Stud Computat Intell 369:207–225
Dunn JC (1973) A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. J Cybern 3:32–57
Gurrutxaga I, Muguerza J, Arbelaitz O, Perez JM, Martin JI (2011) Towards a standard methodology to evaluate internal cluster validity indices. Pattern Recognit Lett 32:505–515
Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. J Intell Inf Syst 17(2–3):107–145
Haouas F, Dhiaf ZB, Hammouda A, Solaiman B (2017a) A new efficient fuzzy cluster validity index: application to images clustering. In: IEEE international conference on fuzzy systems. pp 1–6
Haouas F, Dhiaf ZB, Hammouda A, Solaiman B (2017b) A new efficient fuzzy cluster validity index: application to images clustering. In: IEEE international conference on fuzzy systems. pp 1–6
Hartigan J (1975) Clustering algorithms. Wiley, NewYork
Holzinger KJ, Harman HH (1941) Factor analysis. University of Chicago Press, Chicago
Huang H, Ma Y (2019) A hybrid clustering approach for bag-of-words image categorization. Math Probl Eng 2019:1–11. https://doi.org/10.1155/2019/4275720
Ilham A, Wahono RS, Supriyanto C, Wijaya A (2019) U-control chart based differential evolution clustering for determining the number of cluster in k-means. Int J Intell Eng Syst 2019(12):306–316
Kashyap Manish, Bhattacharya Mahua (2017) A density invariant approach to clustering. Neural Comput Appl 28:1695–1713
Kole DK, Halder A (2010) An efficient dynamic image segmentation algorithm using a hybrid technique based on particle S warm optimization and genetic algorithm. In: 2010 international conference on advances in computer engineering. pp 252–255
Krzanowski W, Lai Y (1985) A criterion for determining the number of groups in a data set using sum of squares clustering. Biometrics 44:23–34
Kuo RJ, Syu YJ, Chen Z-Y, Tien FC (2012) Integration of particle swarm optimization and genetic algorithm for dynamic clustering. Inf Sci 195:124–140
Lee JS, Olafsson S (2011) Data clustering by minimizing disconnectivity. Inf Sci 181:732–746
Lee SH, Jeong YS, Kim JY, Jeong MK (2018) A new clustering validity index for arbitrary shape of clusters. Pattern Recognit Lett 112:263–269
Li H, He H, Wen Y (2015) Dynamic particle swarm optimization and K-means clustering algorithm for image segmentation. Optik 126:4817–4822
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth berkeley symposium on mathematical statistics and probability, volume 1: Statistics, Berkeley, Calif., pp 281–297
Naïja Y, Sinaoui KB (2012) Interpretability-based validity methods for clustering results evaluation. J Intell Inf Syst 39(1):109–139
Naldi M, Carvalho A, Campello R (2013) Cluster ensemble selection based on relative validity indexes. Data Min Knowl Discov 27(2):259–289
Omran MG, Salman A, Engelbrecht AP (2005) Dynamic clustering using particle swarm optimization with application in image segmentation. Pattern Anal Appl 8(4):332–344
Pfitzner D, Leibbrandt R, Powers D (2009) Characterization and evaluation of similarity measures for pairs of clusterings. Knowl Inf Syst 19:361–394
Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66(336):846–850
Rezaee MR, Lelieveldt BPF, Reiber JHC (1998) A new cluster validity index for the fuzzy c-means. Pattern Recognit Lett 19(3–4):237–246
Rojas-Thomas JC, Santos M, Mora M (2017) New internal index for clustering validation based on graphs. Expert Syst Appl 86:334–349
Rousseeuw P (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Saha S, Bandyopadhyay S (2012) Some connectivity based cluster validity indices. Appl Soft Comput 12:1555–1565
Salehian S, Subraminiam SK (2015) Unequal clustering by improved particle swarm optimization in wireless sensor network. Procedia Comput Sci 62:403–409
Sneath PHA, Sokal RR (1973) Numerical taxonomy, books in biology. W.H. Freeman and Company, San Francisco
Strehl A (2002) Relationship-based clustering and cluster ensembles for high-dimensional data mining. Doctoral dissertation. The University of Texas, Austin
UC Irvine Machine Learning Repository. http://archive.ics.uci.edu/ml/index.html
Yang L, Bezdek JC, Romano S, Vinh NX, Chan J, Bailey J (2017) Ground truth bias in external cluster validity indices. Pattern Recognit 65:58–70
Zhao Q, Xu M, Fränti P (2009a) Sum-of-square based cluster validity index and significance analysis. In: Proceedings of the 17th international conference on adaptive and natural computing algorithms. pp 313–322
Zhao Q, Xu M, Fränti P (2009b) Sum-of-square based cluster validity index and significance analysis. In: Proceedings of the 17th international conference on adaptive and natural computing algorithms. pp 313–322
Zhou ZH (2016) Machine learning. Tsinghua University Press, Beijing, pp 214–217
Zhou S, Xu Z (2018) A novel internal validity index based on the cluster centre and the nearest neighbour cluster. Appl Soft Comput 71:78–88
Acknowledgements
The authors would like to thank all the referees for their constructive and insightful comments on this paper.
Funding
This study was funded by National Natural Science Foundation of China (No. 61862042, 61762062, 61601215, 61862044); Science and Technology Innovation Platform Project of Jiangxi Province (No.20181BCD40005); Major Discipline Academic and Technical Leader Training Plan Project of Jiangxi Province(No.20172BCB22030); Primary Research & Development Plan Project of Jiangxi Province (No. 20192BBE50075, 20181ACE50033, 20171BBE50064, 2013ZBBE50018); Natural Science Foundation of Jiangxi Province (No. 20192BAB207019, 20192BAB207020, 20171BAB202027); and Graduate Innovation Fund Project of Jiangxi Province (No.YC2019-S100, YC2019-S048).
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of interest
All authors declare that they have no conflict of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Informed consent
Informed consent was obtained from all individual participants included in the study.
Additional information
Communicated by V. Loia.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Li, X., Liang, W., Zhang, X. et al. A cluster validity evaluation method for dynamically determining the near-optimal number of clusters. Soft Comput 24, 9227–9241 (2020). https://doi.org/10.1007/s00500-019-04449-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-019-04449-7