Abstract
Estimating the optimal number of clusters (NC) is pivotal in cluster analysis. From the viewpoint of sample geometry, a novel internal clustering validity index, which is termed the between-within cluster (BWC) index, is designed in this paper. Moreover, a method is proposed to estimate the optimal NC. The BWC index improves the well-known Silhouette index. BWC validates the clustering results from a certain clustering algorithm (e.g., affinity propagation or hierarchical) and estimates the optimal NC for many kinds of data sets, including synthetic data sets, benchmark data sets, UCI data sets, gene expression data sets, and images. Theoretical analysis and experimental studies demonstrate the effectiveness and high efficiency of the new index and method.
Similar content being viewed by others
References
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. pp 281–297
Kaufman L, Rousseeuw PJ (1990) Finding Groups in Data: an Introduction to Cluster Analysis. Wiley & Sons, Hoboken, NJ, USA, pp 40–41
Bradley PS, Mangasarian OL, Street WN (1996) Clustering via concave minimization. In: Proceedings of the NIPS, Denver, CO, USA. pp 368–374
Dunn JC (1973) A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J Cybern 3(3):32–57
Duda RO, Hart PE, Stork DG (2001) Pattern classification, 2nd edn. Wiley, New York, pp 550–554
Cattinelli I, Valentini G, Paulesu E, Borghese NA (2013) A novel approach to the problem of non-uniqueness of the solution in hierarchical clustering. IEEE Trans Neural Netw Learn Syst 24(7):1166–1173
Bhargavi MS, Gowda SD (2015) A novel validity index with dynamic cut-off for determining true clusters. Pattern Recognit 48(11):3673–3687
Frey BJ, Dueck D (2007) Clustering by passing messages between data points. Science 315(5814):972–976
Wu S, Chow TWS (2003) Self-Organizing-Map based clustering using a local clustering validity index. Neural Process Lett 17:253–271
Tasdemir K, Merényi E (2011) A validity index for prototype-based clustering of data sets with complex cluster structures. IEEE Trans Syst Man Cybern B Cybern 41(4):1039–1053
Lee JS, Olafsson S (2013) A meta-learning approach for determining the number of clusters with consideration of nearest neighbors. Inf Sci 232:208–224
Liu Y, Li Z, Xiong H et al (2013) Understanding and enhancement of internal clustering validation measures. IEEE Trans Cybern 43(3):982–994
Bezdek JC, Moshtaghi M, Runkler T, Leckie C (2016) The generalized C index for internal fuzzy cluster validity. IEEE Trans Fuzzy Syst 24(6):1500–1512
Wu CH, Ouyang CS, Chen LW, Lu LW (2015) A new fuzzy clustering validity index with a median factor for centroid-based clustering. IEEE Trans Fuzzy Syst 23(3):701–718
Liang J, Zhao X, Li D et al (2012) Determining the number of clusters using information entropy for mixed data. Pattern Recognit 45(6):2251–2265
Guo G, Chen L, Ye Y, Jiang Q (2017) Cluster validation method for determining the number of clusters in categorical sequences. IEEE Trans Neural Netw Learn Syst 28(12):2936–2948
Yang X, Song Q, Cao A (2006) A new cluster validity for data clustering. Neural Process Lett 23:325–344
Xu R, Xu J, Wunsch DC II (2012) A comparison study of validity indices on swarm-intelligence-based clustering. IEEE Trans Syst Man Cybern B Cybern 42(4):1243–1256
Calinski T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat 3(1):1–27
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20(1):53–65
Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 1(2):224–227
Dudoit S, Fridlyand J (2002) A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol 3(7):1–21
Hartigan JA (1985) Statistical theory in clustering. J Classif 2(1):63–76
Strehl A (2002) Relationship-based clustering and cluster ensembles for high-dimensional data mining. University of Texas at Austin, Austin
Wang KJ, Li J, Zhang JY, Guo LX (2008) Experimental comparison of clusters number estimation for cluster analysis. Comput Eng 34(9):198–202
Kapp AV, Tibshirani R (2007) Are clusters found in one dataset present in another dataset? Biostatistics 8(1):9–31
Pakhira MK, Bandyopadhyay S, Maulik U (2004) Validity index for crisp and fuzzy clusters. Pattern Recogn 37:487–501
Arbelaitz O, Gurrutxaga I, Muguerza J et al (2013) An extensive comparative study of cluster validity indices. Pattern Recogn 46(1):243–256
Davis J, Goadrich M (2006) The relationship between precision-recall and ROC curves. In: Proceedings of the ICML. pp 233–240
Pal NR, Bezdek JC (1995) On cluster validity for the fuzzy c-means model. IEEE Trans Fuzzy Syst 3(3):370–379
Bezdek JC, Pal NR (1998) Some new indexes of cluster validity. IEEE Trans Syst Man Cybern B Cybern 28(3):301–315
Iam-On N, Boongoen T, Garrett S, Price C (2011) A link-based approach to the cluster ensemble problem. IEEE Trans Pattern Anal Mach Intell 33(12):2396–2409
Shieh HL (2014) Robust validity index for a modified subtractive clustering algorithm. Appl Soft Comput 22:47–59
Wang KJ, Zhang JY, Li D, Zhang XN, Guo T (2007) Adaptive affinity propagation clustering. Acta Autom Sin 33(12):1242–1246
Armstrong SA, Staunton JE, Silverman LB et al (2002) MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nat Genet 30(1):41–47
Ben-Hur A, Elisseeff A, Guyon I (2002) A stability based method for discovering structure in clustered data. In: Proceedings of the 7th Pacific symposium on Biocomputing. pp 6–17
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
García S, Herrera F (2008) An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. J Mach Learn Res 9:2677–2694
Jiang Y, Deng Z, Chung FL et al (2017) Recognition of epileptic EEG signals using a novel multiview TSK fuzzy system. IEEE Trans Fuzzy Syst 25(1):3–20
Acknowledgements
The authors would like to thank the anonymous reviewers for their insightful comments and valuable suggestions. This work was supported in part by the Fundamental Research Funds for the Central Universities under Grant JUSRP11235 and in part by the National Natural Science Foundation of China under Grant Nos. 61673193 and 61833007.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhou, S., Liu, F. & Song, W. Estimating the Optimal Number of Clusters Via Internal Validity Index. Neural Process Lett 53, 1013–1034 (2021). https://doi.org/10.1007/s11063-021-10427-8
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-021-10427-8