Abstract
In this paper, new measures—called clustering performance measures (CPMs)—for assessing the reliability of a clustering algorithm are proposed. These CPMs are defined using a validation measure, which determines how well the algorithm works with a given set of parameter values, and a repeatability measure, which is used for studying the stability of the clustering solutions and has the ability to estimate the correct number of clusters in a dataset. These proposed CPMs can be used to evaluate clustering algorithms that have a structure bias to certain types of data distribution as well as those that have no structure biases. Additionally, we propose a novel cluster validity index, V I index, which is able to handle non-spherical clusters. Five clustering algorithms on different types of real-world data and synthetic data are evaluated. The first dataset type refers to a communications signal dataset representing one modulation scheme under a variety of noise conditions, the second represents two breast cancer datasets, while the third type represents different synthetic datasets with arbitrarily shaped clusters. Additionally, comparisons with other methods for estimating the number of clusters indicate the applicability and reliability of the proposed cluster validity V I index and repeatability measure for correct estimation of the number of clusters.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-007-0099-1/MediaObjects/10044_2007_99_Fig1_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-007-0099-1/MediaObjects/10044_2007_99_Fig2_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-007-0099-1/MediaObjects/10044_2007_99_Fig3_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-007-0099-1/MediaObjects/10044_2007_99_Fig4_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-007-0099-1/MediaObjects/10044_2007_99_Fig5_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-007-0099-1/MediaObjects/10044_2007_99_Fig6_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-007-0099-1/MediaObjects/10044_2007_99_Fig7_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-007-0099-1/MediaObjects/10044_2007_99_Fig8_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-007-0099-1/MediaObjects/10044_2007_99_Fig9_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-007-0099-1/MediaObjects/10044_2007_99_Fig10_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-007-0099-1/MediaObjects/10044_2007_99_Fig11_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-007-0099-1/MediaObjects/10044_2007_99_Fig12_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-007-0099-1/MediaObjects/10044_2007_99_Fig13_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-007-0099-1/MediaObjects/10044_2007_99_Fig14_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-007-0099-1/MediaObjects/10044_2007_99_Fig15_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-007-0099-1/MediaObjects/10044_2007_99_Fig16_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-007-0099-1/MediaObjects/10044_2007_99_Fig17_HTML.gif)
Similar content being viewed by others
Notes
This paper is an extension of [24] and contains further investigations and experimental results. The current manuscript represents a significant extension.
Software codes for the CPMs may be available on request from sameh.salem@liverpool.ac.uk
References
Webb AR (2003) Statistical pattern recognition. Wiley, New York
Theodoridis S, Koutroubas K (2003) Pattern recognition. Academic Press, New York
Halkidi M, Batistakis Y, Vazirgiannis M (2002) Cluster validity methods: Part I, SIGMOD. Record 31(2):40–45
Halkidi M, Batistakis Y, Vazirgiannis M (2002) Cluster validity methods: Part II, SIGMOD. Record 31(3):19–27
Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 1:224–227
Dunn JC (1973) A fuzzy relative of ISODATA process and its use in detecting compact well-separated clusters. J Cybern 3:32–57
Calinski RB, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat 3:1–27
Maulik U, Bandyopadhyay S (2002) Performance evaluation of some clustering algorithms and validity indices. IEEE Trans Pattern Anal Mach Intell 24(12):1650–1654
Milligan GW, Cooper C (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2):159–179
Fraley C, Raftery AE (1998) How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput J 41(8):578–588
Bezdek JC, Pal NR (1998) Some new indexes of cluster validity. IEEE Trans Syst Man Cybern Part B 28(3):301–315
Xie XL, Beni G (1991) A validity measure for fuzzy clustering. IEEE Trans Pattern Anal Mach Intell 13:841–847
Chou C, Su M, Lai E (2004) A new cluster validity measure and its application to image compression. Pattern Anal Appl 7:205–220
Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters via the gap statistic. J R Stat Soc B 63(2):411–423
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Englewood Cliffs
Law MH, Jain AK (2003) Cluster validity by bootstrapping partitions. Technical report MSU-CSE-03-5, Department of Computer Science and Engineering, Michigan State University
Lange T, Braun M, Buhmann JM (2004) Stability-based validation of clustering solutions. Neural Comput 16:1299–1323
Ben-Hur A, Elisseeff A, Guyon I (2002) A stability based method for discovering structure in clustered data. In: Pacific symposium on biocomputing. World Scientific, Singapore, pp 6–17
Levine E, Domany E (2001) Resampling method for unsupervised estimation of cluster validity. Neural Comput 13:2573–2593
Jain A, Morean J (1987) Bootstrap techniques in cluster analysis. Pattern Recognit 20:547–568
Tibshirani R, Walther G, Botstein D, Brown P (2001) Cluster validation by prediction strength. Technical report, Statistics Department, Stanford University, Stanford, CA
Dudoit S, Fridlyand JA (2002) Prediction-based resampling method for estimating the number of clusters in a data set. Genome Biol 3(7). Available online: http://genomebiology.com/2002/317/research/0036
Lange T, Braun M, Roth V, Buhmann JM (2002) Stability-based model selection. Adv Neural Inf Process Syst 15:617–624
Salem SA, Nandi AK (2005) New assessment criteria for clustering algorithms. In: Proceedings of the IEEE international workshop on machine learning for signal processing, Mystic, CT, USA, pp 285–290
Proakis JG (2001) Digital communications. McGraw-Hill, Boston
UCI Machine Learning. http://www.ics.uci.edu/∼mlearn/MLRepository.html
Ng AY, Jordan MI, Weiss Y (2002) On spectral clustering: analysis and an algorithm. In: Advances in neural information processing systems, vol 14. MIT Press, Cambridge
Cormen TH, Leiserson CE, Rivest LR, Stein C (2001) Introduction to algorithms. ISBN 10:0-262-03293-7. The MIT Press, London
Fischer B, Buhmann JM (2003) Path based clustering for grouping smooth curves and texture segmentation. IEEE Trans Pattern Anal Mach Intell 25:1–6
Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York
Fonseca JRS, Cardoso MGMS (2007) Mixture-model cluster analysis using information theoretical criteria. Intell Data Anal 11:155–173
Kverh B, Leonardis A (2004) A generalisation of model selection criteria. Pattern Anal Appl 7:51–65
Hu T, Sung Y (2005) Clustering spatial data with a hybrid EM approach. Pattern Anal Appl 8:139–148
Hu X, Xu L (2004) Investigation on several model selection criteria for determining the number of clusters. Neural Inf Process Lett Rev 4:1–10
Jain AK, Murty MN, Flyn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323
Han J, Kamber M (2001) Data mining: concepts and techniques. Morgan Kaufmann, San Francisco
Kohonen T (1997) Self-organizing maps. Springer, Heidelberg
Chen T, Chen L-K, MA K-K (1999) Colour image indexing using SOM for region-of-interest retrieval. Pattern Anal Appl 2(2):164–171
Zhang S, Ganesan R, Xistris GD (1996) Self-organizing neural networks for automated machinery monitoring systems. Mech Syst Signal Process 10(5):517–532
Chen GW, Luo JB, Parker KJ (1998) Image segmentation via adaptive k means clustering and knowledge-based morphological operations with biomedical operations. IEEE Trans Image Process 7(12):1673–1683
Frigui H (2005) Unsupervised learning of arbitrarily shaped clusters using ensembles of Gaussian models. Pattern Anal Appl 8:32–49
Pelleg D, Moore AW (2000) X-means: extending K-means with efficient estimation of the number of clusters. In: Seventeenth international conference on machine learning. Morgan Kaufmann, San Francisco, pp 727–734
Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithm. Plenum Press, New York
Yang MS (1993) A survey of fuzzy clustering. Math Comput Modell 18:1–16
Baraldi A, Blonda P (1999) A survey of fuzzy clustering algorithms for pattern recognition. IEEE Trans Syst Man Cybern Part B 29(6):778–801
Xu W, Nandi AK, Zhang J (2003) Novel fuzzy reinforcement learning vector quantization algorithm and its application in image compression. IEEE Proc Vis Image Signal Process 150(5):292–298
Ester M, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the second international conference on knowledge discovery and data mining, (KDD). Portland,OR, USA, pp 226–231
Ankerst M, Breuing M, Kriegel H, Sander J (1996) OPTICS: ordering points to identify the clustering structure. In: Proceedings of the international conference on management of data, (SIGMOD). ACM Press, Philadelphia 28(2):49–60
Jack LB, Nandi AK (2004) Microarray data using the self organising oscillator network. In: Proceedings of EUSIPCO 2004, Vienna, Austria, pp 2183–2186
Von Luxburg U (2006) A tutorial on spectral clustering. Max Planck Institute for Biological Cybernetics. Technical report no. TR-149
Acknowledgments
The authors would like to acknowledge the financial support of the Egyptian Ministry of Higher Eduction, Egypt, for S. A. Salem and many fruitful discussions with Dr. L. B. Jack formerly of the University of Liverpool.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Salem, S.A., Nandi, A.K. Development of assessment criteria for clustering algorithms. Pattern Anal Applic 12, 79–98 (2009). https://doi.org/10.1007/s10044-007-0099-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-007-0099-1