Skip to main content
Log in

Development of assessment criteria for clustering algorithms

  • Theoretical Advances
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

In this paper, new measures—called clustering performance measures (CPMs)—for assessing the reliability of a clustering algorithm are proposed. These CPMs are defined using a validation measure, which determines how well the algorithm works with a given set of parameter values, and a repeatability measure, which is used for studying the stability of the clustering solutions and has the ability to estimate the correct number of clusters in a dataset. These proposed CPMs can be used to evaluate clustering algorithms that have a structure bias to certain types of data distribution as well as those that have no structure biases. Additionally, we propose a novel cluster validity index, V I index, which is able to handle non-spherical clusters. Five clustering algorithms on different types of real-world data and synthetic data are evaluated. The first dataset type refers to a communications signal dataset representing one modulation scheme under a variety of noise conditions, the second represents two breast cancer datasets, while the third type represents different synthetic datasets with arbitrarily shaped clusters. Additionally, comparisons with other methods for estimating the number of clusters indicate the applicability and reliability of the proposed cluster validity V I index and repeatability measure for correct estimation of the number of clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

Notes

  1. This paper is an extension of [24] and contains further investigations and experimental results. The current manuscript represents a significant extension.

  2. Software codes for the CPMs may be available on request from sameh.salem@liverpool.ac.uk

References

  1. Webb AR (2003) Statistical pattern recognition. Wiley, New York

  2. Theodoridis S, Koutroubas K (2003) Pattern recognition. Academic Press, New York

  3. Halkidi M, Batistakis Y, Vazirgiannis M (2002) Cluster validity methods: Part I, SIGMOD. Record 31(2):40–45

    Google Scholar 

  4. Halkidi M, Batistakis Y, Vazirgiannis M (2002) Cluster validity methods: Part II, SIGMOD. Record 31(3):19–27

    Google Scholar 

  5. Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 1:224–227

    Google Scholar 

  6. Dunn JC (1973) A fuzzy relative of ISODATA process and its use in detecting compact well-separated clusters. J Cybern 3:32–57

    Article  MATH  MathSciNet  Google Scholar 

  7. Calinski RB, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat 3:1–27

    Article  MathSciNet  Google Scholar 

  8. Maulik U, Bandyopadhyay S (2002) Performance evaluation of some clustering algorithms and validity indices. IEEE Trans Pattern Anal Mach Intell 24(12):1650–1654

    Article  Google Scholar 

  9. Milligan GW, Cooper C (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2):159–179

    Article  Google Scholar 

  10. Fraley C, Raftery AE (1998) How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput J 41(8):578–588

    Article  MATH  Google Scholar 

  11. Bezdek JC, Pal NR (1998) Some new indexes of cluster validity. IEEE Trans Syst Man Cybern Part B 28(3):301–315

    Article  Google Scholar 

  12. Xie XL, Beni G (1991) A validity measure for fuzzy clustering. IEEE Trans Pattern Anal Mach Intell 13:841–847

    Article  Google Scholar 

  13. Chou C, Su M, Lai E (2004) A new cluster validity measure and its application to image compression. Pattern Anal Appl 7:205–220

    MathSciNet  Google Scholar 

  14. Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters via the gap statistic. J R Stat Soc B 63(2):411–423

    Article  MATH  MathSciNet  Google Scholar 

  15. Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Englewood Cliffs

    MATH  Google Scholar 

  16. Law MH, Jain AK (2003) Cluster validity by bootstrapping partitions. Technical report MSU-CSE-03-5, Department of Computer Science and Engineering, Michigan State University

  17. Lange T, Braun M, Buhmann JM (2004) Stability-based validation of clustering solutions. Neural Comput 16:1299–1323

    Article  MATH  Google Scholar 

  18. Ben-Hur A, Elisseeff A, Guyon I (2002) A stability based method for discovering structure in clustered data. In: Pacific symposium on biocomputing. World Scientific, Singapore, pp 6–17

  19. Levine E, Domany E (2001) Resampling method for unsupervised estimation of cluster validity. Neural Comput 13:2573–2593

    Article  MATH  Google Scholar 

  20. Jain A, Morean J (1987) Bootstrap techniques in cluster analysis. Pattern Recognit 20:547–568

    Article  Google Scholar 

  21. Tibshirani R, Walther G, Botstein D, Brown P (2001) Cluster validation by prediction strength. Technical report, Statistics Department, Stanford University, Stanford, CA

  22. Dudoit S, Fridlyand JA (2002) Prediction-based resampling method for estimating the number of clusters in a data set. Genome Biol 3(7). Available online: http://genomebiology.com/2002/317/research/0036

  23. Lange T, Braun M, Roth V, Buhmann JM (2002) Stability-based model selection. Adv Neural Inf Process Syst 15:617–624

    Google Scholar 

  24. Salem SA, Nandi AK (2005) New assessment criteria for clustering algorithms. In: Proceedings of the IEEE international workshop on machine learning for signal processing, Mystic, CT, USA, pp 285–290

  25. Proakis JG (2001) Digital communications. McGraw-Hill, Boston

    Google Scholar 

  26. UCI Machine Learning. http://www.ics.uci.edu/∼mlearn/MLRepository.html

  27. Ng AY, Jordan MI, Weiss Y (2002) On spectral clustering: analysis and an algorithm. In: Advances in neural information processing systems, vol 14. MIT Press, Cambridge

  28. Cormen TH, Leiserson CE, Rivest LR, Stein C (2001) Introduction to algorithms. ISBN 10:0-262-03293-7. The MIT Press, London

  29. Fischer B, Buhmann JM (2003) Path based clustering for grouping smooth curves and texture segmentation. IEEE Trans Pattern Anal Mach Intell 25:1–6

    Article  Google Scholar 

  30. Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York

    Google Scholar 

  31. Fonseca JRS, Cardoso MGMS (2007) Mixture-model cluster analysis using information theoretical criteria. Intell Data Anal 11:155–173

    Google Scholar 

  32. Kverh B, Leonardis A (2004) A generalisation of model selection criteria. Pattern Anal Appl 7:51–65

    Article  MathSciNet  Google Scholar 

  33. Hu T, Sung Y (2005) Clustering spatial data with a hybrid EM approach. Pattern Anal Appl 8:139–148

    Article  MathSciNet  Google Scholar 

  34. Hu X, Xu L (2004) Investigation on several model selection criteria for determining the number of clusters. Neural Inf Process Lett Rev 4:1–10

    Google Scholar 

  35. Jain AK, Murty MN, Flyn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323

    Article  Google Scholar 

  36. Han J, Kamber M (2001) Data mining: concepts and techniques. Morgan Kaufmann, San Francisco

  37. Kohonen T (1997) Self-organizing maps. Springer, Heidelberg

  38. Chen T, Chen L-K, MA K-K (1999) Colour image indexing using SOM for region-of-interest retrieval. Pattern Anal Appl 2(2):164–171

    Article  MathSciNet  Google Scholar 

  39. Zhang S, Ganesan R, Xistris GD (1996) Self-organizing neural networks for automated machinery monitoring systems. Mech Syst Signal Process 10(5):517–532

    Article  Google Scholar 

  40. Chen GW, Luo JB, Parker KJ (1998) Image segmentation via adaptive k means clustering and knowledge-based morphological operations with biomedical operations. IEEE Trans Image Process 7(12):1673–1683

    Article  Google Scholar 

  41. Frigui H (2005) Unsupervised learning of arbitrarily shaped clusters using ensembles of Gaussian models. Pattern Anal Appl 8:32–49

    Article  MathSciNet  Google Scholar 

  42. Pelleg D, Moore AW (2000) X-means: extending K-means with efficient estimation of the number of clusters. In: Seventeenth international conference on machine learning. Morgan Kaufmann, San Francisco, pp 727–734

  43. Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithm. Plenum Press, New York

  44. Yang MS (1993) A survey of fuzzy clustering. Math Comput Modell 18:1–16

    Article  MATH  Google Scholar 

  45. Baraldi A, Blonda P (1999) A survey of fuzzy clustering algorithms for pattern recognition. IEEE Trans Syst Man Cybern Part B 29(6):778–801

    Article  Google Scholar 

  46. Xu W, Nandi AK, Zhang J (2003) Novel fuzzy reinforcement learning vector quantization algorithm and its application in image compression. IEEE Proc Vis Image Signal Process 150(5):292–298

    Article  Google Scholar 

  47. Ester M, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the second international conference on knowledge discovery and data mining, (KDD). Portland,OR, USA, pp 226–231

  48. Ankerst M, Breuing M, Kriegel H, Sander J (1996) OPTICS: ordering points to identify the clustering structure. In: Proceedings of the international conference on management of data, (SIGMOD). ACM Press, Philadelphia 28(2):49–60

  49. Jack LB, Nandi AK (2004) Microarray data using the self organising oscillator network. In: Proceedings of EUSIPCO 2004, Vienna, Austria, pp 2183–2186

  50. Von Luxburg U (2006) A tutorial on spectral clustering. Max Planck Institute for Biological Cybernetics. Technical report no. TR-149

Download references

Acknowledgments

The authors would like to acknowledge the financial support of the Egyptian Ministry of Higher Eduction, Egypt, for S. A. Salem and many fruitful discussions with Dr. L. B. Jack formerly of the University of Liverpool.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sameh A. Salem.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Salem, S.A., Nandi, A.K. Development of assessment criteria for clustering algorithms. Pattern Anal Applic 12, 79–98 (2009). https://doi.org/10.1007/s10044-007-0099-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-007-0099-1

Keywords

Navigation