Skip to main content
Log in

An unsupervised and robust validity index for clustering analysis

  • Methodologies and Application
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

The evaluation of clustering results plays an important role in clustering analysis and usually is completed by a validity index or several. But currently existing validity indexes are supervised since they greatly depend on prior information, such as specified clustering algorithms and optimal initializations. Once the prior information is unavailable, the evaluating results of these supervised validity indexes are no longer guaranteed, which lead to that their applicable ranges are greatly limited. In this paper, we firstly propose an estimation of the lower and upper bounds of the number of within-cluster distances in any dataset, and then an unsupervised validity index without needing any clustering algorithm and initialization is presented. A group of typical simulated and real datasets with various characteristics validate the proposed index in an unsupervised way. Experimental results demonstrate that the proposed index has higher accuracy in most tested datasets and has advantages in robustness and runtime compared with the other existing validity indexes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  • Arbelaitz O, Gurrutxaga I, Muguerza J, Perez JM, Perona I (2013) An extensive comparative study of cluster validity indices. Pattern Recogn 46(1):243–256

    Article  Google Scholar 

  • Azar AT, Hassanien AE (2015) Dimensionality reduction of medical big data using neural-fuzzy classifier. Soft Comput 19(4):1115–1127

    Article  Google Scholar 

  • Bezdek JC (1974) Cluster validity with fuzzy sets. J Cybern 3(3):58–73

    Article  MathSciNet  MATH  Google Scholar 

  • Bezdek J (1981) Pattern recognition with fuzzy objective function algorithms. Plenum Press, New York

    Book  MATH  Google Scholar 

  • Caliński T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat 3:1–27

    MathSciNet  MATH  Google Scholar 

  • Chou CH, Su MC, Lai E (2004) A new cluster validity measure and its application to image compression. Pattern Anal Appl 7(2):205–220

    Article  MathSciNet  Google Scholar 

  • Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal 1(2):224–227

    Article  Google Scholar 

  • Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya AY, Foufou S, Bouras A (2014) A Survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267–279

    Article  Google Scholar 

  • Fukuyama Y, Sugeno M (1989) A new method of choosing the number of clusters for the fuzzy c-means method. In: Proceedings of fifth fuzzy system symposium. Kobe, pp 247–250

  • Gao Y, Huang JZ, Wu L (2007) Learning classifier system ensemble and compact rule set. Connect Sci 19(4):321–337

    Article  Google Scholar 

  • García-Gil D, Ramírez-Gallego S, García S, Herrera F (2017) A comparison on scalability for batch big data processing on Apache Spark and Apache Flink. Big Data Anal 2(1):1

    Article  Google Scholar 

  • Kim M, Ramakrishna RS (2005) New indices for cluster validity assessment. Pattern Recogn Lett 26(15):2353–2363

    Article  Google Scholar 

  • Lee C, Zaiane OR, Park H, Huang J, Greiner R (2008) Clustering high dimensional data: a graph-based relaxed optimization approach. Inf Sci 178(23):4501–4511

    Article  MathSciNet  Google Scholar 

  • Liu C, Wang W, Konan M, Wang S, Huang L, Tang Y, Zhang X (2017) A new validity index of feature subset for evaluating the dimensionality reduction algorithms. Knowl Based Syst 121:83–98

    Article  Google Scholar 

  • MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, University of California Press, California, pp 281–297

  • Maillo J, Ramirez S, Triguero I, Herrera F (2017) kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data. Knowl Based Syst 117(15):3–15

    Article  Google Scholar 

  • Mamat R, Herawan T, Denis MM (2013) MAR: maximum attribute relative of soft set for clustering attribute selection. Knowl Based Syst 52:11–20

    Article  Google Scholar 

  • Pakhira MK, Bandyopadhyay S, Maulik U (2004) Validity index for crisp and fuzzy clusters. Pattern Recogn 37(3):487–501

    Article  MATH  Google Scholar 

  • Pelleg D, Moore A (2000) X-means: extending K-means with efficient estimation of the number of clusters. In: 17th international conference on machine learning. Morgan Kaufmann, San Francisco, pp 727–734

  • Rodriguez A, Laio A (2014) Clustering by fast search and find of density peaks. Science 344(6191):1492–1496

    Article  Google Scholar 

  • Tibshirani R, Walther G, Hastie T (2001) Estimation the number of clusters in a dataset via the gap statistic. J R Stat Soc A Stat 63(2):411–423

    Article  MATH  Google Scholar 

  • Wang J, Lin C, Yang YC, Ho Y (2012) Walking pattern classification and walking distance estimation algorithms using gait phase information. IEEE Trans Bio-Med Eng 59(10):2884–2892

    Article  Google Scholar 

  • Wu KL, Yang MS (2002) Alternative c-means clustering algorithms. Pattern Recogn 35(10):2267–2278

    Article  MATH  Google Scholar 

  • Xie XL, Beni G (1991) A validity measure for fuzzy clustering. IEEE Trans Pattern Anal 13(13):841–847

    Article  Google Scholar 

  • Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Net 16(3):645–678

    Article  Google Scholar 

  • Yue S, Wu T, Liu Z, Zhao X (2011) Fused multi-characteristic validity index: an application to reconstructed image evaluation in electrical tomography. Int J Comput Int Syst 4(5):1052–1061

    Article  Google Scholar 

  • Yue S, Wang P, Wang J, Huang T (2013) Extension of the gap statistics index to fuzzy clustering. Soft Comput 17(10):1833–1846

    Article  Google Scholar 

  • Yue S, Wang J, Wang J, Bao X (2016) A new validity index for evaluating the clustering results by partitional clustering algorithms. Soft Comput 20(3):1127–1138

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the National Science Foundation of China (Grant No. 61573251). The authors thank Yalu Liao for assistance with editing the format of the original manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shihong Yue.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest. This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Communicated by V. Loia.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, Y., Yue, S., Hao, Z. et al. An unsupervised and robust validity index for clustering analysis. Soft Comput 23, 10303–10319 (2019). https://doi.org/10.1007/s00500-018-3582-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-018-3582-2

Keywords

Navigation