Skip to main content
Log in

DStab: estimating clustering quality by distance stability

  • Theoretical Advances
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

Most commonly, stability analyses are performed using an external validation measure. For example, the Jaccard index is one of the indexes of choice for stability measurement. The index is wrapped around a resampling method to sense the model’s stability. Other methods use classifiers to look for stable partitions instead. In these cases, a resampling method is also used with an external index, an error measure driven by a classifier, and a clustering algorithm aiming to find stable clustering model configurations. Contrary to previous stability-based methods, we propose a novel validation procedure consisting of an internal validation index within a resampling strategy. We propose an index based on the distance between cluster centroids coupled with a twofold cross-validation resampling approach. Moreover, we use a threshold based on a null hypothesis to detect meaningful clustering partitions. As part of our experimental study, we have selected the K-means algorithm because of its simplicity but primarily for its instability compared to other algorithms, such as Hierarchical methods. Finally, we compare our approach with several known validation indexes and discuss the results. Our findings show that our method cannot only find meaningful clustering partitions but is also helpful as an unsupervised data analysis tool.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data and code availability

The code repository address https://github.com/ar-baya/DStab. Data is on public repositories.

References

  1. Arbelaitz O, Gurrutxaga I, Muguerza J et al (2013) An extensive comparative study of cluster validity indices. Pattern Recogn 46(1):243–256. https://doi.org/10.1016/j.patcog.2012.07.021

    Article  Google Scholar 

  2. Arthur D, Vassilvitskii S (2007) K-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms. Society for Industrial and Applied Mathematics, pp 1027–1035

  3. Bayá AE, Granitto PM (2013) How many clusters: a validation index for arbitrary-shaped clusters. IEEE ACM Trans Comput Biol Bioinf 10(2):401–414. https://doi.org/10.1109/TCBB.2013.32

    Article  Google Scholar 

  4. Ben-Hur A, Elisseeff A, Guyon I (2002) A stability based method for discovering structure in clustered data. In: Pacific symposium on biocomputing pacific symposium on biocomputing, pp 6–17

  5. Caliński T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat 3(1):1–27. https://doi.org/10.1080/03610927408827101

    Article  MathSciNet  MATH  Google Scholar 

  6. Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 1(2):224–227. https://doi.org/10.1109/TPAMI.1979.4766909

    Article  Google Scholar 

  7. Dua D, Graff C (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml

  8. Dudoit S, Fridlyand J (2002) A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol 3(7):1–21

    Article  Google Scholar 

  9. Dunn JC (1974) Well-separated clusters and optimal fuzzy partitions. J Cybern 4(1):95–104. https://doi.org/10.1080/01969727408546059

    Article  MathSciNet  MATH  Google Scholar 

  10. Ester M, Kriegel HP, Sander J et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the second international conference on knowledge discovery and data mining. AAAI Press, pp 226–231

  11. Fang Y, Wang J (2012) Selection of the number of clusters via the bootstrap method. Comput Stat Data Anal 56(3):468–477. https://doi.org/10.1016/j.csda.2011.09.003

    Article  MathSciNet  MATH  Google Scholar 

  12. Fränti P, Virmajoki O (2006) Iterative shrinking method for clustering problems. Pattern Recogn 39(5):761–765. https://doi.org/10.1016/j.patcog.2005.09.012

    Article  MATH  Google Scholar 

  13. Handl J, Knowles J, Kell D (2005) Computational cluster validation in post-genomic data analysis. Bioinformatics 21(15):3201–3212. https://doi.org/10.1093/bioinformatics/bti517

    Article  Google Scholar 

  14. Hartigan JA (1975) Clustering algorithms, 99th edn. Wiley, New York

    MATH  Google Scholar 

  15. Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference and prediction, 4th edn. Springer

    Book  MATH  Google Scholar 

  16. Hennig C (2007) Cluster-wise assessment of cluster stability. Comput Stat Data Anal 52(1):258–271. https://doi.org/10.1016/j.csda.2006.11.025

    Article  MathSciNet  MATH  Google Scholar 

  17. Hu L, Zhong C (2019) An internal validity index based on density-involved distance. IEEE Access 7:40038–40051. https://doi.org/10.1109/ACCESS.2019.2906949

    Article  Google Scholar 

  18. Jain A (2010) Data clustering: 50 years beyond k-means. Pattern Recogn Lett 31(8):651–666. https://doi.org/10.1016/j.patrec.2009.09.011

    Article  Google Scholar 

  19. Kosub S (2019) A note on the triangle inequality for the Jaccard distance. Pattern Recogn Lett 120:36–38. https://doi.org/10.1016/j.patrec.2018.12.007

    Article  Google Scholar 

  20. Krzanowski WJ, Lai YT (1988) A criterion for determining the number of groups in a data set using sum of squares clustering. Biometrics 44(1):23–24. https://doi.org/10.2307/2531893

    Article  MathSciNet  MATH  Google Scholar 

  21. Kuhn H (2005) The Hungarian method for the assignment problem. Naval Res Logist 52(1):7–21. https://doi.org/10.1002/nav.20053

    Article  MathSciNet  Google Scholar 

  22. Lange T, Roth V, Braun M et al (2004) Stability-based validation of clustering solutions. Neural Comput 16(6):1299–1323. https://doi.org/10.1162/089976604773717621

    Article  MATH  Google Scholar 

  23. Liu T, Yu H, Blair R (2022) Stability estimation for unsupervised clustering: a review. Wiley Interdiscip Rev Comput Stat. https://doi.org/10.1002/wics.1575

    Article  MathSciNet  Google Scholar 

  24. Monti S, Tamayo P, Mesirov J et al (2003) Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 52(1–2):91–118. https://doi.org/10.1023/A:1023949509487

    Article  MATH  Google Scholar 

  25. Moulavi D, Jaskowiak P, Campello R et al (2014) Density-based clustering validation, pp 839–847. https://doi.org/10.1137/1.9781611973440.96

  26. Pedregosa F, Varoquaux G, Gramfort A et al (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830

    MathSciNet  MATH  Google Scholar 

  27. Prasanth A, Pavalarajan S (2020) Implementation of efficient intra- and inter-zone routing for extending network consistency in wireless sensor networks. J Circuits Syst Comput. https://doi.org/10.1142/S0218126620501297

    Article  Google Scholar 

  28. Rathore P, Ghafoori Z, Bezdek JC et al (2019) Approximating Dunn’s cluster validity indices for partitions of big data. IEEE Trans Cybern 49(5):1629–1641. https://doi.org/10.1109/TCYB.2018.2806886

    Article  Google Scholar 

  29. Rezaei M, Fränti P (2016) Set-matching methods for external cluster validity. IEEE Trans Knowl Data Eng 28(8):2173–2186

    Article  Google Scholar 

  30. Rojas-Thomas J, Santos M (2021) New internal clustering validation measure for contiguous arbitrary-shape clusters. Int J Intell Syst 36(10):5506–5529. https://doi.org/10.1002/int.22521

    Article  Google Scholar 

  31. Rousseeuw P (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20(1):53–65. https://doi.org/10.1016/0377-0427(87)90125-7

    Article  MATH  Google Scholar 

  32. Tibshirani R, Walther G (2005) Cluster validation by prediction strength. J Comput Graph Stat 14(3):511–528. https://doi.org/10.1198/106186005X59243

    Article  MathSciNet  Google Scholar 

  33. Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J Roy Stat Soc Ser B (Stat Methodol) 63(2):411–423. https://doi.org/10.1111/1467-9868.00293

    Article  MathSciNet  MATH  Google Scholar 

  34. Veenman C, Reinders M, Backer E (2002) A maximum variance cluster algorithm. IEEE Trans Pattern Anal Mach Intell 24(9):1273–1280. https://doi.org/10.1109/TPAMI.2002.1033218

    Article  Google Scholar 

  35. Xie X, Beni G (1991) A validity measure for fuzzy clustering. IEEE Trans Pattern Anal Mach Intell 13(8):841–847. https://doi.org/10.1109/34.85677

    Article  Google Scholar 

  36. Yu H, Chapman B, Di Florio A et al (2019) Bootstrapping estimates of stability for clusters, observations and model selection. Comput Stat 34(1):349–372. https://doi.org/10.1007/s00180-018-0830-y

    Article  MathSciNet  MATH  Google Scholar 

Download references

Funding

This study was funded by CONICET Grant “PIP 11220170100066CO.”

Author information

Authors and Affiliations

Authors

Contributions

AB: Conceptualization, Methodology, Software, Writing—Original draft preparation. ML: Methodology, Visualization, Investigation, Writing—Original draft preparation—Reviewing and Editing.

Corresponding author

Correspondence to Ariel E. Bayá.

Ethics declarations

Conflict of interest

the authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Here we include Tables 4 and 5, where we summarize Figs. 7,8 9,  10 and 11. In Table 4, we present clusters estimation for artificial datasets, and in Table 5, we estimate the number of groups on real datasets. Tables have three columns, the first with the dataset name, the second with the number of clusters estimation, and the third with a p-value. We analyze the relation between DStab and the reference for each k to determine candidates. In our analyses, we view DStab as an observed distribution and the reference as a Null Hypothesis. We assume that the observed data and the null come from the same distribution but have different locations. Hence, we can use a non-parametric test to estimate if DStab distribution median is smaller than the reference. In this case, a smaller median means better stability-results presented in Tables 4 and 5 show the number of clusters k with the lowest p-values after performing a Mann–Whitney U test. Fewer solutions mean that candidates have a p-value above the 0.05 threshold, and a hyphen indicates no candidate is below the 0.05 threshold. Also, solutions inside a parenthesis mean equivalent candidates. For example, Meter D dataset has three identical solutions (2-3-4), we report a single p-value for the identical solutions.

Fig. 7
figure 7

We display the individual DStab results on 8 artificial datasets. The blue plot is the observed DStab score, and the orange plot is DStab reference. “Appendix” discusses candidate selection and Table 2 summarizes validation results for DStab and other methods (color figure online)

Fig. 8
figure 8

Artificial data visualization. Panel a displays the Unbalance dataset, and Panel b the Gauss dataset

Fig. 9
figure 9

We display the individual DStab results on 8 UCI datasets. The blue plot is the observed DStab score, and the orange plot is DStab reference. “Appendix” discusses candidate selection and Table 3 summarizes validation results for DStab and other methods (color figure online)

Fig. 10
figure 10

We display the individual DStab results on 4 Genomic datasets. The blue plot is the observed DStab score, and the orange plot is DStab reference. “Appendix” discusses candidate selection and Table 3 summarizes validation results for DStab and other methods (color figure online)

Fig. 11
figure 11

Real data discriminant components visualization. Panel a displays the Meter D dataset’s first and second components, and panel b shows second and third components. Panel c depicts Leukemia’s dataset first and second components. Panel d displays the Fetal Health dataset’s first and second components

Table 4 We perform a Mann–Whitney U test between DStab and DStab Ref., assuming that significant results have a lower median value than the reference
Table 5 We perform a Mann–Whitney U test between DStab and DStab Ref., assuming that significant results have a lower median value than the reference

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bayá, A.E., Larese, M.G. DStab: estimating clustering quality by distance stability. Pattern Anal Applic 26, 1463–1479 (2023). https://doi.org/10.1007/s10044-023-01175-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-023-01175-7

Keywords

Navigation