Skip to main content
Log in

Bootstrapping estimates of stability for clusters, observations and model selection

  • Original Paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

Clustering is a challenging problem in unsupervised learning. In lieu of a gold standard, stability has become a valuable surrogate to performance and robustness. In this work, we propose a non-parametric bootstrapping approach to estimating the stability of a clustering method, which also captures stability of the individual clusters and observations. This flexible framework enables different types of comparisons between clusterings and can be used in connection with two possible bootstrap approaches for stability. The first approach, scheme 1, can be used to assess confidence (stability) around clustering from the original dataset based on bootstrap replications. A second approach, scheme 2, searches over the bootstrap clusterings for an optimally stable partitioning of the data. The two schemes accommodate different model assumptions that can be motivated by an investigator’s trust (or lack thereof) in the original data and additional computational considerations. We propose a hierarchical visualization extrapolated from the stability profiles that give insights into the separation of groups, and projected visualizations for the inspection of the stability of individual operations. Our approaches show good performance in simulation and on real data. These approaches can be implemented using the R package bootcluster that is available on the Comprehensive R Archive Network (CRAN).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  • Ben-Hur A, Elisseeff A, Guyon I (2001) A stability based method for discovering structure in clustered data. In: Pacific symposium on biocomputing, vol 7, pp 6–17

  • Breiman L (1996) Out-of-bag estimation. Technical report, Statistics Department, University of California Berkeley, Berkeley CA

  • Dudoit S, Fridlyand J (2003) Bagging to improve the accuracy of a clustering procedure. Bioinformatics 19(9):1090–1099

    Article  Google Scholar 

  • Efron B, Tibshirani RJ (1994) An Introduction to the bootstrap: Chapman and Hall/CRC monographs on statistics and applied probability. CRC Press, Boca Raton

    Google Scholar 

  • Efron B, Halloran E, Holmes S (1996) Bootstrap confidence levels for phylogenetic trees. Proc Natl Acad Sci 93(23):13429–13429

    Article  MATH  Google Scholar 

  • Falasconi M, Gutierrez A, Pardo M, Sberveglieri G, Marco S (2010) A stability based validity method for fuzzy clustering. Pattern Recognit 43(4):1292–1305

    Article  MATH  Google Scholar 

  • Fang Y, Wang J (2012) Selection of the number of clusters via the bootstrap. Comput Stat Data Anal 56:468–477

    Article  MathSciNet  MATH  Google Scholar 

  • Felsenstein J (1985) Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39(4):783–791

    Article  Google Scholar 

  • Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Springer series in statistics. Springer New York Inc., New York

    Book  MATH  Google Scholar 

  • Hennig C (2007) Cluster-wise assessment of cluster stability. Comput Stat Data Anal 52(1):258–271

    Article  MathSciNet  MATH  Google Scholar 

  • Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323

    Article  Google Scholar 

  • Kerr MK, Churchill GA (2001) Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. Proc Natl Acad Sci 98(16):8961–8965

    Article  MATH  Google Scholar 

  • Ross DT, Scherf U, Eisen MB, Perou CM, Rees C, Spellman P, Iyer V, Jeffrey SS, Van de Rijn M, Waltham M, Permamenschikov L, Lashkari D, Shalon D, Myers T, Botstein D, Brown P (2000) Systematic variation in gene expression patterns in human cancer cell lines. Nat Genet 24(3):227–235

    Article  Google Scholar 

  • Tibshirani R, Walther G (2005) Cluster validation by prediction strength. J Comput Graph Stat 14(3):511–528

    Article  MathSciNet  Google Scholar 

  • Von Luxburg U (2009) Clustering stability: an overview. Found Trends Mach Learn 2(3):235–274

    Article  MATH  Google Scholar 

  • Wang J (2010) Consistent selection of the number of clusters via crossvalidation. Biometrika 97(4):893–904

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rachael Hageman Blair.

Additional information

This work was supported by the National Science Foundation. HY and RHB were both sup ported through NSF DMS 1557589, and RHB also through NSF DMS 1312250. BC was sup ported through NSF DMS 1557576. EE was supported through NSF DMS 1557642. MJ was supported through NSF DMS 1557668. AD and DG was supported through NSF DMS 1557593.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 260 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yu, H., Chapman, B., Di Florio, A. et al. Bootstrapping estimates of stability for clusters, observations and model selection. Comput Stat 34, 349–372 (2019). https://doi.org/10.1007/s00180-018-0830-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-018-0830-y

Keywords

Navigation