Bootstrapping estimates of stability for clusters, observations and model selection

Yu, Han; Chapman, Brian; Di Florio, Arianna; Eischen, Ellen; Gotz, David; Jacob, Mathews; Blair, Rachael Hageman

doi:10.1007/s00180-018-0830-y

Bootstrapping estimates of stability for clusters, observations and model selection

Original Paper
Published: 28 August 2018

Volume 34, pages 349–372, (2019)
Cite this article

Computational Statistics Aims and scope Submit manuscript

Han Yu¹,
Brian Chapman²,
Arianna Di Florio⁴^nAff3,
Ellen Eischen⁵,
David Gotz⁶,
Mathews Jacob⁷ &
…
Rachael Hageman Blair⁸

1146 Accesses
22 Citations
Explore all metrics

Abstract

Clustering is a challenging problem in unsupervised learning. In lieu of a gold standard, stability has become a valuable surrogate to performance and robustness. In this work, we propose a non-parametric bootstrapping approach to estimating the stability of a clustering method, which also captures stability of the individual clusters and observations. This flexible framework enables different types of comparisons between clusterings and can be used in connection with two possible bootstrap approaches for stability. The first approach, scheme 1, can be used to assess confidence (stability) around clustering from the original dataset based on bootstrap replications. A second approach, scheme 2, searches over the bootstrap clusterings for an optimally stable partitioning of the data. The two schemes accommodate different model assumptions that can be motivated by an investigator’s trust (or lack thereof) in the original data and additional computational considerations. We propose a hierarchical visualization extrapolated from the stability profiles that give insights into the separation of groups, and projected visualizations for the inspection of the stability of individual operations. Our approaches show good performance in simulation and on real data. These approaches can be implemented using the R package bootcluster that is available on the Comprehensive R Archive Network (CRAN).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Flexible parametric bootstrap for testing homogeneity against clustering and assessing the number of clusters

Article Open access 11 June 2015

Christian Hennig & Chien-Ju Lin

DStab: estimating clustering quality by distance stability

Article 21 June 2023

Ariel E. Bayá & Mónica G. Larese

Bottom-Up Variable Selection in Cluster Analysis Using Bootstrapping: A Proposal

References

Ben-Hur A, Elisseeff A, Guyon I (2001) A stability based method for discovering structure in clustered data. In: Pacific symposium on biocomputing, vol 7, pp 6–17
Breiman L (1996) Out-of-bag estimation. Technical report, Statistics Department, University of California Berkeley, Berkeley CA
Dudoit S, Fridlyand J (2003) Bagging to improve the accuracy of a clustering procedure. Bioinformatics 19(9):1090–1099
Article Google Scholar
Efron B, Tibshirani RJ (1994) An Introduction to the bootstrap: Chapman and Hall/CRC monographs on statistics and applied probability. CRC Press, Boca Raton
Google Scholar
Efron B, Halloran E, Holmes S (1996) Bootstrap confidence levels for phylogenetic trees. Proc Natl Acad Sci 93(23):13429–13429
Article MATH Google Scholar
Falasconi M, Gutierrez A, Pardo M, Sberveglieri G, Marco S (2010) A stability based validity method for fuzzy clustering. Pattern Recognit 43(4):1292–1305
Article MATH Google Scholar
Fang Y, Wang J (2012) Selection of the number of clusters via the bootstrap. Comput Stat Data Anal 56:468–477
Article MathSciNet MATH Google Scholar
Felsenstein J (1985) Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39(4):783–791
Article Google Scholar
Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Springer series in statistics. Springer New York Inc., New York
Book MATH Google Scholar
Hennig C (2007) Cluster-wise assessment of cluster stability. Comput Stat Data Anal 52(1):258–271
Article MathSciNet MATH Google Scholar
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323
Article Google Scholar
Kerr MK, Churchill GA (2001) Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. Proc Natl Acad Sci 98(16):8961–8965
Article MATH Google Scholar
Ross DT, Scherf U, Eisen MB, Perou CM, Rees C, Spellman P, Iyer V, Jeffrey SS, Van de Rijn M, Waltham M, Permamenschikov L, Lashkari D, Shalon D, Myers T, Botstein D, Brown P (2000) Systematic variation in gene expression patterns in human cancer cell lines. Nat Genet 24(3):227–235
Article Google Scholar
Tibshirani R, Walther G (2005) Cluster validation by prediction strength. J Comput Graph Stat 14(3):511–528
Article MathSciNet Google Scholar
Von Luxburg U (2009) Clustering stability: an overview. Found Trends Mach Learn 2(3):235–274
Article MATH Google Scholar
Wang J (2010) Consistent selection of the number of clusters via crossvalidation. Biometrika 97(4):893–904
Article MathSciNet MATH Google Scholar

Download references

Author information

Arianna Di Florio
Present address: Institute of Psychological Medicine and Clinical Neurosciences, Cardiff University School of Medicine, Hadyn Ellis Building, Maindy Road, Cathays, Cardiff, CF24 4HQ, UK

Authors and Affiliations

Department of Biostatistics, State University of New York at Buffalo, 3435 Main Street, 706 Kimball Tower, Buffalo, NY, 14214, USA
Han Yu
Department of Radiology and Imaging Science, University of Utah, 729 Arapeen Drive, Salt Lake City, UT, 84108, USA
Brian Chapman
Department of Psychiatry, University of North Carolina at Chapel Hill, Campus Box 7160, Chapel Hill, NC, 27599, USA
Arianna Di Florio
Department of Mathematics, University of Oregon, 315 Fenton Hall, Eugene, OR, 97403-1222, USA
Ellen Eischen
School of Information and Library Science, University of North Carolina at Chapel Hill, 216 Lenoir Drive, Campus Box 3360, Chapel Hill, NC, 27599, USA
David Gotz
Department of Electrical and Computer Engineering, University of Iowa, 3314 Seamans Center for the Engineering Arts and Sciences, Iowa City, IA, 52242, USA
Mathews Jacob
Department of Biostatistics, State University of New York at Buffalo, 3435 Main Street, 709 Kimball Tower, Buffalo, NY, 14214, USA
Rachael Hageman Blair

Authors

Han Yu
View author publications
You can also search for this author in PubMed Google Scholar
Brian Chapman
View author publications
You can also search for this author in PubMed Google Scholar
Arianna Di Florio
View author publications
You can also search for this author in PubMed Google Scholar
Ellen Eischen
View author publications
You can also search for this author in PubMed Google Scholar
David Gotz
View author publications
You can also search for this author in PubMed Google Scholar
Mathews Jacob
View author publications
You can also search for this author in PubMed Google Scholar
Rachael Hageman Blair
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rachael Hageman Blair.

Additional information

This work was supported by the National Science Foundation. HY and RHB were both sup ported through NSF DMS 1557589, and RHB also through NSF DMS 1312250. BC was sup ported through NSF DMS 1557576. EE was supported through NSF DMS 1557642. MJ was supported through NSF DMS 1557668. AD and DG was supported through NSF DMS 1557593.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 260 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yu, H., Chapman, B., Di Florio, A. et al. Bootstrapping estimates of stability for clusters, observations and model selection. Comput Stat 34, 349–372 (2019). https://doi.org/10.1007/s00180-018-0830-y

Download citation

Received: 16 November 2016
Accepted: 18 August 2018
Published: 28 August 2018
Issue Date: 05 March 2019
DOI: https://doi.org/10.1007/s00180-018-0830-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Bootstrapping estimates of stability for clusters, observations and model selection

Abstract

Access this article

Similar content being viewed by others

Flexible parametric bootstrap for testing homogeneity against clustering and assessing the number of clusters

DStab: estimating clustering quality by distance stability

Bottom-Up Variable Selection in Cluster Analysis Using Bootstrapping: A Proposal

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material 1 (pdf 260 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Bootstrapping estimates of stability for clusters, observations and model selection

Abstract

Access this article

Similar content being viewed by others

Flexible parametric bootstrap for testing homogeneity against clustering and assessing the number of clusters

DStab: estimating clustering quality by distance stability

Bottom-Up Variable Selection in Cluster Analysis Using Bootstrapping: A Proposal

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material 1 (pdf 260 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation