Skip to main content
Log in

On the Added Value of Bootstrap Analysis for K-Means Clustering

  • Published:
Journal of Classification Aims and scope Submit manuscript

Abstract

Because of its deterministic nature, K-means does not yield confidence information about centroids and estimated cluster memberships, although this could be useful for inferential purposes. In this paper we propose to arrive at such information by means of a non-parametric bootstrap procedure, the performance of which is tested in an extensive simulation study. Results show that the coverage of hyper-ellipsoid bootstrap confidence regions for the centroids is in general close to the nominal coverage probability. For the cluster memberships, we found that probabilistic membership information derived from the bootstrap analysis can be used to improve the cluster assignment of individual objects, albeit only in the case of a very large number of clusters. However, in the case of smaller numbers of clusters, the probabilistic membership information still appeared to be useful as it indicates for which objects the cluster assignment resulting from the analysis of the original data is likely to be correct; hence, this information can be used to construct a partial clustering in which the latter objects only are assigned to clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • BRYANT, P.G., and WILLIAMSON, J.A. (1978), "Asymptotic Behavior of Classification Maximum Likelihood Estimates", Biometrika, 65, 273–281.

  • DAVISON, A.C., and HINKLEY, D.V. (1997), Bootstrap Methods and Their Application, Cambridge UK: Cambridge University Press.

  • DUDOIT, S., and FRIDLYAND, J. (2003), "Bagging to Improve the Accuracy of a Clustering Procedure", Bioinformatics, 19, 1090–1099.

  • EFRON, B. (1979), "Bootstrap Methods: Another Look at the Jackknife", Annals of Statistics, 7, 1–26.

  • EFRON, B., and TIBSHIRANI, R. (1993), An Introduction to the Bootstrap, London: Chapman and Hall.

  • HAND, D.J., and KRZANOWSKI, W.J. (2005), "Optimising K-Means Clustering Results With Standard Software Packages", Computational Statistics and Data Analysis, 49, 969–973.

  • HENNIG, C. (2007), "Cluster-Wise Assessment of Cluster Stability", Computational Statistics and Data Analysis, 52, 258–271.

  • HUBERT, L., and ARABIE, P. (1985), "Comparing Partitions", Journal of Classification, 2, 193–218.

  • JAIN, A.K. (2010), "Data Clustering: 50 Years Beyond K-Means", Pattern Recognition Letters, 31, 651–666.

  • JAIN, A.K., and MOREAU, J.V. (1987), “Bootstrap Technique in Cluster Analysis”, Pattern Recognition, 20, 547–568.

  • JASRA, A., HOLMES, C.C., and STEPHENS, D.A. (2005), "Markow Chain Monte Carlo Methods and the Label Switching Problem in Bayesian Mixture Modeling", Statistical Science, 20, 50–67.

  • JHUN, M. (1990), "Bootstrapping K-Means Clustering", Journal of the Japanese Society for Computational Statistics, 3, 1–14.

  • KAUFMAN, L., and ROUSSEEUW, P.J. (1990), Finding Groups in Data, New York: Wiley.

  • KIERS, H.A.L. (2004), "Bootstrap Confidence Intervals for Three-way Methods", Journal of Chemometrics, 18, 22–36.

  • KOGAN, J. (2007), Introduction to Clustering Large and High-Dimensional Data, Cambridge: Cambridge University Press.

  • KRZANOWSKI, W.J. (1989), "On Confidence Regions in Canonical Variate Analysis", Biometrika, 76, 107–116.

  • LINTING, M., MEULMAN, J.J., GROENEN, P.J.F., and VAN DER KOOIJ, A.J. (2007), “Stability of Nonlinear Principal Components Analysis: An Empirical Study Using the Balanced Bootstrap”, Psychological Methods, 12, 359–379.

  • MACQUEEN, J. (1967), "Some Methods of Classification and Analysis of Multivariate Observations", in Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297.

  • MAITRA, R., MELNYKOV, V., and LAHIRI, S.N (2012), “Bootstrapping for Significance of Compact Clusters in Multidimensional Datasets”, Journal of the American Statistical Association, 107, 378–392.

  • MCLACHLAN, G.J., and PEEL, D. (2000), Finite Mixture Models, New York: Wiley.

  • MILLIGAN, G.W. (1985), "An Algorithm for Generating Artificial Test Clusters", Psychometrika, 50, 123–127.

  • MÖLLER, U., and RADKE, D. (2006), "Performance of Data Resampling Methods for Robust Class Discovery Based on Clustering", Intelligent Data Analysis, 10, 139–162.

  • MONTI, S., TAMAYO, P., MESIROV, J., and GOLUB, T. (2003), "Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data", Machine Learning, 52, 91–118.

  • MUNKRES, J. (1957), “Algorithms for the Assignment and Transportation Problems,” Journal of the Society for Industrial and Applied Mathematics, 5(1), 32–38.

  • POLLARD, D. (1982), "A Central Limit Theorem for K-Means Clustering", Annals of Probability, 10, 919–926.

  • SEBESTYEN, G.S. (1962), Decision Making Processes in Pattern Recognition, New York: Macmillan.

  • STEINLEY, D. (2003), "Local Optima in K-Means Clustering: What You Don't Know May Hurt You", Psychological Methods, 8, 294–304.

  • STEINLEY, D. (2004), "Properties of the Hubert-Arabia Adjusted Rand Index", Psychological Methods, 9, 386–396.

  • STEINLEY, D. (2006), "K-Means Clustering: A Half-Century Synthesis", British Journal of Mathematical and Statistical Psychology, 59, 1–34.

  • STEINLEY, D. (2008), “Stability Analysis in K-Means Clustering”, British Journal of Mathematical and Statistical Psychology, 61, 255–273.

  • STEINLEY, D., and BRUSCO, M.J. (2007), "Initializing K-Means Batch Clustering: A Critical Evaluation of Several Techniques", Journal of Classification, 24, 99–121.

  • STEINLEY, D., and BRUSCO, M.J. (2011), "Evaluating Mixture Modeling for Clustering: Recommendations and Cautions", Psychological Methods, 16, 63–79.

  • TIMMERMAN, M.E., KIERS, H.A.L., and SMILDE, A.K. (2007), "Estimating Confidence Intervals for Principal Component Loadings: A Comparison Between the Bootstrap and Asymptotic Results", British Journal of Mathematical and Statistical Psychology, 60, 295–314.

  • TIMMERMAN, M.E., KIERS, H.A.L., SMILDE, A.K., CEULEMANS, E., and STOUTEN, J. (2009), "Bootstrap Confidence Intervals In Multi-Level Simultaneous Component Analysis", British Journal of Mathematical & Statistical Psychology, 62, 299–318.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Joeri Hofmans.

Additional information

The research reported in this paper was supported by the Fund for Scientific Research – Flanders (Project G.0146.06) and by the Research Fund of K.U.Leuven (GOA/2005/04 and GOA/2010/02).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hofmans, J., Ceulemans, E., Steinley, D. et al. On the Added Value of Bootstrap Analysis for K-Means Clustering. J Classif 32, 268–284 (2015). https://doi.org/10.1007/s00357-015-9178-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00357-015-9178-y

Keywords

Navigation