Skip to main content

Experiments for the Number of Clusters in K-Means

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4874))

Abstract

K-means is one of the most popular data mining and unsupervised learning algorithms that solve the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a pre-specified number of clusters K, therefore the problem of determining “the right number of clusters” has attracted considerable interest. However, to the authors’ knowledge, no experimental results of their comparison have been reported so far. This paper presents results of such a comparison involving eight selection options presenting four approaches. We generate data according to a Gaussian-mixture distribution with clusters’ spread and spatial sizes variant. Most consistent results are shown by the least squares and least modules version of an intelligent version of the method, iK-Means by Mirkin [14]. However, the right K is reproduced best by the Hartigan’s [5] method. This leads us to propose an adjusted iK-Means method, which performs well in the current experiment setting.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Banfield, J.D., Raftery, A.E.: Model-based Gaussian and non-Gaussian clustering. Biometrics 49, 803–821 (1993)

    Article  MATH  MathSciNet  Google Scholar 

  2. Calinski, T., Harabasz, J.: A Dendrite method for cluster analysis. Communications in Statistics 3(1), 1–27 (1974)

    Article  MathSciNet  Google Scholar 

  3. Chiang Mark, M.T., Mirkin, B.: Determining the number of clusters in the Straight K-means: Experimental comparison of eight options. In: Proceeding of the 2006 UK workshop on Computational Intelligence, pp. 119–126 (2006)

    Google Scholar 

  4. Generation of Gaussian mixture distributed data, NETLAB neural network software (2006), http://www.ncrg.aston.ac.uk/netlab

  5. Hartigan, J.A.: Clustering Algorithms. J. Wiley & Sons, New York (1975)

    MATH  Google Scholar 

  6. Hubert, L.J., Arabie, P.: Comparing partitions. Journal of Classification 2, 193–218 (1985)

    Article  Google Scholar 

  7. Jain, A.K, Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs (1988)

    MATH  Google Scholar 

  8. Kaufman, L., Rousseeuw, P.: Finding Groups in Data: An Introduction to Cluster Analysis. J. Wiley & Son, New York (1990)

    Google Scholar 

  9. Krzanowski, W., Lai, Y.: A criterion for determining the number of groups in a dataset using sum of squares clustering. Biometrics 44, 23–34 (1985)

    Article  MathSciNet  Google Scholar 

  10. McLachlan, G., Basford, K.: Mixture Models: Inference and Applications to Clustering. Marcel Dekker, New York (1988)

    MATH  Google Scholar 

  11. McQueen, J.: Some methods for classification and analysis of multivariate observations. In: Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. II, pp. 281–297 (1967)

    Google Scholar 

  12. Milligan, G.W., Cooper, M.C.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50, 159–179 (1985)

    Article  Google Scholar 

  13. Mirkin, B.: Eleven ways to look at the Pearson chi squares coefficient at contingency tables. The American Statistician 55(2), 111–120 (2001)

    Article  MathSciNet  Google Scholar 

  14. Mirkin, B.: Clustering for Data Mining: A Data Recovery Approach. Chapman and Hall/CRC, Boca Raton Fl (2005)

    MATH  Google Scholar 

  15. Monti, S., Tamayo, P., Mesirov, J., Golub, T.: Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data. Machine Learning 52, 91–118 (2003)

    Article  MATH  Google Scholar 

  16. Roweis, S.: EM algorithms for PCA and SPCA. In: Jordan, M., Kearns, M., Solla, S. (eds.) Advances in Neural Information Processing Systems, vol. 10, pp. 626–632. MIT Press, Cambridge (1998)

    Google Scholar 

  17. Sugar, C.A., James, G.M.: Finding the number of clusters in a data set: An information-theoretic approach. Journal of American Statistical Association 98(463), 750–778 (2003)

    Article  MATH  MathSciNet  Google Scholar 

  18. Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a dataset via the Gap statistics. Journal of the Royal Statistical Society B 63, 411–423 (2001)

    Article  MATH  MathSciNet  Google Scholar 

  19. Tipping, M.E., Bishop, C.M.: Probabilistic principal component analysis. J. Roy. Statist. Soc. Ser. B 61, 611–622 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  20. Wasito, I., Mirkin, B.: Nearest neighbours in least-squares data imputation algorithms with different missing patterns. Computational Statistics & Data Analysis 50, 926–949 (2006)

    Article  MathSciNet  Google Scholar 

  21. Yeung, K.Y., Ruzzo, W.L.: Details of the Adjusted Rand index and clustering algorithms. Bioinformatics 17, 763–774 (2001)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

José Neves Manuel Filipe Santos José Manuel Machado

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Chiang, M.MT., Mirkin, B. (2007). Experiments for the Number of Clusters in K-Means. In: Neves, J., Santos, M.F., Machado, J.M. (eds) Progress in Artificial Intelligence. EPIA 2007. Lecture Notes in Computer Science(), vol 4874. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77002-2_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-77002-2_33

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-77000-8

  • Online ISBN: 978-3-540-77002-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics