Experiments for the Number of Clusters in K-Means

Chiang, Mark Ming-Tso; Mirkin, Boris

doi:10.1007/978-3-540-77002-2_33

Experiments for the Number of Clusters in K-Means

Mark Ming-Tso Chiang¹ &
Boris Mirkin¹

Conference paper

1602 Accesses
17 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4874))

Abstract

K-means is one of the most popular data mining and unsupervised learning algorithms that solve the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a pre-specified number of clusters K, therefore the problem of determining “the right number of clusters” has attracted considerable interest. However, to the authors’ knowledge, no experimental results of their comparison have been reported so far. This paper presents results of such a comparison involving eight selection options presenting four approaches. We generate data according to a Gaussian-mixture distribution with clusters’ spread and spatial sizes variant. Most consistent results are shown by the least squares and least modules version of an intelligent version of the method, iK-Means by Mirkin [14]. However, the right K is reproduced best by the Hartigan’s [5] method. This leads us to propose an adjusted iK-Means method, which performs well in the current experiment setting.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Banfield, J.D., Raftery, A.E.: Model-based Gaussian and non-Gaussian clustering. Biometrics 49, 803–821 (1993)
Article MATH MathSciNet Google Scholar
Calinski, T., Harabasz, J.: A Dendrite method for cluster analysis. Communications in Statistics 3(1), 1–27 (1974)
Article MathSciNet Google Scholar
Chiang Mark, M.T., Mirkin, B.: Determining the number of clusters in the Straight K-means: Experimental comparison of eight options. In: Proceeding of the 2006 UK workshop on Computational Intelligence, pp. 119–126 (2006)
Google Scholar
Generation of Gaussian mixture distributed data, NETLAB neural network software (2006), http://www.ncrg.aston.ac.uk/netlab
Hartigan, J.A.: Clustering Algorithms. J. Wiley & Sons, New York (1975)
MATH Google Scholar
Hubert, L.J., Arabie, P.: Comparing partitions. Journal of Classification 2, 193–218 (1985)
Article Google Scholar
Jain, A.K, Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs (1988)
MATH Google Scholar
Kaufman, L., Rousseeuw, P.: Finding Groups in Data: An Introduction to Cluster Analysis. J. Wiley & Son, New York (1990)
Google Scholar
Krzanowski, W., Lai, Y.: A criterion for determining the number of groups in a dataset using sum of squares clustering. Biometrics 44, 23–34 (1985)
Article MathSciNet Google Scholar
McLachlan, G., Basford, K.: Mixture Models: Inference and Applications to Clustering. Marcel Dekker, New York (1988)
MATH Google Scholar
McQueen, J.: Some methods for classification and analysis of multivariate observations. In: Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. II, pp. 281–297 (1967)
Google Scholar
Milligan, G.W., Cooper, M.C.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50, 159–179 (1985)
Article Google Scholar
Mirkin, B.: Eleven ways to look at the Pearson chi squares coefficient at contingency tables. The American Statistician 55(2), 111–120 (2001)
Article MathSciNet Google Scholar
Mirkin, B.: Clustering for Data Mining: A Data Recovery Approach. Chapman and Hall/CRC, Boca Raton Fl (2005)
MATH Google Scholar
Monti, S., Tamayo, P., Mesirov, J., Golub, T.: Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data. Machine Learning 52, 91–118 (2003)
Article MATH Google Scholar
Roweis, S.: EM algorithms for PCA and SPCA. In: Jordan, M., Kearns, M., Solla, S. (eds.) Advances in Neural Information Processing Systems, vol. 10, pp. 626–632. MIT Press, Cambridge (1998)
Google Scholar
Sugar, C.A., James, G.M.: Finding the number of clusters in a data set: An information-theoretic approach. Journal of American Statistical Association 98(463), 750–778 (2003)
Article MATH MathSciNet Google Scholar
Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a dataset via the Gap statistics. Journal of the Royal Statistical Society B 63, 411–423 (2001)
Article MATH MathSciNet Google Scholar
Tipping, M.E., Bishop, C.M.: Probabilistic principal component analysis. J. Roy. Statist. Soc. Ser. B 61, 611–622 (1999)
Article MATH MathSciNet Google Scholar
Wasito, I., Mirkin, B.: Nearest neighbours in least-squares data imputation algorithms with different missing patterns. Computational Statistics & Data Analysis 50, 926–949 (2006)
Article MathSciNet Google Scholar
Yeung, K.Y., Ruzzo, W.L.: Details of the Adjusted Rand index and clustering algorithms. Bioinformatics 17, 763–774 (2001)
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science & Information Systems, Birkbeck University of London, London, UK
Mark Ming-Tso Chiang & Boris Mirkin

Authors

Mark Ming-Tso Chiang
View author publications
You can also search for this author in PubMed Google Scholar
Boris Mirkin
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

José Neves Manuel Filipe Santos José Manuel Machado

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chiang, M.MT., Mirkin, B. (2007). Experiments for the Number of Clusters in K-Means. In: Neves, J., Santos, M.F., Machado, J.M. (eds) Progress in Artificial Intelligence. EPIA 2007. Lecture Notes in Computer Science(), vol 4874. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77002-2_33

Download citation

DOI: https://doi.org/10.1007/978-3-540-77002-2_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-77000-8
Online ISBN: 978-3-540-77002-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics