Comparison of the Performance of Center-Based Clustering Algorithms

Zhang, Bin

doi:10.1007/3-540-36175-8_7

Bin Zhang⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2637))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

1199 Accesses
9 Citations

Abstract

Center-based clustering algorithms like K-means, and EM are one of the most popular classes of clustering algorithms in use today. The author developed another variation in this family — K-Harmonic Means (KHM). It has been demonstrated using a small number of “benchmark” datasets that KHM is more robust than K-means and EM. In this paper, we compare their performance statistically. We run K-means, K-Harmonic Means and EM on each of 3600 pairs of (dataset, initialization) to compare the statistical average and variation of the performance of these algorithms. The results are that, for low dimensional datasets, KHM performs consistently better than KM, and KM performs consistently better than EM over a large variation of clustered-ness of the datasets and a large variation of initializations. Some of the reasons that contributed to this difference are explained.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bradley, P., Fayyad, U. M., C.A., “Refining Initial Points for KM Clustering”, MS Technical Report MSR-TR-98-36, May (1998)
Google Scholar
Duda, R., Hart, P., “Pattern Classification and Scene Analysis”, John Wiley & Sons, (1972)
Google Scholar
Dempster, A. P., Laird, N.M., and Rubin, D.B., “Miximum Likelyhood from Incomplete Data via the EM Algorithm”, Journal of the Royal Statistical Society, Series B, 39(1):1–38, (1977)
MATH MathSciNet Google Scholar
Fayyad, U. M., Piatetsky-Shapiro, G. Smyth, P. and Uthurusamy, R., “Advances in Knowledge Discovery and Data Mining”, AAAI Press (1996)
Google Scholar
Fink, L.J., Fung, M., McGuire, K.L., Gribskov, M, “Elucidation of Genes Involved in HTLV-I-induced Transformation Using the K-Harmonic Means Algorithm to Cluster Microarray Data”, follow the link http://www.ismb02.org/posterlist.htm to find an extended abstract. Software tools at http://array.sdsc.edu/. (2002)
Gersho & Gray, “Vector Quantization and Signal Compression”, KAP, (1992)
Google Scholar
Kaufman, L. and Rousseeuw, P. J., “Finding Groups in Data: An Introduction to Cluster Analysis”, John Wiley & Sons, (1990)
Google Scholar
MacQueen, J., “Some Methods for Classification and Analysis of Multivariate Observations”. Pp. 281–297 in: L. M. Le Cam & J. Neyman [eds.] Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Vol. 1. University of California Press, Berkeley. xvii + 666 p, (1967)
Google Scholar
McKenzie, P. and Alder, M., “Initializing the EM Algorithm for Use in Gaussian Mixture Modeling”, The Univ. of Western Australia, Center for Information Processing Systems, Manuscript
Google Scholar
McLachlan, G. J. and Krishnan, T., “The EM Algorithm and Extensions.”, John Wiley & Sons, (1997)
Google Scholar
Meila, M., Heckerman, D., “An Experimental Comparison of Model-based Clustering Methods”, Machine Learning, 42, 9–29, (2001)
Article MATH Google Scholar
Pena, J., Lozano, J., Larranaga, P., “An Empirical Comparison of Four Initialization Methods for the K-means Algorithm”, Pattern Recognition Letters, 20, 1027–1040, (1999)
Article Google Scholar
Rendner, R.A. and Walker, H.F., “Mixture Densities, Maximum Likelihood and The EM Algorithm”, SIAM Review, vol. 26 # 2, (1984)
Google Scholar
Tibshirani, R., Walther, G., and Hastie, T., “Estimating the Number of Clusters in a Dataset via the Gap Statistic”, Available at http://www-stat.stanford.edu/~tibs/research.html. March, (2000)
Zhang, B., Hsu, M., Dayal, U., “K-Harmonic Means”, Intl. Workshop on Temporal, Spatial and Spatio-Temporal Data Mining, Lyon, France Sept. 12, (2000)
Google Scholar
Zhang, B., “Generalized K-Harmonic Means — Dynamic Weighting of Data in Unsupervised Learning,”, the First SIAM International Conference on Data Mining (SDM’2001), Chicago, USA, April 5–7, (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

Hewlett-Packard Research Laboratories, Palo Alto, 94304, USA
Bin Zhang

Authors

Bin Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Science Department, Korea Advanced Institute of Science and Technology, 373-1 Koo-Sung Dong, Yoo-Sung Ku, Daejeon, 305-701, Korea
Kyu-Young Whang
Department of Statistics, Seoul National University, Sillimdong Kwanakgu, Seoul, 151-742, Korea
Jongwoo Jeon
School of Electrical Engineering and Computer Science, Seoul National University, Kwanak P.O. Box 34, Seoul, 151-742, Korea
Kyuseok Shim
Department of Computer Science and Engineering, University of Minnesota, 200 Union St SE, Minneapolis, MN, 55455, USA
Jaideep Srivastava

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, B. (2003). Comparison of the Performance of Center-Based Clustering Algorithms. In: Whang, KY., Jeon, J., Shim, K., Srivastava, J. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2003. Lecture Notes in Computer Science(), vol 2637. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36175-8_7

Download citation

DOI: https://doi.org/10.1007/3-540-36175-8_7
Published: 30 April 2003
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-04760-5
Online ISBN: 978-3-540-36175-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics