Abstract
This paper proposes a maximum clustering similarity (MCS) method for determining the number of clusters in a data set by studying the behavior of similarity indices comparing two (of several) clustering methods. The similarity between the two clusterings is calculated at the same number of clusters, using the indices of Rand (R), Fowlkes and Mallows (FM), and Kulczynski (K) each corrected for chance agreement. The number of clusters at which the index attains its maximum is a candidate for the optimal number of clusters. The proposed method is applied to simulated bivariate normal data, and further extended for use in circular data. Its performance is compared to the criteria discussed in Tibshirani, Walther, and Hastie (2001). The proposed method is not based on any distributional or data assumption which makes it widely applicable to any type of data that can be clustered using at least two clustering algorithms.
Similar content being viewed by others
References
ALBATINEH, A.N., NIEWIADOMSKA-BUGAJ, M., and MIHALKO, D.P. (2006), ”On Similarity Indices and Correction for Chance Agreement”, Journal of Classification, 23, 301–313.
ALBATINEH, A.N (2010), ”Means and Variances for a Family of Similarity Indices Used in Cluster Analysis”, Journal of Statistical Planning and Inference, 140, 2828–2838.
ANDREWS, D.F. (1972), ”Plots of High Dimensional Data”, Biometrics, 28, 125–136.
BANFIELD, J.D., and RAFTERY, A.E. (1993), ”Model-based Gaussian and Non-Gaussian Clustering”, Biometrics, 49, 803–821.
BARAGONA, R. (2003), ”Further Results on Lund’s Statistic for Identifying Cluster in a Circular Data Set with Application to Time Series”, Communications in Statistics: Simulation and Computation, 32, 943–952.
BATSCHELET, E. (1981), Circular Statistics in Biology, London: Academic Press.
BOCK, H.H. (1985), ”On Some Significance Tests in Cluster Analysis”, Journal of Classification, 2, 77–108.
BRECKENRIDGE, J.N. (1989), ”Replicating Cluster Analysis: Method, Consistency, and Validity”, Multivariate Behavioral Research, 24, 147–161.
BRUSCO, M.J., and STEINLEY, D. (2007), ”A Comparison of Heuristic Procedures for Minimum Within-Cluster Sums of Squares Partitioning”, Psychometrika, 72, 583–600.
CALINSKI, R.B., and HARABASZ, J. (1974), ”A Dendrite Method for Cluster Analysis”, Communications in Statistics, 3, 1–27.
COHEN, A.J. (1960), ”A Coefficient of Agreement for Nominal Scales”, Educational and Psychological Measurement, 3, 37–46
DUDOIT, S., and FRIDLYAND, J. (2002), ”A Prediction-based Resampling Method for Estimating the Number of Custers in a Dataset”, Genome Biology, 3, 1–21.
EVERITT, B.S., LANDAU, S., and LEESE, M. (2001), Cluster Analysis, New York: Oxford University Press.
FISHER, N.I. (1993), Statistical Analysis of Circular Data, Cambridge: Cambridge University Press.
FOWLKES, E.B., and MALLOWS, C.L. (1983), ”A Method for Comparing Two Hierarchical Clusterings”, Journal of the American Statistical Association, 78, 553–569.
FRALEY, C., and RAFTERY,A.E. (1998), ”How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis”, The Computer Journal, 41, 578–588.
GOODMAN, L., and KRUSKAL, W. (1954), ”Measures of Association for Cross Classifications”, Journal of the American Statistical Association, 49, 732–764.
GORDON, A.D. (1999), Classification (2nd ed.), St. Andrews: Chapman & Hall/CRC.
GUTTMAN, L. (1941), ”An Outline of the Statistical Theory of Prediction”, in In Prediction of Personal Adjustment, ed. P. Horst, New York: Social Science Research Council.
HARDY, A. (1994), ”An Examination of Procedures for Determining the Number of Clusters in a Data Set”, in New Approaches in Classification and Data Analysis, ed. E. Diday et al., Paris: Springer-Verlag, pp. 178–185.
HARDY, A. (1996), ”On the Number of Clusters”, Computational Statistics and Data Analysis, 23, 83–96.
HUBERT, L., and ARABIE, P. (1985), ”Comparing Partitions”, Journal of Classification, 2, 193–218.
HARTIGAN, J.A. (1975), Clustering Algorithms, New York: Wiley.
JAIN, A.K., and DUBES, R.C. (1988), Algorithms for Clustering Data, New Jersey: Prentice Hall.
KAUFMAN, L., and ROUSSEEUW, P.J. (1990), Finding Groups in Data: An Introduction to Cluster Analysis, New York: John Wiley & Sons.
KOZIOL, J.A. (1990), ”Cluster Analysis of Antigenic Profiles of Tumors: Selection of Number of Clusters Using Akaike’s Information Criterion”, Methods of Information in Medicine, 29, 200–204.
KROLAK−SCHWERDT, S., and ECKES, T. (1992), ”A Graph Theoretic Criterion for Determining the Number of Clusters in a Data Set”, Multivariate Behavior Research, 27, 541–565.
KRZANOWSKI, W.J., and LAI, Y.T. (1985), ”A Criterion for Determining the Number of Groups in a Data Set Using Sum of Squares Clustering”, Biometrics, 44, 23–34.
KULCZYNSKI, S. (1927), ”Die Pflanzenassociationen der Pienenen”, Bulletin of the International Academy of Political Science Letters, Science Mathematics Nature, Series B, Supplement, 2, 57–203.
LANGE, T., ROTH, V., BRAUN, M.L., and BUHMANN, J.M. (2004), ”Stability-Based Validation of Clustering Solutions”, Neural Computations, 16, 1299–1323.
LUND, U. (1999), ”Cluster Analysis for Directional Data”, Communications in Statistics: Simulations and Computations, 4, 1001–1009.
MARDIA, K.V., and JUPP, P.E. (2000), Directional Statistics, England: JohnWiley & Sons Ltd.
MARRIOTT, F.H.C. (1971), ”Practical Problems in a Method of Cluster Analysis”, Biometrics, 27, 501–514.
MILLIGAN, G., and COOPER, M. (1985), ”An Examination of Procedures for Determining the Number of Clusters in a Data Set”, Psychometrika, 50, 159–179.
MILLIGAN, G., and COOPER,M. (1986), ”A Study of the Comparability of External Criteria for Hierarchical Cluster Analysis”, Multivariate Behavioral Research, 21, 441–458.
MILLIGAN, G., SOON, S., and SOKOL, L. (1983), ”The Effect of Cluster Size, Dimensionality, and the Number of Clusters on Recovery of True Cluster Structure”, IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-5, 40–47.
MOREY, L., and AGRESTI, A. (1984), ”The Measurement of Classification Agreement: An Adjustment to the Rand Statistic for Chance Agreement”, Educational and Psychological Measurement, 44, 33–37.
PECK, R., FISHER, L., and NESS, V.J. (1989), ”Approximate Confidence Intervals for the Number of Clusters”, Journal of the American Statistical Association, 84, 184–191.
R Development Core Team (2007), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0, URL http://www.R-project.org.
RAND, W. (1971), ”Objective Criteria for the Evaluation of Clustering Methods”, Journal of the American Statistical Association, 66, 846–850.
SAXENA, P.C., and NAVANEERHAM, K. (1991), ”The Effect of Cluster Size, Dimensionality, and Number of Clusters on Recovery of True Cluster Structure Through Chernoff-Type Faces”, The Statistician, 40, 415–425.
SAXENA, P.C., and NAVANEERHAM, K. (1993), ”Comparison of Chernoff-Type Face and Non-Graphical Methods for Clustering Multivariate Observations”, Computational Statistics and Data Analysis, 15, 63–79.
STEINLEY, D., and BRUSCO, M.J. (2007), ”Initializing K-means Batch Clustering: A Critical Evaluation of Several Techniques”, Journal of Classification, 24, 99–121.
STRUYF, A., HUBERT, M., and ROUSSEEUW, P.J. (1997), ”Integrating Robust Clustering Techniques in S-PLUS”, Computational Statistics and Data Analysis, 26, 17–37.
SUGAR, C.A., and JAMES, G.M. (2003), ”Finding the Number of Clusters in a Dataset : An Information-Theoretic Approach”, Journal of the American Statistical Association, 98, 750–763.
TIBSHIRANI, R., WALTHER, G., and HASTIE, T. (2001), ”Estimating the Number of Clusters in a Data Set via the Gap Statistic”, Journal of the Royal Statistical Society B, 63, 411–423.
VASSILLIOU, A., TAMBOURATZIS, D.G., KOUTRAS, M.V., and BERSIMIS, S. (2004), ”A New Similarity Measure and Its Use in Determining the Number of Clusters in a Multivariate Data Set”, Communications in Statistics, Theory and Methods, 33, 1643–1666.
WISHART, D. (1978), CLUSTAN User Manual (3rd ed.), Program Library Unit, University of Edinburgh.
WOLFE, J.H. (1970), ”Pattern Clustering by Multivariate Mixture Analysis”, Multivariate Behavioral Research, 5, 329–350.
YANG, M-S., and PAN, J-A. (1997), ”On Fuzzy Clustering of Directional Data”, Fuzzy Sets and Systems, 91, 319–326.
Author information
Authors and Affiliations
Corresponding author
Additional information
The authors thank Willem Heiser and two anonymous referees for helpful comments and valuable suggestions on an earlier draft of this paper.
Rights and permissions
About this article
Cite this article
Albatineh, A.N., Niewiadomska-Bugaj, M. MCS: A Method for Finding the Number of Clusters. J Classif 28, 184–209 (2011). https://doi.org/10.1007/s00357-010-9069-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00357-010-9069-1