Skip to main content
Log in

MCS: A Method for Finding the Number of Clusters

  • Published:
Journal of Classification Aims and scope Submit manuscript

Abstract

This paper proposes a maximum clustering similarity (MCS) method for determining the number of clusters in a data set by studying the behavior of similarity indices comparing two (of several) clustering methods. The similarity between the two clusterings is calculated at the same number of clusters, using the indices of Rand (R), Fowlkes and Mallows (FM), and Kulczynski (K) each corrected for chance agreement. The number of clusters at which the index attains its maximum is a candidate for the optimal number of clusters. The proposed method is applied to simulated bivariate normal data, and further extended for use in circular data. Its performance is compared to the criteria discussed in Tibshirani, Walther, and Hastie (2001). The proposed method is not based on any distributional or data assumption which makes it widely applicable to any type of data that can be clustered using at least two clustering algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • ALBATINEH, A.N., NIEWIADOMSKA-BUGAJ, M., and MIHALKO, D.P. (2006), ”On Similarity Indices and Correction for Chance Agreement”, Journal of Classification, 23, 301–313.

    Article  MathSciNet  Google Scholar 

  • ALBATINEH, A.N (2010), ”Means and Variances for a Family of Similarity Indices Used in Cluster Analysis”, Journal of Statistical Planning and Inference, 140, 2828–2838.

    Article  MathSciNet  MATH  Google Scholar 

  • ANDREWS, D.F. (1972), ”Plots of High Dimensional Data”, Biometrics, 28, 125–136.

    Article  Google Scholar 

  • BANFIELD, J.D., and RAFTERY, A.E. (1993), ”Model-based Gaussian and Non-Gaussian Clustering”, Biometrics, 49, 803–821.

    Article  MathSciNet  MATH  Google Scholar 

  • BARAGONA, R. (2003), ”Further Results on Lund’s Statistic for Identifying Cluster in a Circular Data Set with Application to Time Series”, Communications in Statistics: Simulation and Computation, 32, 943–952.

    Article  MathSciNet  MATH  Google Scholar 

  • BATSCHELET, E. (1981), Circular Statistics in Biology, London: Academic Press.

    MATH  Google Scholar 

  • BOCK, H.H. (1985), ”On Some Significance Tests in Cluster Analysis”, Journal of Classification, 2, 77–108.

    Article  MathSciNet  MATH  Google Scholar 

  • BRECKENRIDGE, J.N. (1989), ”Replicating Cluster Analysis: Method, Consistency, and Validity”, Multivariate Behavioral Research, 24, 147–161.

    Article  Google Scholar 

  • BRUSCO, M.J., and STEINLEY, D. (2007), ”A Comparison of Heuristic Procedures for Minimum Within-Cluster Sums of Squares Partitioning”, Psychometrika, 72, 583–600.

    Article  MathSciNet  MATH  Google Scholar 

  • CALINSKI, R.B., and HARABASZ, J. (1974), ”A Dendrite Method for Cluster Analysis”, Communications in Statistics, 3, 1–27.

    Article  MathSciNet  Google Scholar 

  • COHEN, A.J. (1960), ”A Coefficient of Agreement for Nominal Scales”, Educational and Psychological Measurement, 3, 37–46

    Article  Google Scholar 

  • DUDOIT, S., and FRIDLYAND, J. (2002), ”A Prediction-based Resampling Method for Estimating the Number of Custers in a Dataset”, Genome Biology, 3, 1–21.

    Article  Google Scholar 

  • EVERITT, B.S., LANDAU, S., and LEESE, M. (2001), Cluster Analysis, New York: Oxford University Press.

    MATH  Google Scholar 

  • FISHER, N.I. (1993), Statistical Analysis of Circular Data, Cambridge: Cambridge University Press.

    Book  MATH  Google Scholar 

  • FOWLKES, E.B., and MALLOWS, C.L. (1983), ”A Method for Comparing Two Hierarchical Clusterings”, Journal of the American Statistical Association, 78, 553–569.

    Article  MATH  Google Scholar 

  • FRALEY, C., and RAFTERY,A.E. (1998), ”How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis”, The Computer Journal, 41, 578–588.

    Article  MATH  Google Scholar 

  • GOODMAN, L., and KRUSKAL, W. (1954), ”Measures of Association for Cross Classifications”, Journal of the American Statistical Association, 49, 732–764.

    Article  MATH  Google Scholar 

  • GORDON, A.D. (1999), Classification (2nd ed.), St. Andrews: Chapman & Hall/CRC.

  • GUTTMAN, L. (1941), ”An Outline of the Statistical Theory of Prediction”, in In Prediction of Personal Adjustment, ed. P. Horst, New York: Social Science Research Council.

  • HARDY, A. (1994), ”An Examination of Procedures for Determining the Number of Clusters in a Data Set”, in New Approaches in Classification and Data Analysis, ed. E. Diday et al., Paris: Springer-Verlag, pp. 178–185.

  • HARDY, A. (1996), ”On the Number of Clusters”, Computational Statistics and Data Analysis, 23, 83–96.

    Article  MATH  Google Scholar 

  • HUBERT, L., and ARABIE, P. (1985), ”Comparing Partitions”, Journal of Classification, 2, 193–218.

    Article  Google Scholar 

  • HARTIGAN, J.A. (1975), Clustering Algorithms, New York: Wiley.

    MATH  Google Scholar 

  • JAIN, A.K., and DUBES, R.C. (1988), Algorithms for Clustering Data, New Jersey: Prentice Hall.

    MATH  Google Scholar 

  • KAUFMAN, L., and ROUSSEEUW, P.J. (1990), Finding Groups in Data: An Introduction to Cluster Analysis, New York: John Wiley & Sons.

    Google Scholar 

  • KOZIOL, J.A. (1990), ”Cluster Analysis of Antigenic Profiles of Tumors: Selection of Number of Clusters Using Akaike’s Information Criterion”, Methods of Information in Medicine, 29, 200–204.

    Google Scholar 

  • KROLAKSCHWERDT, S., and ECKES, T. (1992), ”A Graph Theoretic Criterion for Determining the Number of Clusters in a Data Set”, Multivariate Behavior Research, 27, 541–565.

    Article  Google Scholar 

  • KRZANOWSKI, W.J., and LAI, Y.T. (1985), ”A Criterion for Determining the Number of Groups in a Data Set Using Sum of Squares Clustering”, Biometrics, 44, 23–34.

    Article  MathSciNet  Google Scholar 

  • KULCZYNSKI, S. (1927), ”Die Pflanzenassociationen der Pienenen”, Bulletin of the International Academy of Political Science Letters, Science Mathematics Nature, Series B, Supplement, 2, 57–203.

    Google Scholar 

  • LANGE, T., ROTH, V., BRAUN, M.L., and BUHMANN, J.M. (2004), ”Stability-Based Validation of Clustering Solutions”, Neural Computations, 16, 1299–1323.

    Article  MATH  Google Scholar 

  • LUND, U. (1999), ”Cluster Analysis for Directional Data”, Communications in Statistics: Simulations and Computations, 4, 1001–1009.

    Article  Google Scholar 

  • MARDIA, K.V., and JUPP, P.E. (2000), Directional Statistics, England: JohnWiley & Sons Ltd.

    MATH  Google Scholar 

  • MARRIOTT, F.H.C. (1971), ”Practical Problems in a Method of Cluster Analysis”, Biometrics, 27, 501–514.

    Article  Google Scholar 

  • MILLIGAN, G., and COOPER, M. (1985), ”An Examination of Procedures for Determining the Number of Clusters in a Data Set”, Psychometrika, 50, 159–179.

    Article  Google Scholar 

  • MILLIGAN, G., and COOPER,M. (1986), ”A Study of the Comparability of External Criteria for Hierarchical Cluster Analysis”, Multivariate Behavioral Research, 21, 441–458.

    Article  Google Scholar 

  • MILLIGAN, G., SOON, S., and SOKOL, L. (1983), ”The Effect of Cluster Size, Dimensionality, and the Number of Clusters on Recovery of True Cluster Structure”, IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-5, 40–47.

  • MOREY, L., and AGRESTI, A. (1984), ”The Measurement of Classification Agreement: An Adjustment to the Rand Statistic for Chance Agreement”, Educational and Psychological Measurement, 44, 33–37.

    Article  Google Scholar 

  • PECK, R., FISHER, L., and NESS, V.J. (1989), ”Approximate Confidence Intervals for the Number of Clusters”, Journal of the American Statistical Association, 84, 184–191.

    Article  MathSciNet  MATH  Google Scholar 

  • R Development Core Team (2007), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0, URL http://www.R-project.org.

  • RAND, W. (1971), ”Objective Criteria for the Evaluation of Clustering Methods”, Journal of the American Statistical Association, 66, 846–850.

    Article  Google Scholar 

  • SAXENA, P.C., and NAVANEERHAM, K. (1991), ”The Effect of Cluster Size, Dimensionality, and Number of Clusters on Recovery of True Cluster Structure Through Chernoff-Type Faces”, The Statistician, 40, 415–425.

    Article  Google Scholar 

  • SAXENA, P.C., and NAVANEERHAM, K. (1993), ”Comparison of Chernoff-Type Face and Non-Graphical Methods for Clustering Multivariate Observations”, Computational Statistics and Data Analysis, 15, 63–79.

    Article  MATH  Google Scholar 

  • STEINLEY, D., and BRUSCO, M.J. (2007), ”Initializing K-means Batch Clustering: A Critical Evaluation of Several Techniques”, Journal of Classification, 24, 99–121.

    Article  MathSciNet  MATH  Google Scholar 

  • STRUYF, A., HUBERT, M., and ROUSSEEUW, P.J. (1997), ”Integrating Robust Clustering Techniques in S-PLUS”, Computational Statistics and Data Analysis, 26, 17–37.

    Article  MATH  Google Scholar 

  • SUGAR, C.A., and JAMES, G.M. (2003), ”Finding the Number of Clusters in a Dataset : An Information-Theoretic Approach”, Journal of the American Statistical Association, 98, 750–763.

    Article  MathSciNet  MATH  Google Scholar 

  • TIBSHIRANI, R., WALTHER, G., and HASTIE, T. (2001), ”Estimating the Number of Clusters in a Data Set via the Gap Statistic”, Journal of the Royal Statistical Society B, 63, 411–423.

    Article  MathSciNet  MATH  Google Scholar 

  • VASSILLIOU, A., TAMBOURATZIS, D.G., KOUTRAS, M.V., and BERSIMIS, S. (2004), ”A New Similarity Measure and Its Use in Determining the Number of Clusters in a Multivariate Data Set”, Communications in Statistics, Theory and Methods, 33, 1643–1666.

    Article  MathSciNet  Google Scholar 

  • WISHART, D. (1978), CLUSTAN User Manual (3rd ed.), Program Library Unit, University of Edinburgh.

  • WOLFE, J.H. (1970), ”Pattern Clustering by Multivariate Mixture Analysis”, Multivariate Behavioral Research, 5, 329–350.

    Article  Google Scholar 

  • YANG, M-S., and PAN, J-A. (1997), ”On Fuzzy Clustering of Directional Data”, Fuzzy Sets and Systems, 91, 319–326.

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ahmed N. Albatineh.

Additional information

The authors thank Willem Heiser and two anonymous referees for helpful comments and valuable suggestions on an earlier draft of this paper.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Albatineh, A.N., Niewiadomska-Bugaj, M. MCS: A Method for Finding the Number of Clusters. J Classif 28, 184–209 (2011). https://doi.org/10.1007/s00357-010-9069-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00357-010-9069-1

Keywords

Navigation