MCS: A Method for Finding the Number of Clusters

Albatineh, Ahmed N.; Niewiadomska-Bugaj, Magdalena

doi:10.1007/s00357-010-9069-1

MCS: A Method for Finding the Number of Clusters

Published: 17 December 2010

Volume 28, pages 184–209, (2011)
Cite this article

Journal of Classification Aims and scope Submit manuscript

Ahmed N. Albatineh¹ &
Magdalena Niewiadomska-Bugaj²

474 Accesses
16 Citations
Explore all metrics

Abstract

This paper proposes a maximum clustering similarity (MCS) method for determining the number of clusters in a data set by studying the behavior of similarity indices comparing two (of several) clustering methods. The similarity between the two clusterings is calculated at the same number of clusters, using the indices of Rand (R), Fowlkes and Mallows (FM), and Kulczynski (K) each corrected for chance agreement. The number of clusters at which the index attains its maximum is a candidate for the optimal number of clusters. The proposed method is applied to simulated bivariate normal data, and further extended for use in circular data. Its performance is compared to the criteria discussed in Tibshirani, Walther, and Hastie (2001). The proposed method is not based on any distributional or data assumption which makes it widely applicable to any type of data that can be clustered using at least two clustering algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Ward’s Hierarchical Agglomerative Clustering Method: Which Algorithms Implement Ward’s Criterion?

Article 18 October 2014

Hierarchical Means Clustering

Article Open access 23 September 2022

A Heuristic Automatic Clustering Method Based on Hierarchical Clustering

References

ALBATINEH, A.N., NIEWIADOMSKA-BUGAJ, M., and MIHALKO, D.P. (2006), ”On Similarity Indices and Correction for Chance Agreement”, Journal of Classification, 23, 301–313.
Article MathSciNet Google Scholar
ALBATINEH, A.N (2010), ”Means and Variances for a Family of Similarity Indices Used in Cluster Analysis”, Journal of Statistical Planning and Inference, 140, 2828–2838.
Article MathSciNet MATH Google Scholar
ANDREWS, D.F. (1972), ”Plots of High Dimensional Data”, Biometrics, 28, 125–136.
Article Google Scholar
BANFIELD, J.D., and RAFTERY, A.E. (1993), ”Model-based Gaussian and Non-Gaussian Clustering”, Biometrics, 49, 803–821.
Article MathSciNet MATH Google Scholar
BARAGONA, R. (2003), ”Further Results on Lund’s Statistic for Identifying Cluster in a Circular Data Set with Application to Time Series”, Communications in Statistics: Simulation and Computation, 32, 943–952.
Article MathSciNet MATH Google Scholar
BATSCHELET, E. (1981), Circular Statistics in Biology, London: Academic Press.
MATH Google Scholar
BOCK, H.H. (1985), ”On Some Significance Tests in Cluster Analysis”, Journal of Classification, 2, 77–108.
Article MathSciNet MATH Google Scholar
BRECKENRIDGE, J.N. (1989), ”Replicating Cluster Analysis: Method, Consistency, and Validity”, Multivariate Behavioral Research, 24, 147–161.
Article Google Scholar
BRUSCO, M.J., and STEINLEY, D. (2007), ”A Comparison of Heuristic Procedures for Minimum Within-Cluster Sums of Squares Partitioning”, Psychometrika, 72, 583–600.
Article MathSciNet MATH Google Scholar
CALINSKI, R.B., and HARABASZ, J. (1974), ”A Dendrite Method for Cluster Analysis”, Communications in Statistics, 3, 1–27.
Article MathSciNet Google Scholar
COHEN, A.J. (1960), ”A Coefficient of Agreement for Nominal Scales”, Educational and Psychological Measurement, 3, 37–46
Article Google Scholar
DUDOIT, S., and FRIDLYAND, J. (2002), ”A Prediction-based Resampling Method for Estimating the Number of Custers in a Dataset”, Genome Biology, 3, 1–21.
Article Google Scholar
EVERITT, B.S., LANDAU, S., and LEESE, M. (2001), Cluster Analysis, New York: Oxford University Press.
MATH Google Scholar
FISHER, N.I. (1993), Statistical Analysis of Circular Data, Cambridge: Cambridge University Press.
Book MATH Google Scholar
FOWLKES, E.B., and MALLOWS, C.L. (1983), ”A Method for Comparing Two Hierarchical Clusterings”, Journal of the American Statistical Association, 78, 553–569.
Article MATH Google Scholar
FRALEY, C., and RAFTERY,A.E. (1998), ”How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis”, The Computer Journal, 41, 578–588.
Article MATH Google Scholar
GOODMAN, L., and KRUSKAL, W. (1954), ”Measures of Association for Cross Classifications”, Journal of the American Statistical Association, 49, 732–764.
Article MATH Google Scholar
GORDON, A.D. (1999), Classification (2nd ed.), St. Andrews: Chapman & Hall/CRC.
GUTTMAN, L. (1941), ”An Outline of the Statistical Theory of Prediction”, in In Prediction of Personal Adjustment, ed. P. Horst, New York: Social Science Research Council.
HARDY, A. (1994), ”An Examination of Procedures for Determining the Number of Clusters in a Data Set”, in New Approaches in Classification and Data Analysis, ed. E. Diday et al., Paris: Springer-Verlag, pp. 178–185.
HARDY, A. (1996), ”On the Number of Clusters”, Computational Statistics and Data Analysis, 23, 83–96.
Article MATH Google Scholar
HUBERT, L., and ARABIE, P. (1985), ”Comparing Partitions”, Journal of Classification, 2, 193–218.
Article Google Scholar
HARTIGAN, J.A. (1975), Clustering Algorithms, New York: Wiley.
MATH Google Scholar
JAIN, A.K., and DUBES, R.C. (1988), Algorithms for Clustering Data, New Jersey: Prentice Hall.
MATH Google Scholar
KAUFMAN, L., and ROUSSEEUW, P.J. (1990), Finding Groups in Data: An Introduction to Cluster Analysis, New York: John Wiley & Sons.
Google Scholar
KOZIOL, J.A. (1990), ”Cluster Analysis of Antigenic Profiles of Tumors: Selection of Number of Clusters Using Akaike’s Information Criterion”, Methods of Information in Medicine, 29, 200–204.
Google Scholar
KROLAK−SCHWERDT, S., and ECKES, T. (1992), ”A Graph Theoretic Criterion for Determining the Number of Clusters in a Data Set”, Multivariate Behavior Research, 27, 541–565.
Article Google Scholar
KRZANOWSKI, W.J., and LAI, Y.T. (1985), ”A Criterion for Determining the Number of Groups in a Data Set Using Sum of Squares Clustering”, Biometrics, 44, 23–34.
Article MathSciNet Google Scholar
KULCZYNSKI, S. (1927), ”Die Pflanzenassociationen der Pienenen”, Bulletin of the International Academy of Political Science Letters, Science Mathematics Nature, Series B, Supplement, 2, 57–203.
Google Scholar
LANGE, T., ROTH, V., BRAUN, M.L., and BUHMANN, J.M. (2004), ”Stability-Based Validation of Clustering Solutions”, Neural Computations, 16, 1299–1323.
Article MATH Google Scholar
LUND, U. (1999), ”Cluster Analysis for Directional Data”, Communications in Statistics: Simulations and Computations, 4, 1001–1009.
Article Google Scholar
MARDIA, K.V., and JUPP, P.E. (2000), Directional Statistics, England: JohnWiley & Sons Ltd.
MATH Google Scholar
MARRIOTT, F.H.C. (1971), ”Practical Problems in a Method of Cluster Analysis”, Biometrics, 27, 501–514.
Article Google Scholar
MILLIGAN, G., and COOPER, M. (1985), ”An Examination of Procedures for Determining the Number of Clusters in a Data Set”, Psychometrika, 50, 159–179.
Article Google Scholar
MILLIGAN, G., and COOPER,M. (1986), ”A Study of the Comparability of External Criteria for Hierarchical Cluster Analysis”, Multivariate Behavioral Research, 21, 441–458.
Article Google Scholar
MILLIGAN, G., SOON, S., and SOKOL, L. (1983), ”The Effect of Cluster Size, Dimensionality, and the Number of Clusters on Recovery of True Cluster Structure”, IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-5, 40–47.
MOREY, L., and AGRESTI, A. (1984), ”The Measurement of Classification Agreement: An Adjustment to the Rand Statistic for Chance Agreement”, Educational and Psychological Measurement, 44, 33–37.
Article Google Scholar
PECK, R., FISHER, L., and NESS, V.J. (1989), ”Approximate Confidence Intervals for the Number of Clusters”, Journal of the American Statistical Association, 84, 184–191.
Article MathSciNet MATH Google Scholar
R Development Core Team (2007), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0, URL http://www.R-project.org.
RAND, W. (1971), ”Objective Criteria for the Evaluation of Clustering Methods”, Journal of the American Statistical Association, 66, 846–850.
Article Google Scholar
SAXENA, P.C., and NAVANEERHAM, K. (1991), ”The Effect of Cluster Size, Dimensionality, and Number of Clusters on Recovery of True Cluster Structure Through Chernoff-Type Faces”, The Statistician, 40, 415–425.
Article Google Scholar
SAXENA, P.C., and NAVANEERHAM, K. (1993), ”Comparison of Chernoff-Type Face and Non-Graphical Methods for Clustering Multivariate Observations”, Computational Statistics and Data Analysis, 15, 63–79.
Article MATH Google Scholar
STEINLEY, D., and BRUSCO, M.J. (2007), ”Initializing K-means Batch Clustering: A Critical Evaluation of Several Techniques”, Journal of Classification, 24, 99–121.
Article MathSciNet MATH Google Scholar
STRUYF, A., HUBERT, M., and ROUSSEEUW, P.J. (1997), ”Integrating Robust Clustering Techniques in S-PLUS”, Computational Statistics and Data Analysis, 26, 17–37.
Article MATH Google Scholar
SUGAR, C.A., and JAMES, G.M. (2003), ”Finding the Number of Clusters in a Dataset : An Information-Theoretic Approach”, Journal of the American Statistical Association, 98, 750–763.
Article MathSciNet MATH Google Scholar
TIBSHIRANI, R., WALTHER, G., and HASTIE, T. (2001), ”Estimating the Number of Clusters in a Data Set via the Gap Statistic”, Journal of the Royal Statistical Society B, 63, 411–423.
Article MathSciNet MATH Google Scholar
VASSILLIOU, A., TAMBOURATZIS, D.G., KOUTRAS, M.V., and BERSIMIS, S. (2004), ”A New Similarity Measure and Its Use in Determining the Number of Clusters in a Multivariate Data Set”, Communications in Statistics, Theory and Methods, 33, 1643–1666.
Article MathSciNet Google Scholar
WISHART, D. (1978), CLUSTAN User Manual (3rd ed.), Program Library Unit, University of Edinburgh.
WOLFE, J.H. (1970), ”Pattern Clustering by Multivariate Mixture Analysis”, Multivariate Behavioral Research, 5, 329–350.
Article Google Scholar
YANG, M-S., and PAN, J-A. (1997), ”On Fuzzy Clustering of Directional Data”, Fuzzy Sets and Systems, 91, 319–326.
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Epidemiology and Biostatistics, Florida International University, Miami, FL, USA
Ahmed N. Albatineh
Department of Statistics, Western Michigan University, Kalamazoo, MI, USA
Magdalena Niewiadomska-Bugaj

Authors

Ahmed N. Albatineh
View author publications
You can also search for this author in PubMed Google Scholar
Magdalena Niewiadomska-Bugaj
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ahmed N. Albatineh.

Additional information

The authors thank Willem Heiser and two anonymous referees for helpful comments and valuable suggestions on an earlier draft of this paper.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Albatineh, A.N., Niewiadomska-Bugaj, M. MCS: A Method for Finding the Number of Clusters. J Classif 28, 184–209 (2011). https://doi.org/10.1007/s00357-010-9069-1

Download citation

Published: 17 December 2010
Issue Date: July 2011
DOI: https://doi.org/10.1007/s00357-010-9069-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MCS: A Method for Finding the Number of Clusters

Abstract

Access this article

Similar content being viewed by others

Ward’s Hierarchical Agglomerative Clustering Method: Which Algorithms Implement Ward’s Criterion?

Hierarchical Means Clustering

A Heuristic Automatic Clustering Method Based on Hierarchical Clustering

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

MCS: A Method for Finding the Number of Clusters

Abstract

Access this article

Similar content being viewed by others

Ward’s Hierarchical Agglomerative Clustering Method: Which Algorithms Implement Ward’s Criterion?

Hierarchical Means Clustering

A Heuristic Automatic Clustering Method Based on Hierarchical Clustering

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation