Skip to main content
Log in

Exploring the number of groups in robust model-based clustering

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

Two key questions in Clustering problems are how to determine the number of groups properly and measure the strength of group-assignments. These questions are specially involved when the presence of certain fraction of outlying data is also expected.

Any answer to these two key questions should depend on the assumed probabilistic-model, the allowed group scatters and what we understand by noise. With this in mind, some exploratory “trimming-based” tools are presented in this work together with their justifications. The monitoring of optimal values reached when solving a robust clustering criteria and the use of some “discriminant” factors are the basis for these exploratory tools.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  • Banfield, J.D., Raftery, A.E.: Model-based Gaussian and non-Gaussian clustering. Biometrics 49, 803–821 (1993)

    Article  MATH  MathSciNet  Google Scholar 

  • Becker, C., Gather, U.: The masking breakdown point for outlier identification rules. J. Am. Stat. Assoc. 94, 947–955 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  • Biernacki, C., Govaert, G.: Using the classification likelihood to choose the number of clusters. Comput. Sci. Stat. 29, 451–457 (1997)

    Google Scholar 

  • Biernacki, C., Celeux, G., Govaert, G.: Assesing a mixture model for clustering with the integrated completed likelihood. IEEE Trans. Pattern Anal. Mach. Intell. 22, 719–725 (2000)

    Article  Google Scholar 

  • Bryant, P.G.: Large-sample results for optimization-based clustering methods. J. Classif. 8, 31–44 (1991)

    Article  MATH  Google Scholar 

  • Bock, H.-H.: Probabilistic models in cluster analysis. Comput. Stat. Data Anal. 23, 5–28 (1996)

    Article  MATH  Google Scholar 

  • Calinski, R.B., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat. 3, 1–27 (1974)

    Article  MathSciNet  Google Scholar 

  • Celeux, G., Govaert, A.: Classification EM algorithm for clustering and two stochastic versions. Comput. Stat. Data Anal. 13, 315–332 (1992a)

    Article  MathSciNet  Google Scholar 

  • Celeux, G., Govaert, A.: Gaussian parsimonious clustering models. Pattern Recognit. 28, 781–793 (1992b)

    Article  Google Scholar 

  • Cook, D.: Graphical detection of regression outliers and mixtures. Proceedings ISI’99. Helsinki (1999)

  • Cuesta-Albertos, J.A., Matran, C., Mayo-Iscar, A.: Robust estimation in the normal mixture model based on robust clustering. J. R. Stat. Soc., Ser. B 70, 779–802 (2008)

    Article  MATH  MathSciNet  Google Scholar 

  • Dasgupta, A., Raftery, A.E.: Detecting features in spatial point processes with clutter via model-based clustering. J. Am. Stat. Assoc. 93, 294–302 (1998)

    Article  MATH  Google Scholar 

  • Engelman, L., Hartigan, J.A.: Percentage points of a test for clusters. J. Am. Stat. Assoc. 64, 1647–1648 (1969)

    Article  Google Scholar 

  • Flury, B.: A First Course in Multivariate Statistics. Springer, New York (1997)

    MATH  Google Scholar 

  • Flury, B., Riedwyl, H.: Multivariate Statistics, A Practical Approach. Cambridge University Press, Cambridge (1988)

    Google Scholar 

  • Friedman, H.P., Rubin, J.: On some invariant criterion for grouping data. J. Am. Stat. Assoc. 63, 1159–1178 (1967)

    Article  MathSciNet  Google Scholar 

  • Fraley, C., Raftery, A.E.: How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput. J. 41, 578–588 (1998)

    Article  MATH  Google Scholar 

  • Gallegos, M.T.: Maximum likelihood clustering with outliers. In: Jajuga, K., Sokolowski, A., Bock, H.-H. (eds.) Classification, Clustering and Data Analysis: Recent Advances and Applications, pp. 247–255. Springer, Berlin (2002)

    Google Scholar 

  • Gallegos, M.T., Ritter, G.: A robust method for cluster analysis. Ann. Stat. 33, 347–380 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  • Gallegos, M.T., Ritter, G.: Trimming algorithms for clustering contaminated grouped data and their robustness. Adv. Data Anal. Classif. 3, 135–167 (2009)

    Article  MathSciNet  Google Scholar 

  • Gallegos, M.T., Ritter, G.: Using combinatorial optimization in model-based trimmed clustering with cardinality constraints. Comput. Stat. Data Anal. 54, 637–654 (2010)

    Article  MATH  MathSciNet  Google Scholar 

  • García-Escudero, L.A., Gordaliza, A., Matrán, C.: Trimming tools in exploratory data analysis. J. Comput. Graph. Stat. 12, 434–449 (2003)

    Article  Google Scholar 

  • García-Escudero, L.A., Gordaliza, A., Matrán, C., Mayo-Iscar, A.: A general trimming approach to robust cluster analysis. Ann. Stat. 36, 1324–1345 (2008)

    Article  MATH  Google Scholar 

  • Hardin, J., Rocke, D.: Outlier detection in the multiple cluster setting using the minimum covariance determinant estimator. Comput. Stat. Data Anal. 44, 625–638 (2004)

    Article  MathSciNet  Google Scholar 

  • Hathaway, R.J.: A constrained formulation of maximum likelihood estimation for normal mixture distributions. Ann. Stat. 13, 795–800 (1985)

    Article  MATH  MathSciNet  Google Scholar 

  • Hawkins, D.M., Olive, D.J.: Inconsistency of resampling algorithms for high-breakdown regression estimators and a new algorithm, with discussion. J. Am. Stat. Assoc. 97, 136–159 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  • Hennig, C.: Breakdown points for maximum likelihood-estimators of location-scale mixtures. Ann. Stat. 32, 1313–1340 (2004a)

    Article  MATH  MathSciNet  Google Scholar 

  • Hennig, C.: Asymmetric linear dimension reduction for classification. J. Comput. Graph. Stat. 13, 930–945 (2004b)

    Article  MathSciNet  Google Scholar 

  • Hennig, C., Christlieb, N.: Validating visual clusters in large datasets: fixed point clusters of spectral features. Comput. Stat. Data Anal. 40, 723–739 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  • Keribin, C.: Consistent estimation of the order of mixture models. Sankhya, Ser. A 62, 49–62 (2000)

    MATH  MathSciNet  Google Scholar 

  • Marriott, F.H.C.: Practical problems in a method of cluster analysis. Biometrics 27, 501–514 (1971)

    Article  Google Scholar 

  • McLachlan, G.: On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. Appl. Stat. 37, 318–324 (1987)

    Article  Google Scholar 

  • McLachlan, G., Peel, D.: Finite Mixture Models. Wiley, New York (2000)

    Book  MATH  Google Scholar 

  • McQueen, J.: Some methods for classification and analysis of multivariate observations. In: 5th Berkeley Symposium on Mathematics, Statistics, and Probability, vol. 1, pp. 281–298 (1967)

  • Milligan, G.W., Cooper, M.C.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50, 159–179 (1985)

    Article  Google Scholar 

  • Neykov, N.M., Filzmoser, P., Dimova, R., Neytchev, P.N.: Mixture of generalized linear models and the trimmed likelihood methodology. In: Antoch, J. (ed.) Proceedings in Computational Statistics, pp. 1585–1592. Physica-Verlag, Heidelberg (2004)

    Google Scholar 

  • Neykov, N.M., Filzmoser, P., Dimova, R., Neytchev, P.: Robust fitting of mixtures using the trimmed likelihood estimator. Comput. Stat. Data Anal. 52, 299–308 (2007)

    Article  MATH  MathSciNet  Google Scholar 

  • Rocke, D.M., Woodruff, D.M.: Computational connections between robust multivariate analysis and clustering. In: Härdle, W., Rönz, B. (eds.) COMPSTAT 2002, Proceedings in Computational Statistics, pp. 255–260. Physica-Verlag, Heidelberg (2002)

    Google Scholar 

  • Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)

    Article  MATH  Google Scholar 

  • Rousseeuw, P.J., Van Driessen, K.: A Fast Algorithm for the Minimum Covariance Determinant Estimator. Technometrics 41, 212–223 (1999)

    Article  Google Scholar 

  • Sugar, C., James, G.: Finding the number of clusters in a data set: an information theoretic approach. J. Am. Stat. Assoc. 98, 750–763 (2003)

    Article  MATH  MathSciNet  Google Scholar 

  • Symons, M.J.: Clustering criteria and multivariate normal mixtures. Biometrics 37, 35–43 (1981)

    Article  MATH  MathSciNet  Google Scholar 

  • Titterington, D.M., Smith, A.F., Makov, U.E.: Statistical Analysis of Finite Mixture Distributions. Wiley, New York (1985)

    MATH  Google Scholar 

  • Van Aelst, S., Wang, X., Zamar, R.H., Zhu, R.: Linear grouping using orthogonal regression. Comput. Stat. Data Anal. 50, 1287–1312 (2006)

    Article  Google Scholar 

  • Woodruff, D.L., Reiners, T.: Experiments with, and on, algorithms for maximum likelihood clustering. Comput. Stat. Data Anal. 47, 237–253 (2004)

    Article  MATH  MathSciNet  Google Scholar 

  • Wolfe, J.H.: Pattern clustering by multivariate analysis. Multivar. Behav. Res. 5, 329–350 (1970)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to L. A. García-Escudero.

Additional information

Research partially supported by the Spanish Ministerio de Ciencia e Innovación, grant MTM2008-06067-C02-01, and 02 and by Consejería de Educación y Cultura de la Junta de Castilla y León, GR150.

Rights and permissions

Reprints and permissions

About this article

Cite this article

García-Escudero, L.A., Gordaliza, A., Matrán, C. et al. Exploring the number of groups in robust model-based clustering. Stat Comput 21, 585–599 (2011). https://doi.org/10.1007/s11222-010-9194-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-010-9194-z

Keywords

Navigation