Abstract
Two key questions in Clustering problems are how to determine the number of groups properly and measure the strength of group-assignments. These questions are specially involved when the presence of certain fraction of outlying data is also expected.
Any answer to these two key questions should depend on the assumed probabilistic-model, the allowed group scatters and what we understand by noise. With this in mind, some exploratory “trimming-based” tools are presented in this work together with their justifications. The monitoring of optimal values reached when solving a robust clustering criteria and the use of some “discriminant” factors are the basis for these exploratory tools.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Banfield, J.D., Raftery, A.E.: Model-based Gaussian and non-Gaussian clustering. Biometrics 49, 803–821 (1993)
Becker, C., Gather, U.: The masking breakdown point for outlier identification rules. J. Am. Stat. Assoc. 94, 947–955 (1999)
Biernacki, C., Govaert, G.: Using the classification likelihood to choose the number of clusters. Comput. Sci. Stat. 29, 451–457 (1997)
Biernacki, C., Celeux, G., Govaert, G.: Assesing a mixture model for clustering with the integrated completed likelihood. IEEE Trans. Pattern Anal. Mach. Intell. 22, 719–725 (2000)
Bryant, P.G.: Large-sample results for optimization-based clustering methods. J. Classif. 8, 31–44 (1991)
Bock, H.-H.: Probabilistic models in cluster analysis. Comput. Stat. Data Anal. 23, 5–28 (1996)
Calinski, R.B., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat. 3, 1–27 (1974)
Celeux, G., Govaert, A.: Classification EM algorithm for clustering and two stochastic versions. Comput. Stat. Data Anal. 13, 315–332 (1992a)
Celeux, G., Govaert, A.: Gaussian parsimonious clustering models. Pattern Recognit. 28, 781–793 (1992b)
Cook, D.: Graphical detection of regression outliers and mixtures. Proceedings ISI’99. Helsinki (1999)
Cuesta-Albertos, J.A., Matran, C., Mayo-Iscar, A.: Robust estimation in the normal mixture model based on robust clustering. J. R. Stat. Soc., Ser. B 70, 779–802 (2008)
Dasgupta, A., Raftery, A.E.: Detecting features in spatial point processes with clutter via model-based clustering. J. Am. Stat. Assoc. 93, 294–302 (1998)
Engelman, L., Hartigan, J.A.: Percentage points of a test for clusters. J. Am. Stat. Assoc. 64, 1647–1648 (1969)
Flury, B.: A First Course in Multivariate Statistics. Springer, New York (1997)
Flury, B., Riedwyl, H.: Multivariate Statistics, A Practical Approach. Cambridge University Press, Cambridge (1988)
Friedman, H.P., Rubin, J.: On some invariant criterion for grouping data. J. Am. Stat. Assoc. 63, 1159–1178 (1967)
Fraley, C., Raftery, A.E.: How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput. J. 41, 578–588 (1998)
Gallegos, M.T.: Maximum likelihood clustering with outliers. In: Jajuga, K., Sokolowski, A., Bock, H.-H. (eds.) Classification, Clustering and Data Analysis: Recent Advances and Applications, pp. 247–255. Springer, Berlin (2002)
Gallegos, M.T., Ritter, G.: A robust method for cluster analysis. Ann. Stat. 33, 347–380 (2005)
Gallegos, M.T., Ritter, G.: Trimming algorithms for clustering contaminated grouped data and their robustness. Adv. Data Anal. Classif. 3, 135–167 (2009)
Gallegos, M.T., Ritter, G.: Using combinatorial optimization in model-based trimmed clustering with cardinality constraints. Comput. Stat. Data Anal. 54, 637–654 (2010)
García-Escudero, L.A., Gordaliza, A., Matrán, C.: Trimming tools in exploratory data analysis. J. Comput. Graph. Stat. 12, 434–449 (2003)
García-Escudero, L.A., Gordaliza, A., Matrán, C., Mayo-Iscar, A.: A general trimming approach to robust cluster analysis. Ann. Stat. 36, 1324–1345 (2008)
Hardin, J., Rocke, D.: Outlier detection in the multiple cluster setting using the minimum covariance determinant estimator. Comput. Stat. Data Anal. 44, 625–638 (2004)
Hathaway, R.J.: A constrained formulation of maximum likelihood estimation for normal mixture distributions. Ann. Stat. 13, 795–800 (1985)
Hawkins, D.M., Olive, D.J.: Inconsistency of resampling algorithms for high-breakdown regression estimators and a new algorithm, with discussion. J. Am. Stat. Assoc. 97, 136–159 (2002)
Hennig, C.: Breakdown points for maximum likelihood-estimators of location-scale mixtures. Ann. Stat. 32, 1313–1340 (2004a)
Hennig, C.: Asymmetric linear dimension reduction for classification. J. Comput. Graph. Stat. 13, 930–945 (2004b)
Hennig, C., Christlieb, N.: Validating visual clusters in large datasets: fixed point clusters of spectral features. Comput. Stat. Data Anal. 40, 723–739 (2002)
Keribin, C.: Consistent estimation of the order of mixture models. Sankhya, Ser. A 62, 49–62 (2000)
Marriott, F.H.C.: Practical problems in a method of cluster analysis. Biometrics 27, 501–514 (1971)
McLachlan, G.: On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. Appl. Stat. 37, 318–324 (1987)
McLachlan, G., Peel, D.: Finite Mixture Models. Wiley, New York (2000)
McQueen, J.: Some methods for classification and analysis of multivariate observations. In: 5th Berkeley Symposium on Mathematics, Statistics, and Probability, vol. 1, pp. 281–298 (1967)
Milligan, G.W., Cooper, M.C.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50, 159–179 (1985)
Neykov, N.M., Filzmoser, P., Dimova, R., Neytchev, P.N.: Mixture of generalized linear models and the trimmed likelihood methodology. In: Antoch, J. (ed.) Proceedings in Computational Statistics, pp. 1585–1592. Physica-Verlag, Heidelberg (2004)
Neykov, N.M., Filzmoser, P., Dimova, R., Neytchev, P.: Robust fitting of mixtures using the trimmed likelihood estimator. Comput. Stat. Data Anal. 52, 299–308 (2007)
Rocke, D.M., Woodruff, D.M.: Computational connections between robust multivariate analysis and clustering. In: Härdle, W., Rönz, B. (eds.) COMPSTAT 2002, Proceedings in Computational Statistics, pp. 255–260. Physica-Verlag, Heidelberg (2002)
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Rousseeuw, P.J., Van Driessen, K.: A Fast Algorithm for the Minimum Covariance Determinant Estimator. Technometrics 41, 212–223 (1999)
Sugar, C., James, G.: Finding the number of clusters in a data set: an information theoretic approach. J. Am. Stat. Assoc. 98, 750–763 (2003)
Symons, M.J.: Clustering criteria and multivariate normal mixtures. Biometrics 37, 35–43 (1981)
Titterington, D.M., Smith, A.F., Makov, U.E.: Statistical Analysis of Finite Mixture Distributions. Wiley, New York (1985)
Van Aelst, S., Wang, X., Zamar, R.H., Zhu, R.: Linear grouping using orthogonal regression. Comput. Stat. Data Anal. 50, 1287–1312 (2006)
Woodruff, D.L., Reiners, T.: Experiments with, and on, algorithms for maximum likelihood clustering. Comput. Stat. Data Anal. 47, 237–253 (2004)
Wolfe, J.H.: Pattern clustering by multivariate analysis. Multivar. Behav. Res. 5, 329–350 (1970)
Author information
Authors and Affiliations
Corresponding author
Additional information
Research partially supported by the Spanish Ministerio de Ciencia e Innovación, grant MTM2008-06067-C02-01, and 02 and by Consejería de Educación y Cultura de la Junta de Castilla y León, GR150.
Rights and permissions
About this article
Cite this article
García-Escudero, L.A., Gordaliza, A., Matrán, C. et al. Exploring the number of groups in robust model-based clustering. Stat Comput 21, 585–599 (2011). https://doi.org/10.1007/s11222-010-9194-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-010-9194-z