Skip to main content
Log in

A simulation study to compare robust clustering methods based on mixtures

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

The following mixture model-based clustering methods are compared in a simulation study with one-dimensional data, fixed number of clusters and a focus on outliers and uniform “noise”: an ML-estimator (MLE) for Gaussian mixtures, an MLE for a mixture of Gaussians and a uniform distribution (interpreted as “noise component” to catch outliers), an MLE for a mixture of Gaussian distributions where a uniform distribution over the range of the data is fixed (Fraley and Raftery in Comput J 41:578–588, 1998), a pseudo-MLE for a Gaussian mixture with improper fixed constant over the real line to catch “noise” (RIMLE; Hennig in Ann Stat 32(4): 1313–1340, 2004), and MLEs for mixtures of t-distributions with and without estimation of the degrees of freedom (McLachlan and Peel in Stat Comput 10(4):339–348, 2000). The RIMLE (using a method to choose the fixed constant first proposed in Coretto, The noise component in model-based clustering. Ph.D thesis, Department of Statistical Science, University College London, 2008) is the best method in some, and acceptable in all, simulation setups, and can therefore be recommended.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Banfield J, Raftery AE (1993) Model-based gaussian and non-gaussian clustering. Biometrics 49: 803–821

    Article  MATH  MathSciNet  Google Scholar 

  • Coretto P (2008) The noise component in model-based clustering. PhD thesis, Department of Statistical Science, University College London. http://www.ontherubicon.com/pietro/docs/phdthesis.pdf

  • Cuesta-Albertos JA, Gordaliza A, Matrán C (1997) Trimmed k-means: an attempt to robustify quantizers. Ann Stat 25: 553–576

    Article  MATH  Google Scholar 

  • Fraley C, Raftery AE (1998) How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput J 41: 578–588

    Article  MATH  Google Scholar 

  • Fraley C, Raftery AE (2006) Mclust version 3 for r: normal mixture modeling and model-based clustering. Technical report 504, Department of Statistics, University of Washington

  • Gallegos MT, Ritter G (2005) A robust method for cluster analysis. Ann Stat 33(5): 347–380

    Article  MATH  MathSciNet  Google Scholar 

  • García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2008) A general trimming approach to robust cluster analysis. Ann Stat 38(3): 1324–1345

    Article  Google Scholar 

  • Hathaway RJ (1985) A constrained formulation of maximum-likelihood estimation for normal mixture distributions. Ann Stat 13: 795–800

    Article  MATH  MathSciNet  Google Scholar 

  • Hennig C (2004) Breakdown points for maximum likelihood estimators of location-scale mixtures. Ann Stat 32(4): 1313–1340

    Article  MATH  MathSciNet  Google Scholar 

  • Hennig C (2005) Robustness of ML estimators of location-scale mixtures. In: Baier D, Wernecke KD (eds) Innovations in classification. Data science, and information systems. Springer, Heidelberg, pp 128–137

    Chapter  Google Scholar 

  • Hennig C, Coretto P (2008) The noise component in model-based cluster analysis. In: Preisach C, Burkhardt H, Schmidt-Thieme L, Decker R (eds) Data analysis, machine learning and applications. Springer, Berlin, , pp 127–138

    Chapter  Google Scholar 

  • Hosmer DW (1978) Comment on “Estimating mixtures of normal distributions and switching regressions” by R. Quandt and J.B. Ramsey. J Am Stat Assoc 73(364): 730–752

    Article  Google Scholar 

  • Karlis D, Xekalaki E (2003) Choosing initial values for the EM algorithm for finite mixtures. Comput Stat Data Anal 41(3–4): 577–590

    Article  MathSciNet  Google Scholar 

  • Liu C (1997) ML estimation of the multivariate t distribution and the EM algorithms. J Multivar Anal 63: 296–312

    Article  MATH  Google Scholar 

  • McLachlan G, Krishnan T (1997) The EM algorithm and extensions. Wiley, New York

    MATH  Google Scholar 

  • McLachlan G, Peel D (2000) Robust mixture modelling using the t-distribution. Stat Comput 10(4): 339–348

    Article  Google Scholar 

  • Neykov N, Filzmoser P, Dimova R, Neytchev P (2007) Robust fitting of mixtures using the trimmed likelihood estimator. Comput Stat Data Anal 17(3): 299–308

    Article  MathSciNet  Google Scholar 

  • Redner R, Walker HF (1984) Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev 26: 195–239

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pietro Coretto.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Coretto, P., Hennig, C. A simulation study to compare robust clustering methods based on mixtures. Adv Data Anal Classif 4, 111–135 (2010). https://doi.org/10.1007/s11634-010-0065-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-010-0065-4

Keywords

Mathematics Subject Classification (2000)

Navigation