Skip to main content
Log in

Trimming algorithms for clustering contaminated grouped data and their robustness

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

We establish an affine equivariant, constrained heteroscedastic model and criterion with trimming for clustering contaminated, grouped data. We show existence of the maximum likelihood estimator, propose a method for determining an appropriate constraint, and design a strategy for finding reasonable clusterings. We finally compute breakdown points of the estimated parameters thereby showing asymptotic robustness of the method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Barnett V, Lewis T (1994) Outliers in statistical data. Wiley, Chichester

    MATH  Google Scholar 

  • Becker C, Gather U (1999) The masking breakdown point of multivariate outlier identification rules. JASA 94: 947–955

    MATH  MathSciNet  Google Scholar 

  • Bezdek JC, Keller J, Krisnapuram R, Pal NR (1999) Fuzzy models and algorithms for pattern recognition and image processing. The handbooks of fuzzy sets series. Kluwer, Boston

    Google Scholar 

  • Bock H-H (1985) On some significance tests in cluster analysis. J Class 2: 77–108

    Article  MATH  MathSciNet  Google Scholar 

  • Chen H, Chen J, Kalbfleisch JD (2004) Testing for a finite mixture model with two components. J R Stat Soc Ser B 66: 95–115

    Article  MATH  MathSciNet  Google Scholar 

  • Cuesta-Albertos JA, Gordaliza A, Matrán C (1997) Trimmed k-means: an attempt to robustify quantizers. Ann Stat 25: 553–576

    Article  MATH  Google Scholar 

  • Dennis JE Jr (1981) Algorithms for nonlinear fitting. In: Powell MJD (eds) Nonlinear optimization 1981. Procedings of the NATO Advanced Research Institute held at Cambridge in July 1981. Academic Press, London

    Google Scholar 

  • Donoho DL, Huber PJ (1983) The notion of a breakdown point. In: Bickel PJ, Doksum KA, Hodges JL (eds) A Festschrift for Erich L. Lehmann, The Wadsworth Statistics/Probability Series. Wadsworth, Belmont, pp 157–184

    Google Scholar 

  • Gallegos MT, Ritter G (2005) A robust method for cluster analysis. Ann Stat 33: 347–380

    Article  MATH  MathSciNet  Google Scholar 

  • Gallegos MT, Ritter G (2009) Using combinatorial optimization in model-based clustering under spurious outliers and cardinality constraints. Comput Statist Data Anal (to appear)

  • García-Escudero LA, Gordaliza A (1999) Robustness properties of k-means and trimmed k-means. J Am Stat Assoc 94: 956–969

    Article  MATH  Google Scholar 

  • García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2008) A general trimming approach to robust cluster analysis. Ann Stat 36: 1324–1345

    Article  MATH  Google Scholar 

  • Gordon AD (1999) Classification. Monographs on statistics and applied probability, vol 82, 2nd edn. CRC Press, New York

    Google Scholar 

  • Hathaway RJ (1985) A constrained formulation of maximum-likelihood estimation for normal mixture distributions. Ann Stat 13: 795–800

    Article  MATH  MathSciNet  Google Scholar 

  • Hodges JL Jr (1967) Efficiency in normal samples and tolerance of extreme values for some estimates of location. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. University of California Press, Berkeley, pp 163–186

  • Kéribin C (2000) Consistent estimation of the order of mixture models. Sankhyā 62(Series A): 49–66

    MATH  Google Scholar 

  • McLachlan GJ, Peel D (2000) Finite mixture models. Wiley, New York

    Book  MATH  Google Scholar 

  • Mecklin CJ, Mundfrom DJ (2004) An appraisal and bibliography of tests for multivariate normality. Int Stat Rev 72(1): 123–138

    MATH  Google Scholar 

  • Milligan GW, Cooper MC (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50: 159–179

    Article  Google Scholar 

  • Mucha H-J, Bartel HG, Dolata J (2002) Exploring Roman brick and tile by cluster analysis with validation of results. In: Gaul W, Ritter G (eds) Classification, automation, and new media. Studies in classification, data analysis, and knowledge organization. Springer, Berlin, pp 471–478

    Google Scholar 

  • Neykov N, Filzmoser P, Dimova R, Neytchev P (2007) Robust fitting of mixtures using the trimmed likelihood estimator. Comput Stat Data Anal 52: 299–308

    Article  MATH  MathSciNet  Google Scholar 

  • Pollard D (1981) Strong consistency of k-means clustering. Ann Stat 9: 135–140

    Article  MATH  MathSciNet  Google Scholar 

  • Ritter G, Gallegos MT (1997) Outliers in statistical pattern recognition and an application to automatic chromosome classification. Patt Rec Lett 18: 525–539

    Article  Google Scholar 

  • Rocke DM, Woodruff DL (1999) A synthesis of outlier detection and cluster identification. Technical report, University of California, Davis. http://handel.cipic.ucdavis.edu/~dmrocke/Synth5.pdf

  • Schroeder A (1976) Analyse d’un mélange de distributions de probabilités de même type. Revue de Statistique Appliquée 24: 39–62

    MathSciNet  Google Scholar 

  • Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6: 461–464

    Article  MATH  Google Scholar 

  • Symons MJ (1981) Clustering criteria and multivariate normal mixtures. Biometrics 37: 35–43

    Article  MATH  MathSciNet  Google Scholar 

  • Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc Ser B 63: 411–423

    Article  MATH  MathSciNet  Google Scholar 

  • Wolfe JH (1970) Pattern clustering by multivariate mixture analysis. Multivar Behav Res 5: 329–350

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gunter Ritter.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gallegos, M.T., Ritter, G. Trimming algorithms for clustering contaminated grouped data and their robustness. Adv Data Anal Classif 3, 135–167 (2009). https://doi.org/10.1007/s11634-009-0044-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-009-0044-9

Keywords

Mathematics Subject Classification (2000)

Navigation