Skip to main content

Finding the Number of Disparate Clusters with Background Contamination

  • Conference paper
Data Science, Learning by Latent Structures, and Knowledge Discovery

Abstract

The Forward Search is used in an exploratory manner, with many random starts, to indicate the number of clusters and their membership in continuous data. The prospective clusters can readily be distinguished from background noise and from other forms of outliers. A confirmatory Forward Search, involving control on the sizes of statistical tests, establishes precise cluster membership. The method performs as well as robust methods such as TCLUST. However, it does not require prior specification of the number of clusters, nor of the level of trimming of outliers. In this way it is “user friendly”.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Atkinson, A. C., & Riani, M. (2007). Exploratory tools for clustering multivariate data. Computational Statistics and Data Analysis, 52, 272–285.

    Article  MATH  MathSciNet  Google Scholar 

  • Atkinson, A. C., Riani, M., & Cerioli, A. (2004). Exploring multivariate data with the forward search. New York: Springer.

    Book  MATH  Google Scholar 

  • Atkinson, A. C., Riani, M., & Cerioli, A. (2006). Random start forward searches with envelopes for detecting clusters in multivariate data. In S. Zani, A. Cerioli, M. Riani, & M. Vichi (Eds.), Data Analysis, Classification and the Forward Search (pp. 163–171). Berlin: Springer.

    Chapter  Google Scholar 

  • Cerioli, A., & Perrotta, D. (2014). Robust clustering around regression lines with high density regions. Advances in Data Analysis and Classification, 8, 5–26.

    Article  MathSciNet  Google Scholar 

  • Coretto, P., & Hennig, C. (2010). A simulation study to compare robust clustering methods based on mixtures. Advances in Data Analysis and Classification, 4, 111–135.

    Article  MATH  MathSciNet  Google Scholar 

  • Fowlkes, E. B., Gnanadesikan, R., & Kettenring, J. R. (1988). Variable selection in clustering. Journal of Classification, 5, 205–228.

    Article  MathSciNet  Google Scholar 

  • Fraley, C., & Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association, 97, 611–631.

    Article  MATH  MathSciNet  Google Scholar 

  • Fritz, H., García-Escudero, L. A., & Mayo-Iscar, A. (2012). TCLUST: An R package for a trimming approach to cluster analysis. Journal of Statistical Software, 47, 1–26.

    Google Scholar 

  • Gallegos, M. T., & Ritter, G. (2009). Trimming algorithms for clustering contaminated grouped data and their robustness. Advances in Data Analysis and Classification, 3, 135–167.

    Article  MATH  MathSciNet  Google Scholar 

  • García-Escudero, L. A., Gordaliza, A., Matrán, C., & Mayo-Iscar, A. (2008). A general trimming approach to robust cluster analysis. Annals of Statistics, 36, 1324–1345.

    Article  MATH  MathSciNet  Google Scholar 

  • García-Escudero, L. A., Gordaliza, A., Matrán, C., & Mayo-Iscar, A. (2010). A review of robust clustering methods. Advances in Data Analysis and Classification, 4, 89–109.

    Article  MATH  Google Scholar 

  • García-Escudero, L. A., Gordaliza, A., Matrán, C., & Mayo-Iscar, A. (2011). Exploring the number of groups in model-based clustering. Statistics and Computing, 21, 585–599.

    Article  MATH  MathSciNet  Google Scholar 

  • Hennig, C., & Christlieb, N. (2002). Validating visual clusters in large datasets: Fixed point clusters of spectral features. Computational Statistics and Data Analysis, 40, 723–739.

    Article  MATH  MathSciNet  Google Scholar 

  • Kaufman, L., & Rousseeuw, P. J. (1990). Finding groups in data. An introduction to cluster analysis. New York: Wiley.

    Google Scholar 

  • Lee, S. X., & Mclachlan, G. J. (2013). Model-based clustering and classification with non-normal mixture distributions. Statistical Methods and Applications, 22, 427–454.

    Article  MathSciNet  Google Scholar 

  • Milligan, G. W. (1980). An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika, 45, 325–342.

    Article  Google Scholar 

  • Morelli, G. (2013). A comparison of different classification methods. Ph.D. dissertation, Università di Parma.

    Google Scholar 

  • Riani, M., Atkinson, A. C., & Cerioli, A. (2009). Finding an unknown number of multivariate outliers. Journal of the Royal Statistical Society, Series B, 71, 447–466.

    Article  MATH  MathSciNet  Google Scholar 

  • Riani, M., Perrotta, D., & Torti, F. (2012). FSDA: A MATLAB toolbox for robust analysis and interactive data exploration. Chemometrics and Intelligent Laboratory Systems, 116, 17–32.

    Article  Google Scholar 

  • Riani, M., Atkinson, A. C., & Perrotta, D. (2014). A parametric framework for the comparison of methods of very robust regression. Statistical Science, 29, 128–143.

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

We are very grateful to Berthold Lausen and Matthias Bömher for their scientific and organizational support during the European Conference on Data Analysis 2013. We also thank an anonymous reviewer for careful reading of an earlier draft, and for pointing out the reference to Hennig and Christlieb (2002). Our work on this paper was partly supported by the project MIUR PRIN “MISURA—Multivariate Models for Risk Assessment”.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andrea Cerioli .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Atkinson, A.C., Cerioli, A., Morelli, G., Riani, M. (2015). Finding the Number of Disparate Clusters with Background Contamination. In: Lausen, B., Krolak-Schwerdt, S., Böhmer, M. (eds) Data Science, Learning by Latent Structures, and Knowledge Discovery. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-44983-7_3

Download citation

Publish with us

Policies and ethics