Finding the Number of Disparate Clusters with Background Contamination

Atkinson, Anthony C.; Cerioli, Andrea; Morelli, Gianluca; Riani, Marco

doi:10.1007/978-3-662-44983-7_3

Anthony C. Atkinson²¹,
Andrea Cerioli²²,
Gianluca Morelli²² &
…
Marco Riani²²

Part of the book series: Studies in Classification, Data Analysis, and Knowledge Organization ((STUDIES CLASS))

3026 Accesses

Abstract

The Forward Search is used in an exploratory manner, with many random starts, to indicate the number of clusters and their membership in continuous data. The prospective clusters can readily be distinguished from background noise and from other forms of outliers. A confirmatory Forward Search, involving control on the sizes of statistical tests, establishes precise cluster membership. The method performs as well as robust methods such as TCLUST. However, it does not require prior specification of the number of clusters, nor of the level of trimming of outliers. In this way it is “user friendly”.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

OpenClustered: an R package with a benchmark suite of clustered datasets for methodological evaluation and comparison

Article Open access 10 April 2025

Clustering Methods for Statistical Inference

The multiColl Package Versus Other Existing Packages in R to Detect Multicollinearity

Article 26 July 2021

References

Atkinson, A. C., & Riani, M. (2007). Exploratory tools for clustering multivariate data. Computational Statistics and Data Analysis, 52, 272–285.
Article MATH MathSciNet Google Scholar
Atkinson, A. C., Riani, M., & Cerioli, A. (2004). Exploring multivariate data with the forward search. New York: Springer.
Book MATH Google Scholar
Atkinson, A. C., Riani, M., & Cerioli, A. (2006). Random start forward searches with envelopes for detecting clusters in multivariate data. In S. Zani, A. Cerioli, M. Riani, & M. Vichi (Eds.), Data Analysis, Classification and the Forward Search (pp. 163–171). Berlin: Springer.
Chapter Google Scholar
Cerioli, A., & Perrotta, D. (2014). Robust clustering around regression lines with high density regions. Advances in Data Analysis and Classification, 8, 5–26.
Article MathSciNet Google Scholar
Coretto, P., & Hennig, C. (2010). A simulation study to compare robust clustering methods based on mixtures. Advances in Data Analysis and Classification, 4, 111–135.
Article MATH MathSciNet Google Scholar
Fowlkes, E. B., Gnanadesikan, R., & Kettenring, J. R. (1988). Variable selection in clustering. Journal of Classification, 5, 205–228.
Article MathSciNet Google Scholar
Fraley, C., & Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association, 97, 611–631.
Article MATH MathSciNet Google Scholar
Fritz, H., García-Escudero, L. A., & Mayo-Iscar, A. (2012). TCLUST: An R package for a trimming approach to cluster analysis. Journal of Statistical Software, 47, 1–26.
Google Scholar
Gallegos, M. T., & Ritter, G. (2009). Trimming algorithms for clustering contaminated grouped data and their robustness. Advances in Data Analysis and Classification, 3, 135–167.
Article MATH MathSciNet Google Scholar
García-Escudero, L. A., Gordaliza, A., Matrán, C., & Mayo-Iscar, A. (2008). A general trimming approach to robust cluster analysis. Annals of Statistics, 36, 1324–1345.
Article MATH MathSciNet Google Scholar
García-Escudero, L. A., Gordaliza, A., Matrán, C., & Mayo-Iscar, A. (2010). A review of robust clustering methods. Advances in Data Analysis and Classification, 4, 89–109.
Article MATH Google Scholar
García-Escudero, L. A., Gordaliza, A., Matrán, C., & Mayo-Iscar, A. (2011). Exploring the number of groups in model-based clustering. Statistics and Computing, 21, 585–599.
Article MATH MathSciNet Google Scholar
Hennig, C., & Christlieb, N. (2002). Validating visual clusters in large datasets: Fixed point clusters of spectral features. Computational Statistics and Data Analysis, 40, 723–739.
Article MATH MathSciNet Google Scholar
Kaufman, L., & Rousseeuw, P. J. (1990). Finding groups in data. An introduction to cluster analysis. New York: Wiley.
Google Scholar
Lee, S. X., & Mclachlan, G. J. (2013). Model-based clustering and classification with non-normal mixture distributions. Statistical Methods and Applications, 22, 427–454.
Article MathSciNet Google Scholar
Milligan, G. W. (1980). An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika, 45, 325–342.
Article Google Scholar
Morelli, G. (2013). A comparison of different classification methods. Ph.D. dissertation, Università di Parma.
Google Scholar
Riani, M., Atkinson, A. C., & Cerioli, A. (2009). Finding an unknown number of multivariate outliers. Journal of the Royal Statistical Society, Series B, 71, 447–466.
Article MATH MathSciNet Google Scholar
Riani, M., Perrotta, D., & Torti, F. (2012). FSDA: A MATLAB toolbox for robust analysis and interactive data exploration. Chemometrics and Intelligent Laboratory Systems, 116, 17–32.
Article Google Scholar
Riani, M., Atkinson, A. C., & Perrotta, D. (2014). A parametric framework for the comparison of methods of very robust regression. Statistical Science, 29, 128–143.
Article MathSciNet Google Scholar

Download references

Acknowledgements

We are very grateful to Berthold Lausen and Matthias Bömher for their scientific and organizational support during the European Conference on Data Analysis 2013. We also thank an anonymous reviewer for careful reading of an earlier draft, and for pointing out the reference to Hennig and Christlieb (2002). Our work on this paper was partly supported by the project MIUR PRIN “MISURA—Multivariate Models for Risk Assessment”.

Author information

Authors and Affiliations

Department of Statistics, London School of Economics, London, WC2A 2AE, UK
Anthony C. Atkinson
Dipartimento di Economia, Università di Parma, Parma, Italy
Andrea Cerioli, Gianluca Morelli & Marco Riani

Authors

Anthony C. Atkinson
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Cerioli
View author publications
You can also search for this author in PubMed Google Scholar
Gianluca Morelli
View author publications
You can also search for this author in PubMed Google Scholar
Marco Riani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andrea Cerioli .

Editor information

Editors and Affiliations

University of Essex, Colchester, United Kingdom
Berthold Lausen
University of Luxembourg, Walferdange, Luxembourg
Sabine Krolak-Schwerdt
University of Luxembourg, Walferdange, Luxembourg
Matthias Böhmer

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Atkinson, A.C., Cerioli, A., Morelli, G., Riani, M. (2015). Finding the Number of Disparate Clusters with Background Contamination. In: Lausen, B., Krolak-Schwerdt, S., Böhmer, M. (eds) Data Science, Learning by Latent Structures, and Knowledge Discovery. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-44983-7_3

Download citation

DOI: https://doi.org/10.1007/978-3-662-44983-7_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-44982-0
Online ISBN: 978-3-662-44983-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics