Robust clustering around regression lines with high density regions

Cerioli, Andrea; Perrotta, Domenico

doi:10.1007/s11634-013-0151-5

Robust clustering around regression lines with high density regions

Regular Article
Published: 11 September 2013

Volume 8, pages 5–26, (2014)
Cite this article

Advances in Data Analysis and Classification Aims and scope Submit manuscript

Andrea Cerioli¹ &
Domenico Perrotta²

577 Accesses
20 Citations
3 Altmetric
Explore all metrics

Abstract

Robust methods are needed to fit regression lines when outliers are present. In a clustering framework, outliers can be extreme observations, high leverage points, but also data points which lie among the groups. Outliers are also of paramount importance in the analysis of international trade data, which motivate our work, because they may provide information about anomalies like fraudulent transactions. In this paper we show that robust techniques can fail when a large proportion of non-contaminated observations fall in a small region, which is a likely occurrence in many international trade data sets. In such instances, the effect of a high-density region is so strong that it can override the benefits of trimming and other robust devices. We propose to solve the problem by sampling a much smaller subset of observations which preserves the cluster structure and retains the main outliers of the original data set. This goal is achieved by defining the retention probability of each point as an inverse function of the estimated density function for the whole data set. We motivate our proposal as a thinning operation on a point pattern generated by different components. We then apply robust clustering methods to the thinned data set for the purposes of classification and outlier detection. We show the advantages of our method both in empirical applications to international trade examples and through a simulation study.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Notes

The legal basis for the fight against fraud is Article 325 of the Treaty on the Functioning of the European Union. The pillars of the common commercial policy are in the Treaty Establishing the European Community, Part Three (Community policies), Title IX (Common commercial policy), Articles 133, 113 (EC Treaty, Maastricht consolidated version) and 113 (EEC Treaty).
See the web interface to ComExt: http://epp.eurostat.ec.europa.eu/newxtweb/.

References

Atkinson AC, Riani M, Cerioli A (2004) Exploring multivariate data with the forward search. Springer, New York
Book MATH Google Scholar
Atkinson AC, Riani M, Cerioli A (2010) The forward search: theory and data analysis. J Korean Stat Soc 39:117–134
Article MathSciNet Google Scholar
Baddeley A, Turner R (2012) Package ‘spatstat’: spatial point pattern analysis, model-fitting, simulation, tests. http://www.cran.r-project.org/web/packages/spatstat/spatstat.pdf
Bai X, Yao W, Boyer JE (2012) Robust fitting of mixture regression models. Comput Stat Data Anal 56:2347–2359
Article MATH MathSciNet Google Scholar
Byers S, Raftery AE (1998) Nearest-neighbor clutter removal for estimating features in spatial point processes. J Am Stat Assoc 93:577–584
Article MATH Google Scholar
Coretto P, Hennig C (2010) A simulation study to compare robust clustering methods based on mixtures. Adv Data Anal Classif 4:111–135
Article MathSciNet Google Scholar
Dasgupta A, Raftery AE (1998) Detecting features in spatial point processes with clutter via model-based clustering. J Am Stat Assoc 93:294–302
Article MATH Google Scholar
De Battisti F, Salini S (2013) Robust analysis of bibliometric data. Stat Methods Appl 22:269–283
Google Scholar
Diggle PJ (1985) A kernel method for smoothing point process data. Appl Stat 34:138–147
Article MATH Google Scholar
FATF-OECD, Financial Action Task Force (2006) Trade based money laundering. http://www.fatf-gafi.org/
FATF-OECD, Financial Action Task Force (2008) Best practices on trade based money laundering. http://www.fatf-gafi.org/
Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97:611–631
Article MATH MathSciNet Google Scholar
Fritz H, Garcìa-Escudero LA, Mayo-Iscar A (2012) tclust: an R package for a trimming approach to Cluster Analysis. J Stat Softw 47.
Garcìa-Escudero LA, Gordaliza A, Van Aelst S, Zamar R (2009) Robust linear clustering. J R Stat Soc B 71:301–319
Article MATH Google Scholar
Garcìa-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2010a) A review of robust clustering methods. Adv Data Anal Classif 4:89–109
Article MathSciNet Google Scholar
Garcìa-Escudero LA, Gordaliza A, Mayo-Iscar A (2010b) Robust clusterwise linear regression through trimming. Comput Stat Data Anal 54:3057–3069
Article Google Scholar
Heikkonen J, Perrotta D, Riani M, Torti F (2013) Issues on clustering and data gridding. In: Giusti A, Ritter G, Vichi M (eds) Classification and data mining. Springer, Berlin, pp 37–44
Chapter Google Scholar
Illian J, Penttinen A, Stoyan H, Stoyan D (2008) Statistical analysis and modelling of spatial point patterns. Wiley, Chichester
MATH Google Scholar
Neykov N, Filzmoser P, Dimova R, Neytchev P (2007) Robust fitting of mixtures using the trimmed likelihood estimator. Comput Stat Data Anal 52:299–308
Article MATH MathSciNet Google Scholar
Riani M, Atkinson AC, Cerioli A et al (2012) Problems and challenges in the analysis of complex data: static and dynamic approaches. In: Di Ciaccio A (ed) Advanced statistical methods for the analysis of large data-sets. Springer, Berlin, pp 145–157
Chapter Google Scholar
Riani M, Cerioli A, Atkinson AC, Perrotta D, Torti F et al (2008) Fitting mixtures of regression lines with the forward search. In: Fogelman-Soulié F (ed) Mining massive data sets for security. IOS Press, Amsterdam, pp 271–286
Google Scholar
Rocci R, Gattone SA, Vichi M (2009) A new dimension reduction method: factor discriminant K-means. J Classif 28:210–226
Article MathSciNet Google Scholar
Van Aelst S, Wang X, Zamar R, Zhu R (2006) Linear grouping using orthogonal regression. Comput Stat Data Anal 50:1287–1312
Google Scholar
Vichi M, Rocci R, Kiers HAL (2007) Simultaneous component and clustering models for three-way data: within and between approaches. J Classif 24:71–98
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

University of Parma, Parma, Italy
Andrea Cerioli
European Commission, Joint Research Centre, Ispra, Italy
Domenico Perrotta

Authors

Andrea Cerioli
View author publications
You can also search for this author in PubMed Google Scholar
Domenico Perrotta
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andrea Cerioli.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cerioli, A., Perrotta, D. Robust clustering around regression lines with high density regions. Adv Data Anal Classif 8, 5–26 (2014). https://doi.org/10.1007/s11634-013-0151-5

Download citation

Received: 21 December 2012
Revised: 04 June 2013
Accepted: 18 July 2013
Published: 11 September 2013
Issue Date: March 2014
DOI: https://doi.org/10.1007/s11634-013-0151-5

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Robust clustering around regression lines with high density regions

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

Violating the normality assumption may be the lesser of two evils

Data clustering: application and trends

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Robust clustering around regression lines with high density regions

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

Violating the normality assumption may be the lesser of two evils

Data clustering: application and trends

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation