Abstract
Robust methods are needed to fit regression lines when outliers are present. In a clustering framework, outliers can be extreme observations, high leverage points, but also data points which lie among the groups. Outliers are also of paramount importance in the analysis of international trade data, which motivate our work, because they may provide information about anomalies like fraudulent transactions. In this paper we show that robust techniques can fail when a large proportion of non-contaminated observations fall in a small region, which is a likely occurrence in many international trade data sets. In such instances, the effect of a high-density region is so strong that it can override the benefits of trimming and other robust devices. We propose to solve the problem by sampling a much smaller subset of observations which preserves the cluster structure and retains the main outliers of the original data set. This goal is achieved by defining the retention probability of each point as an inverse function of the estimated density function for the whole data set. We motivate our proposal as a thinning operation on a point pattern generated by different components. We then apply robust clustering methods to the thinned data set for the purposes of classification and outlier detection. We show the advantages of our method both in empirical applications to international trade examples and through a simulation study.
Similar content being viewed by others
Notes
The legal basis for the fight against fraud is Article 325 of the Treaty on the Functioning of the European Union. The pillars of the common commercial policy are in the Treaty Establishing the European Community, Part Three (Community policies), Title IX (Common commercial policy), Articles 133, 113 (EC Treaty, Maastricht consolidated version) and 113 (EEC Treaty).
See the web interface to ComExt: http://epp.eurostat.ec.europa.eu/newxtweb/.
References
Atkinson AC, Riani M, Cerioli A (2004) Exploring multivariate data with the forward search. Springer, New York
Atkinson AC, Riani M, Cerioli A (2010) The forward search: theory and data analysis. J Korean Stat Soc 39:117–134
Baddeley A, Turner R (2012) Package ‘spatstat’: spatial point pattern analysis, model-fitting, simulation, tests. http://www.cran.r-project.org/web/packages/spatstat/spatstat.pdf
Bai X, Yao W, Boyer JE (2012) Robust fitting of mixture regression models. Comput Stat Data Anal 56:2347–2359
Byers S, Raftery AE (1998) Nearest-neighbor clutter removal for estimating features in spatial point processes. J Am Stat Assoc 93:577–584
Coretto P, Hennig C (2010) A simulation study to compare robust clustering methods based on mixtures. Adv Data Anal Classif 4:111–135
Dasgupta A, Raftery AE (1998) Detecting features in spatial point processes with clutter via model-based clustering. J Am Stat Assoc 93:294–302
De Battisti F, Salini S (2013) Robust analysis of bibliometric data. Stat Methods Appl 22:269–283
Diggle PJ (1985) A kernel method for smoothing point process data. Appl Stat 34:138–147
FATF-OECD, Financial Action Task Force (2006) Trade based money laundering. http://www.fatf-gafi.org/
FATF-OECD, Financial Action Task Force (2008) Best practices on trade based money laundering. http://www.fatf-gafi.org/
Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97:611–631
Fritz H, Garcìa-Escudero LA, Mayo-Iscar A (2012) tclust: an R package for a trimming approach to Cluster Analysis. J Stat Softw 47.
Garcìa-Escudero LA, Gordaliza A, Van Aelst S, Zamar R (2009) Robust linear clustering. J R Stat Soc B 71:301–319
Garcìa-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2010a) A review of robust clustering methods. Adv Data Anal Classif 4:89–109
Garcìa-Escudero LA, Gordaliza A, Mayo-Iscar A (2010b) Robust clusterwise linear regression through trimming. Comput Stat Data Anal 54:3057–3069
Heikkonen J, Perrotta D, Riani M, Torti F (2013) Issues on clustering and data gridding. In: Giusti A, Ritter G, Vichi M (eds) Classification and data mining. Springer, Berlin, pp 37–44
Illian J, Penttinen A, Stoyan H, Stoyan D (2008) Statistical analysis and modelling of spatial point patterns. Wiley, Chichester
Neykov N, Filzmoser P, Dimova R, Neytchev P (2007) Robust fitting of mixtures using the trimmed likelihood estimator. Comput Stat Data Anal 52:299–308
Riani M, Atkinson AC, Cerioli A et al (2012) Problems and challenges in the analysis of complex data: static and dynamic approaches. In: Di Ciaccio A (ed) Advanced statistical methods for the analysis of large data-sets. Springer, Berlin, pp 145–157
Riani M, Cerioli A, Atkinson AC, Perrotta D, Torti F et al (2008) Fitting mixtures of regression lines with the forward search. In: Fogelman-Soulié F (ed) Mining massive data sets for security. IOS Press, Amsterdam, pp 271–286
Rocci R, Gattone SA, Vichi M (2009) A new dimension reduction method: factor discriminant K-means. J Classif 28:210–226
Van Aelst S, Wang X, Zamar R, Zhu R (2006) Linear grouping using orthogonal regression. Comput Stat Data Anal 50:1287–1312
Vichi M, Rocci R, Kiers HAL (2007) Simultaneous component and clustering models for three-way data: within and between approaches. J Classif 24:71–98
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Cerioli, A., Perrotta, D. Robust clustering around regression lines with high density regions. Adv Data Anal Classif 8, 5–26 (2014). https://doi.org/10.1007/s11634-013-0151-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-013-0151-5
Keywords
- Anti-fraud
- Concentrated noise
- International trade
- Orthogonal regression
- Outlier detection
- Robust clusterwise regression
- TCLUST
- Thinning
- Trimming