Skip to main content
Log in

Robust clustering around regression lines with high density regions

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

Robust methods are needed to fit regression lines when outliers are present. In a clustering framework, outliers can be extreme observations, high leverage points, but also data points which lie among the groups. Outliers are also of paramount importance in the analysis of international trade data, which motivate our work, because they may provide information about anomalies like fraudulent transactions. In this paper we show that robust techniques can fail when a large proportion of non-contaminated observations fall in a small region, which is a likely occurrence in many international trade data sets. In such instances, the effect of a high-density region is so strong that it can override the benefits of trimming and other robust devices. We propose to solve the problem by sampling a much smaller subset of observations which preserves the cluster structure and retains the main outliers of the original data set. This goal is achieved by defining the retention probability of each point as an inverse function of the estimated density function for the whole data set. We motivate our proposal as a thinning operation on a point pattern generated by different components. We then apply robust clustering methods to the thinned data set for the purposes of classification and outlier detection. We show the advantages of our method both in empirical applications to international trade examples and through a simulation study.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. The legal basis for the fight against fraud is Article 325 of the Treaty on the Functioning of the European Union. The pillars of the common commercial policy are in the Treaty Establishing the European Community, Part Three (Community policies), Title IX (Common commercial policy), Articles 133, 113 (EC Treaty, Maastricht consolidated version) and 113 (EEC Treaty).

  2. See the web interface to ComExt: http://epp.eurostat.ec.europa.eu/newxtweb/.

References

  • Atkinson AC, Riani M, Cerioli A (2004) Exploring multivariate data with the forward search. Springer, New York

    Book  MATH  Google Scholar 

  • Atkinson AC, Riani M, Cerioli A (2010) The forward search: theory and data analysis. J Korean Stat Soc 39:117–134

    Article  MathSciNet  Google Scholar 

  • Baddeley A, Turner R (2012) Package ‘spatstat’: spatial point pattern analysis, model-fitting, simulation, tests. http://www.cran.r-project.org/web/packages/spatstat/spatstat.pdf

  • Bai X, Yao W, Boyer JE (2012) Robust fitting of mixture regression models. Comput Stat Data Anal 56:2347–2359

    Article  MATH  MathSciNet  Google Scholar 

  • Byers S, Raftery AE (1998) Nearest-neighbor clutter removal for estimating features in spatial point processes. J Am Stat Assoc 93:577–584

    Article  MATH  Google Scholar 

  • Coretto P, Hennig C (2010) A simulation study to compare robust clustering methods based on mixtures. Adv Data Anal Classif 4:111–135

    Article  MathSciNet  Google Scholar 

  • Dasgupta A, Raftery AE (1998) Detecting features in spatial point processes with clutter via model-based clustering. J Am Stat Assoc 93:294–302

    Article  MATH  Google Scholar 

  • De Battisti F, Salini S (2013) Robust analysis of bibliometric data. Stat Methods Appl 22:269–283

    Google Scholar 

  • Diggle PJ (1985) A kernel method for smoothing point process data. Appl Stat 34:138–147

    Article  MATH  Google Scholar 

  • FATF-OECD, Financial Action Task Force (2006) Trade based money laundering. http://www.fatf-gafi.org/

  • FATF-OECD, Financial Action Task Force (2008) Best practices on trade based money laundering. http://www.fatf-gafi.org/

  • Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97:611–631

    Article  MATH  MathSciNet  Google Scholar 

  • Fritz H, Garcìa-Escudero LA, Mayo-Iscar A (2012) tclust: an R package for a trimming approach to Cluster Analysis. J Stat Softw 47.

  • Garcìa-Escudero LA, Gordaliza A, Van Aelst S, Zamar R (2009) Robust linear clustering. J R Stat Soc B 71:301–319

    Article  MATH  Google Scholar 

  • Garcìa-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2010a) A review of robust clustering methods. Adv Data Anal Classif 4:89–109

    Article  MathSciNet  Google Scholar 

  • Garcìa-Escudero LA, Gordaliza A, Mayo-Iscar A (2010b) Robust clusterwise linear regression through trimming. Comput Stat Data Anal 54:3057–3069

    Article  Google Scholar 

  • Heikkonen J, Perrotta D, Riani M, Torti F (2013) Issues on clustering and data gridding. In: Giusti A, Ritter G, Vichi M (eds) Classification and data mining. Springer, Berlin, pp 37–44

    Chapter  Google Scholar 

  • Illian J, Penttinen A, Stoyan H, Stoyan D (2008) Statistical analysis and modelling of spatial point patterns. Wiley, Chichester

    MATH  Google Scholar 

  • Neykov N, Filzmoser P, Dimova R, Neytchev P (2007) Robust fitting of mixtures using the trimmed likelihood estimator. Comput Stat Data Anal 52:299–308

    Article  MATH  MathSciNet  Google Scholar 

  • Riani M, Atkinson AC, Cerioli A et al (2012) Problems and challenges in the analysis of complex data: static and dynamic approaches. In: Di Ciaccio A (ed) Advanced statistical methods for the analysis of large data-sets. Springer, Berlin, pp 145–157

    Chapter  Google Scholar 

  • Riani M, Cerioli A, Atkinson AC, Perrotta D, Torti F et al (2008) Fitting mixtures of regression lines with the forward search. In: Fogelman-Soulié F (ed) Mining massive data sets for security. IOS Press, Amsterdam, pp 271–286

    Google Scholar 

  • Rocci R, Gattone SA, Vichi M (2009) A new dimension reduction method: factor discriminant K-means. J Classif 28:210–226

    Article  MathSciNet  Google Scholar 

  • Van Aelst S, Wang X, Zamar R, Zhu R (2006) Linear grouping using orthogonal regression. Comput Stat Data Anal 50:1287–1312

    Google Scholar 

  • Vichi M, Rocci R, Kiers HAL (2007) Simultaneous component and clustering models for three-way data: within and between approaches. J Classif 24:71–98

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andrea Cerioli.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cerioli, A., Perrotta, D. Robust clustering around regression lines with high density regions. Adv Data Anal Classif 8, 5–26 (2014). https://doi.org/10.1007/s11634-013-0151-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-013-0151-5

Keywords

Mathematics Subject Classification

Navigation