Abstract
We propose an approach that utilizes the Delaunay triangulation to identify a robust/outlier-free subsample. Given that the data structure of the non-outlying points is convex (e.g. of elliptical shape), this subsample can then be used to give a robust estimation of location and scatter (by applying the classical mean and covariance). The estimators derived from our approach are shown to have a high breakdown point. In addition, we provide a diagnostic plot to expand the initial subset in a data-driven way, further increasing the estimators’ efficiency.
Similar content being viewed by others
References
Allard, D., Fraley, C.: Nonparametric maximum likelihood estimation of features in spatial point processes using Voronoi tessellation. J. Am. Stat. Assoc. 92(440), 1485–1493 (1997)
Alqallaf, F., Van Aelst, S., Yohai, V., Zamar, R.: Propagation of outliers in multivariate data. Ann. Stat. 37(1), 311–331 (2009)
Amenta, N., Attali, D., Devillers, O.: Complexity of Delaunay triangulation for points on lower-dimensional polyhedra. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete algorithms, SODA’07, pp. 1106–1113. Society for Industrial and Applied Mathematics, Philadelphia (2007)
Amenta, N., Attali, D., Devillers, O.: A tight bound for the Delaunay triangulation of points on a polyhedron. Rapport de recherche RR-6522, INRIA (2008)
Atkinson, A., Riani, M., Cerioli, A.: Exploring Multivariate Data with the Forward Search. Springer, Berlin (2004)
Attali, D., Boissonnat, J.-D., Lieutier, A.: Complexity of the Delaunay triangulation of points on surfaces the smooth case. In: Proceedings of the Nineteenth Annual Symposium on Computational Geometry, SCG’03, pp. 201–210. ACM, New York (2003)
Barnett, V., Lewis, T.: Outliers in Statistical Data, 3rd edn. Wiley, New York (2000)
Becker, C., Gather, U.: The masking breakdown point of multivariate outlier identification rules. J. Am. Stat. Assoc. 94, 947–955 (1999)
Becker, C., Paris Scholz, S.: MVE, MCD, and MZE: a simulation study comparing convex body minimizers. Allg. Stat. Arch. 88(2), 155–162 (2004)
Becker, C., Paris Scholz, S.: Deepest points and least deep points: robustness and outliers with MZE. In: Spiliopoulou, M., Kruse, R., Borgelt, C., Nürnberger, A., Gaul, W. (eds.) From Data and Information Analysis to Knowledge Engineering, pp. 254–261. Springer, Berlin (2006)
Cignoni, P., Montani, C., Scopigno, R.: Dewall: a fast divide and conquer Delaunay triangulation algorithm in ed. Comput. Aided Des. 30(5), 333–341 (1998)
Croux, C., Haesbroeck, G.: Principal component analysis based on robust estimators of the covariance or correlation matrix: influence functions and efficiencies. Biometrika 87(3), 603 (2000)
Davies, P.: Asymptotic behaviour of S-estimates of multivariate location parameters and dispersion matrices. Ann. Stat. 15(3), 1269–1292 (1987)
Davies, P.: The asymptotics of Rousseeuw’s minimum volume ellipsoid estimator. Ann. Stat. 20(4), 1828–1843 (1992)
Davies, P., Gather, U.: Breakdown and groups. Ann. Stat. 33(3), 977–988 (2005)
Davies, P., Gather, U.: Addendum to the discussion of “breakdown and groups”. Ann. Stat. 34(3), 1577–1579 (2006)
De Berg, M., Cheong, O., Van Kreveld, M., Overmars, M.: Computational Geometry: Algorithms and Applications. Springer, New York (2008)
Delaunay, B.: Sur la sphere vide. Izv. Akad. Nauk SSSR, Ser. VII, Otd. Mat. Est. Nauk 7, 793–800 (1934)
Donoho, D.: Breakdown properties of multivariate location estimators. Ph.D. thesis (1982)
Donoho, D., Huber, P.: The notion of breakdown point. In: A Festschrift for Erich Lehmann, pp. 157–184 (1983)
Gather, U., Becker, C.: Outlier identification and robust methods. In: Maddala, G., Rao, C. (eds.) Robust Inference. Handbook of Statistics, vol. 15, pp. 123–143 (1997)
Gnanadesikan, R., Kettenring, J.R.: Robust estimates, residuals, and outlier detection with multiresponse data. Biometrics 28(1), 81–124 (1972)
Gower, J.: Euclidean distance geometry. Math. Sci. 7, 1–14 (1982)
Gower, J.C.: Algorithm as 78: the mediancentre. J. R. Stat. Soc., Ser. C, Appl. Stat. 23(3), 466–470 (1974)
Gower, J.C.: Properties of Euclidean and non-Euclidean distance matrices. In: Linear Algebra and its Applications, vol. 67, pp. 81–97 (1985)
Hampel, F., Ronchetti, E., Rousseeuw, P., Stahel, W.: Robust Statistics: The Approach Based on Influence Functions. Wiley, New York (2005)
Hubert, M., Rousseeuw, P., Aelst, S.: High-breakdown robust multivariate methods. Stat. Sci. 23(1), 92–119 (2008)
Kirschstein, T., Liebscher, S., Becker, C.: Robust Estimation of Location and Scatter by Pruning the Minimum Spanning Tree (2012, submitted for publication)
Leach, G.: Improving worst-case optimal Delaunay triangulation algorithms. In: 4th Canadian Conference on Computational Geometry, p. 15 (1992)
Liebscher, S., Kirschstein, T., Becker, C.: The flood algorithm—a multivariate, self-organizing-map-based, robust location and covariance estimator. Stat. Comput. 22, 325–336 (2012). doi:10.1007/s11222-011-9250-3
Lopuhaä, H.: Asymptotics of reweighted estimators of multivariate location and scatter. Ann. Stat. 27(5), 1638–1665 (1999)
Lopuhaä, H., Rousseeuw, P.: Breakdown points of affine equivariant estimators of multivariate location and covariance matrices. Ann. Stat. 19(1), 229–248 (1991)
Maronna, R., Martin, D., Yohai, V.: Robust Statistics: Theory and Methods. John Wiley and Sons, Chichester (2006)
Maronna, R.A., Zamar, R.H.: Robust estimates of location and dispersion for high-dimensional datasets. Technometrics 44(4), 307–317 (2002)
McMullen, P.: The maximum numbers of faces of a convex polytope. Mathematika 17(02), 179–184 (1970)
Paris Scholz, S.: Robustness concepts and investigations for estimators of convex bodies. Ph.D. thesis (2002)
Pison, G., van Aelst, S., Willems, G.: Small sample corrections for LTS and MCD. Metrika 55, 111–123 (2002)
Riani, M., Atkinson, A., Cerioli, A.: Finding an unknown number of multivariate outliers. J. R. Stat. Soc., Ser. B, Stat. Methodol. 71(2), 447–466 (2009)
Rocke, D.: Robustness properties of S-estimators of multivariate location and shape in high dimension. Ann. Stat. 24(3), 1327–1345 (1996)
Rocke, D., Woodruff, D.: Computation of robust estimators of multivariate location and shape. Stat. Neerl. 47(1), 27–42 (1993)
Rousseeuw, P.: Multivariate estimation with high breakdown point. Math. Stat. Appl. 8, 283–297 (1985)
Rousseeuw, P., Leroy, A.: Robust Regression and Outlier Detection. John Wiley and Sons, New York (1987)
Rousseeuw, P., van Driessen, K.: A fast algorithm for the minimum covariance determinant estimator. Technometrics 41, 212–223 (1999)
Su, P., Drysdale, R.L.S.: A comparison of sequential Delaunay triangulation algorithms. Comput. Geom. 7(5–6), 361–385. 11th ACM Symposium on Computational Geometry (1997)
Tyler, D.E.: A distribution-free m-estimator of multivariate scatter. Ann. Stat. 15(1), 234–251 (1987)
Author information
Authors and Affiliations
Corresponding author
Appendix: Breakdown points of RDELA based estimators
Appendix: Breakdown points of RDELA based estimators
Theorem 1
Let X be a collection of n≥d+1 points x 1,…,x n in dimension d, x i ∈ℝd, i=1,…,n, in general position, and let T(X)={t k , k=1,…,l} be the corresponding Delaunay triangulation (where t k denotes a d-simplex with \(\mathbf{t}_{k}=\{\mathbf{x}_{i_{1}},\ldots,\mathbf{x}_{i_{d+1}}\}\subset \mathbf{X}\)), and let μ n and Σ n be the RDELA estimators of location and covariance (as described in Sects. 2.2 and 2.3). Then the breakdown point ε ∗ (defined as the smallest fraction of outliers that can take the estimate over all bounds) of these estimators is ε ∗(μ n ,X)=⌊(n+1)/2⌋/n and ε ∗(Σ n ,X)≥⌊(n−d+1)/2⌋/n respectively.
Proof
We first show that ε ∗(μ n ,X) is at least ⌊(n+1)/2⌋/n.
To let the location estimate break down at least one point in the chosen subset has to grow to infinity. If for some x∈X we have ∥x∥→∞, then ∥x−y∥→∞ for all y∈X. For a subset of X consisting of d+1 points and its corresponding triangulation object t k with x,y∈t k it follows (due to the circumcircle property) that the radius of t k grows also to infinity r(t k )→∞.
Suppose that X=X′∪X″, where |X′|>|X″|=m=⌊(n−1)/2⌋, and X″ will be altered in an arbitrary way to cause the procedure’s breakdown. Denote Y⊂X the subset chosen by the RDELA procedure with |Y|=|X′|=n−m. Furthermore, let \(\widetilde{\mathbf {T}}_{\mathbf{X}^{\prime}}= \{\mathbf {t}'_{k}=\{\mathbf{x}_{i_{1}},\ldots,\mathbf{x}_{i_{d+1}}\},\mathbf{x}_{i_{j}}\in \mathbf{X}^{\prime} \}\) as well as \(\widetilde{\mathbf {T}}_{\mathbf {Y}}= \{\mathbf {t}^{y}_{k}=\{\mathbf{x}_{i_{1}},\ldots,\mathbf{x}_{i_{d+1}}\},\mathbf{x}_{i_{j}}\in \mathbf {Y}\}\) the sets of the corresponding triangulation objects with \(\widetilde{\mathbf {T}}_{\mathbf{X}^{\prime}},\widetilde{\mathbf {T}}_{\mathbf {Y}}\subseteq \mathbf {T}(\mathbf{X})\). Then the radii of all triangulation objects \(\mathbf {t}'_{k}\) are bounded, i.e. ∃α∈ℝ with \(r(\mathbf {t}'_{k})<\alpha\).
Suppose now ∃Y′⊂Y in such a way that ∥x−y∥≥2⋅α where x∈Y′ and y∈Y∖Y′. Then it follows from the circumcircle property that for the triangulation objects \(\mathbf {t}^{y'}_{k} \in\widetilde{\mathbf {T}}_{\mathbf {Y}}\) with \(\mathbf{x},\mathbf {y}\in \mathbf {t}^{y'}_{k}\):
But on the other hand if Y=X′, for all triangulation objects \(\mathbf {t}^{y}_{k} \in\widetilde{\mathbf {T}}_{\mathbf {Y}}\) holds \(r(\mathbf {t}^{y}_{k}) < \alpha\) as stated above. Hence, the algorithm described in Sect. 2.3 always terminates in X′. This proves that the breakdown point of the RDELA location estimator is at least ⌊(n+1)/2⌋/n.
The Delaunay triangulation is invariant against orthogonal transformations because its calculation is purely based on Euclidean distances which are invariant against this type of transformation. Consequently, this is also true for the estimators derived from the Delaunay triangulation.
As for orthogonal equivariant location estimators the maximum breakdown point is proved to be ⌊(n+1)/2⌋/n (Lopuhaä and Rousseeuw, 1991), this also ends the proof.
Now, we show that ε ∗(Σ n ,X)≥⌊(n−d+1)/2⌋/n.
Denote 0<λ 1≤⋯≤λ d <∞ the eigenvalues of Σ n (Y). The covariance estimator may break down by explosion (meaning that λ d →∞) or by implosion (if λ 1→0). For Σ n to explode, at least one observation y from the chosen subset Y has to grow arbitrarily, ∥y∥→∞. This case is covered by the location estimator’s proof.
For the case of implosion note that each chosen subset Y which is of size ⌊(n+d+1)/2⌋ by construction contains at least d+1 unaltered points x i ∈X′. Because X′ is in general position each subset of X′ consisting of d+1 points is also in general position. Due to this Σ n is positive semidefinite (and λ i >0 ∀i). This proves that ε ∗(Σ n ,X) is at least ⌊(n−d+1)/2⌋/n. □
It can be assumed that the breakdown point of the covariance estimator is at most ⌊(n−d+1)/2⌋/n. However, this has not been proved yet. This is primarily due to the fact that a triangulation is not defined for degenerated point configurations, which in turn are required to let the covariance estimate implode.
Rights and permissions
About this article
Cite this article
Liebscher, S., Kirschstein, T. & Becker, C. RDELA—a Delaunay-triangulation-based, location and covariance estimator with high breakdown point. Stat Comput 23, 677–688 (2013). https://doi.org/10.1007/s11222-012-9337-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-012-9337-5