A Monte Carlo study of the accuracy and robustness of ten bivariate location estimators
Introduction
Robust alternatives to the arithmetic mean for estimating location have a history going back at least to Laplace (see Stigler, 1986, p. 54). Fisher (1922) drew attention to the inefficiency of the arithmetic mean as an estimator of location for some distributions belonging to the family of Pearson curves near the normal. Using his normal contamination models, Tukey (1960) dramatically demonstrated how little efficient the mean can become when contamination increases. The same paper also shows how alternative location estimators such as the median or trimmed means can achieve higher asymptotic efficiency than the mean. As a result, statisticians have become more wary of making uncritical use of normal theory and have felt aware of the need for robust procedures, in the sense of procedures that remain good when the assumed model does not quite fit.
A theory of robust estimation was first developed by Huber (1964) in a paper that also introduced the so-called M-estimators of location, a class of estimators that includes the arithmetic mean, the median and maximum likelihood estimators. Huber defined a robust estimator of location as one that minimizes the asymptotic variance over some neighborhood of a known distribution (such as the normal). In the wake of Huber's work, several statisticians assessed the robustness and efficiency of various classes of location estimators for a variety of parent distributions; these classes include estimators defined by linear combination of order statistics (L-estimators) and estimators defined by rank tests (R-estimators). This effort culminated in a major Monte Carlo robustness study conducted at Princeton by Andrews et al. (1972), in which the performances of 68 univariate estimators of location were compared with samples from a dozen distributions. For good historical accounts of the development of these ideas, see Huber (1972) and Hampel et al. (1986).
Bickel (1964) was one of the first to study efficiency and robustness for multivariate location estimators. In his paper, the vector of coordinate medians and the vector of coordinate medians of averages of pairs (vector of coordinate Hodges–Lehmann estimators) are compared with the mean with respect to asymptotic efficiency under various distributions, some close to the normal, some not; in addition, Bickel investigates the robustness of the vector of coordinate Hodges–Lehmann estimators when the underlying distribution is contaminated.
Beginning in the seventies, a few authors obtained several multivariate versions of typically univariate notions such as medians, L-estimators, R-estimators. Four of the multivariate medians are now known as the spatial median (also called mediancenter or L1-median), the Tukey or half-space median, the Oja median and the Liu or simplicial median; these estimators were, respectively, proposed or discussed by Gini and Galvani (1929) (see also Haldane (1948)), Tukey (1975), Oja (1983) and Liu (1990). Some work has been done on the efficiency of the spatial median and the Oja median with respect to the arithmetical mean; a good reference for some of the results is Hettmansperger and McKean (1998).
Through his notion of depth, Tukey (1975) initiated a very fruitful approach for defining multivariate location estimators. A depth can be seen as a device for measuring the centrality of a multivariate data point within a given data cloud. Each such function induces a center-outward ranking of data points within a given multivariate data set; this allows a multivariate generalization of univariate location estimators such as the median, trimmed means or, more generally, L-estimators of location. In particular, Tukey's depth is the basis for the so-called half-space median and a notion of trimmed mean used in this paper. Various depth functions have since appeared in the statistical literature (see Zuo and Serfling, 2000a), each one giving rise to a ranking of the data and, therefore, to a family of L-estimators of location. Besides Tukey's depth, the best-known depth function is the simplicial depth of Liu (1990), which is used in this paper through the Liu median and Liu depth-based trimmed means.
In order that a depth be a useful applicable tool, ease of computation is a prerequisite. An algorithm was proposed by Niinimaa et al. (1992) to compute the Oja bivariate median. Until recently, no algorithms were available to compute the half-space and simplicial depths, thus severely limiting the applicability of these functions. An important advance came with Rousseeuw and Ruts (1996), who constructed exact algorithms for computing the half-space and simplicial depths of a point in a two-dimensional cloud. Rousseeuw and Ruts (1998) did the same for the computation of the bivariate Tukey median. Exact or approximate algorithms have also been obtained for higher dimensions: see Rousseeuw and Ruts (1998), Rousseeuw and Struyf (1998) and Struyf and Rousseeuw (2000).
Taking advantage of the algorithms recently proposed to compute depths, this Monte Carlo study aims at making a finite-sample comparison of the accuracy and robustness of ten bivariate estimators of location, six of which being based on Tukey's or Liu's depths. For each estimator, sample size and underlying distribution, performance is assessed through numerical functions of the estimated mean squared error and bias.
Little is known about the efficiency of the Tukey or Liu depths-based location estimators studied in this paper. Through a small simulation, Rousseeuw and Ruts (1998) have studied empirically the efficiency of the Tukey median and vector of medians for various sample sizes and the standard bivariate normal as the underlying distribution. Fraiman and Meloche (1999) have performed a similar study of six location estimators (including the vector of medians, as well as the spatial and Liu medians) for sample size 20 under various distributions, mostly different from ours. At the present time, the Monte Carlo method appears to be the only practical means of studying the efficiency of most depth-based location estimators. Indeed, for six of the depth-based location estimators chosen for this study, exact efficiency calculations remain intractable even for a normal distribution.
The paper is arranged as follows. Section 2 describes the location estimators studied in the simulation. Section 3 deals with the bivariate distributions which along with the sample size determine the sampling situations applied to the estimators. Section 4 covers some technical aspects of the Monte Carlo study and describes the numerical measures used to assess the estimators. Section 5 reports on the performances of the estimators and interprets the results. Section 6 is a conclusion.
Section snippets
Bivariate location estimators selected for the study
Let F be a probability distribution in and X1,X2,…,Xn a random sample from F. A bivariate location estimator can be informally described as a -valued function Tn, defined for each sample size n, mapping the set of data points into some point Tn(X1,…,Xn), which we understand as some approximation of the location or center of F.
In this Monte Carlo study, it will always be assumed that 0 is the true location to be estimated. This is the natural center for 14 simulated distributions which are
Bivariate distributions simulated
Twenty-six distributions were investigated. Among these, 14 are centrally symmetric about 0, where central symmetry about 0 means that X and −X have the same distribution. The 12 remaining simulated distributions are asymmetric contaminated normal distributions. One of the distributions has finite support and the others have been chosen such that heavy-tailedness ranges from low to high.
Sample sizes and number of replications
In what follows, each combination of distribution and sample size is called a sampling situation. This Monte Carlo experiment studies the behavior of the 10 bivariate location estimators under 78 sampling situations determined by three sample sizes: 20, 60 and 200, and 26 distributions. For each sampling situation, 500 replications were used to take into account sampling variability.
Algorithms
The Ranlib library of Fortran routines for random number generation has been used. Johnson (1987) provides
Results and interpretation
The results are presented in a series of five tables at the end of the paper. In all those tables, the performances of the location estimators are compared through the numerical measures M and B, or some function of these.
Conclusion
The performance of 10 bivariate location estimators has been investigated under 78 sampling situations, the primary concern always being accuracy and robustness. In addition to three sample sizes, 14 centrally symmetric distributions and 12 asymmetric contaminated normals were retained for the study, most of these distributions having some degree of heavy-tailedness.
The study has shown that four bivariate medians and, to a lesser degree, two depth-based trimmed means are very good alternatives
Acknowledgements
The authors greatly appreciate the constructive remarks and suggestions made by the referees, which led to improvement of the paper.
The research of the first author was supported by grants from the National Sciences and Engineering Research Council of Canada and the Fonds FCAR de la Province de Québec.
References (34)
- et al.
A note on the robustness of multivariate medians
Statist. Probab. Lett.
(1999) Descriptive statistics for multivariate distributions
Statist. Probab. Lett.
(1983)- et al.
Robust Estimates of Location: Survey and Advances
(1972) - et al.
AS 143: the mediancentre
Appl. Statist.
(1979) On some alternative estimates for shift in the p-variate one sample problem
Ann. Math. Statist.
(1964)On a geometric notion of quantiles for multivariate data
J. Amer. Statist. Assoc.
(1996)- et al.
Breakdown properties of location estimates based on half-space depth and projected outlyingness
Ann. Math. Statist.
(1992) On the mathematical foundations of theoretical statistics
Philos. Trans. Roy. Astronom. Soc. London Ser. A
(1922)- et al.
Multivariate L-estimation
Soc. Estad. Invest. Oper. Test
(1999) - Gini, C., Galvani, L., 1929. Di talune estensioni dei concetti di media ai caratteri qualitativi. Metron 8, 3–209....
Note on the median of multivariate distribution
Biometrika
Robust Statistics. The Approach Based on Influence Functions
Robust Nonparametric Statistical Methods
Robust estimation of a location parameter
Ann. Math. Statist.
Robust statistics: a review
Ann. Math. Statist.
Multivariate Statistical Simulation
Theory of Point Estimation
Cited by (17)
Multivariate trimmed means based on the Tukey depth
2009, Journal of Statistical Planning and InferenceA new robust multivariate mode estimator for eye-tracking calibration
2023, Behavior Research MethodsThe quarter median
2022, arXivThe quarter median
2022, MetrikaIntroduction to Robust Estimation and Hypothesis Testing
2021, Introduction to Robust Estimation and Hypothesis TestingOn the Use of the Geometric Median in Delay-and-Sum Ultrasonic Array Imaging
2020, IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control