A Monte Carlo study of the accuracy and robustness of ten bivariate location estimators

https://doi.org/10.1016/S0167-9473(02)00103-2Get rights and content

Abstract

In a Monte Carlo study, ten bivariate location estimators are compared as regards their accuracy and robustness. In addition to the arithmetic mean, five bivariate medians and four depth-based trimmed means are thus investigated. The behavior of the estimators is examined under various sampling situations determined by three sample sizes and 26 underlying distributions, 14 of which are centrally symmetric and 12 are asymmetric contaminated normals. Performance is assessed through numerical functions of the sample mean squared error and bias matrices.

Introduction

Robust alternatives to the arithmetic mean for estimating location have a history going back at least to Laplace (see Stigler, 1986, p. 54). Fisher (1922) drew attention to the inefficiency of the arithmetic mean as an estimator of location for some distributions belonging to the family of Pearson curves near the normal. Using his normal contamination models, Tukey (1960) dramatically demonstrated how little efficient the mean can become when contamination increases. The same paper also shows how alternative location estimators such as the median or trimmed means can achieve higher asymptotic efficiency than the mean. As a result, statisticians have become more wary of making uncritical use of normal theory and have felt aware of the need for robust procedures, in the sense of procedures that remain good when the assumed model does not quite fit.

A theory of robust estimation was first developed by Huber (1964) in a paper that also introduced the so-called M-estimators of location, a class of estimators that includes the arithmetic mean, the median and maximum likelihood estimators. Huber defined a robust estimator of location as one that minimizes the asymptotic variance over some neighborhood of a known distribution (such as the normal). In the wake of Huber's work, several statisticians assessed the robustness and efficiency of various classes of location estimators for a variety of parent distributions; these classes include estimators defined by linear combination of order statistics (L-estimators) and estimators defined by rank tests (R-estimators). This effort culminated in a major Monte Carlo robustness study conducted at Princeton by Andrews et al. (1972), in which the performances of 68 univariate estimators of location were compared with samples from a dozen distributions. For good historical accounts of the development of these ideas, see Huber (1972) and Hampel et al. (1986).

Bickel (1964) was one of the first to study efficiency and robustness for multivariate location estimators. In his paper, the vector of coordinate medians and the vector of coordinate medians of averages of pairs (vector of coordinate Hodges–Lehmann estimators) are compared with the mean with respect to asymptotic efficiency under various distributions, some close to the normal, some not; in addition, Bickel investigates the robustness of the vector of coordinate Hodges–Lehmann estimators when the underlying distribution is contaminated.

Beginning in the seventies, a few authors obtained several multivariate versions of typically univariate notions such as medians, L-estimators, R-estimators. Four of the multivariate medians are now known as the spatial median (also called mediancenter or L1-median), the Tukey or half-space median, the Oja median and the Liu or simplicial median; these estimators were, respectively, proposed or discussed by Gini and Galvani (1929) (see also Haldane (1948)), Tukey (1975), Oja (1983) and Liu (1990). Some work has been done on the efficiency of the spatial median and the Oja median with respect to the arithmetical mean; a good reference for some of the results is Hettmansperger and McKean (1998).

Through his notion of depth, Tukey (1975) initiated a very fruitful approach for defining multivariate location estimators. A depth can be seen as a device for measuring the centrality of a multivariate data point within a given data cloud. Each such function induces a center-outward ranking of data points within a given multivariate data set; this allows a multivariate generalization of univariate location estimators such as the median, trimmed means or, more generally, L-estimators of location. In particular, Tukey's depth is the basis for the so-called half-space median and a notion of trimmed mean used in this paper. Various depth functions have since appeared in the statistical literature (see Zuo and Serfling, 2000a), each one giving rise to a ranking of the data and, therefore, to a family of L-estimators of location. Besides Tukey's depth, the best-known depth function is the simplicial depth of Liu (1990), which is used in this paper through the Liu median and Liu depth-based trimmed means.

In order that a depth be a useful applicable tool, ease of computation is a prerequisite. An algorithm was proposed by Niinimaa et al. (1992) to compute the Oja bivariate median. Until recently, no algorithms were available to compute the half-space and simplicial depths, thus severely limiting the applicability of these functions. An important advance came with Rousseeuw and Ruts (1996), who constructed exact algorithms for computing the half-space and simplicial depths of a point in a two-dimensional cloud. Rousseeuw and Ruts (1998) did the same for the computation of the bivariate Tukey median. Exact or approximate algorithms have also been obtained for higher dimensions: see Rousseeuw and Ruts (1998), Rousseeuw and Struyf (1998) and Struyf and Rousseeuw (2000).

Taking advantage of the algorithms recently proposed to compute depths, this Monte Carlo study aims at making a finite-sample comparison of the accuracy and robustness of ten bivariate estimators of location, six of which being based on Tukey's or Liu's depths. For each estimator, sample size and underlying distribution, performance is assessed through numerical functions of the estimated mean squared error and bias.

Little is known about the efficiency of the Tukey or Liu depths-based location estimators studied in this paper. Through a small simulation, Rousseeuw and Ruts (1998) have studied empirically the efficiency of the Tukey median and vector of medians for various sample sizes and the standard bivariate normal as the underlying distribution. Fraiman and Meloche (1999) have performed a similar study of six location estimators (including the vector of medians, as well as the spatial and Liu medians) for sample size 20 under various distributions, mostly different from ours. At the present time, the Monte Carlo method appears to be the only practical means of studying the efficiency of most depth-based location estimators. Indeed, for six of the depth-based location estimators chosen for this study, exact efficiency calculations remain intractable even for a normal distribution.

The paper is arranged as follows. Section 2 describes the location estimators studied in the simulation. Section 3 deals with the bivariate distributions which along with the sample size determine the sampling situations applied to the estimators. Section 4 covers some technical aspects of the Monte Carlo study and describes the numerical measures used to assess the estimators. Section 5 reports on the performances of the estimators and interprets the results. Section 6 is a conclusion.

Section snippets

Bivariate location estimators selected for the study

Let F be a probability distribution in R2 and X1,X2,…,Xn a random sample from F. A bivariate location estimator can be informally described as a R2-valued function Tn, defined for each sample size n, mapping the set of data points into some point Tn(X1,…,Xn), which we understand as some approximation of the location or center of F.

In this Monte Carlo study, it will always be assumed that 0 is the true location to be estimated. This is the natural center for 14 simulated distributions which are

Bivariate distributions simulated

Twenty-six distributions were investigated. Among these, 14 are centrally symmetric about 0, where central symmetry about 0 means that X and −X have the same distribution. The 12 remaining simulated distributions are asymmetric contaminated normal distributions. One of the distributions has finite support and the others have been chosen such that heavy-tailedness ranges from low to high.

Sample sizes and number of replications

In what follows, each combination of distribution and sample size is called a sampling situation. This Monte Carlo experiment studies the behavior of the 10 bivariate location estimators under 78 sampling situations determined by three sample sizes: 20, 60 and 200, and 26 distributions. For each sampling situation, 500 replications were used to take into account sampling variability.

Algorithms

The Ranlib library of Fortran routines for random number generation has been used. Johnson (1987) provides

Results and interpretation

The results are presented in a series of five tables at the end of the paper. In all those tables, the performances of the location estimators are compared through the numerical measures M and B, or some function of these.

Conclusion

The performance of 10 bivariate location estimators has been investigated under 78 sampling situations, the primary concern always being accuracy and robustness. In addition to three sample sizes, 14 centrally symmetric distributions and 12 asymmetric contaminated normals were retained for the study, most of these distributions having some degree of heavy-tailedness.

The study has shown that four bivariate medians and, to a lesser degree, two depth-based trimmed means are very good alternatives

Acknowledgements

The authors greatly appreciate the constructive remarks and suggestions made by the referees, which led to improvement of the paper.

The research of the first author was supported by grants from the National Sciences and Engineering Research Council of Canada and the Fonds FCAR de la Province de Québec.

References (34)

  • B. Chakraborty et al.

    A note on the robustness of multivariate medians

    Statist. Probab. Lett.

    (1999)
  • H. Oja

    Descriptive statistics for multivariate distributions

    Statist. Probab. Lett.

    (1983)
  • D.F. Andrews et al.

    Robust Estimates of Location: Survey and Advances

    (1972)
  • F.K. Bedall et al.

    AS 143: the mediancentre

    Appl. Statist.

    (1979)
  • P.J. Bickel

    On some alternative estimates for shift in the p-variate one sample problem

    Ann. Math. Statist.

    (1964)
  • P. Chaudhuri

    On a geometric notion of quantiles for multivariate data

    J. Amer. Statist. Assoc.

    (1996)
  • D. Donoho et al.

    Breakdown properties of location estimates based on half-space depth and projected outlyingness

    Ann. Math. Statist.

    (1992)
  • R.A. Fisher

    On the mathematical foundations of theoretical statistics

    Philos. Trans. Roy. Astronom. Soc. London Ser. A

    (1922)
  • R. Fraiman et al.

    Multivariate L-estimation

    Soc. Estad. Invest. Oper. Test

    (1999)
  • Gini, C., Galvani, L., 1929. Di talune estensioni dei concetti di media ai caratteri qualitativi. Metron 8, 3–209....
  • J.B.S. Haldane

    Note on the median of multivariate distribution

    Biometrika

    (1948)
  • F.R. Hampel et al.

    Robust Statistics. The Approach Based on Influence Functions

    (1986)
  • T.P. Hettmansperger et al.

    Robust Nonparametric Statistical Methods

    (1998)
  • P.J. Huber

    Robust estimation of a location parameter

    Ann. Math. Statist.

    (1964)
  • P.J. Huber

    Robust statistics: a review

    Ann. Math. Statist.

    (1972)
  • M.E. Johnson

    Multivariate Statistical Simulation

    (1987)
  • E.L. Lehmann

    Theory of Point Estimation

    (1983)
  • Cited by (17)

    • Multivariate trimmed means based on the Tukey depth

      2009, Journal of Statistical Planning and Inference
    • The quarter median

      2022, Metrika
    • Introduction to Robust Estimation and Hypothesis Testing

      2021, Introduction to Robust Estimation and Hypothesis Testing
    • On the Use of the Geometric Median in Delay-and-Sum Ultrasonic Array Imaging

      2020, IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control
    View all citing articles on Scopus
    View full text