Error rates for multivariate outlier detection

https://doi.org/10.1016/j.csda.2010.05.021Get rights and content

Abstract

Multivariate outlier identification requires the choice of reliable cut-off points for the robust distances that measure the discrepancy from the fit provided by high-breakdown estimators of location and scatter. Multiplicity issues affect the identification of the appropriate cut-off points. It is described how a careful choice of the error rate which is controlled during the outlier detection process can yield a good compromise between high power and low swamping, when alternatives to the Family Wise Error Rate are considered. Multivariate outlier detection rules based on the False Discovery Rate and the False Discovery Exceedance criteria are proposed. The properties of these rules are evaluated through simulation. The rules are then applied to real data examples. The conclusion is that the proposed approach provides a sensible strategy in many situations of practical interest.

Introduction

With multivariate data, multiple outliers are revealed by their large distances from the robust fit provided by high-breakdown estimators of location and scatter (Hubert et al., 2008). An important issue is the occurrence of multiplicity problems when outlier detection is set up in a statistical testing framework. Multiplicity arises because the candidate outliers are not known in advance and all the observations are tested in sequence starting from the most remote one. Different error rates may be of interest when performing multiple tests. The multiplicity problem has not been considered thoroughly in the literature about outlier detection, even if there are notable exceptions like Becker and Gather (1999) and Davies and Gather (1993) who define outward testing procedures and use Sidak correction to guarantee that the level of swamping is below a threshold.

The goal of this paper is to show how carefully choosing the error rate to be controlled in multiple outlier detection can provide a reasonable compromise between good performance under the null hypothesis of no outliers and high power under contamination. In particular, we focus on multivariate outlier detection rules based on the False Discovery Rate (FDR) of Benjamini and Hochberg (1995), on the False Discovery eXceedance (FDX) of Lehmann and Romano (2005) and van der Laan et al. (2004), and we compare the power of the resulting outlier tests with those of alternative procedures attaining the same nominal size. We also evaluate the positive FDR (pFDR) of the procedures (Storey, 2002, Storey, 2003). We conclude that controlling these error rates, especially the FDR, can be a sensible strategy for outlier identification in many situations of practical interest.

The rest of the paper is as follows: in the remainder of this section we briefly review the error rates which are of interest in multiple testing. In Section 2 we set out our strategies for FDR and FDX control, and for pFDR estimation, in multivariate outlier identification. The merits of these strategies are illustrated with a simulation study in Section 3 and on two motivating examples in Section 4.

Let yi be a v-variate observation with mean vector μ and covariance matrix Σ. Our basic model explaining the genesis of yi is a two-component mixture model of the kind: yi|ziFzi for some unobserved zi{0,1}. The clean observations arise from F0N(μ,Σ), while the contaminated observations are those for which zi=1, with F1 arbitrary. Outlier detection is stated in terms of testing n null hypotheses H0i:yiN(μ,Σ),i=1,,n. Each test is performed by computing the squared robust distance di2=(yiμ̃)Σ̃1(yiμ̃), where μ̃ and Σ̃ are high-breakdown estimators of μ and Σ. In this paper we take μ̃ and Σ̃ to be the reweighted MCD (RMCD) estimators of Rousseeuw and Van Driessen (1999).

Suppose that there are M0 clean observations and M1 contaminated ones. R is the number of observations declared to be outliers, i.e. those for which (1) is rejected. Table 1 summarizes the outcome of the outlier detection process. The values of N0|1 and N1|0 measure the amount of masking and swamping, respectively. Furthermore, the quantities in Table 1 are used to define error rates, which are deemed to be under control when they are bound, before the experiment, to be below a threshold α.

Traditional methods in multiple testing involve control of the Family Wise Error Rate (FWER), defined as the probability of making one or more false rejections. There is a plethora of methods for FWER control, the simplest being the Bonferroni correction, which consists in performing each individual test at a level of α/n. Another simple, but slightly more powerful, one-step procedure is Sidak correction, where each test is performed at a level of γ=1(1α)1/n. The observations selected after control of the FWER are all trusted to be outliers. The main drawback of FWER control is its low power. The consequences of FWER control may thus be close to those of masking.

A different approach is proposed by Benjamini and Hochberg (1995), who define the False Discovery Rate (FDR): FDR=E[N1|0R|R>0]Pr(R>0). The FDR is the expected proportion of erroneously rejected hypotheses, if any. The method developed by Benjamini and Hochberg (1995) (BH) is a stepwise procedure which proceeds by rejecting all tests corresponding to p-values below ρiα/n, where ρi is the rank of the i-th p-value. A very similar error rate, the positive FDR (pFDR), is defined by Storey, 2002, Storey, 2003 as pFDR=E[N1|0R|R>0], thus restricting to the cases in which there is at least one rejection. The pFDR has a nice Bayesian interpretation (Storey, 2003). It can be directly controlled or, as we do in this paper, it can be estimated to further evaluate the performance of any testing procedure. The pFDR is estimated by pFDR̂=aˆp(r)r(1(1p(r))n), where r>0 denotes the observed value of R,p(r) is the largest p-value associated with rejected tests and aˆ is an estimator of the number of true null hypotheses. In this paper we use the Schweder and Spjøtvoll (1982) estimator, and set aˆ=2(nτ0.5), where τ0.5 denotes the count of p-values smaller than or equal to 0.5.

Both (4), (5) are based on an expectation, whereas the actual proportion of false discoveries may be larger than α. Therefore, Lehmann and Romano (2005) and van der Laan et al. (2004) independently define the False Discovery eXceedance (FDX) as the probability of the false discovery proportion being above a threshold, that is, FDX=Pr(N1|0max(R,1)>c), where typically, and also in this paper, c=0.1. We control the FDX using the Lehmann and Romano (2005) (LR) procedure, which rejects all tests corresponding to p-values below (ρic+1)α/(n+ρic+1ρi), where is the integer part.

There is a plethora of other available methods, and error rates, for a review of which we refer to Farcomeni (2008). An important feature for our purposes is that the Bonferroni and Sidak corrections provide strong control of the FWER, which is bounded no matter the number and the configuration of outliers. Instead, the procedures based on the FDR and FDX ensure weak control of the FWER, which is then bounded only under the complete null hypothesis of no outliers H0:i=1nH0i. The main consequence for outlier detection is that FDR (or FDX) control provides a balance between ignoring multiplicity, as in Hardin and Rocke (2005) or in Hubert et al. (2008), and strictly correcting for multiplicity through FWER control, as in Becker and Gather (1999) or in Cerioli et al. (2009). The improvement obtained by controlling (4) or (7) may be particularly advantageous when n is high, or when many samples of moderate size need to be analyzed in sequence. In such instances the total number of hypotheses (1) to be tested will be large and the loss of power induced by strong FWER control will become more relevant.

Section snippets

FDR and FDX rules for multivariate outlier detection

The performance of any outlier detection method with well-behaved data sets is ruled by two basic elements:

  • (a)

    availability of a good approximation to the unknown finite-sample null distribution of the squared robust distances (2);

  • (b)

    correction for the multiplicity implied by repeated testing of the n individual hypotheses (1).

Avoiding (b) leads to identifying a proportion α of false outliers in any good data set, a situation that can have negative consequences in many applications like the examples

Enemy brothers: power and swamping

We now show the results of a simulation experiment run under the location-shift contamination model N(μ+λe,Σ), where λ is a positive scalar and e is a column-vector of ones. In our study, a proportion ω of observations come from the location-shift contamination model, while the remaining n(1ω) observations come from the null N(μ,Σ) model. We call ω the contamination rate. We also define power to be the proportion of contaminated observations correctly labeled as outliers. Without loss of

Data analysis

We outline two real data examples, on which we demonstrate the usefulness of multiplicity corrections. Both examples might also be seen as classification situations, for which alternative solutions are available. The outlier detection framework, with respect to many classifiers, has the disadvantage that we must use assumptions on the sample. Nevertheless, we believe that our approach is worthwhile in these examples for several reasons.

First, with statistical classifiers it may be hard to

Conclusions

In this paper we have explored alternative ways to reconcile the two opposite goals of multivariate outlier detection: achieving high power under contamination and ensuring low swamping with well behaved data. We have shown that the choice among the alternative methodologies mainly depends on the user attitude towards swamping. The FSRMCD and IRMCD procedures proposed by Cerioli (2010a) have opposite performances. With FSRMCD the level of swamping is kept under control for any number and

Acknowledgements

The authors are grateful to two anonymous reviewers for many useful comments that helped to sharpen the focus of the article. The authors also thank Anthony C. Atkinson and Marco Riani for helpful discussions on previous drafts of this work. Research of the first author was partially supported by the grant “Nuovi metodi multivariati robusti per la valutazione delle politiche sull’e-government e la società dell’informazione” of Ministero dell’Università e della RicercaPRIN 2008.

References (29)

  • A. Farcomeni et al.

    Nonparametric analysis of infrared spectra for recognition of glass and ceramic glass fragments in recycling plants

    Waste Management

    (2008)
  • P. Filzmoser et al.

    Outlier identification in high dimensions

    Computational Statistics and Data Analysis

    (2008)
  • V. Todorov et al.

    Robust statistic for the one-way MANOVA

    Computational Statistics and Data Analysis

    (2010)
  • C. Becker et al.

    The masking breakdown point of multivariate outlier identification rules

    Journal of the American Statistical Association

    (1999)
  • Y. Benjamini et al.

    Controlling the false discovery rate: a practical and powerful approach to multiple testing

    Journal of the Royal Statistical Society. Series B

    (1995)
  • Y. Benjamini et al.

    The control of the false discovery rate in multiple testing under dependency

    Annals of Statistics

    (2001)
  • A. Cerioli

    Multivariate outlier detection with high-breakdown estimators

    Journal of the American Statistical Association

    (2010)
  • Cerioli, A., 2010b. Diagnostic checking of multivariate normality under contamination....
  • A. Cerioli et al.

    Controlling the size of multivariate outlier tests with the MCD estimator of scatter

    Statistics and Computing

    (2009)
  • C. Croux et al.

    Influence functions of the Spearman and Kendall correlation measures

    Statistical Methods and Applications

    (2010)
  • L. Davies et al.

    The identification of multiple outliers

    Journal of the American Statistical Association

    (1993)
  • A. Farcomeni

    Some results on the control of the false discovery rate under dependence

    Scandinavian Journal of Statistics

    (2007)
  • A. Farcomeni

    A review of modern multiple hypothesis testing, with particular attention to the false discovery proportion

    Statistical Methods in Medical Research

    (2008)
  • A. Farcomeni

    Generalized augmentation to control the false discovery exceedance in multiple testing

    Scandinavian Journal of Statistics

    (2009)
  • Cited by (47)

    • An impartial trimming algorithm for robust circle fitting

      2023, Computational Statistics and Data Analysis
    • Efficient robust methods via monitoring for clustering and multivariate data analysis

      2019, Pattern Recognition
      Citation Excerpt :

      The former arises when the aberrant observations attract the estimates in such a way that they do not appear anomalous anymore; conversely, the latter occurs when the estimation bias leads uncontaminated observations to be mistakenly labeled as outliers. Safeguard against these undesirable effects can be obtained by the use of diagnostic tools based on high-breakdown estimators and by a careful design of the related statistical testing procedures [22–26]. Cerioli et al. [27,28] give a brief history of robust statistical methods from the hopeful dawn at the time of the Princeton Robustness Study [29].

    View all citing articles on Scopus
    View full text