Error rates for multivariate outlier detection
Introduction
With multivariate data, multiple outliers are revealed by their large distances from the robust fit provided by high-breakdown estimators of location and scatter (Hubert et al., 2008). An important issue is the occurrence of multiplicity problems when outlier detection is set up in a statistical testing framework. Multiplicity arises because the candidate outliers are not known in advance and all the observations are tested in sequence starting from the most remote one. Different error rates may be of interest when performing multiple tests. The multiplicity problem has not been considered thoroughly in the literature about outlier detection, even if there are notable exceptions like Becker and Gather (1999) and Davies and Gather (1993) who define outward testing procedures and use Sidak correction to guarantee that the level of swamping is below a threshold.
The goal of this paper is to show how carefully choosing the error rate to be controlled in multiple outlier detection can provide a reasonable compromise between good performance under the null hypothesis of no outliers and high power under contamination. In particular, we focus on multivariate outlier detection rules based on the False Discovery Rate (FDR) of Benjamini and Hochberg (1995), on the False Discovery eXceedance (FDX) of Lehmann and Romano (2005) and van der Laan et al. (2004), and we compare the power of the resulting outlier tests with those of alternative procedures attaining the same nominal size. We also evaluate the positive FDR (pFDR) of the procedures (Storey, 2002, Storey, 2003). We conclude that controlling these error rates, especially the FDR, can be a sensible strategy for outlier identification in many situations of practical interest.
The rest of the paper is as follows: in the remainder of this section we briefly review the error rates which are of interest in multiple testing. In Section 2 we set out our strategies for FDR and FDX control, and for pFDR estimation, in multivariate outlier identification. The merits of these strategies are illustrated with a simulation study in Section 3 and on two motivating examples in Section 4.
Let be a -variate observation with mean vector and covariance matrix . Our basic model explaining the genesis of is a two-component mixture model of the kind: for some unobserved . The clean observations arise from , while the contaminated observations are those for which , with arbitrary. Outlier detection is stated in terms of testing null hypotheses Each test is performed by computing the squared robust distance where and are high-breakdown estimators of and . In this paper we take and to be the reweighted MCD (RMCD) estimators of Rousseeuw and Van Driessen (1999).
Suppose that there are clean observations and contaminated ones. is the number of observations declared to be outliers, i.e. those for which (1) is rejected. Table 1 summarizes the outcome of the outlier detection process. The values of and measure the amount of masking and swamping, respectively. Furthermore, the quantities in Table 1 are used to define error rates, which are deemed to be under control when they are bound, before the experiment, to be below a threshold .
Traditional methods in multiple testing involve control of the Family Wise Error Rate (FWER), defined as the probability of making one or more false rejections. There is a plethora of methods for FWER control, the simplest being the Bonferroni correction, which consists in performing each individual test at a level of . Another simple, but slightly more powerful, one-step procedure is Sidak correction, where each test is performed at a level of The observations selected after control of the FWER are all trusted to be outliers. The main drawback of FWER control is its low power. The consequences of FWER control may thus be close to those of masking.
A different approach is proposed by Benjamini and Hochberg (1995), who define the False Discovery Rate (FDR): The FDR is the expected proportion of erroneously rejected hypotheses, if any. The method developed by Benjamini and Hochberg (1995) (BH) is a stepwise procedure which proceeds by rejecting all tests corresponding to -values below , where is the rank of the -th -value. A very similar error rate, the positive FDR (pFDR), is defined by Storey, 2002, Storey, 2003 as thus restricting to the cases in which there is at least one rejection. The pFDR has a nice Bayesian interpretation (Storey, 2003). It can be directly controlled or, as we do in this paper, it can be estimated to further evaluate the performance of any testing procedure. The pFDR is estimated by where denotes the observed value of is the largest -value associated with rejected tests and is an estimator of the number of true null hypotheses. In this paper we use the Schweder and Spjøtvoll (1982) estimator, and set , where denotes the count of -values smaller than or equal to 0.5.
Both (4), (5) are based on an expectation, whereas the actual proportion of false discoveries may be larger than . Therefore, Lehmann and Romano (2005) and van der Laan et al. (2004) independently define the False Discovery eXceedance (FDX) as the probability of the false discovery proportion being above a threshold, that is, where typically, and also in this paper, . We control the FDX using the Lehmann and Romano (2005) (LR) procedure, which rejects all tests corresponding to -values below , where is the integer part.
There is a plethora of other available methods, and error rates, for a review of which we refer to Farcomeni (2008). An important feature for our purposes is that the Bonferroni and Sidak corrections provide strong control of the FWER, which is bounded no matter the number and the configuration of outliers. Instead, the procedures based on the FDR and FDX ensure weak control of the FWER, which is then bounded only under the complete null hypothesis of no outliers The main consequence for outlier detection is that FDR (or FDX) control provides a balance between ignoring multiplicity, as in Hardin and Rocke (2005) or in Hubert et al. (2008), and strictly correcting for multiplicity through FWER control, as in Becker and Gather (1999) or in Cerioli et al. (2009). The improvement obtained by controlling (4) or (7) may be particularly advantageous when is high, or when many samples of moderate size need to be analyzed in sequence. In such instances the total number of hypotheses (1) to be tested will be large and the loss of power induced by strong FWER control will become more relevant.
Section snippets
FDR and FDX rules for multivariate outlier detection
The performance of any outlier detection method with well-behaved data sets is ruled by two basic elements:
- (a)
availability of a good approximation to the unknown finite-sample null distribution of the squared robust distances (2);
- (b)
correction for the multiplicity implied by repeated testing of the individual hypotheses (1).
Enemy brothers: power and swamping
We now show the results of a simulation experiment run under the location-shift contamination model , where is a positive scalar and is a column-vector of ones. In our study, a proportion of observations come from the location-shift contamination model, while the remaining observations come from the null model. We call the contamination rate. We also define power to be the proportion of contaminated observations correctly labeled as outliers. Without loss of
Data analysis
We outline two real data examples, on which we demonstrate the usefulness of multiplicity corrections. Both examples might also be seen as classification situations, for which alternative solutions are available. The outlier detection framework, with respect to many classifiers, has the disadvantage that we must use assumptions on the sample. Nevertheless, we believe that our approach is worthwhile in these examples for several reasons.
First, with statistical classifiers it may be hard to
Conclusions
In this paper we have explored alternative ways to reconcile the two opposite goals of multivariate outlier detection: achieving high power under contamination and ensuring low swamping with well behaved data. We have shown that the choice among the alternative methodologies mainly depends on the user attitude towards swamping. The FSRMCD and IRMCD procedures proposed by Cerioli (2010a) have opposite performances. With FSRMCD the level of swamping is kept under control for any number and
Acknowledgements
The authors are grateful to two anonymous reviewers for many useful comments that helped to sharpen the focus of the article. The authors also thank Anthony C. Atkinson and Marco Riani for helpful discussions on previous drafts of this work. Research of the first author was partially supported by the grant “Nuovi metodi multivariati robusti per la valutazione delle politiche sull’e-government e la società dell’informazione” of Ministero dell’Università e della Ricerca–PRIN 2008.
References (29)
- et al.
Nonparametric analysis of infrared spectra for recognition of glass and ceramic glass fragments in recycling plants
Waste Management
(2008) - et al.
Outlier identification in high dimensions
Computational Statistics and Data Analysis
(2008) - et al.
Robust statistic for the one-way MANOVA
Computational Statistics and Data Analysis
(2010) - et al.
The masking breakdown point of multivariate outlier identification rules
Journal of the American Statistical Association
(1999) - et al.
Controlling the false discovery rate: a practical and powerful approach to multiple testing
Journal of the Royal Statistical Society. Series B
(1995) - et al.
The control of the false discovery rate in multiple testing under dependency
Annals of Statistics
(2001) Multivariate outlier detection with high-breakdown estimators
Journal of the American Statistical Association
(2010)- Cerioli, A., 2010b. Diagnostic checking of multivariate normality under contamination....
- et al.
Controlling the size of multivariate outlier tests with the MCD estimator of scatter
Statistics and Computing
(2009) - et al.
Influence functions of the Spearman and Kendall correlation measures
Statistical Methods and Applications
(2010)
The identification of multiple outliers
Journal of the American Statistical Association
Some results on the control of the false discovery rate under dependence
Scandinavian Journal of Statistics
A review of modern multiple hypothesis testing, with particular attention to the false discovery proportion
Statistical Methods in Medical Research
Generalized augmentation to control the false discovery exceedance in multiple testing
Scandinavian Journal of Statistics
Cited by (47)
Information Criteria for Outlier Detection Avoiding Arbitrary Significance Levels
2024, Econometrics and StatisticsAn impartial trimming algorithm for robust circle fitting
2023, Computational Statistics and Data AnalysisRobust fitting of mixture models using weighted complete estimating equations
2022, Computational Statistics and Data AnalysisEfficient robust methods via monitoring for clustering and multivariate data analysis
2019, Pattern RecognitionCitation Excerpt :The former arises when the aberrant observations attract the estimates in such a way that they do not appear anomalous anymore; conversely, the latter occurs when the estimation bias leads uncontaminated observations to be mistakenly labeled as outliers. Safeguard against these undesirable effects can be obtained by the use of diagnostic tools based on high-breakdown estimators and by a careful design of the related statistical testing procedures [22–26]. Cerioli et al. [27,28] give a brief history of robust statistical methods from the hopeful dawn at the time of the Princeton Robustness Study [29].
ICS for multivariate outlier detection with application to quality control
2018, Computational Statistics and Data AnalysisProcessing APT Spectral Backgrounds for Improved Quantification
2020, Microscopy and Microanalysis