Error rates for multivariate outlier detection

doi:10.1016/j.csda.2010.05.021

Computational Statistics & Data Analysis

Volume 55, Issue 1, 1 January 2011, Pages 544-553

https://doi.org/10.1016/j.csda.2010.05.021 Get rights and content

Abstract

Multivariate outlier identification requires the choice of reliable cut-off points for the robust distances that measure the discrepancy from the fit provided by high-breakdown estimators of location and scatter. Multiplicity issues affect the identification of the appropriate cut-off points. It is described how a careful choice of the error rate which is controlled during the outlier detection process can yield a good compromise between high power and low swamping, when alternatives to the Family Wise Error Rate are considered. Multivariate outlier detection rules based on the False Discovery Rate and the False Discovery Exceedance criteria are proposed. The properties of these rules are evaluated through simulation. The rules are then applied to real data examples. The conclusion is that the proposed approach provides a sensible strategy in many situations of practical interest.

Introduction

With multivariate data, multiple outliers are revealed by their large distances from the robust fit provided by high-breakdown estimators of location and scatter (Hubert et al., 2008). An important issue is the occurrence of multiplicity problems when outlier detection is set up in a statistical testing framework. Multiplicity arises because the candidate outliers are not known in advance and all the observations are tested in sequence starting from the most remote one. Different error rates may be of interest when performing multiple tests. The multiplicity problem has not been considered thoroughly in the literature about outlier detection, even if there are notable exceptions like Becker and Gather (1999) and Davies and Gather (1993) who define outward testing procedures and use Sidak correction to guarantee that the level of swamping is below a threshold.

The goal of this paper is to show how carefully choosing the error rate to be controlled in multiple outlier detection can provide a reasonable compromise between good performance under the null hypothesis of no outliers and high power under contamination. In particular, we focus on multivariate outlier detection rules based on the False Discovery Rate (FDR) of Benjamini and Hochberg (1995), on the False Discovery eXceedance (FDX) of Lehmann and Romano (2005) and van der Laan et al. (2004), and we compare the power of the resulting outlier tests with those of alternative procedures attaining the same nominal size. We also evaluate the positive FDR (pFDR) of the procedures (Storey, 2002, Storey, 2003). We conclude that controlling these error rates, especially the FDR, can be a sensible strategy for outlier identification in many situations of practical interest.

The rest of the paper is as follows: in the remainder of this section we briefly review the error rates which are of interest in multiple testing. In Section 2 we set out our strategies for FDR and FDX control, and for pFDR estimation, in multivariate outlier identification. The merits of these strategies are illustrated with a simulation study in Section 3 and on two motivating examples in Section 4.

Let $y_{i}$ be a $v$ -variate observation with mean vector $μ$ and covariance matrix $Σ$ . Our basic model explaining the genesis of $y_{i}$ is a two-component mixture model of the kind: $y_{i} | z_{i} \sim F_{z_{i}}$ for some unobserved $z_{i} \in {0, 1}$ . The clean observations arise from $F_{0} \sim N (μ, Σ)$ , while the contaminated observations are those for which $z_{i} = 1$ , with $F_{1}$ arbitrary. Outlier detection is stated in terms of testing $n$ null hypotheses $H_{0 i} : y_{i} \sim N (μ, Σ), i = 1, \dots, n .$ Each test is performed by computing the squared robust distance $d_{i}^{2} = {(y_{i} - \tilde{μ})}^{'} {\tilde{Σ}}^{- 1} (y_{i} - \tilde{μ}),$ where $\tilde{μ}$ and $\tilde{Σ}$ are high-breakdown estimators of $μ$ and $Σ$ . In this paper we take $\tilde{μ}$ and $\tilde{Σ}$ to be the reweighted MCD (RMCD) estimators of Rousseeuw and Van Driessen (1999).

Suppose that there are $M_{0}$ clean observations and $M_{1}$ contaminated ones. $R$ is the number of observations declared to be outliers, i.e. those for which (1) is rejected. Table 1 summarizes the outcome of the outlier detection process. The values of $N_{0 | 1}$ and $N_{1 | 0}$ measure the amount of masking and swamping, respectively. Furthermore, the quantities in Table 1 are used to define error rates, which are deemed to be under control when they are bound, before the experiment, to be below a threshold $α$ .

Traditional methods in multiple testing involve control of the Family Wise Error Rate (FWER), defined as the probability of making one or more false rejections. There is a plethora of methods for FWER control, the simplest being the Bonferroni correction, which consists in performing each individual test at a level of $α / n$ . Another simple, but slightly more powerful, one-step procedure is Sidak correction, where each test is performed at a level of $γ = 1 - {(1 - α)}^{1 / n} .$ The observations selected after control of the FWER are all trusted to be outliers. The main drawback of FWER control is its low power. The consequences of FWER control may thus be close to those of masking.

A different approach is proposed by Benjamini and Hochberg (1995), who define the False Discovery Rate (FDR): $FDR = E [\frac{N_{1 | 0}}{R} | R > 0] Pr (R > 0) .$ The FDR is the expected proportion of erroneously rejected hypotheses, if any. The method developed by Benjamini and Hochberg (1995) (BH) is a stepwise procedure which proceeds by rejecting all tests corresponding to $p$ -values below $ρ_{i} α / n$ , where $ρ_{i}$ is the rank of the $i$ -th $p$ -value. A very similar error rate, the positive FDR (pFDR), is defined by Storey, 2002, Storey, 2003 as $pFDR = E [\frac{N_{1 | 0}}{R} | R > 0],$ thus restricting to the cases in which there is at least one rejection. The pFDR has a nice Bayesian interpretation (Storey, 2003). It can be directly controlled or, as we do in this paper, it can be estimated to further evaluate the performance of any testing procedure. The pFDR is estimated by $\hat{pFDR} = \frac{\hat{a} p_{(r)}}{r (1 - {(1 - p_{(r)})}^{n})},$ where $r > 0$ denotes the observed value of $R, p_{(r)}$ is the largest $p$ -value associated with rejected tests and $\hat{a}$ is an estimator of the number of true null hypotheses. In this paper we use the Schweder and Spjøtvoll (1982) estimator, and set $\hat{a} = 2 (n - τ_{0.5})$ , where $τ_{0.5}$ denotes the count of $p$ -values smaller than or equal to 0.5.

Both (4), (5) are based on an expectation, whereas the actual proportion of false discoveries may be larger than $α$ . Therefore, Lehmann and Romano (2005) and van der Laan et al. (2004) independently define the False Discovery eXceedance (FDX) as the probability of the false discovery proportion being above a threshold, that is, $FDX = Pr (\frac{N_{1 | 0}}{max (R, 1)} > c),$ where typically, and also in this paper, $c = 0.1$ . We control the FDX using the Lehmann and Romano (2005) (LR) procedure, which rejects all tests corresponding to $p$ -values below $(⌊ ρ_{i} c ⌋ + 1) α / (n + ⌊ ρ_{i} c ⌋ + 1 - ρ_{i})$ , where $⌊ \cdot ⌋$ is the integer part.

There is a plethora of other available methods, and error rates, for a review of which we refer to Farcomeni (2008). An important feature for our purposes is that the Bonferroni and Sidak corrections provide strong control of the FWER, which is bounded no matter the number and the configuration of outliers. Instead, the procedures based on the FDR and FDX ensure weak control of the FWER, which is then bounded only under the complete null hypothesis of no outliers $H_{0} : \cap_{i = 1}^{n} H_{0 i} .$ The main consequence for outlier detection is that FDR (or FDX) control provides a balance between ignoring multiplicity, as in Hardin and Rocke (2005) or in Hubert et al. (2008), and strictly correcting for multiplicity through FWER control, as in Becker and Gather (1999) or in Cerioli et al. (2009). The improvement obtained by controlling (4) or (7) may be particularly advantageous when $n$ is high, or when many samples of moderate size need to be analyzed in sequence. In such instances the total number of hypotheses (1) to be tested will be large and the loss of power induced by strong FWER control will become more relevant.

Section snippets

FDR and FDX rules for multivariate outlier detection

The performance of any outlier detection method with well-behaved data sets is ruled by two basic elements:

(a)
availability of a good approximation to the unknown finite-sample null distribution of the squared robust distances (2);
(b)
correction for the multiplicity implied by repeated testing of the $n$ individual hypotheses (1).

Avoiding (b) leads to identifying a proportion

α

of false outliers in any good data set, a situation that can have negative consequences in many applications like the examples

Enemy brothers: power and swamping

We now show the results of a simulation experiment run under the location-shift contamination model $N (μ + λ e, Σ)$ , where $λ$ is a positive scalar and $e$ is a column-vector of ones. In our study, a proportion $ω$ of observations come from the location-shift contamination model, while the remaining $n (1 - ω)$ observations come from the null $N (μ, Σ)$ model. We call $ω$ the contamination rate. We also define power to be the proportion of contaminated observations correctly labeled as outliers. Without loss of

Data analysis

We outline two real data examples, on which we demonstrate the usefulness of multiplicity corrections. Both examples might also be seen as classification situations, for which alternative solutions are available. The outlier detection framework, with respect to many classifiers, has the disadvantage that we must use assumptions on the sample. Nevertheless, we believe that our approach is worthwhile in these examples for several reasons.

First, with statistical classifiers it may be hard to

Conclusions

In this paper we have explored alternative ways to reconcile the two opposite goals of multivariate outlier detection: achieving high power under contamination and ensuring low swamping with well behaved data. We have shown that the choice among the alternative methodologies mainly depends on the user attitude towards swamping. The FSRMCD and IRMCD procedures proposed by Cerioli (2010a) have opposite performances. With FSRMCD the level of swamping is kept under control for any number and

Acknowledgements

The authors are grateful to two anonymous reviewers for many useful comments that helped to sharpen the focus of the article. The authors also thank Anthony C. Atkinson and Marco Riani for helpful discussions on previous drafts of this work. Research of the first author was partially supported by the grant “Nuovi metodi multivariati robusti per la valutazione delle politiche sull’e-government e la società dell’informazione” of Ministero dell’Università e della Ricerca–PRIN 2008.

References (29)

A. Farcomeni et al.
Nonparametric analysis of infrared spectra for recognition of glass and ceramic glass fragments in recycling plants
Waste Management
(2008)
P. Filzmoser et al.
Outlier identification in high dimensions
Computational Statistics and Data Analysis
(2008)
V. Todorov et al.
Robust statistic for the one-way MANOVA
Computational Statistics and Data Analysis
(2010)
C. Becker et al.
The masking breakdown point of multivariate outlier identification rules
Journal of the American Statistical Association
(1999)
Y. Benjamini et al.
Controlling the false discovery rate: a practical and powerful approach to multiple testing
Journal of the Royal Statistical Society. Series B
(1995)
Y. Benjamini et al.
The control of the false discovery rate in multiple testing under dependency
Annals of Statistics
(2001)
A. Cerioli
Multivariate outlier detection with high-breakdown estimators
Journal of the American Statistical Association
(2010)
Cerioli, A., 2010b. Diagnostic checking of multivariate normality under contamination....
A. Cerioli et al.
Controlling the size of multivariate outlier tests with the MCD estimator of scatter
Statistics and Computing
(2009)
C. Croux et al.
Influence functions of the Spearman and Kendall correlation measures
Statistical Methods and Applications
(2010)

L. Davies et al.

The identification of multiple outliers

Journal of the American Statistical Association

(1993)

A. Farcomeni

Some results on the control of the false discovery rate under dependence

Scandinavian Journal of Statistics

(2007)

A. Farcomeni

A review of modern multiple hypothesis testing, with particular attention to the false discovery proportion

Statistical Methods in Medical Research

(2008)

A. Farcomeni

Generalized augmentation to control the false discovery exceedance in multiple testing

Scandinavian Journal of Statistics

(2009)

Cited by (47)

Information Criteria for Outlier Detection Avoiding Arbitrary Significance Levels
2024, Econometrics and Statistics
Information criteria for model choice are extended to the detection of outliers in regression models. For deletion of observations (hard trimming) the family of models is generated by monitoring properties of the fitted models as the trimming level is varied. For soft trimming (downweighting of observations), some properties are monitored as the efficiency or breakdown point of the robust regression is varied. Least Trimmed Squares and the Forward Search are used to monitor hard trimming, with MM- and S-estimation the methods for soft trimming. Bayesian Information Criteria (BIC) for both scenarios are developed and results about their asymptotic properties provided. In agreement with the theory, simulations and data analyses show good performance for the hard trimming methods for outlier detection. Importantly, this is achieved very simply, without the need to specify either significance levels or decision rules for multiple outliers.
An impartial trimming algorithm for robust circle fitting
2023, Computational Statistics and Data Analysis
Accurate circle fitting can be seriously compromised by the occurrence of even few anomalous points. Then, it is proposed to resort to a robust fitting strategy based on the idea of impartial trimming. Malicious data are supposed to be deleted, whereas estimation only relies on a set of genuine observations. The procedure is impartial in that trimmed points are not decided in advance but they are detected simultaneously to parameters estimation, according to an iterative algorithm: in each step a fixed proportion of the data is trimmed after sorting their geometric distances from the current fitted circle in non decreasing order. A reweighting step is also considered to improve the quality of the fit and make it less dependent on the selected trimming level. The global robustness properties of the method are established. The finite sample behavior of the proposed estimator has been investigated according to some numerical studies and real data examples.
Robust fitting of mixture models using weighted complete estimating equations
2022, Computational Statistics and Data Analysis
Mixture modeling, which considers the potential heterogeneity in data, is widely adopted for classification and clustering problems. Mixture models can be estimated using the Expectation-Maximization algorithm, which works with the complete estimating equations conditioned by the latent membership variables of the cluster assignment based on the hierarchical expression of mixture models. However, when the mixture components have light tails such as a normal distribution, the mixture model can be sensitive to outliers. This study proposes a method of weighted complete estimating equations (WCE) for the robust fitting of mixture models. Our WCE introduces weights to complete estimating equations such that the weights can automatically downweight the outliers. The weights are constructed similarly to the density power divergence for mixture models, but in our WCE, they depend only on the component distributions and not on the whole mixture. A novel expectation-estimating-equation (EEE) algorithm is also developed to solve the WCE. For illustrative purposes, a multivariate Gaussian mixture, a mixture of experts, and a multivariate skew normal mixture are considered, and how our EEE algorithm can be implemented for these specific models is described. The numerical performance of the proposed robust estimation method was examined using simulated and real datasets.
Efficient robust methods via monitoring for clustering and multivariate data analysis
2019, Pattern Recognition
Citation Excerpt :
The former arises when the aberrant observations attract the estimates in such a way that they do not appear anomalous anymore; conversely, the latter occurs when the estimation bias leads uncontaminated observations to be mistakenly labeled as outliers. Safeguard against these undesirable effects can be obtained by the use of diagnostic tools based on high-breakdown estimators and by a careful design of the related statistical testing procedures [22–26]. Cerioli et al. [27,28] give a brief history of robust statistical methods from the hopeful dawn at the time of the Princeton Robustness Study [29].
Monitoring the properties of single sample robust analyses of multivariate data as a function of breakdown point or efficiency leads to the adaptive choice of the best values of these parameters, eliminating arbitrary decisions about their values and so increasing the quality of estimators. Monitoring the trimming proportion in robust cluster analysis likewise leads to improved estimators. We illustrate these procedures on a sample of 424 cows with bovine phlegmon. For clustering we use a method which includes constraints on the eigenvalues of the dispersion matrices, so avoiding thread shaped clusters. The “car-bike” plot reveals the stability of clustering as the trimming level changes. The pattern of clusters and outliers alters appreciably for low levels of trimming.
ICS for multivariate outlier detection with application to quality control
2018, Computational Statistics and Data Analysis
In high reliability standards fields such as automotive, avionics or aerospace, the detection of anomalies is crucial. An efficient methodology for automatically detecting multivariate outliers is introduced. It takes advantage of the remarkable properties of the Invariant Coordinate Selection (ICS) method which leads to an affine invariant coordinate system in which the Euclidian distance corresponds to a Mahalanobis Distance (MD) in the original coordinates. The limitations of MD are highlighted using theoretical arguments in a context where the dimension of the data is large. Owing to the resulting dimension reduction, ICS is expected to improve the power of outlier detection rules such as MD-based criteria. The paper includes practical guidelines for using ICS in the context of a small proportion of outliers. The use of the regular covariance matrix and the so called matrix of fourth moments as the scatter pair is recommended. This choice combines the simplicity of implementation together with the possibility to derive theoretical results. The selection of relevant invariant components through parallel analysis and normality tests is addressed. A simulation study confirms the good properties of the proposal and provides a comparison with Principal Component Analysis and MD. The performance of the proposal is also evaluated on two real data sets using a user-friendly R package accompanying the paper.
Processing APT Spectral Backgrounds for Improved Quantification
2020, Microscopy and Microanalysis

View all citing articles on Scopus

View full text

Error rates for multivariate outlier detection

Abstract

Introduction

Section snippets

FDR and FDX rules for multivariate outlier detection

Enemy brothers: power and swamping

Data analysis

Conclusions

Acknowledgements

Waste Management

Computational Statistics and Data Analysis

Computational Statistics and Data Analysis

The masking breakdown point of multivariate outlier identification rules

Journal of the American Statistical Association

Controlling the false discovery rate: a practical and powerful approach to multiple testing

Journal of the Royal Statistical Society. Series B

The control of the false discovery rate in multiple testing under dependency

Annals of Statistics

Multivariate outlier detection with high-breakdown estimators

Journal of the American Statistical Association

Controlling the size of multivariate outlier tests with the MCD estimator of scatter

Statistics and Computing

Influence functions of the Spearman and Kendall correlation measures

Statistical Methods and Applications

The identification of multiple outliers

Journal of the American Statistical Association

Some results on the control of the false discovery rate under dependence

Scandinavian Journal of Statistics

A review of modern multiple hypothesis testing, with particular attention to the false discovery proportion

Statistical Methods in Medical Research

Generalized augmentation to control the false discovery exceedance in multiple testing

Scandinavian Journal of Statistics