Concentration Free Outlier Detection

Angiulli, Fabrizio

doi:10.1007/978-3-319-71249-9_1

Concentration Free Outlier Detection

Fabrizio Angiulli¹⁸

Conference paper
First Online: 30 December 2017

4067 Accesses
13 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10534))

Abstract

We present a novel notion of outlier, called Concentration Free Outlier Factor (CFOF), having the peculiarity to resist concentration phenomena that affect other scores when the dimensionality of the feature space increases. Indeed we formally prove that $\hbox {CFOF}$ does not concentrate in intrinsically high-dimensional spaces. Moreover, $\hbox {CFOF}$ is adaptive to different local density levels and it does not require the computation of exact neighbors in order to be reliably computed. We present a very efficient technique, named ${\textit{fast-}\hbox {CFOF}}$, for detecting outliers in very large high-dimensional datasets. The technique is efficiently parallelizable, and we provide a MIMD-SIMD implementation. Experimental results witness for scalability and effectiveness of the technique and highlight that $\hbox {CFOF}$ exhibits state of the art detection performances.

You have full access to this open access chapter, Download conference paper PDF

1 Introduction

Outlier detection is a prominent data mining task, whose goal is to single out anomalous observations, also called outliers [2]. While the other data mining approaches consider outliers as noise that must be eliminated, as pointed out in [11] “one person’s noise could be another person’s signal”, thus outliers themselves are of great interest in different settings (e.g. fraud detection, ecosystem disturbances, intrusion detection, cybersecurity, medical analysis, to cite a few).

Data mining outlier approaches can be supervised, semi-supervised, and unsupervised [8, 13]. Supervised methods take in input data labeled as normal and abnormal and build a classifier. The challenge there is posed by the fact that abnormal data form a rare class. Semi-supervised methods, also called one-class classifiers or domain description techniques, take in input only normal examples and use them to identify anomalies. Unsupervised methods detect outliers in an input dataset by assigning a score or anomaly degree to each object.

Unsupervised outlier detection methods can be categorized in several approaches, each of which assumes a specific concept of outlier. Among the most popular families there are distance-based [4, 5, 16, 23], density-based [7, 15, 20], angle-based [18], isolation-forest [19], subspace methods [1, 14], and others [2, 9, 25].

This work focuses on unsupervised outlier detection problem in the full feature space. In particular, we introduce a novel notion of outlier, the Concentration Free Outlier Factor ($\hbox {CFOF}$), having the peculiarity to resist concentration phenomena affecting other measures. Informally, the $\hbox {CFOF}$ score measures how many neighbors have to be taken into account in order for the object to be considered close by an appreciable fraction of the population. The term distance concentration refers to the tendency of distances to become almost indiscernible as dimensionality increases, and is part of the so called curse of dimensionality problem [6, 10]. And, indeed, the concentration problem also affects outlier scores of different families due to the specific role played by distances in their formulation [17, 25]. Moreover, a special kind of concentration phenomenon, known as hubness, concerns scores based on reverse nearest neighbor counts [12, 22], that is the concentration of the scores towards the values associated with anomalies, which results in almost all the dataset composed of outliers.

The contributions of the work within this scenario are summarized next:

As a major peculiarity, we formally show that, differently from the practical totality of existing outlier scores, the $\hbox {CFOF}$ score distribution is not affected by concentration phenomena arising when the dimensionality of the space increases.
The $\hbox {CFOF}$ score is adaptive to different local density levels. Despite local methods usually require to know the exact nearest neighbors in order to compare the neighborhood of each object with the neighborhood of its neighbors, this is not the case for $\hbox {CFOF}$, which can be reliably computed through sampling. This characteristics is favored by the separation between inliers and outliers guaranteed by the absence of concentration.
We describe the ${\textit{fast-}\hbox {CFOF}}$ technique, which from the computational point of view does not suffer of the problems affecting (reverse) nearest neighbor search techniques. The cost of ${\textit{fast-}\hbox {CFOF}}$ is linear both in the dataset size and dimensionality. Moreover, we provide a multi-core (MIMD) vectorized (SIMD) implementation.
The applicability of the technique is not limited to Euclidean or vector spaces. It can be applied both in metric and non-metric spaces equipped with a distance function.
Experimental results highlight that $\hbox {CFOF}$ exhibits state of the art detection performances.

The rest of the work is organized as follows. Section 2 introduces the $\hbox {CFOF}$ score and its properties. Section 3 describes the ${\textit{fast-}\hbox {CFOF}}$ algorithm. Section 4 presents experiments. Finally, section draws conclusions.

2 The Concentration Free Outlier Factor

2.1 Definition

We assume that a dataset $\mathbf{DS}= \{x_1,x_2,\ldots ,x_n\}$ of n objects belonging to an object space $\mathbb {U}$, on which a distance function $\mathrm{dist}$ is defined, is given in input. We assume that $\mathbb {U} = \mathbb {D}^d$ (where $\mathbb {D}$ is usually the set $\mathbb {R}$ of real numbers), with $d\in \mathbb {N}^+$, but the method can be applied in any object space equipped with a distance function (not necessarily a metric).

Given an object x and a positive integer k, the k -th nearest neighbor of x is the object ${ nn}_k(x)$ such that there exists exactly $k-1$ objects lying at distance smaller than $\mathrm{dist}(x,{ nn}_k(x))$ from x. It always holds that $x={ nn}_1(x)$. We assume that ties are non-deterministically ordered. The k nearest neighbors set $\mathrm{NN}_k(x)$ of x, where k is also called the neighborhood width, consists of the objects $\{ { nn}_i(x) \mid 1\le i\le k \}$.

By $\mathrm{N}_k(x)$ we denote the number of objects having x among their k nearest neighbors:

$$\begin{aligned} \mathrm{N}_k(x) = |\{ y : x\in \mathrm{NN}_k(y) \}|, \end{aligned}$$

also referred to as reverse k nearest neighbor count or reverse neighborhood size.

Given a parameter $\varrho \in (0,1)$ (or equivalently a parameter $k_\varrho \in [1,n]$ such that $k_\varrho =n\varrho $), the Concentration Free Outlier Score, also referred to as $\hbox {CFOF}$, is defined as:

$$\begin{aligned} \hbox {CFOF}(x) = \min \left\{ k/n : \mathrm{N}_k(x) \ge n\varrho \right\} , \end{aligned}$$

(1)

that is to say, the score returns the smallest neighborhood width (normalized with respect to n) for which the object x exhibits a reverse neighborhood of size at least $n\varrho $ (or $k_\varrho $).^{Footnote 1}

Intuitively, the $\hbox {CFOF}$ score measures how many neighbors have to be taken into account in order for the object to be considered close by an appreciable fraction of the dataset objects. We notice that this kind of notion of perceiving the abnormality of an observation is completely different from any other notion so far introduced in the literature.

The $\hbox {CFOF}$ score is adaptive to different density levels. This characteristics is also influenced by the fact that actual distance values are not employed in its computation. Thus, $\hbox {CFOF}$ is invariant to all of the transformations the leave unchanged the nearest neighbor ranking, such as translation or scaling. Also, duplicating the data in a way that avoids to affect the original neighborhood order (e.g. by creating a separate, possibly scaled, cluster from each copy of the original data) will preserve original scores.

Consider Fig. 1 showing a dataset consisting of two normally distributed clusters, each consisting of 250 points. The cluster centered in (4, 4) is obtained by translating and scaling (by a factor 0.5) the cluster centered in the origin. The top 25 $\hbox {CFOF}$ outliers for $k_\varrho =20$ are highlighted (objects within small circles). It can be seen that the outliers are the “same” objects of the two clusters.

2.2 Relationship with the Distance Concentration Phenomenon

The term distance concentration, which is part of the so called curse of dimensionality problem [6], refers to the tendency of distances to become almost indiscernible as dimensionality increases. In a more quantitative way this phenomenon is measured through the ratio between a quantity related to the mean $\mu $ and a quantity related to the standard deviation $\sigma $ of the distance distribution of interest. E.g., in [10] the intrinsic dimensionality $\rho $ of a metric space is defined as $\rho =\mu _d^2/(2\sigma _d^2)$, where $\mu _d$ is the mean of the pairwise distance distribution and $\sigma _d$ the associated standard deviation. The intrinsic dimensionality intends to quantify the expected difficulty of performing a nearest neighbor search: the smaller the ratio the larger the difficulty to search on an arbitrary metric space.

In general, it is said that we have concentration when this kind of ratio tends to zero as dimensionality goes to infinity, as it is the case for objects with i.i.d. attributes.

The concentration problem also affects different families of outlier scores, due to the specific role played by distances in their formulation.

Figure 2 reports the sorted scores of different outlier detection techniques, that are aKNN [5], LOF [7], ABOF [18], and $\hbox {CFOF}$ (the parameters k of aKNN, LOF, and ABOF, and $k_\varrho $ of $\hbox {CFOF}$, are held fixed to 50 for all the scores), associated with a family of uniformly distributed datasets having fixed size ($n=1000$) and increasing dimensionality $d\in [10^0,10^4]$. The figure highlights that, except for $\hbox {CFOF}$, the other scores exhibit a concentration effect. For aKNN (Fig. 2a) the mean score value raises while the spread stay limited. For LOF (Fig. 2b) all the values tend to 1 as the dimensionality increases. For ABOF (Fig. 2c) both the mean and the standard deviation decrease of various orders of magnitude with the latter term varying at a faster rate than the former one. As for $\hbox {CFOF}$ the score distributions for $d>100$ are very close and exhibit only slight changes. Notably, the separation between scores associated with outliers and inliers is always ample.

2.3 Relationship with the Hubness Phenomenon

$\hbox {CFOF}$ has connections with the reverse neighborhood size, a tool which has been also used for characterizing outliers. In [12], the authors proposed the use of the reverse neighborhood size $\mathrm{N}_k(\cdot )$ as an outlier score, which we refer to as RNN count (RNNc for short). Outliers are those objects associated with the lowest RNN counts. However, RNNc suffers of a peculiar problem known as hubness [21]. As the dimensionality of the space increases, the number of antihubs, that are objects appearing in a much lower number of k nearest neighbors sets (possibly they are neighbors only of themselves), overcomes the number of hubs, that are objects that appear in many more k nearest neighbor sets than other points, and, according to the RNNc score, the vast majority of the dataset objects become outliers with identical scores.

We provide evidence that $\hbox {CFOF}$ does not present the hubness problem. Figure 3 reports the distribution of the $N_k(\cdot )$ value and of the $\hbox {CFOF}$ absolute score for a ten thousand dimensional normal dataset (a very similar behavior has been observed also for uniform data). Notice that $\hbox {CFOF}$ outliers are associated with the largest score values, hence to the tails of the corresponding distribution, while RNNc outliers are associated with the smallest score values, hence with the largely populated region of the associated score distribution, a completely opposite behavior. To illustrate the impact of the hubness problem with the dimensionality, Fig. 4 shows the cumulative frequency associated with the normalized, between 0 and 1, increasing score. This transformation has been implemented here in order to make the comparison much more interpretable. Original scores have been mapped to [0, 1]. $\hbox {CFOF}$ scores have been divided by their maximum value. The mapping for $\mathrm{N}_k(\cdot )$ has been obtained as $1-\frac{\mathrm{N}_k(x)}{\max _y\mathrm{N}_k(y)}$, since outliers are those objects associated with the lowest counts. The plots make evident the deep difference between the two approaches. Here both n and k for RNNc ($k_\varrho $ for $\hbox {CFOF}$, resp.) are held fixed, while d is increased. As for RNNc, the hubness problem is already evident for $d=10$ (where objects with a normalized score $\ge 0.8$ corresponds to about the $40\%$ of the dataset), while the curve for $d=10^2$ closely resembles that for $d=10^4$ (where almost all the dataset objects have a normalized score ${\ge }0.8$). As far as $\hbox {CFOF}$ is concerned, the two curves for $d=10^4$ closely resemble each other and the number of objects associated with a large score value always correspond to a very small fraction of the dataset population.

2.4 Concentration Free Property of CFOF

In this section we formally prove that the $\hbox {CFOF}$ score is concentration-free. Specifically, the following theorem shows that the separation between the scores associated with outliers and the rest of the scores is guaranteed in any arbitrary large dimensionality.

Before going into the details, we recall that the concept of intrinsic dimensionality of a space is identified as the minimum number of variables needed to represent the data, which corresponds in a linear space to the number of linearly independent vectors needed to describe each point.

Theorem 1

Let $\mathbf{DS }^{(d)}$ be a d-dimensional dataset consisting of realizations of a d-dimensional independent and (non-necessarily) identically distributed random vector $\mathbf X $ having distribution function f. Then, as $d\rightarrow \infty $, the $\hbox {CFOF}$ scores of the points of $\mathbf{DS }^{(d)}$ do not concentrate.

Proof

Consider the squared norm $\Vert \mathbf X \Vert ^2 = \sum _{i=1}^d X_i^2$ of the random vector $\mathbf X $. As $d\rightarrow \infty $, by the Central Limit Theorem, the standard score of $\sum _{i=1}^d X_i^2$ tends to a standard normal distribution. This implies that $\Vert \mathbf X \Vert ^2$ approaches a normal distribution with mean $\mu _{\Vert \mathbf X \Vert ^2}=\mathbf E [X_i^2]=d\mu _2$ and variance $\sigma ^2_{\Vert \mathbf X \Vert ^2}=d(\mathbf E [(X_i^2)^2]-\mathbf E [X_i^2]) = d(\mu _4 - \mu _2^2)$, where $\mu _2$ and $\mu _4$ are the 2nd- and 4th-order central moments of the univariate probability distribution f.

In the case that the components $X_i$ of $\mathbf X $ are non-identically distributed according to the distributions $f_i$ ($1\le i\le d$), the result still holds by considering the average of the central moments of the $f_i$ functions.

Let x be an element of $\mathbf DS ^{(d)}$ and define the zeta score $z_x$ of the squared norm of x as

$$\begin{aligned} z_{x} = \frac{\Vert x\Vert ^2-\mu _{\Vert \mathbf X \Vert ^2}}{\sigma _{\Vert \mathbf X \Vert ^2}}. \end{aligned}$$

It can be shown [3] w.l.o.g. assume that $\mathbf {E}[X]=0$, for large values of d, the number of k-occurrences of x is given by

$$\begin{aligned} \mathrm{N}_k(x) = n\cdot Pr[x\in \mathrm{NN}_k(\mathbf X )] \approx n \varPhi \left( \frac{\varPhi ^{-1}(\frac{k}{n})\sqrt{\mu _4+3\mu _2^2} - z_x \sqrt{\mu _4-\mu _2^2}}{2\mu _2} \right) . \end{aligned}$$

Let $t(z_x)$ denote the smallest integer k such that $\mathrm{N}_k(x)\ge n\varrho $. By exploiting the equation above it can be concluded that

$$\begin{aligned} t(z_x) \approx n \varPhi \left( \frac{z_{x} \sqrt{\mu _4-\mu _2^2} + 2\mu _2\varPhi ^{-1}(\varrho ) }{ \sqrt{\mu _4+3\mu _2^2} } \right) . \end{aligned}$$

Since $\hbox {CFOF}(x)=k/n$ implies that k is the smallest integer such that $\mathrm{N}_k(x)\ge n\varrho $, it also follows that $\hbox {CFOF}(x) \approx t(z_x)/n = \hat{t}(z_x)$.

Moreover, since as stated above the $\Vert \mathbf X \Vert ^2$ random variable is normally distributed, it also holds that for each $z\ge 0$

$$\begin{aligned} Pr\left[ \frac{\Vert \mathbf X \Vert ^2-\mu _{\Vert \mathbf X \Vert ^2}}{\sigma _{\Vert \mathbf X \Vert ^2}} \le z \right] = \varPhi (z), \end{aligned}$$

where $\varPhi (\cdot )$ denotes the cdf of the normal distribution.

Thus, for arbitrarily large values of d and for any standard score value $z\ge 0$

$$\begin{aligned} Pr\left[ CFOF(\mathbf X ) \ge \hat{t}(z) \right] = 1 - \varPhi (z), \end{aligned}$$

irrespective of the actual data dimensionality value d.

3 Score Computation

$\hbox {CFOF}$ scores can be determined in time $O(n^2 d)$, where d denotes the dimensionality of the feature space (or the cost of computing a distance), after computing all pairwise dataset distances.^{Footnote 2} Next we introduce a technique, named ${\textit{fast-}\hbox {CFOF}}$ which does not require the computation of the exact nearest neighbor sets and, from the computational point of view, does not suffer of the curse of dimensionality affecting nearest neighbor search techniques.

The technique builds on the following probabilistic formulation of the $\hbox {CFOF}$ score. Assume that the dataset consists of n i.i.d. samples drawn according to an unknown probability law $p(\cdot )$. Given a parameter $\varrho \in (0,1)$, the (Probabilistic) Concentration Free Outlier Factor $\hbox {CFOF}$ is defined as follows:

$$\begin{aligned} \hbox {CFOF}(x) = \min \left\{ k/n : \mathbf{E}\big [Pr[x\in \mathrm{NN}_{k}(y)]\big ] \ge \varrho \right\} . \end{aligned}$$

(2)

To differentiate the two definitions reported in Eqs. (1) and (2), we also refer to the former as ${\textit{hard-}\hbox {CFOF}}$ and to the latter as ${\textit{soft-}\hbox {CFOF}}$. Intuitively, the ${\textit{soft-}\hbox {CFOF}}$ score measures how many neighbors have to be taken into account in order for the expected number of dataset objects having it among their neighbors correspond to the fraction $\varrho $ of the overall population.

3.1 The ${\textit{fast-}\hbox {CFOF}}$ Technique

Given a dataset $\mathbf{DS}$ and two objects x and y from $\mathbf{DS}$, the building block of the algorithm is the computation of $Pr[x\in \mathrm{NN}_k(y)]$. Consider the boolean function $B_{x,y}(z)$ defined on instances z of $\mathbf{DS}$ such that $B_{x,y}(z)=1$ if z lies within the region $\mathcal{I}_{\mathrm{dist}(x,y)}(y)$, and 0 otherwise. We want to estimate the average value $\overline{B}_{x,y}$ of $B_{x,y}$ in $\mathbf{DS}$, which corresponds to the probability p(x, y) that a randomly picked dataset object z is at distance not grater than $\mathrm{dist}(x,y)$ from y.

It is enough to compute $\overline{B}_{x,y}$ within a certain error bound. Thus, we resort to batch sampling, consisting in picking up s elements of $\mathbf{DS}$ randomly and estimating $p(x,y) = \overline{B}_{x,y}$ as the fraction $\hat{p}(x,y)$ of the elements of the sample satisfying $B_{x,y}$ [24]. Given $\delta >0$ (an error probability) and $\epsilon , 0<\epsilon <1$ (an absolute error), if the sample size s satisfies certain conditions [24] then

$$\begin{aligned} Pr[|\hat{p}(x,y)-p(x,y)|\le \epsilon ] > 1-\delta . \end{aligned}$$

(3)

For large values of n, since the variance of the Binomial distribution becomes negligible with respect to the mean, the cdf $\textit{binocdf}(k;p,n)$ tends to the step function $H\big (k-np\big )$, where $H\big (k\big )=0$ for $k<0$ and $H\big (k\big )=1$ for $k>0$. Thus, we can approximate the value $Pr[x\in \mathrm{NN}_k(y)] = \textit{binocdf}(k;p(x,y),n)$ with the boolean function $H\big (k-k_{up}(x,y)\big )$, with $k_{up}(x,y)=n \widehat{p}(x,y)$.^{Footnote 3} It then follows that we can obtain $\mathbf{E}\big [Pr[x\in \mathrm{NN}_k(y)]\big ]$ as the average value of the boolean function $H\big (k-n \widehat{p}(x,y)\big )$, whose estimate can be again obtained by exploiting batch sampling. Specifically, ${\textit{fast-}\hbox {CFOF}}$ exploits the one single sample in order to perform the two estimates above described.

The algorithm ${\textit{fast-}\hbox {CFOF}}$ receives in input a list $\varvec{\varrho } = \varrho _1,\ldots ,\varrho _\ell $ of values for the parameter $\varrho $, since it is able to perform a multi-resolution analysis, that is to compute scores associated with different values of the parameter $\varrho $ with no additional cost. Both $\varvec{\varrho }$ and parameters $\epsilon ,\delta $ can be conveniently left at the default value ($\varvec{\varrho }=0.001, 0.005, 0.01, 0.05, 0.1$ and $\epsilon ,\delta =0.01$; see later for details).

First, the algorithm determines the size $s = \left\lceil \frac{1}{2\epsilon ^2}\log \left( \frac{1}{\delta }\right) \right\rceil $ of the sample (or partition) of the dataset needed in order to guarantee the bound reported in Eq. (3). We notice that the algorithm does not require the dataset to be entirely loaded in main memory, since only a partition at a time is needed to carry out the computation. Thus, the technique is suitable also for disk resident datasets. We assume that dataset objects are randomly ordered and, hence, partitions can be contiguous. Otherwise, randomization can be done in linear time and constant space by disk-based shuffling. Each partition, consisting of a group of s consecutive objects, is processed by the subroutine (see Algorithm 1), which estimates $\hbox {CFOF}$ scores of the objects within the partition through batch sampling.

The matrix hst, consisting of $s\times B$ counters, is employed by . The entry hst(i, k) of hst is used to estimate how many times the sample object $x'_i$ is the kth nearest neighbor of a generic object dataset. Values of k, ranging from 1 to n, are partitioned into B log-spaced bins. The function $k\_bin$ maps original k values to the corresponding bin, while $k\_bin^{-1}$ implements the reverse mapping (by returning a certain value within the corresponding bin).

For each sample object $x'_i$, the distance dst(j) from any other sample object $x'_j$ is computed (lines 3–4) and, then, distances are ordered (line 5) obtaining the list ord of sample identifiers such that $dst(ord(1))\le dst(ord(2)) \le \ldots \le dst(ord(s))$.

Moreover, for each element ord(j) of ord, the variable p is set to j/s (line 7), representing the probability $p(x'_{ord(j)},x'_i)$, estimated through the sample, that a randomly picked dataset object is located within the region of radius $dst(ord(j)) = dist(x'_i,x'_{ord(j)})$ centered in $x'_i$. The value $k_{up}$ (line 8) represents the point of transition from 0 to 1 of the step function $H\big (k-k_{up}\big )$ employed to approximate the probability $Pr[x'_{ord(j)}\in \mathrm{NN}_k(y)]$ when $y=x'_i$. Thus, before concluding each cycle of the inner loop (lines 6–10), the $k\_bin(k_{up})$-th entry of hst associated with the sample $x'_{ord(j)}$ is incremented.

The last step consists in the computation of the scores. For each sample $x'_i$ the associated counts are accumulated till their sum goes over the value $\varrho s$ and the associated value of k is employed to obtain the score.

The temporal cost of the technique is $O\left( s\cdot n\cdot d \right) $, where s is independent of the number n of dataset objects and can be considered a constant, and $n\cdot d$ is the size of the input, hence the temporal cost is linear in the size of the input. As for the spatial cost, O(Bs) space is needed for storing counters hst, O(2s) for distances dst and the ordering $ord, O(\ell s)$ for storing scores, and O(sd) for the buffer maintaining the sample, hence the spatial cost is linear in the sample size.

Before concluding, we notice that ${\textit{fast-}\hbox {CFOF}}$ is an embarrassingly parallel algorithm, since partition computations do not need to communicate intermediate results. Thus, it is readily suitable for multi-processor/computer system. We implemented a version for multi-core processors (using gcc, OpenMP, and the AVX x86-64 instruction set extensions) that elaborates partitions sequentially, but employs both MIMD (cores) and SIMD (vector registers) parallelism to elaborate each single partition.

4 Experimental Results

Experiments are performed on a Intel Core i7 2.40 GHz CPU (having 4 cores with 8 hardware threads, and SIMD registers accommodating 8 single-precision floating-point numbers) based PC with 8 GB of main memory, under the Linux operating system. As for the implementation parameters, the number B of hst bins is set to 100 and the constant c used to compute $k_{up}$ is set to 2. We assume 0.01 as the default value for the parameters $\varrho , \epsilon $, and $\delta $.

Some of the dataset employed are described next. Clust2 is a dataset family (with $n\in [10^4,10^6]$ and $d\in [2,10^3]$) consisting of two normally distributed clusters centered in the origin and in $(4,\ldots ,4)$, with standard deviation 1.0 and 0.5 along each dimension, respectively. MNIST is a dataset consisting of handwritten digits^{Footnote 4} composed of $n=60000$ vectors and $d=784$ dimensions.

4.1 Accuracy

The goal of this experiment is to assess the quality of the result of ${\textit{fast-}\hbox {CFOF}}$ for different sample sizes, that is different combinations of the parameters $\epsilon $ and $\delta $. We notice that the default sample size is $s = 26624$. With this aim we first computed the exact dataset scores by setting the sample size s to n.

Figure 5 compares the exact scores with those obtained for the standard sample size on the Clust2 (for $n=10^5$ and $d=100$) and MNIST datsets. The blue curve is associated with the exact scores sorted in descending order and the x-axis represents the outlier rank position of the dataset objects. As for the red curve, it shows the approximate scores associated with the objects at each rank position. The curves highlight that the ranking position tends to be preserved and that in both cases top outliers are associated with the largest scores.

Table 1. Spearman correlation between the exact and approximate outlier rankings computed by ${\textit{fast-}\hbox {CFOF}}$.

Full size table

We can justify the accuracy of the method by noticing that the larger the $\hbox {CFOF}$ score of x and, for any y, the larger the probability p(x, y) that a dataset object will lie in between x and y and, moreover, the smaller the impact of the error $\epsilon $ on the estimated value $\widehat{p}(x,y)$. Intuitively, the objects we are interested in, that are the outliers, are precisely the one least prone to bad estimations.

We employ the Spearman’s rank correlation coefficient to assesses relationship between the two rankings. This coefficient is high (close to 1) when observations have a similar rank. Table 1 reports Spearman’s coefficients for different combinations of $\epsilon , \delta $, and $\varrho $. The coefficient ameliorates for increasing samples (very high values are reached for the default sample) and larger $\varrho $ values (that exhibit high coefficient values also for small samples).

4.2 Scalability

Figure 6 shows the execution time on the Clust2 and MNIST datasets.

Figure 6a shows the execution time on Clust2 for the default sample size, $n\in [10^4,10^6]$ and $d\in [2,10^3]$. The largest dataset considered ($n=10^6$ and $d=10^3$, occupying 4 GB of disk space) required about 44 min. ${\textit{fast-}\hbox {CFOF}}$ exhibits a sub-linear dependence from the dimensionality, due to the exploitation of the SIMD parallelism. As for the dashed curves, they are obtained by disabling MIMD parallelism. The performance ratio between the two versions is about 7.6, thus confirming the effectiveness of the parallelization schema.

Figure 6b shows the execution time on Clust2 ($n=10^6, d=10^3$) and MNIST (180 MB of disk space) for different sample sizes. As for Clust2, the execution time drops from 44 min, for the default sample, to about 24 min, for $s=15360$ ($\epsilon =0.01, \delta =0.1$). Finally, as for MNIST, the whole dataset ($s=n$) required less than 6 min, while about 3 min are required with the default sample.

4.3 Effectiveness

On Clust2, we used the distance to cluster centers as the ground truth. Specifically, for each dataset object, the distance R from the closest cluster center has been determined and distances associated with the same cluster have been normalized as $R'=\frac{R-\mu _R}{\sigma _R}$. Table 2 reports the Spearman’s correlation between normalized distances $R'$ and $\hbox {CFOF}$ scores. The high correlation values witness for both the meaningfulness of the definition and its behavior as a local outlier measure even in high dimensions.

Table 2. Spearman correlation between the normalized distance to the object’s cluster center and the score computed by ${\textit{fast-}\hbox {CFOF}}$.

Full size table

Figure 7 shows the height top outliers of MNIST. It appears that these digits are deformed, quite difficult to recognize, and possibly misaligned within the $28\times 28$ cell grid.

4.4 Comparison with Other Approaches

We compared $\hbox {CFOF}$ with aKNN, LOF, and ABOD, by using some labelled datasets as ground truth. The datasets, randomly selected at the UCI ML Repository^{Footnote 5}, are: Breast Cancer Wisconsin Diagnostic ($n=569, d=32$), Image segmentation ($n=2310, d=19$), Ozone Level Detection ($n=2536, d=73$), Pima indians diabetes ($n=768, d=8$), QSAR biodegradation ($n=1055, d=41$), Yeast ($n=1484, d=8$). Each class in turn is marked as abnormal, and a dataset composed of all the objects of the other classes plus 10 randomly selected objects of the abnormal class is considered. Table 3 reports the Area Under the ROC Curve (AUC) obtained by $\hbox {CFOF}$ (${\textit{hard-}\hbox {CFOF}}$ has been used), aKNN, LOF, and ABOD. As for the parameters $k_\varrho $ and k, for all the methods the corresponding parameter has been varied between 2 and 100, and the best result has been reported in the table. Notice that the wins are 16 for $\hbox {CFOF}, 4$ for aKNN, 2 for LOF, and 4 for ABOD. The comparison points out that $\hbox {CFOF}$ represents an outlier detection definition with its own peculiarities, since the other methods behaved differently, and state of the art detection performances.

5 Conclusions

We presented the Concentration Free Outlier Factor, a novel density estimation measure whose main characteristic is to resist concentration phenomena usually arising in high dimensional spaces and to allow very efficient and reliable outlier detection through the use of sampling. We are extending the study of the theoretical properties of the definition, assessing guarantees of the fast-CFOF algorithm, and extending the experimental activity. We believe that the $\hbox {CFOF}$ score can offer insights also in the context of other data mining tasks. We are currently investigating its application in other classification scenarios.

Table 3. AUCs for the labelled datasets.

Full size table

Notes

1.
Notice that k (or k / n), representing a neighborhood width, denotes the output of $\hbox {CFOF}$, while the other outlier definitions employ k as an input parameter. We warn the reader that, in order to make more intelligible the comparison of $\hbox {CFOF}$ with other outlier techniques, sometimes we will refer to k as an input parameter (the use will be clear from the context). Moreover, in order to avoid confusion and to maintain analogy with the input parameter $\varrho $, we also refer to $\varrho $ as $k_\varrho $.
2.
It is generally recognized that this cost can be reduced to $O(d n \log n)$ in low dimensional spaces.
3.
Alternatively, by exploiting the Normal approximation of the Binomial distribution, a suitable value for $k_{up}(x,y)$ is given by $k_{up}(x,y) = n \widehat{p}(x,y) + c \sqrt{n \widehat{p}(x,y) (1-\widehat{p}(x,y))}$ with $c\in [0,3]$.
4.
http://yann.lecun.com/exdb/mnist/.
5.
https://archive.ics.uci.edu/ml/index.html.

References

Aggarwal, C.C., Yu, P.S.: Outlier detection for high dimensional data. In: Proceedings of the International Conference on Managment of Data (SIGMOD) (2001)
Google Scholar
Aggarwal, C.C.: Outlier Analysis. Springer, New York (2013). https://doi.org/10.1007/978-3-319-47578-3
Book MATH Google Scholar
Angiulli, F.: On the behavior of intrinsically high-dimensional spaces: distance distributions, direct and reverse nearest neighbors, and hubness. Manuscript submitted for publication to an international journal (2017). Available at the author’s site
Google Scholar
Angiulli, F., Fassetti, F.: Dolphin: an efficient algorithm for mining distance-based outliers in very large datasets. ACM Trans. Knowl. Disc. Data, 3(1), Article 4 (2009)
Google Scholar
Angiulli, F., Pizzuti, C.: Outlier mining in large high-dimensional data sets. IEEE Trans. Knowl. Data Eng. 2(17), 203–215 (2005)
Article MATH Google Scholar
Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is “nearest neighbor” meaningful? In: Beeri, C., Buneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 217–235. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-49257-7_15
Chapter Google Scholar
Breunig, M.M., Kriegel, H., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: Proceedings of the International Conference on Managment of Data (SIGMOD) (2000)
Google Scholar
Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3), 15:1–15:58 (2009)
Article Google Scholar
Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection for discrete sequences: a survey. IEEE Trans. Knowl. Data Eng. 24(5), 823–839 (2012)
Article Google Scholar
Chávez, E., Navarro, G., Baeza-Yates, R., Marroquín, J.L.: Searching in metric spaces. ACM Comput. Surv. 33(3), 273–321 (2001)
Article Google Scholar
Han, J., Kamber, M.: Data Mining, Concepts and Technique. Morgan Kaufmann, San Francisco (2001)
Google Scholar
Hautamaki, V., Karkkainen, I., Franti, P.: Outlier detection using k-nearest neighbour graph. In: Proceedings of the International Conference on Pattern Recognition (ICPR), pp. 430–433 (2004)
Google Scholar
Hodge, V., Austin, J.: A survey of outlier detection methodologies. Artif. Intell. Rev. 22(2), 85–126 (2004)
Article MATH Google Scholar
Kriegel, H.-P., Kröger, P., Schubert, E., Zimek, A.: Outlier detection in axis-parallel subspaces of high dimensional data. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS (LNAI), vol. 5476, pp. 831–838. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-01307-2_86
Chapter Google Scholar
Jin, W., Tung, A.K.H., Han, J.: Mining top-n local outliers in large databases. In: Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD) (2001)
Google Scholar
Knorr, E., Ng, R.: Algorithms for mining distance-based outliers in large datasets. In: Proceedings of the International Conference on Very Large Databases (VLDB), pp. 392–403 (1998)
Google Scholar
Kriegel, H.-P., Kröger, P., Schubert, E., Zimek, A.: Interpreting and unifying outlier scores. In: Proceedings of the SIAM International Conference on Data Mining (SDM), pp. 13–24 (2011)
Google Scholar
Kriegel, H.-P., Schubert, M., Zimek, A.: Angle-based outlier detection in high-dimensional data. In: Proceedings of the International Confernce on Knowledge Discovery and Data Mining (KDD), pp. 444–452 (2008)
Google Scholar
Liu, F.T., Ting, K.M., Zhou, Z.-H.: Isolation-based anomaly detection. TKDD 6(1), 3:1–3:39 (2012)
Google Scholar
Papadimitriou, S., Kitagawa, H., Gibbons, P.B., Faloutsos, C.: LOCI: fast outlier detection using the local correlation integral. In: Proceedings of the International Conference on Data Enginnering (ICDE), pp. 315–326 (2003)
Google Scholar
Radovanovic, M., Nanopoulos, A., Ivanovic, M.: Hubs in space: popular nearest neighbors in high-dimensional data. J. Mach. Learn. Res. 11, 2487–2531 (2010)
MathSciNet MATH Google Scholar
Radovanovic, M., Nanopoulos, A., Ivanovic, M.: Reverse nearest neighbors in unsupervised distance-based outlier detection. IEEE Trans. Knowl. Data Eng. 27(5), 1369–1382 (2015)
Article Google Scholar
Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: Proceedings of the International Conference on Management of Data (SIGMOD), pp. 427–438 (2000)
Google Scholar
Watanabe, O.: Sequential sampling techniques for algorithmic learning theory. Theor. Comput. Sci. 348(1), 3–14 (2005)
Article MathSciNet MATH Google Scholar
Zimek, A., Schubert, E., Kriegel, H.-P.: A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Min. 5(5), 363–387 (2012)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

University of Calabria, 87036, Rende, CS, Italy
Fabrizio Angiulli

Authors

Fabrizio Angiulli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fabrizio Angiulli .

Editor information

Editors and Affiliations

Università degli Studi di Bari Aldo Moro, Bari, Italy
Michelangelo Ceci
Aalto University School of Science, Espoo, Finland
Jaakko Hollmén
University of Ljubljana, Ljubljana, Slovenia
Ljupčo Todorovski
KU Leuven Kulak, Kortrijk, Belgium
Celine Vens
Jožef Stefan Institute, Ljubljana, Slovenia
Sašo Džeroski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Angiulli, F. (2017). Concentration Free Outlier Detection. In: Ceci, M., Hollmén, J., Todorovski, L., Vens, C., Džeroski, S. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2017. Lecture Notes in Computer Science(), vol 10534. Springer, Cham. https://doi.org/10.1007/978-3-319-71249-9_1

Download citation

DOI: https://doi.org/10.1007/978-3-319-71249-9_1
Published: 30 December 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-71248-2
Online ISBN: 978-3-319-71249-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Concentration Free Outlier Detection

Abstract

1 Introduction