Outlier detection using binary decision diagrams

Kutsuna, Takuro; Yamamoto, Akihiro

doi:10.1007/s10618-016-0486-6

Outlier detection using binary decision diagrams

Published: 17 November 2016

Volume 31, pages 548–572, (2017)
Cite this article

Download PDF

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Outlier detection using binary decision diagrams

Download PDF

Takuro Kutsuna¹ &
Akihiro Yamamoto²

929 Accesses
2 Citations
Explore all metrics

Abstract

We propose a novel method for outlier detection using binary decision diagrams. Leave-one-out density is proposed as a new measure for detecting outliers, which is defined as a ratio of the number of data elements inside a region to the volume of the region after a focused datum is removed. We show that leave-one-out density can be evaluated very efficiently on a set of regions around each datum in a given dataset by using binary decision diagrams. The time complexity of the proposed method is nearly linear with respect to the size of the dataset, while the outlier detection accuracy is still comparable to that of other methods. Experimental results show the effectiveness of the proposed method.

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

Dongkuan Xu & Yingjie Tian

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Article 27 November 2022

Gbeminiyi John Oyewole & George Alex Thopil

1 Introduction

In this paper, we propose a novel and efficient method for outlier detection, which is an important task in data mining and has been applied to many problems such as fraud detection, intrusion detection, and data cleaning (Chandola et al. 2009). The goal of outlier detection is to find an unusual datum (outlier) from a given dataset, and many kinds of measures have been proposed to detect outliers. In this paper, we regard a datum as an outlier if the leave-one-out density is lower than a given threshold for a set of regions around the datum. Leave-one-out density is a ratio of the number of data elements inside a region to the volume of the region, in which the focused datum is removed from the original dataset. In general, a method like leave-one-out is time consuming because a learning procedure is repeated N-times, where N denotes the cardinality of the dataset. However, we propose a method that enables us to evaluate the leave-one-out density efficiently without repeating a learning procedure N-times.

We extend the work of Kutsuna (2010) to outlier detection by introducing the notion of leave-one-out density and developing an efficient algorithm to evaluate it. We employ the initial region method proposed by Kutsuna, in which a dataset is encoded into a Boolean formula and represented as a binary decision diagram. Kutsuna proposed a method to overapproximate the training dataset, that is, to estimate a region that encloses the training dataset, by manipulating the binary decision diagram and learn a one-class classifier. Although one-class classification and outlier detection are very similar tasks, the method proposed by Kutsuna is not applicable to outlier detection, because the classifier is estimated as an overapproximation of the training dataset and it never classifies a datum in the training dataset as an outlier.

The proposed method is compared to other well-known outlier detection methods, such as the one-class support vector machine (Schölkopf et al. 2001), the local outlier factor (Breunig et al. 2000), the Orca algorithm (Bay and Schwabacher 2003), and the isolation forest (Liu et al. 2008), with benchmark datasets. The experimental results show that the computation time and the outlier detection accuracy of the proposed method are better than the other methods for datasets that consist of both continuous and categorical attributes and have a large number of samples.

The rest of the paper is organized as follows: The outlier detection problem addressed in this paper is formally defined in Sect. 2. We review the initial region method in Sect. 3. In Sect. 4, the outline of the proposed method is stated, and then its efficient implementation based on binary decision diagrams is proposed. Section 5 shows experimental results. In Sect. 6, we conclude this paper by discussing the limitations of this study and our future work.

This paper extends the results reported in a conference paper (Kutsuna and Yamamoto 2014a) in the following ways: We describe the way to deal with categorical attributes in the proposed method, which were not detailed in the conference version. We first modify the problem setting in Sect. 2 to include categorical attributes, then explain the proposed method with both continuous and categorical attributes throughout the paper, including proofs in Sect. 4.2.2. In Sect. 4.1, we introduce a new parameter $\lambda $ to improve the performance of the proposed method. The impact of the parameter is experimentally examined in Sect. 5.2.2. In Sect. 5.2.3 and Sect. 5.2.4, the experiments for comparing the proposed method with other methods are re-conducted by adding new existing methods and new datasets. We have also enriched Sects. 1.1, 3, 4.2.1, 5.2.1, and 6 with extra explanations and illustrations.

1.1 Related work

Kutsuna (2010) proposed a one-class classifier that overapproximates a training dataset. The approximation is done quite efficiently by manipulating a binary decision diagram that is obtained by encoding the training dataset. The situation considered in Kutsuna (2010) is that both a training dataset and a test dataset are given: A classifier is first learned from the training dataset, and then the test dataset is classified by the classifier. It may seem that we can detect outliers in a given dataset by using the dataset as both the training dataset and the test dataset simultaneously. However, no datum is detected as an outlier in such a setting, because the classifier is estimated as an overapproximation of the training dataset. Therefore, the method in Kutsuna (2010) cannot be applied to outlier detection directly.

For outlier detection, the Mahalanobis distance (Mahalanobis 1936) between the center of a dataset and each datum is often used as a score of the datum being an outlier. The Mahalanobis distance is a classical measure of distance between data points, in which the correlation of the dataset is taken into account. The Mahalanobis distance works well if the dataset fits well to a normal distribution, but, this condition seldom holds in practice.

Schölkopf et al. (2001) extended the support vector machine (SVM) to outlier detection, which was originally invented for binary classification. Their method estimates a hyperplane that separates the origin and the dataset with maximum margin, in which the hyperplane can become nonlinear by introducing kernel functions. The data that are classified to the origin side are detected as outliers. The SVM has an advantage that various nonlinear hyperplanes are estimated by changing kernel parameters. A cross-validation method is often used to tune parameters in the SVM. However, unfortunately, such a tuning method is not available in the case of outlier detection, because there is no labeled training dataset. Therefore, parameter tuning is a critical issue in applying the SVM to outlier detection. Some heuristics are proposed to tune kernel parameters, such as Karatzoglou et al. (2004).

Breunig et al. (2000) proposed the local outlier factor (LOF) that is calculated on the basis of the distance to the k-nearest neighbor of each datum and has an advantage that it can detect local outliers, that is, data that are outlying relative to their local neighborhoods. The LOF has been shown to perform very well in realistic problems (Lazarevic et al. 2003). An efficient calculation of the k-nearest neighbors is essential in the LOF. Some techniques are proposed to accelerate the k-nearest neighbor calculation, such as Beckmann et al. (1990).

Bay and Schwabacher (2003) proposed Orca that finds outliers based on the distance to the nearest neighbors of each datum. A straightforward algorithm for distance computations requires quadratic time with respect to the number of data. Orca computes scores only for the top t outliers and reduces the calculation time by pruning the search space based on the cutoff, which is the lowest score of the top t outliers obtained thus far. Bay and Schwabacher (2003) experimentally showed that Orca works in nearly linear time with real datasets. Ghoting et al. (2008) proposed RBRP that introduces a pre-processing phase before the nearest neighbors calculation. In the pre-processing phase, the dataset is analyzed using a clustering algorithm and the principal component analysis to efficiently find data points that are likely to be near a focused data point. Using the information obtained in the pre-processing phase, RBRP facilitates the pruning of the search space based on the cutoff. Ghoting et al. (2008) report some experimental results in which RBRP outperforms Orca in terms of computational efficiency.

Liu et al. (2008) proposed the isolation forest (iForest) that consists of a set of isolation trees. An isolation tree is constructed by partitioning a dataset, which is sampled from an original dataset, randomly in a recursive manner. A datum is regarded as an outlier if it is isolated closer to the root of isolation trees on average. iForest has two parameters: the number of isolation trees t and the sampling size $\psi $. Liu et al. (2008) experimentally observed that iForest works well with relatively small parameter values, e.g., $t=100$ and $\psi =256$. Recently, Aryal et al. (2014) proposed a modification of the iForest to detect local outliers, in which the score of being an outlier is calculated on the basis of the relative mass that takes the local data distribution into account.

2 Problem setting

Let (x, y) be a datum where $x=(x_1,\ldots ,x_u)$ is a vector of continuous attributes and $y=(y_1,\ldots ,y_v)$ is a vector of categorical attributes. Let D be a dataset that includes N data points, in which the ith datum is denoted by $(x^{(i)}, y^{(i)})$ $(i=1,\ldots ,N)$. We assume that there is no missing values in D and all the data in D are unlabeled. In this paper, we regard a datum as an outlier if the leave-one-out density is lower than the threshold for a set of regions around the datum. The leave-one-out density $\rho _\mathrm{{LOO}}$ of the ith datum is defined as

$$\begin{aligned} \rho _\mathrm{{LOO}}\left( i, \mathcal {D}\right) = \frac{\left| \left\{ \left( x^{(j)}, y^{(j)}\right) \ \bigl | \ x^{(j)} \in \mathcal {D},\ y^{(j)} = y^{(i)}, \ j \in \mathcal {I} {\setminus } i\right\} \right| }{vol\left( \mathcal {D}\right) }, \end{aligned}$$

where $\mathcal {D}$ denotes a u-dimensional region such that $x^{(i)} \in \mathcal {D}$, $|\cdot |$ represents the cardinality of a set, $\mathcal {I} = \{1,\ldots ,N\}$, and $vol\left( \cdot \right) $ indicates the volume of the region. The outlier score S of the ith datum is defined as

$$\begin{aligned} S\left( i\right) = \max _{\mathcal {D} \in \tilde{\mathcal {D}}\left( i\right) } \rho _\mathrm{{LOO}}\left( i, \mathcal {D}\right) , \end{aligned}$$

where $\tilde{\mathcal {D}}\left( i\right) $ represents a set of regions around $x^{(i)}$ defined as

$$\begin{aligned} \tilde{\mathcal {D}}\left( i\right) = \left\{ u\text {-dimensional region } \mathcal {D} \ \bigl | \ x^{(i)} \in \mathcal {D}\right\} . \end{aligned}$$

A datum is detected as an outlier if the outlier score of the datum is less than the given threshold. Note that $\tilde{\mathcal {D}}\left( i\right) $ is not the set of all possible regions, but a fixed family, which is defined in Sect. 4.1.

The idea behind the proposed method is that, if the ith datum is not an outlier, there must be a region around the datum in which the other data exist with high density under the condition that categorical attributes are identical to the focused datum. On the other hand, if the ith datum is an outlier, it is expected that there is no region around the datum that contains the other data densely.

In contrast to some existing methods, such as LOF and Orca, the proposed method does not require a distance function that defines the distance between a pair of samples in the dataset. Such a feature is preferable because it is generally difficult to design a distance function properly for a dataset that consists of both continuous and categorical attributes.

The choice of the set of regions $\tilde{\mathcal {D}}(i)$ determines the performance of the proposed method. In this paper, we assume $\tilde{\mathcal {D}}(i)$ to be a set of u dimensional hypercubes around the ith datum, which will be described in Sect. 4.1. Experimental results in Sect. 5 show that the proposed method can find outliers efficiently and effectively for a dataset that consists of both continuous and categorical attributes and has a large number of samples.

3 Preliminaries

In this section, we briefly review the initial region method (Kutsuna 2010) and define notations. Let $\mathcal {H}$ be a u-dimensional hypercube defined as $\mathcal {H} = [0, 2^m)^{u}$, where m is an arbitrary positive integer. Further, let $\xi $ be an example normalizer such that $\xi (x^{(i)}) \in \mathcal {H}$ for every $x^{(i)}$ in D. An example of $\xi $ is given in Kutsuna (2010) as a following simple scaling function:

$$\begin{aligned} \xi (x) = \left( \frac{scale(x_1)-x_{min}}{x_{max}-x_{min}} \left( 2^m - \epsilon \right) , \ldots , \frac{scale(x_u)-x_{min}}{x_{max}-x_{min}} \left( 2^m - \epsilon \right) \right) , \end{aligned}$$

(1)

where $\epsilon $ denotes an arbitrary small positive number and

$$\begin{aligned} scale(x_j)&= \frac{x_j - \mu _j}{\sigma _j} \quad (j=1, \ldots , u), \\ x_{max}&= \max _{i=1, \ldots , N;\ j=1, \ldots , u} \left( scale\left( x^{(i)}_j\right) \right) , \\ x_{min}&= \min _{i=1, \ldots , N;\ j=1, \ldots , u} \left( scale\left( x^{(i)}_j\right) \right) , \end{aligned}$$

where $x_j$ represents the jth element of x, and $\mu _j$ and $\sigma _j$ denote the mean and standard deviation of $x_j$, respectively, which are calculated from D. We denote the projected datum $\xi (x)$ as z.

The neighborhood function $\nu $, which is defined as $\nu (z) := \left[ \lfloor z_1 \rfloor , \lfloor z_1 \rfloor + 1 \right) \times \cdots \times \left[ \lfloor z_u \rfloor , \lfloor z_u \rfloor + 1 \right) $, where $\lfloor \cdot \rfloor $ is the floor function, returns a u-dimensional unit hypercube that subsumes $z =\xi (x)$.

The number of elements included in the domain of $y_j$, which is the jth element of y, is denoted by $M_j\ (j=1,\ldots ,v)$. Let $\phi $ be a perfect hash function that receives y as a key and returns a hash value. The range of $\phi $ is assumed to be $\left\{ 1, \ldots , M\right\} $, where $M=\prod _{j=1}^v M_j$, without loss of generality, because the range has M hash values from the definition of y. If a dataset does not contain categorical attributes, it is assumed that $M=1$ and $\phi (\cdot )=1$.

3.1 Initial region vector

The initial region vector G is a set of u-dimensional regions inside $\mathcal {H}$ and the jth region $G_j$ subsumes all the projected data $z^{(i)}=\xi \left( x^{(i)}\right) $ such that $\phi \left( y^{(i)}\right) =j$ for $i=1,\ldots ,N$, defined as

$$\begin{aligned} G = \left\{ G_1, \ldots , G_M\right\} , \quad G_j = \bigcup _{i=1, \ldots , N; \phi \left( y^{(i)}\right) =j} \nu \left( z^{(i)}\right) . \nonumber \end{aligned}$$

For example, we consider a dataset that has $u=2$ continuous attributes and no categorical attribute. We set $m=3$, then the dataset is projected into $\mathcal {H}=[0, 2^3)^2$ by $\xi $. The projected data z are denoted by “$\times $”-marks and the initial region G is shown as the gray region in Fig. 1. Note that we refer to the initial region vector G as the initial region when the dataset has no categorical attribute, because $G=G_1$ in this case.

3.2 Initial Boolean function

In the initial region vector method, the initial region vector G is expressed as a Boolean function. Special coding functions CodeZand CodeYare introduced to convert every datum in D into a logical formula.

Definition 1

CodeZis a function that codes data $z = (z_1, \ldots , z_u) \in \mathcal {H}$ into a logical formula. Each $z_j$ is first truncated to an integer, and then coded in the manner of an unsigned-integer-type coding by using m Boolean variables $\{b_{j1}, \ldots , b_{jm}\}$, where $b_{j1}$ is the most significant bit and $b_{jm}$ is the least significant bit.

For example, Boolean variables $b_{11}, \ldots , b_{23}$ that are used to code $z_1$ and $z_2$ are shown in Fig. 1. The datum $z^{(i)}$ in the region $[5, 6) \times [2, 3)$ in Fig. 1 is first truncated to (5, 2), and then coded by CodeZas follows:

$$\begin{aligned} \mathtt{CodeZ}{z^{(i)}} = \left( b_{11} \wedge \bar{b}_{12} \wedge b_{13} \right) \wedge \left( \bar{b}_{21} \wedge b_{22} \wedge \bar{b}_{23} \right) , \end{aligned}$$

(2)

where $\bar{b}$ denotes the negation of b.

Definition 2

CodeYis a function that codes categorical data y into a logical formula, which is defined upon L Boolean variables $\{d_1, \ldots , d_L\}$. For our convenience of explanation, we define $\mathtt{CodeY}{y}=1$ and $L=0$ if the given dataset has no categorical attribute.

For example, suppose that $y=(y_1, y_2)$ and the domain of $y_1$ and $y_2$ are $\{\text {blue}, \text {red}, \text {yellow}\}$ and $\{\text {ON}, \text {OFF}\}$, respectively. By setting $L=3$, the elements of the domains of $y_1$ and $y_2$ can be coded as $\{d_1 \wedge d_2,\ d_1 \wedge \overline{d_2},\ \overline{d_1} \wedge d_2\}$ and $\{d_3, \overline{d_3}\}$, respectively. Then, $\mathtt{CodeY}{y} = d_1 \wedge \overline{d_2} \wedge d_3$ for $y=(\text {red}, \text {ON})$. The sets of Boolean variables used to code z and y are denoted as $\mathbb {B}$ and $\mathbb {D}$, respectively:

$$\begin{aligned} \mathbb {B}=\left\{ b_{jk}\ | j=1, \ldots , u;\ k=1, \ldots , m\right\} , \quad \mathbb {D}=\left\{ d_{j}\ | j=1, \ldots , L\right\} . \end{aligned}$$

We assume $\mathbb {D}=\emptyset $ if the dataset has no categorical attribute.

The initial Boolean function F is given as a disjunction of logical formulas that are calculated from the dataset D with CodeZand CodeYas follows:

$$\begin{aligned} F = \bigvee _{i=1, \ldots , N} \left( \mathtt{CodeY}{y^{(i)}} \wedge \mathtt{CodeZ}{z^{(i)}}\right) . \end{aligned}$$

(3)

The initial Boolean function F is informational equivalent of the initial region vector G (Kutsuna and Yamamoto 2014b). We introduce the decoding function R defined as follows.

Definition 3

The decoding function R is a function that decodes a Boolean formula defined on $\mathbb {B}$ into the corresponding u-dimensional region in $\mathcal {H}$.

In particular, $R(1) = \mathcal {H}$ and $R\left( F \wedge \mathtt{CodeY}(y^{(i)})\right) = G_{\phi (y^{(i)})}$. Note that $F \wedge \mathtt{CodeY}(y^{(i)})$ can be viewed as a Boolean formula defined on $\mathbb {B}$, because variables in $\mathbb {D}$ are fixed to $\mathtt{CodeY}(y^{(i)})$, although F is defined on $\left\{ \mathbb {B}, \mathbb {D}\right\} $.

The initial Boolean function can be explained from an alternative point of view as follows: In Eq. (3), $\mathtt{CodeY}(y^{(i)}) \wedge \mathtt{CodeZ}(z^{(i)})$ can be viewed as a bit string that encodes the ith data point. The initial Boolean function F itself is just a list of bit strings describing the points in the dataset. Furthermore, $vol(R(F \wedge \mathtt{CodeY}(y^{(i)})))$ could be interpreted as the number of bit strings in F whose categorical attributes are encoded by $\mathtt{CodeY}(y^{(i)})$.

3.3 BDD representation of the initial Boolean function

A binary decision diagram (BDD) is a compressed representation of a Boolean function and has a feature that most logical operations between BDDs can be implemented by polynomial-time algorithms (Bryant 1986). In the initial region vector method, BDDs are employed to efficiently construct and represent the initial Boolean function F. The order of Boolean variables is set to hold the following constraint on constructing F as a BDD:

$$\begin{aligned} \left[ d_1, \ldots , d_L\right] \prec \left[ b_{11},\ldots ,b_{u1}\right] \prec \ldots \prec \left[ b_{1m},\ldots ,b_{um}\right] , \end{aligned}$$

where the variables inside the square brackets can be in an arbitrary order.

For example, Fig. 2 shows a BDD that represents the initial Boolean formula F that is obtained from the dataset in Fig. 1. In Fig. 2, square nodes, ellipsoidal nodes, and double-squared nodes are referred to as terminal nodes, variable nodes and function nodes, respectively. Variable nodes are labeled with “a”, “b”, ..., “s” and their corresponding Boolean variables are on the left side. Solid lines, dashed lines, and dotted lines indicate true edges, false edges, and complement edges, respectively. A path from a function node to terminal 1 corresponds to a conjunction of literals. If the path contains an even number of complement edges, the conjunction is included in the function. For example, the path that goes from the function node F to the terminal 1 through Nodes {a, b, d, h, p} corresponds to the following conjunction

$$\begin{aligned} b_{11} \wedge b_{21} \wedge \bar{b}_{12} \wedge \bar{b}_{22} \wedge \bar{b}_{13}. \end{aligned}$$

(4)

Because the path contains two complement edges, the conjunction in (4) is included in F. Moreover, the conjunction in (4) corresponds to the region

$$\begin{aligned} R\left( b_{11} \wedge b_{21} \wedge \bar{b}_{12} \wedge \bar{b}_{22} \wedge \bar{b}_{13}\right) = [4,5) \times [4, 6), \end{aligned}$$

which is included in the initial region G as shown in Fig. 1.

A subgraph of a BDD, which is rooted from a node in the BDD, also represents a Boolean function. Let $\alpha $ be a node in the BDD. We denote the Boolean function represented by the subgraph rooted from $\alpha $ as $\mathcal {N}_{\alpha }^+$. Further, we denote the complement of $\mathcal {N}_{\alpha }^+$ as $\mathcal {N}_{\alpha }^-$. For example, the BDD representation of $\mathcal {N}_\text {e}^-$ and $\mathcal {N}_\text {g}^+$ for BDD nodes “e” and “g” in Fig. 2 is shown in Fig. 3. From Fig. 3, we can see that F is also expressed as $\mathcal {N}_\text {a}^-$. Both $\mathcal {N}_\mathrm{{e}}^-$ and $\mathcal {N}_\mathrm{{g}}^+$ represent regions inside $\mathcal {H}$ that are expressed as $R\left( \mathcal {N}_\mathrm{{e}}^-\right) $ and $R\left( \mathcal {N}_\mathrm{{g}}^+\right) $, respectively. Fig. 4 shows $R\left( \mathcal {N}_\mathrm{{e}}^-\right) $ and $R\left( \mathcal {N}_\mathrm{{g}}^+\right) $. An important observation is that $R\left( \mathcal {N}_\mathrm{{e}}^-\right) $ is identical to the initial region G in the region $[4,8) \times [0, 4)$. Furthermore, Node “e” is reached from the function node F through the path {F, a, b, e} in Fig. 4, which indicates $b_{11} \wedge \bar{b}_{21}$ that corresponds to the region $[4,8) \times [0, 4)$. Likewise, $R\left( \mathcal {N}_\mathrm{{g}}^+\right) $ is identical to the initial region G in $[0,4) \times [0, 4)$.

4 Proposed method

4.1 Outline

In the proposed method, we calculate the leave-one-out density with the normalized data $z^{(i)} = \xi \left( x^{(i)}\right) $ as

$$\begin{aligned} \rho _\mathrm{{LOO}}\left( i, \mathcal {C}\right) = \frac{\left| \left\{ \left( z^{(j)}, y^{(j)}\right) \ \bigl | \ z^{(j)} \in \mathcal {C},\ y^{(j)} = y^{(i)},\ j \in \mathcal {I} {\setminus } i\right\} \right| }{vol\left( \mathcal {C}\right) }, \end{aligned}$$

(5)

where $\mathcal {C}$ represents a region such that $z^{(i)} \in \mathcal {C}$. The outlier score is given as

$$\begin{aligned} S\left( i\right) = \max _{\mathcal {C} \in \tilde{\mathcal {C}}\left( i\right) } \rho _\mathrm{{LOO}}\left( i, \mathcal {C}\right) , \end{aligned}$$

(6)

where $\tilde{\mathcal {C}}(i)$ is a set of hypercubes defined as

$$\begin{aligned} \begin{array}{l} \displaystyle {\tilde{\mathcal {C}}\left( i\right) = \left\{ R\left( \mathcal {L}_l(i)\right) \ |\ l=0, \ldots , \lambda \right\} }, \\ \displaystyle {\mathcal {L}_0(i) = 1,\quad \mathcal {L}_l(i) = \bigwedge _{j=1, \ldots , u; k=1, \ldots , l} \tilde{b}_{jk}^{(i)} \quad \left( l=1, \ldots , \lambda \right) } \end{array}, \end{aligned}$$

(7)

where $\lambda $ is an integer parameter such that $1\le \lambda \le m$, which controls the number of hypercubes in $\tilde{\mathcal {C}}(i)$, and $\tilde{b}_{jk}^{(i)}$ are the Boolean encoding of $z^{(i)}$ obtained by CodeZ. For example,

$$\begin{aligned} \left( \tilde{b}_{11}^{(i)}, \tilde{b}_{21}^{(i)}, \tilde{b}_{12}^{(i)}, \tilde{b}_{22}^{(i)}, \tilde{b}_{13}^{(i)}, \tilde{b}_{23}^{(i)}) = (b_{11}, \bar{b}_{21}, \bar{b}_{12}, b_{22}, b_{13}, \bar{b}_{23}\right) \end{aligned}$$

(8)

for $z^{(i)}$ in the region $[5, 6) \times [2, 3)$ shown in Fig. 1. From (7), the set $\tilde{\mathcal {C}}(i)$ for this datum is derived as

$$\begin{aligned} \tilde{\mathcal {C}}(i) = \left\{ [0, 8) \times [0, 8), \ [4, 8) \times [0, 4), \ [4, 6) \times [2, 4), \ [5, 6) \times [2, 3)\right\} \end{aligned}$$

for $\lambda =3$, which are shown as bold squares in Fig. 6. If we set $\lambda =1$, then we have

$$\begin{aligned} \tilde{\mathcal {C}}(i) = \left\{ [0, 8) \times [0, 8), \ [4, 8) \times [0, 4)\right\} . \end{aligned}$$

Note that smaller hypercubes are excluded from $\tilde{\mathcal {C}}(i)$ as we set $\lambda $ smaller.

We assume that there is no duplicate in D; that is, at least one of the attributes has a different value for $\left( x^{(i)}, y^{(i)}\right) $ and $\left( x^{(j)}, y^{(j)}\right) $ if $i \ne j$. Then, we can construct the initial region vector G so that each datum in D is allocated in a distinct unit hypercube by setting m to a sufficiently large value and using an appropriate example normalizer. Then, the volume of $G_j \in G$ is equivalent to the number of data elements in D such that $\phi \left( y^{(i)}\right) =j$ for $j=1, \ldots , M$, because each datum in D is encoded as a unit hypercube in $G_j$. In this case, the leave-one-out density defined as (5) can be calculated on the basis of the initial region vector G as follows for $\mathcal {C} \in \tilde{\mathcal {C}}(i)$:

$$\begin{aligned} \rho _\mathrm{{LOO}}\left( i, \mathcal {C}\right) = \frac{vol\left( G_\mathrm{{LOO}}\left( i\right) \cap \mathcal {C} \right) }{vol\left( \mathcal {C}\right) }, \end{aligned}$$

(9)

where $G_\mathrm{{LOO}}$ denotes the leave-one-out region defined as

$$\begin{aligned} G_\mathrm{{LOO}}\left( i\right) = G_{\phi \left( y^{(i)}\right) } {\setminus } \nu \left( z^{(i)}\right) . \end{aligned}$$

(10)

For example, the left side of Fig. 5 illustrates the initial region and the right side illustrates the leave-one-out region for the datum denoted with the “$\times $”-mark. Further, Fig. 6 illustrates the calculation of the leave-one-out density for this datum.

4.2 BDD-based implementation

4.2.1 The number of minterms calculation

A minterm is a conjunction of literals in which each Boolean variable in the domain appears once. Let $\#_{A}$ be a function that returns the number of minterms of a Boolean function on the assumption that the domain of the function is A. For example, $\#_{\left\{ a\right\} }(1)=2$ and $\#_{\left\{ a, b\right\} }(1)=4$ where a and b are Boolean variables. In the initial region method, the following lemma holds.

Lemma 1

For a Boolean function F that is defined on $\mathbb {B}$, it holds that

$$\begin{aligned} vol\left( R\left( F\right) \right) = \#_{\mathbb {B}} \left( F\right) . \end{aligned}$$

Proof

Since a minterm represents a unit hypercube in the initial region method, the number of minterms is equal to the volume of the region that F represents. $\square $

Assume that a BDD that represents the initial Boolean function F is given. It is possible to efficiently calculate the number of minterms of $\mathcal {N}_{\alpha }^+$ and $\mathcal {N}_{\alpha }^-$ for each node $\alpha $ in the BDD by using the following equations in a depth-first manner (Somenzi 1999):

$$\begin{aligned} \#_{\left\{ \mathbb {B}, \mathbb {D}\right\} }\left( \mathcal {N}_{\alpha }^+\right)&= {\left\{ \begin{array}{@{}l@{\quad }l@{}} 2^{mu+L} &{} \text {if}\,\alpha \,\text {is the terminal~1,} \\ \left\{ \begin{array}{@{}l@{}} \left( \#_{\left\{ \mathbb {B}, \mathbb {D}\right\} }\left( \mathcal {N}_{\beta }^+\right) + \#_{\left\{ \mathbb {B}, \mathbb {D}\right\} }\left( \mathcal {N}_{\gamma }^ +\right) \right) /2 \\ \left( \#_{\left\{ \mathbb {B}, \mathbb {D}\right\} }\left( \mathcal {N}_{\beta }^+\right) + \#_{\left\{ \mathbb {B}, \mathbb {D}\right\} }\left( \mathcal {N}_{\gamma }^-\right) \right) /2 \end{array}\right. &{} \begin{array}{@{}l@{}} \text {if } e=1, \\ \text {if } e=0, \end{array} \end{array}\right. } \\ \#_{\left\{ \mathbb {B}, \mathbb {D}\right\} }\left( \mathcal {N}_{\alpha }^-\right)&= {\left\{ \begin{array}{@{}l@{\quad }l@{}} 0 &{} \text {if}\,\alpha \,\text {is the terminal~1,} \\ \left\{ \begin{array}{@{}l@{}} \left( \#_{\left\{ \mathbb {B}, \mathbb {D}\right\} }\left( \mathcal {N}_{\beta }^-\right) + \#_{\left\{ \mathbb {B}, \mathbb {D}\right\} } \left( \mathcal {N}_{\gamma }^-\right) \right) /2 \\ \left( \#_{\left\{ \mathbb {B}, \mathbb {D}\right\} }\left( \mathcal {N}_{\beta }^-\right) + \#_{\left\{ \mathbb {B}, \mathbb {D}\right\} }\left( \mathcal {N}_{\gamma }^+\right) \right) /2 \end{array}\right. &{} \begin{array}{@{}l@{}} \text {if } e=1, \\ \text {if } e=0, \end{array} \end{array}\right. } \end{aligned}$$

where $\beta $ and $\gamma $ are the true and false children of $\alpha $, respectively, and $e=1$ if the edge between $\alpha $ and $\gamma $ is complemented and $e=0$ otherwise. For example, Table 1 shows the number of minterms of $\mathcal {N}_{\alpha }^+$ and $\mathcal {N}_{\alpha }^-$ for each node in Fig. 2.

Table 1 The number of minterms of $\mathcal {N}_{\alpha }^+$ and $\mathcal {N}_{\alpha }^-$ for each node $\alpha $ in Fig. 2

Full size table

4.2.2 The leave-one-out density calculation

Because of the fact that $\nu \left( z^{(i)}\right) \subseteq G_{\phi \left( y^{(i)}\right) }$ and $\nu \left( z^{(i)}\right) \subseteq \mathcal {C} \in \tilde{\mathcal {C}}(i)$, it is derived from (9) and (10) that the following equation holds for $\mathcal {C} \in \tilde{\mathcal {C}}(i)$:

$$\begin{aligned} \rho _\mathrm{{LOO}}\left( i, \mathcal {C}\right) = \frac{vol\left( G_{\phi \left( y^{(i)}\right) } \cap \mathcal {C} \right) - 1}{vol\left( \mathcal {C}\right) } \end{aligned}$$

(11)

By replacing $\mathcal {C}$ in (11) with $R\left( \mathcal {L}_l\left( i\right) \right) $, we derive the following equation:

$$\begin{aligned} \rho _\mathrm{{LOO}}\left( i, R\left( \mathcal {L}_l\left( i\right) \right) \right)&= \frac{vol\left( G_{\phi \left( y^{(i)}\right) } \cap R\left( \mathcal {L}_l\left( i\right) \right) \right) - 1}{vol\left( R\left( \mathcal {L}_l\left( i\right) \right) \right) } \nonumber \\&= \frac{vol\left( R\left( F \wedge \mathtt{CodeY}{y^{(i)}} \wedge \mathcal {L}_l\left( i\right) \right) \right) - 1}{vol\left( R\left( \mathcal {L}_l\left( i\right) \right) \right) } \end{aligned}$$

(12)

From Lemma 1, (12) is transformed as follows:

$$\begin{aligned} \rho _\mathrm{{LOO}}\left( i, R\left( \mathcal {L}_l\left( i\right) \right) \right)&= \frac{\#_{\mathbb {B}}\left( F \wedge \mathtt{CodeY}{y^{(i)}} \wedge \mathcal {L}_l\left( i\right) \right) - 1}{\#_{\mathbb {B}}\left( \mathcal {L}_l\left( i\right) \right) } \nonumber \\&= \frac{\#_{\mathbb {B}}\left( F \wedge \mathtt{CodeY}{y^{(i)}} \wedge \mathcal {L}_l\left( i\right) \right) - 1}{2^{(m-l)u}} \end{aligned}$$

(13)

Lemma 2

In the BDD that represents F, let $\alpha _{i, l}$ be the node that can be reached from the function node F through the path defined by $\mathtt{CodeY}{y^{(i)}{} } \wedge \mathcal {L}_l\left( i\right) $. Let c be the number of complement edges on the path. Then, it holds that

$$\begin{aligned} \#_{\mathbb {B}}\left( F \wedge \mathtt{CodeY}{y^{(i)}} \wedge \mathcal {L}_l\left( i\right) \right) = {\left\{ \begin{array}{ll} \displaystyle {\frac{\#_{\left\{ \mathbb {B}, \mathbb {D}\right\} }\left( \mathcal {N}_{\alpha _{i, l}}^+\right) }{2^{L+lu}}} &{} \mathrm{{if}}\,c\, \mathrm{{is\,even}}, \\ \displaystyle {\frac{\#_{\left\{ \mathbb {B}, \mathbb {D}\right\} }\left( \mathcal {N}_{\alpha _{i, l}}^-\right) }{2^{L+lu}}} &{} \mathrm{{if}}\,c\, \mathrm{{is\,odd}}. \end{array}\right. } \end{aligned}$$

Proof

$F \wedge \mathtt{CodeY}{y^{(i)}{} } \wedge \mathcal {L}_l\left( i\right) $ means that $L+lu$ Boolean variables that appear in $\mathtt{CodeY}{y^{(i)}{} } \wedge \mathcal {L}_l\left( i\right) $ are fixed to specific values in F. On the other hand, $\mathcal {N}_{\alpha _{i, l}}^+$ ($\mathcal {N}_{\alpha _{i, l}}^-$) means that $L + lu$ Boolean variables in $\mathtt{CodeY}{y^{(i)}{} } \wedge \mathcal {L}_l\left( i\right) $ are smoothed^{Footnote 1} in F if c is even (odd). Therefore, the number of minterms of $\mathcal {N}_{\alpha _{i, l}}^+$ ($\mathcal {N}_{\alpha _{i, l}}^-$) is $2^{L+lu}$ times larger than that of $F \wedge \mathtt{CodeY}{y^{(i)}{} } \wedge \mathcal {L}_l\left( i\right) $. $\square $

Theorem 1

Leave-one-out density $\rho _\mathrm{{LOO}}$ is evaluated on the basis of the BDD that represents the initial Boolean function F as follows:

$$\begin{aligned} \rho _\mathrm{{LOO}}\left( i, R\left( \mathcal {L}_l\left( i\right) \right) \right) = {\left\{ \begin{array}{ll} \displaystyle {\frac{\#_{\left\{ \mathbb {B}, \mathbb {D}\right\} }\left( \mathcal {N}_{\alpha _{i, l}}^+\right) - 2^{L+lu}}{2^{L+mu}}} &{} \mathrm{{if}}\,c\,\mathrm{{is\,even}}, \\ \displaystyle {\frac{\#_{\left\{ \mathbb {B}, \mathbb {D}\right\} }\left( \mathcal {N}_{\alpha _{i, l}}^-\right) - 2^{L+lu}}{2^{L+mu}}} &{} \mathrm{{if}}\,c\,\mathrm{{is\,odd}}. \end{array}\right. } \end{aligned}$$

Proof

It follows from (13) and Lemma 2 immediately. $\square $

It is worth mentioning that we can evaluate the leave-one-out density from the initial Boolean function F straightforwardly without any leave-one-out operation by using Theorem 1. For example, we consider a dataset shown in Fig. 1 and the datum in $[5, 6) \times [2, 3)$. The datum is coded as (8), and the corresponding path in Fig. 2 is $F\bullet \text {a}\circ \text {b}\circ \text {e}\circ \text {k}\circ \text {q}\circ \text {s}\bullet 1$, where $\circ $ and $\bullet $ denote non-complement and complement edges, respectively. The leave-one-out density of the datum is calculated from Theorem 1 and Table 1 as follows:

$$\begin{aligned} \rho _\mathrm{{LOO}}\left( i, R\left( \mathcal {L}_0\left( i\right) \right) \right)&= \frac{\#_{\left\{ \mathbb {B}, \mathbb {D}\right\} }\left( \mathcal {N}_\mathrm{{a}}^-\right) - 2^{0}}{2^{6}} = \frac{19 - 2^0}{2^6} = \frac{18}{64}, \\ \rho _\mathrm{{LOO}}\left( i, R\left( \mathcal {L}_1\left( i\right) \right) \right)&= \frac{\#_{\left\{ \mathbb {B}, \mathbb {D}\right\} }\left( \mathcal {N}_\mathrm{{e}}^-\right) - 2^{2}}{2^{6}} = \frac{28 - 2^2}{2^6} = \frac{6}{16}, \\ \rho _\mathrm{{LOO}}\left( i, R\left( \mathcal {L}_2\left( i\right) \right) \right)&= \frac{\#_{\left\{ \mathbb {B}, \mathbb {D}\right\} }\left( \mathcal {N}_\mathrm{{q}}^-\right) - 2^{4}}{2^{6}} = \frac{32 - 2^4}{2^6} = \frac{1}{4}, \\ \rho _\mathrm{{LOO}}\left( i, R\left( \mathcal {L}_3\left( i\right) \right) \right)&= \frac{\#_{\left\{ \mathbb {B}, \mathbb {D}\right\} }\left( \mathcal {N}_{1}^+\right) - 2^{6}}{2^{6}} = \frac{64 - 2^6}{2^6} = 0. \end{aligned}$$

We can see that these values are equal to those shown in Fig. 6.

4.3 The proposed algorithm and computational complexity

We propose Algorithm 1 that calculates the outlier score of each datum in D.

In Algorithm 1, the time complexity of constructing the initial Boolean function F is approximately $O(\Phi N)$, where $\Phi $ denotes the number of nodes of the created BDD, because logical operations between BDDs are practically almost linear to the size of the BDDs (Brace et al. 1990). The size of the created BDD depends on m, u, N, and the characteristics of the dataset, though, it is difficult to analytically specify the relationship between them. The size of the BDD can be exponentially large in the worst case, however, it is compact for realistic datasets used in our experiments. The time complexity of calculating the number of minterms is $O(\Phi )$ as mentioned in Sect. 4.2.1. The time complexity of calculating the outlier score is $O((L+mu)N)$ because the depth of the BDD is $L+mu$. Consequently, the time complexity of Algorithm 1 is $O((\Phi +L+mu)N)$. Therefore, the proposed method can deal with a large dataset efficiently unless the number of Boolean variables and the created BDD are intractably huge.

5 Experimental results

We first examine the basic behavior of the proposed method by using synthetic datasets, which are distributed non-linearly in two dimensional space. Then we investigate the impact of the parameter setting of the proposed method and compare the proposed method with the existing methods by using realistic datasets. We implemented the proposed method as a C program with the help of the CUDD package (Somenzi 2012). We refer to the proposed method as ODBDD in this section. In ODBDD, we used $\xi $ in (1) as an example normalizer. The accuracy of detecting outliers is evaluated in terms of the area under an ROC curve (AUC) (Fawcett 2006). The AUC value ranges from 0 to 1, and a higher AUC value implies better performance in accuracy. The experiment was performed on an Intel Core i7 CPU (3.20 GHz) machine with 64 GB RAM, running with Microsoft Windows 7.

5.1 Evaluation with synthetic datasets

We generated a synthetic dataset called the Ten dataset, which consists of two continuous attributes and no categorical attribute. 95% data of the Ten dataset are distributed inside the shape of “10” randomly. The remaining 5% are distributed on the outside randomly and are regarded as outliers. The number of data elements is set to $10^3$, $10^4$, $10^5$, or $10^6$, and we generate datasets randomly for ten times in each setting. The computation time and the AUC values are averaged over these datasets in the evaluation. An example of the Ten-$10^3$ dataset is shown on the left side of Fig. 7, in which the “+”-marks denote generated outliers.

We calculated the outlier score of each datum in the Ten dataset by ODBDD, in which we set $m=16$ and $\lambda =16$. Table 2 shows the mean and standard deviation of the computation time and AUC values. We can see that the computation time of ODBDD increases almost linearly with respect to the size of the dataset. Further, ODBDD achieves good AUC values for these datasets. The right side of Fig. 7 shows the outliers detected by ODBDD, in which the data whose outlier scores are within the top 5% are detected as outliers and plotted as “$\times $”-marks.

Table 2 The mean and standard deviation of the computation time (in seconds) and the AUC values for the Ten datasets

Full size table

5.2 Evaluation with realistic datasets

5.2.1 Datasets

We used 9 datasets from the UCI machine learning repository (Bache and Lichman 2013): Ionosphere, Yeast, Abalone, Magic, Adult, Shuttle, Bank (Moro et al. 2014), Covertype (Blackard and Dean 1999), and Kddcup99. All of these datasets are originally arranged for the classification task. In order to utilize these datasets for the evaluation of outlier detection algorithms, we carried out the following procedure.

For each dataset except the Kddcup99 dataset:

1.
Select the class of the maximum size as the target class. The size of the target class is denoted by $N_T$.
2.
Pick out all the data that belong to the target class from the dataset.
3.
Pick out $N_o = \lceil 0.01N_T \rceil $ data randomly from the classes other than the target class.
4.
Combine the data obtained in Step 2 and Step 3 to generate a new dataset, in which the data from Step 3 are considered to be outliers.
5.
Repeat Step 3 and Step 4 for 50 times to generate 50 different datasets.
6.
Evaluate the computation time and the AUC value with each generated dataset. The averages over 50 datasets are used for evaluation.

Table 3 shows the target class, $N_T$, and $N_o$ for each dataset. The number of categorical attributes and continuous attributes included in each dataset are also shown in Table 3.

Table 3 An overview of the UCI datasets used in the experiment

Full size table

The Kddcup99 dataset consists of network access data, which includes both normal access and several types of attacks. We set the normal class as the target class in Step 1 of the procedure for the Kddcup99 dataset. It then can be regarded as a practical problem of discovering attacks in a network access log data, in which the most part of logs are from normal accesses, whereas some attacks are also included as outliers. In reference to Yamanishi et al. (2004), we transformed two continuous attributes, src_bytes and dst_bytes, by $y=\log (x+0.01)$, because they were concentrated around 0. Moreover, we regarded continuous attributes that have less than 10 unique values as categorical attributes and removed the attribute num_outbound_cmds since it had only one unique value. In addition to the setting of using all attributes, we consider the setting in which only four attributes {service, duration, src_bytes, and dst_bytes} are used, because they are thought to be the most essential attributes (Yamanishi et al. 2004). In Table 3, Kddcup99_4a indicates the setting with these four attributes.

5.2.2 Investigating the impact of the parameter setting

The proposed method has parameters m, which defines the bit length used to code each continuous attribute, and $\lambda \ (\le m)$, which determines the hypercubes included in $\tilde{\mathcal {C}}(i)$. As mentioned in Sect. 4.1, the set $\tilde{\mathcal {C}}(i)$ includes smaller hypercubes as $\lambda $ is set larger. To examine the impact of these parameters, we evaluated the AUC values and the computation time of ODBDD with various parameter settings.

Figure 8 shows the AUC values obtained with ODBDD, in which we set $m=8, 12, 16$ and $\lambda =1,\ldots ,m$ for each m. The first observation from Fig. 8 is that the AUC value gets larger and it converge to a specific value as $\lambda $ increases in most datasets. However, the AUC values get worse as $\lambda $ increases in the Kddcup99 dataset. Such a case can happen when outliers in a dataset lie close each other and cluster themselves in small locally dense regions. The leave-one-out density of these outliers will grow as the size of the hypercube, which is used to evaluate the leave-one-out density, gets smaller. The outliers generated from the Kddcup99 dataset indeed have very small variance in its original 40 dimensions, since the attacks included in the Kddcup99 dataset are mostly from the class of either smurf or neptune. The second observation from Fig. 8 is that the AUC values are mainly depends on $\lambda $. The setting of m has little impact on the AUC values in these datasets, except for the Shuttle dataset and the Kddcup99_4a dataset. The AUC values become relatively lower with $m=8$ than those with $m=12, 16$ for the Shuttle dataset and the Kddcup99_4a dataset. These results suggest that $m \ge 12$ is a reasonable choice for the proposed method in terms of the AUC values.

Figure 9 shows the computation time of ODBDD with $m=8, 12, 16$ and $\lambda =1,\ldots ,m$ for each m. From Fig. 9, we see that the choice of $\lambda $ has little impact on the computation time of ODBDD. Accordingly, the choice of $\lambda $ mainly affects the AUC values and not the computation time. On the other hand, from Fig. 9, we find that the computation time of ODBDD increases as m is set larger in most datasets, except for the Ionosphere, Yeast, and Abalone datasets whose sizes are relatively small. The reason for why the computation time increases as m gets larger is that the size of the BDD that represents the initial Boolean function F tends to be large as the number of Boolean variables, on which F is defined, increases. To further investigate the impact of m on the computation time, we conducted experiments with $m=1,\ldots ,16$, in which $\lambda $ is fixed to 1. The results are shown in Fig. 10 where the computation time are scaled by the result at $m=1$ on each dataset. From Fig. 10, we see that the increase rate of the computation time is moderate with respect to m. The computation time for the Covertype dataset increases rapidly compared to that for the Kddcup99 dataset, even though both datasets include hundreds of thousands of data points that have dozens of attributes. A reason for this can be explained with Fig. 11. Figure 11 illustrates the size of the BDD with respect to the number of data points processed in the construction of the initial Boolean formula F. The size of the BDD for the Covertype dataset grows more rapidly than that for the Kddcup99 dataset. As the BDD becomes large, processing each data point for constructing F gets slower. The growing rate of the BDD depends on the characteristics of the dataset to be processed.

From the viewpoint of computational efficiency, it is better to set m as small as possible. However, the AUC value gets worse if m is set too small as mentioned above. From Fig. 8 and Fig. 9, it seems that $m = 12$ is a reasonable choice for the proposed method.

5.2.3 Settings of methods to compare

We compared the proposed method with existing outlier detection methods: the one-class support vector machine (OCSVM), the local outlier factor (LOF), the Orca algorithm, and the isolation forest (iForest). The experiment was conducted in the R environment (R Core Team 2014). The ksvm function in the kernlab package (Karatzoglou et al. 2004) was used for OCSVM, the lofactor function in the DMwR package (Torgo 2010) was used for LOF, the binary implementation provided by Bay (2013) was used for Orca, and the IsolationTrees function in the IsolationForest package (Liu 2009) was used for iForest. In OCSVM, the Gaussian kernel was used and the kernel parameter $\gamma $ was set to one of the 10%, 50%, and 90% quartiles of the distance between samples (Karatzoglou et al. 2004), which are referred to as $\gamma _{0.1}$, $\gamma _{0.5}$, and $\gamma _{0.9}$, respectively. The parameter $\nu $ was fixed to $\nu =0.1$ in OCSVM. In LOF, the number of neighbors was set to either $k=10$ or $k=50$. In the execution of Orca, we used default parameter settings of the binary implementation provided by Bay (2013), including the nearest neighbor parameter $k=5$, except for the parameter n, which we set to 5% of the size of the dataset to be analyzed. The iForest was executed with the default setting of the IsolationTrees function. In OCSVM, LOF, and iForest, continuous attributes were scaled and categorical attributes were coded by using dummy variables. Moreover, the time required to load each dataset into the R environment using the read.delim function was added to the computation time of OCSVM, LOF, and iForest, because these methods are implemented as R functions. In ODBDD, we set $m=12$ and $\lambda =8$.

5.2.4 Comparing the proposed method with other methods

Table 4 shows the computation time for UCI datasets in Table 3. Note that the numbers in parentheses are the results evaluated with one of 50 datasets that are generated by the procedure in Sect. 5.2.1, since the computation time was relatively long in these settings. TO means that the execution with a dataset had not been completed within the timeout limit of 48 h. From Table 4, we see that the computation time of the methods other than ODBDD and iForest increases rapidly with respect to the size of the dataset. There is not much difference between ODBDD and iForest in terms of the computation time, though, ODBDD seems to work faster than iForest when the dataset is large and has a few attributes, like the Kddcup99_4a dataset.

Table 4 Computation time for UCI datasets (in seconds)

Full size table

Table 5 shows the AUC values for UCI datasets. Parentheses and TO mean the same as in Table 4. In Table 5, bold values indicate the top two highest scores for each dataset.

Table 5 AUC values for UCI datasets

Full size table

From Table 5, we cannot find a single method that works best for all datasets. LOF achieves better results than other methods for datasets that are not so large. LOF might work better for large datasets if the parameter k is tuned carefully, though, it is generally very difficult to choose a good parameter in advance. Note that this fact is also true for other methods used in the experiment. ODBDD performs better than the others for datasets that are relatively large. On the other hand, the performance of ODBDD is not as good as the others for small datasets. This result suggests that the leave-one-out density that is used in ODBDD works well as a measure for outlier detection if the dataset is sufficiently large.

As mentioned in Sect. 5.2.1, the experiments with the Kddcup99 dataset correspond to a practical problem of analyzing access logs for finding attacks. Table 4 and Table 5 show that we can efficiently and accurately find attacks in the dataset by the proposed method. A valuable finding is that we can achieve better AUC values by using only 4 attributes than using original 40 attributes. Moreover, we can reduce the computation time by using 4 attributes. It follows from these results that it is very important to carefully pick out attributes to be used for outlier detection in the sense of both the accuracy and the efficiency.

The computation time of Orca in Table 4 might seem much worse compared to those reported in the literature (Bay and Schwabacher 2003; Ghoting et al. 2008). This is mainly caused by the setting of the parameter n of Orca, which is often set to $n=30$ in the literature. We instead set n to 5% of the size of the dataset in the experiment, because the AUC values of Orca will be unreasonably low if we set n too small. For reference, we executed Orca with $n=30$ for datasets in Table 3. The computation time of Orca was then, for example, 176.2, 96.4, and 492.6 s for Covertype, Kddcup_4a, and Kddcup datasets, respectively. It shows that Orca is comparable to ODBDD or iForest in terms of computational efficiency when n is small enough, although the AUC values of Orca were then almost 0.5 for all these datasets. This is inevitable because in this case only 30 data points have high outlier scores and the others are assumed to have the same low score.

6 Conclusions and future work

In this work we proposed a novel approach for outlier detection in which a score of being an outlier is defined on the basis of the leave-one-out density. We proposed an algorithm to evaluate the leave-one-out density very efficiently by processing a BDD that represents the given dataset in a logical formula. The proposed method can deal with a large dataset efficiently, because the time complexity is nearly linear unless the created BDD gets intractably huge.

This work can be extended in several ways. The accuracy of the outlier detection may be improved by modifying or enriching the region set $\tilde{\mathcal {D}}(i)$. It is possible to modify $\tilde{\mathcal {D}}(i)$ by changing the example normalizer used in the proposed method. If we employ a nonlinear normalizer, a hypercube in a projected space corresponds to a nonlinear region in the original space, which may lead to more precise outlier detection. Further, we can enrich $\tilde{\mathcal {D}}(i)$ by using various normalizers in parallel and integrating the results. A simple way to generate various normalizers is to incorporate a random rotation into the normalizer used in this work.

The proposed method may suffer from the curse of dimensionality when a dataset has many continuous attributes and the number of data elements is not sufficient, because the leave-one-out density is zero in almost every subregion of the whole hypercube in such a situation. A simple solution is to embed some dimension reduction method into a normalizer. Another possible solution is to repeat outlier detection by using a small subset of attributes and combine these results to find the better outliers, as proposed in Lazarevic and Kumar (2005).

The leave-one-out density is proposed in this paper as a measure for detecting outliers, in which data samples around the focused datum are enumerated under the condition that categorical attributes are identical to the focused datum. However, such a condition can be too strong to detect outliers efficiently for a dataset that has a lot of categorical attributes, because most of the data points has very low leave-one-out density if few data points has exactly the same categorical attributes. Another problem arises when we deal with a dataset that is completely categorical. The leave-one-out density is then substantially equivalent to merely examining the cell frequency in the contingency table. To overcome these limitations, we need to consider how to relax the equivalent condition of categorical attributes in the definition of the leave-one-out density.

The proposed method outputs a score of being an outlier of each datum in the dataset, but it does not give a systematic way to determine the threshold to which the score of each datum is compared for judging whether it is an outlier or not. As far as we know, there are few methods for the outlier detection that have a systematic way to determine such a threshold. For example, the one-class SVM generally assumes that the threshold is zero when learning a hyperplane. We need to develop a method to determine the threshold if we want to use the proposed method in a completely automatic manner.

Notes

Smoothing a Boolean function f with respect to x means $(f \wedge x) \vee (f \wedge \overline{x})$.

References

Aryal S, Ting KM, Wells JR, Washio T (2014) Improving iforest with relative mass. In: Tseng VS, Ho TB, Zhou ZH, Chen AL, Kao HY (eds) Advances in knowledge discovery and data mining. Lecture notes in computer science. Springer, New York, pp 510–521
Chapter Google Scholar
Bache K, Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 24 June 2014
Bay SD (2003) Orca: a program for mining distance-based outliers. http://www.stephenbay.net/orca. Accessed 6 Jul 2015
Bay SD, Schwabacher M (2003) Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Proceedings of the 9th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’03), ACM, New York, pp 29–38
Beckmann N, Kriegel H, Schneider R, Seeger B (1990) The R*-tree: an efficient and robust access method for points and rectangles. SIGMOD Rec 19(2):322–331
Article Google Scholar
Blackard JA, Dean DJ (1999) Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Comput Electron Agric 24(3):131–151
Article Google Scholar
Brace K, Rudell R, Bryant R (1990) Efficient implementation of a BDD package. In: The 27th ACM/IEEE design automation conference, pp 40–45
Breunig MM, Kriegel HP, Ng RT, Sander J (2000) LOF: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data (SIGMOD ’00), ACM, New York, pp 93–104
Bryant R (1986) Graph-based algorithms for boolean function manipulation. IEEE Trans Comput 35(8):677–691
Article Google Scholar
Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3):15:1–15:58
Article Google Scholar
Fawcett T (2006) An introduction to ROC analysis. Pattern Recognit Lett 27(8):861–874
Article MathSciNet Google Scholar
Ghoting A, Parthasarathy S, Otey ME (2008) Fast mining of distance-based outliers in high-dimensional datasets. Data Min Knowl Discov 16(3):349–364
Article MathSciNet Google Scholar
Karatzoglou A, Smola A, Hornik K, Zeileis A (2004) kernlab-an S4 package for kernel methods in R. J Stat Softw 11(9):1–20
Article Google Scholar
Kutsuna T (2010) A binary decision diagram-based one-class classifier. In: Proceedings of the 10th IEEE international conference on data mining (ICDM ’10), pp 284–293
Kutsuna T, Yamamoto A (2014a) Outlier detection based on leave-one-out density using binary decision diagrams. In: Tseng V, Ho T, Zhou ZH, Chen A, Kao HY (eds) Advances in knowledge discovery and data mining. Lecture notes in computer science. Springer, New York, pp 486–497
Chapter Google Scholar
Kutsuna T, Yamamoto A (2014b) A parameter-free approach for one-class classification using binary decision diagrams. Intell Data Anal 18(5):889–910
Article Google Scholar
Lazarevic A, Kumar V (2005) Feature bagging for outlier detection. In: Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery in data mining (KDD ’05), ACM, New York, pp 157–166
Lazarevic A, Ertoz L, Kumar V, Ozgur A, Srivastava J (2003) A comparative study of anomaly detection schemes in network intrusion detection. In: Proceedings of the 2003 SIAM conference on data mining
Chapter Google Scholar
Liu FT (2009) Isolationforest: Isolation forest. http://sourceforge.net/projects/iforest. Accessed 11 November 2014. R package version 0.0-25
Liu FT, Ting KM, Zhou ZH (2008) Isolation forest. In: Proceedings of the 8th IEEE international conference on data mining (ICDM ’08), pp 413–422
Mahalanobis P (1936) On the generalized distance in statistics. Proc Natl Inst Sci (Calcutta) 2:49–55
MATH Google Scholar
Moro S, Cortez P, Rita P (2014) A data-driven approach to predict the success of bank telemarketing. Decis Support Syst 62:22–31
Article Google Scholar
R Core Team (2014) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. http://www.R-project.org. Accessed 20 Jan 2016
Schölkopf B, Platt J, Shawe-Taylor J, Smola A, Williamson R (2001) Estimating the support of a high-dimensional distribution. Neural Comput 13(7):1443–1471
Article Google Scholar
Somenzi F (1999) Calculational system design. In: Broy M, Steninbruggen R (eds) Binary decision diagrams, vol 173. IOS Press, Amsterdam, pp 303–366
MATH Google Scholar
Somenzi F (2012) CUDD: CU decision diagram package. http://vlsi.colorado.edu/~fabio/CUDD. Accessed 24 June 2014
Torgo L (2010) Data mining with R, learning with case studies. Chapman and Hall/CRC, Boca Raton
Book Google Scholar
Yamanishi K, Takeuchi JI, Williams G, Milne P (2004) On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms. Data Min Knowl Discov 8(3):275–300
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Toyota Central R&D Labs. Inc., 41-1, Yokomichi, Nagakute, Aichi, 480-1192, Japan
Takuro Kutsuna
Department of Intelligence Science and Technology, Graduate School of Informatics, Kyoto University, Yoshida-Honmachi, Sakyo-ku, Kyoto, 606-8501, Japan
Akihiro Yamamoto

Authors

Takuro Kutsuna
View author publications
You can also search for this author in PubMed Google Scholar
Akihiro Yamamoto
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Takuro Kutsuna.

Additional information

Responsible editor: Srinivasan Parthasarathy

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kutsuna, T., Yamamoto, A. Outlier detection using binary decision diagrams. Data Min Knowl Disc 31, 548–572 (2017). https://doi.org/10.1007/s10618-016-0486-6

Download citation

Received: 19 January 2015
Accepted: 01 November 2016
Published: 17 November 2016
Issue Date: March 2017
DOI: https://doi.org/10.1007/s10618-016-0486-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Outlier detection using binary decision diagrams

Abstract

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

1 Introduction

1.1 Related work

2 Problem setting

3 Preliminaries

3.1 Initial region vector

3.2 Initial Boolean function

Definition 1

Definition 2

Definition 3

3.3 BDD representation of the initial Boolean function

4 Proposed method

4.1 Outline

4.2 BDD-based implementation

4.2.1 The number of minterms calculation

Lemma 1

Proof

4.2.2 The leave-one-out density calculation

Lemma 2

Proof

Theorem 1

Proof

4.3 The proposed algorithm and computational complexity

5 Experimental results

5.1 Evaluation with synthetic datasets

5.2 Evaluation with realistic datasets

5.2.1 Datasets

5.2.2 Investigating the impact of the parameter setting

5.2.3 Settings of methods to compare

5.2.4 Comparing the proposed method with other methods

6 Conclusions and future work

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation