1 Introduction

In this paper, we propose a novel and efficient method for outlier detection, which is an important task in data mining and has been applied to many problems such as fraud detection, intrusion detection, and data cleaning (Chandola et al. 2009). The goal of outlier detection is to find an unusual datum (outlier) from a given dataset, and many kinds of measures have been proposed to detect outliers. In this paper, we regard a datum as an outlier if the leave-one-out density is lower than a given threshold for a set of regions around the datum. Leave-one-out density is a ratio of the number of data elements inside a region to the volume of the region, in which the focused datum is removed from the original dataset. In general, a method like leave-one-out is time consuming because a learning procedure is repeated N-times, where N denotes the cardinality of the dataset. However, we propose a method that enables us to evaluate the leave-one-out density efficiently without repeating a learning procedure N-times.

We extend the work of Kutsuna (2010) to outlier detection by introducing the notion of leave-one-out density and developing an efficient algorithm to evaluate it. We employ the initial region method proposed by Kutsuna, in which a dataset is encoded into a Boolean formula and represented as a binary decision diagram. Kutsuna proposed a method to overapproximate the training dataset, that is, to estimate a region that encloses the training dataset, by manipulating the binary decision diagram and learn a one-class classifier. Although one-class classification and outlier detection are very similar tasks, the method proposed by Kutsuna is not applicable to outlier detection, because the classifier is estimated as an overapproximation of the training dataset and it never classifies a datum in the training dataset as an outlier.

The proposed method is compared to other well-known outlier detection methods, such as the one-class support vector machine (Schölkopf et al. 2001), the local outlier factor (Breunig et al. 2000), the Orca algorithm (Bay and Schwabacher 2003), and the isolation forest (Liu et al. 2008), with benchmark datasets. The experimental results show that the computation time and the outlier detection accuracy of the proposed method are better than the other methods for datasets that consist of both continuous and categorical attributes and have a large number of samples.

The rest of the paper is organized as follows: The outlier detection problem addressed in this paper is formally defined in Sect. 2. We review the initial region method in Sect. 3. In Sect. 4, the outline of the proposed method is stated, and then its efficient implementation based on binary decision diagrams is proposed. Section 5 shows experimental results. In Sect. 6, we conclude this paper by discussing the limitations of this study and our future work.

This paper extends the results reported in a conference paper (Kutsuna and Yamamoto 2014a) in the following ways: We describe the way to deal with categorical attributes in the proposed method, which were not detailed in the conference version. We first modify the problem setting in Sect. 2 to include categorical attributes, then explain the proposed method with both continuous and categorical attributes throughout the paper, including proofs in Sect. 4.2.2. In Sect. 4.1, we introduce a new parameter \(\lambda \) to improve the performance of the proposed method. The impact of the parameter is experimentally examined in Sect. 5.2.2. In Sect. 5.2.3 and Sect. 5.2.4, the experiments for comparing the proposed method with other methods are re-conducted by adding new existing methods and new datasets. We have also enriched Sects. 1.1, 3, 4.2.1, 5.2.1, and 6 with extra explanations and illustrations.

1.1 Related work

Kutsuna (2010) proposed a one-class classifier that overapproximates a training dataset. The approximation is done quite efficiently by manipulating a binary decision diagram that is obtained by encoding the training dataset. The situation considered in Kutsuna (2010) is that both a training dataset and a test dataset are given: A classifier is first learned from the training dataset, and then the test dataset is classified by the classifier. It may seem that we can detect outliers in a given dataset by using the dataset as both the training dataset and the test dataset simultaneously. However, no datum is detected as an outlier in such a setting, because the classifier is estimated as an overapproximation of the training dataset. Therefore, the method in Kutsuna (2010) cannot be applied to outlier detection directly.

For outlier detection, the Mahalanobis distance (Mahalanobis 1936) between the center of a dataset and each datum is often used as a score of the datum being an outlier. The Mahalanobis distance is a classical measure of distance between data points, in which the correlation of the dataset is taken into account. The Mahalanobis distance works well if the dataset fits well to a normal distribution, but, this condition seldom holds in practice.

Schölkopf et al. (2001) extended the support vector machine (SVM) to outlier detection, which was originally invented for binary classification. Their method estimates a hyperplane that separates the origin and the dataset with maximum margin, in which the hyperplane can become nonlinear by introducing kernel functions. The data that are classified to the origin side are detected as outliers. The SVM has an advantage that various nonlinear hyperplanes are estimated by changing kernel parameters. A cross-validation method is often used to tune parameters in the SVM. However, unfortunately, such a tuning method is not available in the case of outlier detection, because there is no labeled training dataset. Therefore, parameter tuning is a critical issue in applying the SVM to outlier detection. Some heuristics are proposed to tune kernel parameters, such as Karatzoglou et al. (2004).

Breunig et al. (2000) proposed the local outlier factor (LOF) that is calculated on the basis of the distance to the k-nearest neighbor of each datum and has an advantage that it can detect local outliers, that is, data that are outlying relative to their local neighborhoods. The LOF has been shown to perform very well in realistic problems (Lazarevic et al. 2003). An efficient calculation of the k-nearest neighbors is essential in the LOF. Some techniques are proposed to accelerate the k-nearest neighbor calculation, such as Beckmann et al. (1990).

Bay and Schwabacher (2003) proposed Orca that finds outliers based on the distance to the nearest neighbors of each datum. A straightforward algorithm for distance computations requires quadratic time with respect to the number of data. Orca computes scores only for the top t outliers and reduces the calculation time by pruning the search space based on the cutoff, which is the lowest score of the top t outliers obtained thus far. Bay and Schwabacher (2003) experimentally showed that Orca works in nearly linear time with real datasets. Ghoting et al. (2008) proposed RBRP that introduces a pre-processing phase before the nearest neighbors calculation. In the pre-processing phase, the dataset is analyzed using a clustering algorithm and the principal component analysis to efficiently find data points that are likely to be near a focused data point. Using the information obtained in the pre-processing phase, RBRP facilitates the pruning of the search space based on the cutoff. Ghoting et al. (2008) report some experimental results in which RBRP outperforms Orca in terms of computational efficiency.

Liu et al. (2008) proposed the isolation forest (iForest) that consists of a set of isolation trees. An isolation tree is constructed by partitioning a dataset, which is sampled from an original dataset, randomly in a recursive manner. A datum is regarded as an outlier if it is isolated closer to the root of isolation trees on average. iForest has two parameters: the number of isolation trees t and the sampling size \(\psi \). Liu et al. (2008) experimentally observed that iForest works well with relatively small parameter values, e.g., \(t=100\) and \(\psi =256\). Recently, Aryal et al. (2014) proposed a modification of the iForest to detect local outliers, in which the score of being an outlier is calculated on the basis of the relative mass that takes the local data distribution into account.

2 Problem setting

Let (xy) be a datum where \(x=(x_1,\ldots ,x_u)\) is a vector of continuous attributes and \(y=(y_1,\ldots ,y_v)\) is a vector of categorical attributes. Let D be a dataset that includes N data points, in which the ith datum is denoted by \((x^{(i)}, y^{(i)})\) \((i=1,\ldots ,N)\). We assume that there is no missing values in D and all the data in D are unlabeled. In this paper, we regard a datum as an outlier if the leave-one-out density is lower than the threshold for a set of regions around the datum. The leave-one-out density \(\rho _\mathrm{{LOO}}\) of the ith datum is defined as

$$\begin{aligned} \rho _\mathrm{{LOO}}\left( i, \mathcal {D}\right) = \frac{\left| \left\{ \left( x^{(j)}, y^{(j)}\right) \ \bigl | \ x^{(j)} \in \mathcal {D},\ y^{(j)} = y^{(i)}, \ j \in \mathcal {I} {\setminus } i\right\} \right| }{vol\left( \mathcal {D}\right) }, \end{aligned}$$

where \(\mathcal {D}\) denotes a u-dimensional region such that \(x^{(i)} \in \mathcal {D}\)\(|\cdot |\) represents the cardinality of a set, \(\mathcal {I} = \{1,\ldots ,N\}\), and \(vol\left( \cdot \right) \) indicates the volume of the region. The outlier score S of the ith datum is defined as

$$\begin{aligned} S\left( i\right) = \max _{\mathcal {D} \in \tilde{\mathcal {D}}\left( i\right) } \rho _\mathrm{{LOO}}\left( i, \mathcal {D}\right) , \end{aligned}$$

where \(\tilde{\mathcal {D}}\left( i\right) \) represents a set of regions around \(x^{(i)}\) defined as

$$\begin{aligned} \tilde{\mathcal {D}}\left( i\right) = \left\{ u\text {-dimensional region } \mathcal {D} \ \bigl | \ x^{(i)} \in \mathcal {D}\right\} . \end{aligned}$$

A datum is detected as an outlier if the outlier score of the datum is less than the given threshold. Note that \(\tilde{\mathcal {D}}\left( i\right) \) is not the set of all possible regions, but a fixed family, which is defined in Sect. 4.1.

The idea behind the proposed method is that, if the ith datum is not an outlier, there must be a region around the datum in which the other data exist with high density under the condition that categorical attributes are identical to the focused datum. On the other hand, if the ith datum is an outlier, it is expected that there is no region around the datum that contains the other data densely.

In contrast to some existing methods, such as LOF and Orca, the proposed method does not require a distance function that defines the distance between a pair of samples in the dataset. Such a feature is preferable because it is generally difficult to design a distance function properly for a dataset that consists of both continuous and categorical attributes.

The choice of the set of regions \(\tilde{\mathcal {D}}(i)\) determines the performance of the proposed method. In this paper, we assume \(\tilde{\mathcal {D}}(i)\) to be a set of u dimensional hypercubes around the ith datum, which will be described in Sect. 4.1. Experimental results in Sect. 5 show that the proposed method can find outliers efficiently and effectively for a dataset that consists of both continuous and categorical attributes and has a large number of samples.

3 Preliminaries

In this section, we briefly review the initial region method (Kutsuna 2010) and define notations. Let \(\mathcal {H}\) be a u-dimensional hypercube defined as \(\mathcal {H} = [0, 2^m)^{u}\), where m is an arbitrary positive integer. Further, let \(\xi \) be an example normalizer such that \(\xi (x^{(i)}) \in \mathcal {H}\) for every \(x^{(i)}\) in D. An example of \(\xi \) is given in Kutsuna (2010) as a following simple scaling function:

$$\begin{aligned} \xi (x) = \left( \frac{scale(x_1)-x_{min}}{x_{max}-x_{min}} \left( 2^m - \epsilon \right) , \ldots , \frac{scale(x_u)-x_{min}}{x_{max}-x_{min}} \left( 2^m - \epsilon \right) \right) , \end{aligned}$$
(1)

where \(\epsilon \) denotes an arbitrary small positive number and

$$\begin{aligned} scale(x_j)&= \frac{x_j - \mu _j}{\sigma _j} \quad (j=1, \ldots , u), \\ x_{max}&= \max _{i=1, \ldots , N;\ j=1, \ldots , u} \left( scale\left( x^{(i)}_j\right) \right) , \\ x_{min}&= \min _{i=1, \ldots , N;\ j=1, \ldots , u} \left( scale\left( x^{(i)}_j\right) \right) , \end{aligned}$$

where \(x_j\) represents the jth element of x, and \(\mu _j\) and \(\sigma _j\) denote the mean and standard deviation of \(x_j\), respectively, which are calculated from D. We denote the projected datum \(\xi (x)\) as z.

The neighborhood function \(\nu \), which is defined as \(\nu (z) := \left[ \lfloor z_1 \rfloor , \lfloor z_1 \rfloor + 1 \right) \times \cdots \times \left[ \lfloor z_u \rfloor , \lfloor z_u \rfloor + 1 \right) \), where \(\lfloor \cdot \rfloor \) is the floor function, returns a u-dimensional unit hypercube that subsumes \(z =\xi (x)\).

The number of elements included in the domain of \(y_j\), which is the jth element of y, is denoted by \(M_j\ (j=1,\ldots ,v)\). Let \(\phi \) be a perfect hash function that receives y as a key and returns a hash value. The range of \(\phi \) is assumed to be \(\left\{ 1, \ldots , M\right\} \), where \(M=\prod _{j=1}^v M_j\), without loss of generality, because the range has M hash values from the definition of y. If a dataset does not contain categorical attributes, it is assumed that \(M=1\) and \(\phi (\cdot )=1\).

3.1 Initial region vector

The initial region vector G is a set of u-dimensional regions inside \(\mathcal {H}\) and the jth region \(G_j\) subsumes all the projected data \(z^{(i)}=\xi \left( x^{(i)}\right) \) such that \(\phi \left( y^{(i)}\right) =j\) for \(i=1,\ldots ,N\), defined as

$$\begin{aligned} G = \left\{ G_1, \ldots , G_M\right\} , \quad G_j = \bigcup _{i=1, \ldots , N; \phi \left( y^{(i)}\right) =j} \nu \left( z^{(i)}\right) . \nonumber \end{aligned}$$

For example, we consider a dataset that has \(u=2\) continuous attributes and no categorical attribute. We set \(m=3\), then the dataset is projected into \(\mathcal {H}=[0, 2^3)^2\) by \(\xi \). The projected data z are denoted by “\(\times \)”-marks and the initial region G is shown as the gray region in Fig. 1. Note that we refer to the initial region vector G as the initial region when the dataset has no categorical attribute, because \(G=G_1\) in this case.

Fig. 1
figure 1

\(\times \)”-marks denote a normalized dataset and the gray region is the initial region G

3.2 Initial Boolean function

In the initial region vector method, the initial region vector G is expressed as a Boolean function. Special coding functions CodeZand CodeYare introduced to convert every datum in D into a logical formula.

Definition 1

CodeZis a function that codes data \(z = (z_1, \ldots , z_u) \in \mathcal {H}\) into a logical formula. Each \(z_j\) is first truncated to an integer, and then coded in the manner of an unsigned-integer-type coding by using m Boolean variables \(\{b_{j1}, \ldots , b_{jm}\}\), where \(b_{j1}\) is the most significant bit and \(b_{jm}\) is the least significant bit.

For example, Boolean variables \(b_{11}, \ldots , b_{23}\) that are used to code \(z_1\) and \(z_2\) are shown in Fig. 1. The datum \(z^{(i)}\) in the region \([5, 6) \times [2, 3)\) in Fig. 1 is first truncated to (5, 2), and then coded by CodeZas follows:

$$\begin{aligned} \mathtt{CodeZ}{z^{(i)}} = \left( b_{11} \wedge \bar{b}_{12} \wedge b_{13} \right) \wedge \left( \bar{b}_{21} \wedge b_{22} \wedge \bar{b}_{23} \right) , \end{aligned}$$
(2)

where \(\bar{b}\) denotes the negation of b.

Definition 2

CodeYis a function that codes categorical data y into a logical formula, which is defined upon L Boolean variables \(\{d_1, \ldots , d_L\}\). For our convenience of explanation, we define \(\mathtt{CodeY}{y}=1\) and \(L=0\) if the given dataset has no categorical attribute.

For example, suppose that \(y=(y_1, y_2)\) and the domain of \(y_1\) and \(y_2\) are \(\{\text {blue}, \text {red}, \text {yellow}\}\) and \(\{\text {ON}, \text {OFF}\}\), respectively. By setting \(L=3\), the elements of the domains of \(y_1\) and \(y_2\) can be coded as \(\{d_1 \wedge d_2,\ d_1 \wedge \overline{d_2},\ \overline{d_1} \wedge d_2\}\) and \(\{d_3, \overline{d_3}\}\), respectively. Then, \(\mathtt{CodeY}{y} = d_1 \wedge \overline{d_2} \wedge d_3\) for \(y=(\text {red}, \text {ON})\). The sets of Boolean variables used to code z and y are denoted as \(\mathbb {B}\) and \(\mathbb {D}\), respectively:

$$\begin{aligned} \mathbb {B}=\left\{ b_{jk}\ | j=1, \ldots , u;\ k=1, \ldots , m\right\} , \quad \mathbb {D}=\left\{ d_{j}\ | j=1, \ldots , L\right\} . \end{aligned}$$

We assume \(\mathbb {D}=\emptyset \) if the dataset has no categorical attribute.

The initial Boolean function F is given as a disjunction of logical formulas that are calculated from the dataset D with CodeZand CodeYas follows:

$$\begin{aligned} F = \bigvee _{i=1, \ldots , N} \left( \mathtt{CodeY}{y^{(i)}} \wedge \mathtt{CodeZ}{z^{(i)}}\right) . \end{aligned}$$
(3)

The initial Boolean function F is informational equivalent of the initial region vector G (Kutsuna and Yamamoto 2014b). We introduce the decoding function R defined as follows.

Definition 3

The decoding function R is a function that decodes a Boolean formula defined on \(\mathbb {B}\) into the corresponding u-dimensional region in \(\mathcal {H}\).

In particular, \(R(1) = \mathcal {H}\) and \(R\left( F \wedge \mathtt{CodeY}(y^{(i)})\right) = G_{\phi (y^{(i)})}\). Note that \(F \wedge \mathtt{CodeY}(y^{(i)})\) can be viewed as a Boolean formula defined on \(\mathbb {B}\), because variables in \(\mathbb {D}\) are fixed to \(\mathtt{CodeY}(y^{(i)})\), although F is defined on \(\left\{ \mathbb {B}, \mathbb {D}\right\} \).

The initial Boolean function can be explained from an alternative point of view as follows: In Eq. (3), \(\mathtt{CodeY}(y^{(i)}) \wedge \mathtt{CodeZ}(z^{(i)})\) can be viewed as a bit string that encodes the ith data point. The initial Boolean function F itself is just a list of bit strings describing the points in the dataset. Furthermore, \(vol(R(F \wedge \mathtt{CodeY}(y^{(i)})))\) could be interpreted as the number of bit strings in F whose categorical attributes are encoded by \(\mathtt{CodeY}(y^{(i)})\).

3.3 BDD representation of the initial Boolean function

A binary decision diagram (BDD) is a compressed representation of a Boolean function and has a feature that most logical operations between BDDs can be implemented by polynomial-time algorithms (Bryant 1986). In the initial region vector method, BDDs are employed to efficiently construct and represent the initial Boolean function F. The order of Boolean variables is set to hold the following constraint on constructing F as a BDD:

$$\begin{aligned} \left[ d_1, \ldots , d_L\right] \prec \left[ b_{11},\ldots ,b_{u1}\right] \prec \ldots \prec \left[ b_{1m},\ldots ,b_{um}\right] , \end{aligned}$$

where the variables inside the square brackets can be in an arbitrary order.

Fig. 2
figure 2

A BDD that represents the initial Boolean function F

For example, Fig. 2 shows a BDD that represents the initial Boolean formula F that is obtained from the dataset in Fig. 1. In Fig. 2, square nodes, ellipsoidal nodes, and double-squared nodes are referred to as terminal nodes, variable nodes and function nodes, respectively. Variable nodes are labeled with “a”, “b”, ..., “s” and their corresponding Boolean variables are on the left side. Solid lines, dashed lines, and dotted lines indicate true edges, false edges, and complement edges, respectively. A path from a function node to terminal 1 corresponds to a conjunction of literals. If the path contains an even number of complement edges, the conjunction is included in the function. For example, the path that goes from the function node F to the terminal 1 through Nodes {a, b, d, h, p} corresponds to the following conjunction

$$\begin{aligned} b_{11} \wedge b_{21} \wedge \bar{b}_{12} \wedge \bar{b}_{22} \wedge \bar{b}_{13}. \end{aligned}$$
(4)

Because the path contains two complement edges, the conjunction in (4) is included in F. Moreover, the conjunction in (4) corresponds to the region

$$\begin{aligned} R\left( b_{11} \wedge b_{21} \wedge \bar{b}_{12} \wedge \bar{b}_{22} \wedge \bar{b}_{13}\right) = [4,5) \times [4, 6), \end{aligned}$$

which is included in the initial region G as shown in Fig. 1.

A subgraph of a BDD, which is rooted from a node in the BDD, also represents a Boolean function. Let \(\alpha \) be a node in the BDD. We denote the Boolean function represented by the subgraph rooted from \(\alpha \) as \(\mathcal {N}_{\alpha }^+\). Further, we denote the complement of \(\mathcal {N}_{\alpha }^+\) as \(\mathcal {N}_{\alpha }^-\). For example, the BDD representation of \(\mathcal {N}_\text {e}^-\) and \(\mathcal {N}_\text {g}^+\) for BDD nodes “e” and “g” in Fig. 2 is shown in Fig. 3. From Fig. 3, we can see that F is also expressed as \(\mathcal {N}_\text {a}^-\). Both \(\mathcal {N}_\mathrm{{e}}^-\) and \(\mathcal {N}_\mathrm{{g}}^+\) represent regions inside \(\mathcal {H}\) that are expressed as \(R\left( \mathcal {N}_\mathrm{{e}}^-\right) \) and \(R\left( \mathcal {N}_\mathrm{{g}}^+\right) \), respectively. Fig. 4 shows \(R\left( \mathcal {N}_\mathrm{{e}}^-\right) \) and \(R\left( \mathcal {N}_\mathrm{{g}}^+\right) \). An important observation is that \(R\left( \mathcal {N}_\mathrm{{e}}^-\right) \) is identical to the initial region G in the region \([4,8) \times [0, 4)\). Furthermore, Node “e” is reached from the function node F through the path {F, a, b, e} in Fig. 4, which indicates \(b_{11} \wedge \bar{b}_{21}\) that corresponds to the region \([4,8) \times [0, 4)\). Likewise, \(R\left( \mathcal {N}_\mathrm{{g}}^+\right) \) is identical to the initial region G in \([0,4) \times [0, 4)\).

Fig. 3
figure 3

BDD representation of \(\mathcal {N}_\text {e}^-\) and \(\mathcal {N}_\text {g}^+\)

Fig. 4
figure 4

Regions represented by \(\mathcal {N}_\text {e}^-\) and \(\mathcal {N}_\text {g}^+\). a \(R\left( \mathcal {N}_\text {e}^-\right) \), b \(R\left( \mathcal {N}_\text {g}^+\right) \)

4 Proposed method

4.1 Outline

In the proposed method, we calculate the leave-one-out density with the normalized data \(z^{(i)} = \xi \left( x^{(i)}\right) \) as

$$\begin{aligned} \rho _\mathrm{{LOO}}\left( i, \mathcal {C}\right) = \frac{\left| \left\{ \left( z^{(j)}, y^{(j)}\right) \ \bigl | \ z^{(j)} \in \mathcal {C},\ y^{(j)} = y^{(i)},\ j \in \mathcal {I} {\setminus } i\right\} \right| }{vol\left( \mathcal {C}\right) }, \end{aligned}$$
(5)

where \(\mathcal {C}\) represents a region such that \(z^{(i)} \in \mathcal {C}\). The outlier score is given as

$$\begin{aligned} S\left( i\right) = \max _{\mathcal {C} \in \tilde{\mathcal {C}}\left( i\right) } \rho _\mathrm{{LOO}}\left( i, \mathcal {C}\right) , \end{aligned}$$
(6)

where \(\tilde{\mathcal {C}}(i)\) is a set of hypercubes defined as

$$\begin{aligned} \begin{array}{l} \displaystyle {\tilde{\mathcal {C}}\left( i\right) = \left\{ R\left( \mathcal {L}_l(i)\right) \ |\ l=0, \ldots , \lambda \right\} }, \\ \displaystyle {\mathcal {L}_0(i) = 1,\quad \mathcal {L}_l(i) = \bigwedge _{j=1, \ldots , u; k=1, \ldots , l} \tilde{b}_{jk}^{(i)} \quad \left( l=1, \ldots , \lambda \right) } \end{array}, \end{aligned}$$
(7)

where \(\lambda \) is an integer parameter such that \(1\le \lambda \le m\), which controls the number of hypercubes in \(\tilde{\mathcal {C}}(i)\), and \(\tilde{b}_{jk}^{(i)}\) are the Boolean encoding of \(z^{(i)}\) obtained by CodeZ. For example,

$$\begin{aligned} \left( \tilde{b}_{11}^{(i)}, \tilde{b}_{21}^{(i)}, \tilde{b}_{12}^{(i)}, \tilde{b}_{22}^{(i)}, \tilde{b}_{13}^{(i)}, \tilde{b}_{23}^{(i)}) = (b_{11}, \bar{b}_{21}, \bar{b}_{12}, b_{22}, b_{13}, \bar{b}_{23}\right) \end{aligned}$$
(8)

for \(z^{(i)}\) in the region \([5, 6) \times [2, 3)\) shown in Fig. 1. From (7), the set \(\tilde{\mathcal {C}}(i)\) for this datum is derived as

$$\begin{aligned} \tilde{\mathcal {C}}(i) = \left\{ [0, 8) \times [0, 8), \ [4, 8) \times [0, 4), \ [4, 6) \times [2, 4), \ [5, 6) \times [2, 3)\right\} \end{aligned}$$

for \(\lambda =3\), which are shown as bold squares in Fig. 6. If we set \(\lambda =1\), then we have

$$\begin{aligned} \tilde{\mathcal {C}}(i) = \left\{ [0, 8) \times [0, 8), \ [4, 8) \times [0, 4)\right\} . \end{aligned}$$

Note that smaller hypercubes are excluded from \(\tilde{\mathcal {C}}(i)\) as we set \(\lambda \) smaller.

We assume that there is no duplicate in D; that is, at least one of the attributes has a different value for \(\left( x^{(i)}, y^{(i)}\right) \) and \(\left( x^{(j)}, y^{(j)}\right) \) if \(i \ne j\). Then, we can construct the initial region vector G so that each datum in D is allocated in a distinct unit hypercube by setting m to a sufficiently large value and using an appropriate example normalizer. Then, the volume of \(G_j \in G\) is equivalent to the number of data elements in D such that \(\phi \left( y^{(i)}\right) =j\) for \(j=1, \ldots , M\), because each datum in D is encoded as a unit hypercube in \(G_j\). In this case, the leave-one-out density defined as (5) can be calculated on the basis of the initial region vector G as follows for \(\mathcal {C} \in \tilde{\mathcal {C}}(i)\):

$$\begin{aligned} \rho _\mathrm{{LOO}}\left( i, \mathcal {C}\right) = \frac{vol\left( G_\mathrm{{LOO}}\left( i\right) \cap \mathcal {C} \right) }{vol\left( \mathcal {C}\right) }, \end{aligned}$$
(9)

where \(G_\mathrm{{LOO}}\) denotes the leave-one-out region defined as

$$\begin{aligned} G_\mathrm{{LOO}}\left( i\right) = G_{\phi \left( y^{(i)}\right) } {\setminus } \nu \left( z^{(i)}\right) . \end{aligned}$$
(10)

For example, the left side of Fig. 5 illustrates the initial region and the right side illustrates the leave-one-out region for the datum denoted with the “\(\times \)”-mark. Further, Fig. 6 illustrates the calculation of the leave-one-out density for this datum.

Fig. 5
figure 5

The initial region G (left) and the leave-one-out region \(G_\mathrm{{LOO}}(i)\) for the datum denoted with the “\(\times \)”-mark (right)

Fig. 6
figure 6

Bold squares show \(R\left( \mathcal {L}_l(i)\right) \) for \(l=0, \ldots , 3\), in which the “\(\times \)”-mark denotes the datum i. The leave-one-out density of the datum is \(\frac{18}{64}\), \(\frac{6}{16}\), \(\frac{1}{4}\), and \(\frac{0}{1}\) for \(l=0,1,2\), and 3, respectively. The outlier score of the datum is \(\max \left( \frac{18}{64}, \frac{6}{16}, \frac{1}{4}, \frac{0}{1}\right) \approx 0.38\) when \(\lambda =3\)

4.2 BDD-based implementation

4.2.1 The number of minterms calculation

A minterm is a conjunction of literals in which each Boolean variable in the domain appears once. Let \(\#_{A}\) be a function that returns the number of minterms of a Boolean function on the assumption that the domain of the function is A. For example, \(\#_{\left\{ a\right\} }(1)=2\) and \(\#_{\left\{ a, b\right\} }(1)=4\) where a and b are Boolean variables. In the initial region method, the following lemma holds.

Lemma 1

For a Boolean function F that is defined on \(\mathbb {B}\), it holds that

$$\begin{aligned} vol\left( R\left( F\right) \right) = \#_{\mathbb {B}} \left( F\right) . \end{aligned}$$

Proof

Since a minterm represents a unit hypercube in the initial region method, the number of minterms is equal to the volume of the region that F represents. \(\square \)

Assume that a BDD that represents the initial Boolean function F is given. It is possible to efficiently calculate the number of minterms of \(\mathcal {N}_{\alpha }^+\) and \(\mathcal {N}_{\alpha }^-\) for each node \(\alpha \) in the BDD by using the following equations in a depth-first manner (Somenzi 1999):

$$\begin{aligned} \#_{\left\{ \mathbb {B}, \mathbb {D}\right\} }\left( \mathcal {N}_{\alpha }^+\right)&= {\left\{ \begin{array}{@{}l@{\quad }l@{}} 2^{mu+L} &{} \text {if}\,\alpha \,\text {is the terminal~1,} \\ \left\{ \begin{array}{@{}l@{}} \left( \#_{\left\{ \mathbb {B}, \mathbb {D}\right\} }\left( \mathcal {N}_{\beta }^+\right) + \#_{\left\{ \mathbb {B}, \mathbb {D}\right\} }\left( \mathcal {N}_{\gamma }^ +\right) \right) /2 \\ \left( \#_{\left\{ \mathbb {B}, \mathbb {D}\right\} }\left( \mathcal {N}_{\beta }^+\right) + \#_{\left\{ \mathbb {B}, \mathbb {D}\right\} }\left( \mathcal {N}_{\gamma }^-\right) \right) /2 \end{array}\right. &{} \begin{array}{@{}l@{}} \text {if } e=1, \\ \text {if } e=0, \end{array} \end{array}\right. } \\ \#_{\left\{ \mathbb {B}, \mathbb {D}\right\} }\left( \mathcal {N}_{\alpha }^-\right)&= {\left\{ \begin{array}{@{}l@{\quad }l@{}} 0 &{} \text {if}\,\alpha \,\text {is the terminal~1,} \\ \left\{ \begin{array}{@{}l@{}} \left( \#_{\left\{ \mathbb {B}, \mathbb {D}\right\} }\left( \mathcal {N}_{\beta }^-\right) + \#_{\left\{ \mathbb {B}, \mathbb {D}\right\} } \left( \mathcal {N}_{\gamma }^-\right) \right) /2 \\ \left( \#_{\left\{ \mathbb {B}, \mathbb {D}\right\} }\left( \mathcal {N}_{\beta }^-\right) + \#_{\left\{ \mathbb {B}, \mathbb {D}\right\} }\left( \mathcal {N}_{\gamma }^+\right) \right) /2 \end{array}\right. &{} \begin{array}{@{}l@{}} \text {if } e=1, \\ \text {if } e=0, \end{array} \end{array}\right. } \end{aligned}$$

where \(\beta \) and \(\gamma \) are the true and false children of \(\alpha \), respectively, and \(e=1\) if the edge between \(\alpha \) and \(\gamma \) is complemented and \(e=0\) otherwise. For example, Table 1 shows the number of minterms of \(\mathcal {N}_{\alpha }^+\) and \(\mathcal {N}_{\alpha }^-\) for each node in Fig. 2.

Table 1 The number of minterms of \(\mathcal {N}_{\alpha }^+\) and \(\mathcal {N}_{\alpha }^-\) for each node \(\alpha \) in Fig. 2

4.2.2 The leave-one-out density calculation

Because of the fact that \(\nu \left( z^{(i)}\right) \subseteq G_{\phi \left( y^{(i)}\right) }\) and \(\nu \left( z^{(i)}\right) \subseteq \mathcal {C} \in \tilde{\mathcal {C}}(i)\), it is derived from (9) and (10) that the following equation holds for \(\mathcal {C} \in \tilde{\mathcal {C}}(i)\):

$$\begin{aligned} \rho _\mathrm{{LOO}}\left( i, \mathcal {C}\right) = \frac{vol\left( G_{\phi \left( y^{(i)}\right) } \cap \mathcal {C} \right) - 1}{vol\left( \mathcal {C}\right) } \end{aligned}$$
(11)

By replacing \(\mathcal {C}\) in (11) with \(R\left( \mathcal {L}_l\left( i\right) \right) \), we derive the following equation:

$$\begin{aligned} \rho _\mathrm{{LOO}}\left( i, R\left( \mathcal {L}_l\left( i\right) \right) \right)&= \frac{vol\left( G_{\phi \left( y^{(i)}\right) } \cap R\left( \mathcal {L}_l\left( i\right) \right) \right) - 1}{vol\left( R\left( \mathcal {L}_l\left( i\right) \right) \right) } \nonumber \\&= \frac{vol\left( R\left( F \wedge \mathtt{CodeY}{y^{(i)}} \wedge \mathcal {L}_l\left( i\right) \right) \right) - 1}{vol\left( R\left( \mathcal {L}_l\left( i\right) \right) \right) } \end{aligned}$$
(12)

From Lemma 1, (12) is transformed as follows:

$$\begin{aligned} \rho _\mathrm{{LOO}}\left( i, R\left( \mathcal {L}_l\left( i\right) \right) \right)&= \frac{\#_{\mathbb {B}}\left( F \wedge \mathtt{CodeY}{y^{(i)}} \wedge \mathcal {L}_l\left( i\right) \right) - 1}{\#_{\mathbb {B}}\left( \mathcal {L}_l\left( i\right) \right) } \nonumber \\&= \frac{\#_{\mathbb {B}}\left( F \wedge \mathtt{CodeY}{y^{(i)}} \wedge \mathcal {L}_l\left( i\right) \right) - 1}{2^{(m-l)u}} \end{aligned}$$
(13)

Lemma 2

In the BDD that represents F, let \(\alpha _{i, l}\) be the node that can be reached from the function node F through the path defined by \(\mathtt{CodeY}{y^{(i)}{} } \wedge \mathcal {L}_l\left( i\right) \). Let c be the number of complement edges on the path. Then, it holds that

$$\begin{aligned} \#_{\mathbb {B}}\left( F \wedge \mathtt{CodeY}{y^{(i)}} \wedge \mathcal {L}_l\left( i\right) \right) = {\left\{ \begin{array}{ll} \displaystyle {\frac{\#_{\left\{ \mathbb {B}, \mathbb {D}\right\} }\left( \mathcal {N}_{\alpha _{i, l}}^+\right) }{2^{L+lu}}} &{} \mathrm{{if}}\,c\, \mathrm{{is\,even}}, \\ \displaystyle {\frac{\#_{\left\{ \mathbb {B}, \mathbb {D}\right\} }\left( \mathcal {N}_{\alpha _{i, l}}^-\right) }{2^{L+lu}}} &{} \mathrm{{if}}\,c\, \mathrm{{is\,odd}}. \end{array}\right. } \end{aligned}$$

Proof

\(F \wedge \mathtt{CodeY}{y^{(i)}{} } \wedge \mathcal {L}_l\left( i\right) \) means that \(L+lu\) Boolean variables that appear in \(\mathtt{CodeY}{y^{(i)}{} } \wedge \mathcal {L}_l\left( i\right) \) are fixed to specific values in F. On the other hand, \(\mathcal {N}_{\alpha _{i, l}}^+\) (\(\mathcal {N}_{\alpha _{i, l}}^-\)) means that \(L + lu\) Boolean variables in \(\mathtt{CodeY}{y^{(i)}{} } \wedge \mathcal {L}_l\left( i\right) \) are smoothedFootnote 1 in F if c is even (odd). Therefore, the number of minterms of \(\mathcal {N}_{\alpha _{i, l}}^+\) (\(\mathcal {N}_{\alpha _{i, l}}^-\)) is \(2^{L+lu}\) times larger than that of \(F \wedge \mathtt{CodeY}{y^{(i)}{} } \wedge \mathcal {L}_l\left( i\right) \). \(\square \)

Theorem 1

Leave-one-out density \(\rho _\mathrm{{LOO}}\) is evaluated on the basis of the BDD that represents the initial Boolean function F as follows:

$$\begin{aligned} \rho _\mathrm{{LOO}}\left( i, R\left( \mathcal {L}_l\left( i\right) \right) \right) = {\left\{ \begin{array}{ll} \displaystyle {\frac{\#_{\left\{ \mathbb {B}, \mathbb {D}\right\} }\left( \mathcal {N}_{\alpha _{i, l}}^+\right) - 2^{L+lu}}{2^{L+mu}}} &{} \mathrm{{if}}\,c\,\mathrm{{is\,even}}, \\ \displaystyle {\frac{\#_{\left\{ \mathbb {B}, \mathbb {D}\right\} }\left( \mathcal {N}_{\alpha _{i, l}}^-\right) - 2^{L+lu}}{2^{L+mu}}} &{} \mathrm{{if}}\,c\,\mathrm{{is\,odd}}. \end{array}\right. } \end{aligned}$$

Proof

It follows from (13) and Lemma 2 immediately. \(\square \)

It is worth mentioning that we can evaluate the leave-one-out density from the initial Boolean function F straightforwardly without any leave-one-out operation by using Theorem 1. For example, we consider a dataset shown in Fig. 1 and the datum in \([5, 6) \times [2, 3)\). The datum is coded as (8), and the corresponding path in Fig. 2 is \(F\bullet \text {a}\circ \text {b}\circ \text {e}\circ \text {k}\circ \text {q}\circ \text {s}\bullet 1\), where \(\circ \) and \(\bullet \) denote non-complement and complement edges, respectively. The leave-one-out density of the datum is calculated from Theorem 1 and Table 1 as follows:

$$\begin{aligned} \rho _\mathrm{{LOO}}\left( i, R\left( \mathcal {L}_0\left( i\right) \right) \right)&= \frac{\#_{\left\{ \mathbb {B}, \mathbb {D}\right\} }\left( \mathcal {N}_\mathrm{{a}}^-\right) - 2^{0}}{2^{6}} = \frac{19 - 2^0}{2^6} = \frac{18}{64}, \\ \rho _\mathrm{{LOO}}\left( i, R\left( \mathcal {L}_1\left( i\right) \right) \right)&= \frac{\#_{\left\{ \mathbb {B}, \mathbb {D}\right\} }\left( \mathcal {N}_\mathrm{{e}}^-\right) - 2^{2}}{2^{6}} = \frac{28 - 2^2}{2^6} = \frac{6}{16}, \\ \rho _\mathrm{{LOO}}\left( i, R\left( \mathcal {L}_2\left( i\right) \right) \right)&= \frac{\#_{\left\{ \mathbb {B}, \mathbb {D}\right\} }\left( \mathcal {N}_\mathrm{{q}}^-\right) - 2^{4}}{2^{6}} = \frac{32 - 2^4}{2^6} = \frac{1}{4}, \\ \rho _\mathrm{{LOO}}\left( i, R\left( \mathcal {L}_3\left( i\right) \right) \right)&= \frac{\#_{\left\{ \mathbb {B}, \mathbb {D}\right\} }\left( \mathcal {N}_{1}^+\right) - 2^{6}}{2^{6}} = \frac{64 - 2^6}{2^6} = 0. \end{aligned}$$

We can see that these values are equal to those shown in Fig. 6.

4.3 The proposed algorithm and computational complexity

We propose Algorithm 1 that calculates the outlier score of each datum in D.

figure a

In Algorithm 1, the time complexity of constructing the initial Boolean function F is approximately \(O(\Phi N)\), where \(\Phi \) denotes the number of nodes of the created BDD, because logical operations between BDDs are practically almost linear to the size of the BDDs (Brace et al. 1990). The size of the created BDD depends on m, u, N, and the characteristics of the dataset, though, it is difficult to analytically specify the relationship between them. The size of the BDD can be exponentially large in the worst case, however, it is compact for realistic datasets used in our experiments. The time complexity of calculating the number of minterms is \(O(\Phi )\) as mentioned in Sect. 4.2.1. The time complexity of calculating the outlier score is \(O((L+mu)N)\) because the depth of the BDD is \(L+mu\). Consequently, the time complexity of Algorithm 1 is \(O((\Phi +L+mu)N)\). Therefore, the proposed method can deal with a large dataset efficiently unless the number of Boolean variables and the created BDD are intractably huge.

5 Experimental results

We first examine the basic behavior of the proposed method by using synthetic datasets, which are distributed non-linearly in two dimensional space. Then we investigate the impact of the parameter setting of the proposed method and compare the proposed method with the existing methods by using realistic datasets. We implemented the proposed method as a C program with the help of the CUDD package (Somenzi 2012). We refer to the proposed method as ODBDD in this section. In ODBDD, we used \(\xi \) in (1) as an example normalizer. The accuracy of detecting outliers is evaluated in terms of the area under an ROC curve (AUC) (Fawcett 2006). The AUC value ranges from 0 to 1, and a higher AUC value implies better performance in accuracy. The experiment was performed on an Intel Core i7 CPU (3.20 GHz) machine with 64 GB RAM, running with Microsoft Windows 7.

5.1 Evaluation with synthetic datasets

We generated a synthetic dataset called the Ten dataset, which consists of two continuous attributes and no categorical attribute. 95% data of the Ten dataset are distributed inside the shape of “10” randomly. The remaining 5% are distributed on the outside randomly and are regarded as outliers. The number of data elements is set to \(10^3\), \(10^4\), \(10^5\), or \(10^6\), and we generate datasets randomly for ten times in each setting. The computation time and the AUC values are averaged over these datasets in the evaluation. An example of the Ten-\(10^3\) dataset is shown on the left side of Fig. 7, in which the “+”-marks denote generated outliers.

Fig. 7
figure 7

An example of the Ten-\(10^3\) dataset in which true outliers are shown as “+”-marks (left). The “\(\times \)”-marks mean outliers detected by ODBDD (right)

We calculated the outlier score of each datum in the Ten dataset by ODBDD, in which we set \(m=16\) and \(\lambda =16\). Table 2 shows the mean and standard deviation of the computation time and AUC values. We can see that the computation time of ODBDD increases almost linearly with respect to the size of the dataset. Further, ODBDD achieves good AUC values for these datasets. The right side of Fig. 7 shows the outliers detected by ODBDD, in which the data whose outlier scores are within the top 5% are detected as outliers and plotted as “\(\times \)”-marks.

Table 2 The mean and standard deviation of the computation time (in seconds) and the AUC values for the Ten datasets

5.2 Evaluation with realistic datasets

5.2.1 Datasets

We used 9 datasets from the UCI machine learning repository (Bache and Lichman 2013): Ionosphere, Yeast, Abalone, Magic, Adult, Shuttle, Bank (Moro et al. 2014), Covertype (Blackard and Dean 1999), and Kddcup99. All of these datasets are originally arranged for the classification task. In order to utilize these datasets for the evaluation of outlier detection algorithms, we carried out the following procedure.

For each dataset except the Kddcup99 dataset:

  1. 1.

    Select the class of the maximum size as the target class. The size of the target class is denoted by \(N_T\).

  2. 2.

    Pick out all the data that belong to the target class from the dataset.

  3. 3.

    Pick out \(N_o = \lceil 0.01N_T \rceil \) data randomly from the classes other than the target class.

  4. 4.

    Combine the data obtained in Step 2 and Step 3 to generate a new dataset, in which the data from Step 3 are considered to be outliers.

  5. 5.

    Repeat Step 3 and Step 4 for 50 times to generate 50 different datasets.

  6. 6.

    Evaluate the computation time and the AUC value with each generated dataset. The averages over 50 datasets are used for evaluation.

Table 3 shows the target class, \(N_T\), and \(N_o\) for each dataset. The number of categorical attributes and continuous attributes included in each dataset are also shown in Table 3.

Table 3 An overview of the UCI datasets used in the experiment

The Kddcup99 dataset consists of network access data, which includes both normal access and several types of attacks. We set the normal class as the target class in Step 1 of the procedure for the Kddcup99 dataset. It then can be regarded as a practical problem of discovering attacks in a network access log data, in which the most part of logs are from normal accesses, whereas some attacks are also included as outliers. In reference to Yamanishi et al. (2004), we transformed two continuous attributes, src_bytes and dst_bytes, by \(y=\log (x+0.01)\), because they were concentrated around 0. Moreover, we regarded continuous attributes that have less than 10 unique values as categorical attributes and removed the attribute num_outbound_cmds since it had only one unique value. In addition to the setting of using all attributes, we consider the setting in which only four attributes {service, duration, src_bytes, and dst_bytes} are used, because they are thought to be the most essential attributes (Yamanishi et al. 2004). In Table 3, Kddcup99_4a indicates the setting with these four attributes.

5.2.2 Investigating the impact of the parameter setting

The proposed method has parameters m, which defines the bit length used to code each continuous attribute, and \(\lambda \ (\le m)\), which determines the hypercubes included in \(\tilde{\mathcal {C}}(i)\). As mentioned in Sect. 4.1, the set \(\tilde{\mathcal {C}}(i)\) includes smaller hypercubes as \(\lambda \) is set larger. To examine the impact of these parameters, we evaluated the AUC values and the computation time of ODBDD with various parameter settings.

Figure 8 shows the AUC values obtained with ODBDD, in which we set \(m=8, 12, 16\) and \(\lambda =1,\ldots ,m\) for each m. The first observation from Fig. 8 is that the AUC value gets larger and it converge to a specific value as \(\lambda \) increases in most datasets. However, the AUC values get worse as \(\lambda \) increases in the Kddcup99 dataset. Such a case can happen when outliers in a dataset lie close each other and cluster themselves in small locally dense regions. The leave-one-out density of these outliers will grow as the size of the hypercube, which is used to evaluate the leave-one-out density, gets smaller. The outliers generated from the Kddcup99 dataset indeed have very small variance in its original 40 dimensions, since the attacks included in the Kddcup99 dataset are mostly from the class of either smurf or neptune. The second observation from Fig. 8 is that the AUC values are mainly depends on \(\lambda \). The setting of m has little impact on the AUC values in these datasets, except for the Shuttle dataset and the Kddcup99_4a dataset. The AUC values become relatively lower with \(m=8\) than those with \(m=12, 16\) for the Shuttle dataset and the Kddcup99_4a dataset. These results suggest that \(m \ge 12\) is a reasonable choice for the proposed method in terms of the AUC values.

Fig. 8
figure 8

AUC values obtained by ODBDD with \(m=8, 12, 16\) and \(\lambda =1,\ldots ,m\) for each m

Figure 9 shows the computation time of ODBDD with \(m=8, 12, 16\) and \(\lambda =1,\ldots ,m\) for each m. From Fig. 9, we see that the choice of \(\lambda \) has little impact on the computation time of ODBDD. Accordingly, the choice of \(\lambda \) mainly affects the AUC values and not the computation time. On the other hand, from Fig. 9, we find that the computation time of ODBDD increases as m is set larger in most datasets, except for the Ionosphere, Yeast, and Abalone datasets whose sizes are relatively small. The reason for why the computation time increases as m gets larger is that the size of the BDD that represents the initial Boolean function F tends to be large as the number of Boolean variables, on which F is defined, increases. To further investigate the impact of m on the computation time, we conducted experiments with \(m=1,\ldots ,16\), in which \(\lambda \) is fixed to 1. The results are shown in Fig. 10 where the computation time are scaled by the result at \(m=1\) on each dataset. From Fig. 10, we see that the increase rate of the computation time is moderate with respect to m. The computation time for the Covertype dataset increases rapidly compared to that for the Kddcup99 dataset, even though both datasets include hundreds of thousands of data points that have dozens of attributes. A reason for this can be explained with Fig. 11. Figure 11 illustrates the size of the BDD with respect to the number of data points processed in the construction of the initial Boolean formula F. The size of the BDD for the Covertype dataset grows more rapidly than that for the Kddcup99 dataset. As the BDD becomes large, processing each data point for constructing F gets slower. The growing rate of the BDD depends on the characteristics of the dataset to be processed.

Fig. 9
figure 9

Computation time of ODBDD with \(m=8, 12, 16\) and \(\lambda =1,\ldots ,m\) for each m

Fig. 10
figure 10

Computation time of ODBDD with \(m=1,\ldots ,16\), in which \(\lambda \) is fixed to 1. Time is scaled by the result at \(m=1\) on each dataset

Fig. 11
figure 11

The size of the BDD toward the number of processed data points in the construction of the initial Boolean formula F

From the viewpoint of computational efficiency, it is better to set m as small as possible. However, the AUC value gets worse if m is set too small as mentioned above. From Fig. 8 and Fig. 9, it seems that \(m = 12\) is a reasonable choice for the proposed method.

5.2.3 Settings of methods to compare

We compared the proposed method with existing outlier detection methods: the one-class support vector machine (OCSVM), the local outlier factor (LOF), the Orca algorithm, and the isolation forest (iForest). The experiment was conducted in the R environment (R Core Team 2014). The ksvm function in the kernlab package (Karatzoglou et al. 2004) was used for OCSVM, the lofactor function in the DMwR package (Torgo 2010) was used for LOF, the binary implementation provided by Bay (2013) was used for Orca, and the IsolationTrees function in the IsolationForest package (Liu 2009) was used for iForest. In OCSVM, the Gaussian kernel was used and the kernel parameter \(\gamma \) was set to one of the 10%, 50%, and 90% quartiles of the distance between samples (Karatzoglou et al. 2004), which are referred to as \(\gamma _{0.1}\), \(\gamma _{0.5}\), and \(\gamma _{0.9}\), respectively. The parameter \(\nu \) was fixed to \(\nu =0.1\) in OCSVM. In LOF, the number of neighbors was set to either \(k=10\) or \(k=50\). In the execution of Orca, we used default parameter settings of the binary implementation provided by Bay (2013), including the nearest neighbor parameter \(k=5\), except for the parameter n, which we set to 5% of the size of the dataset to be analyzed. The iForest was executed with the default setting of the IsolationTrees function. In OCSVM, LOF, and iForest, continuous attributes were scaled and categorical attributes were coded by using dummy variables. Moreover, the time required to load each dataset into the R environment using the read.delim function was added to the computation time of OCSVM, LOF, and iForest, because these methods are implemented as R functions. In ODBDD, we set \(m=12\) and \(\lambda =8\).

5.2.4 Comparing the proposed method with other methods

Table 4 shows the computation time for UCI datasets in Table 3. Note that the numbers in parentheses are the results evaluated with one of 50 datasets that are generated by the procedure in Sect. 5.2.1, since the computation time was relatively long in these settings. TO means that the execution with a dataset had not been completed within the timeout limit of 48 h. From Table 4, we see that the computation time of the methods other than ODBDD and iForest increases rapidly with respect to the size of the dataset. There is not much difference between ODBDD and iForest in terms of the computation time, though, ODBDD seems to work faster than iForest when the dataset is large and has a few attributes, like the Kddcup99_4a dataset.

Table 4 Computation time for UCI datasets (in seconds)

Table 5 shows the AUC values for UCI datasets. Parentheses and TO mean the same as in Table 4. In Table 5, bold values indicate the top two highest scores for each dataset.

Table 5 AUC values for UCI datasets

From Table 5, we cannot find a single method that works best for all datasets. LOF achieves better results than other methods for datasets that are not so large. LOF might work better for large datasets if the parameter k is tuned carefully, though, it is generally very difficult to choose a good parameter in advance. Note that this fact is also true for other methods used in the experiment. ODBDD performs better than the others for datasets that are relatively large. On the other hand, the performance of ODBDD is not as good as the others for small datasets. This result suggests that the leave-one-out density that is used in ODBDD works well as a measure for outlier detection if the dataset is sufficiently large.

As mentioned in Sect. 5.2.1, the experiments with the Kddcup99 dataset correspond to a practical problem of analyzing access logs for finding attacks. Table 4 and Table 5 show that we can efficiently and accurately find attacks in the dataset by the proposed method. A valuable finding is that we can achieve better AUC values by using only 4 attributes than using original 40 attributes. Moreover, we can reduce the computation time by using 4 attributes. It follows from these results that it is very important to carefully pick out attributes to be used for outlier detection in the sense of both the accuracy and the efficiency.

The computation time of Orca in Table 4 might seem much worse compared to those reported in the literature (Bay and Schwabacher 2003; Ghoting et al. 2008). This is mainly caused by the setting of the parameter n of Orca, which is often set to \(n=30\) in the literature. We instead set n to 5% of the size of the dataset in the experiment, because the AUC values of Orca will be unreasonably low if we set n too small. For reference, we executed Orca with \(n=30\) for datasets in Table 3. The computation time of Orca was then, for example, 176.2, 96.4, and 492.6 s for Covertype, Kddcup_4a, and Kddcup datasets, respectively. It shows that Orca is comparable to ODBDD or iForest in terms of computational efficiency when n is small enough, although the AUC values of Orca were then almost 0.5 for all these datasets. This is inevitable because in this case only 30 data points have high outlier scores and the others are assumed to have the same low score.

6 Conclusions and future work

In this work we proposed a novel approach for outlier detection in which a score of being an outlier is defined on the basis of the leave-one-out density. We proposed an algorithm to evaluate the leave-one-out density very efficiently by processing a BDD that represents the given dataset in a logical formula. The proposed method can deal with a large dataset efficiently, because the time complexity is nearly linear unless the created BDD gets intractably huge.

This work can be extended in several ways. The accuracy of the outlier detection may be improved by modifying or enriching the region set \(\tilde{\mathcal {D}}(i)\). It is possible to modify \(\tilde{\mathcal {D}}(i)\) by changing the example normalizer used in the proposed method. If we employ a nonlinear normalizer, a hypercube in a projected space corresponds to a nonlinear region in the original space, which may lead to more precise outlier detection. Further, we can enrich \(\tilde{\mathcal {D}}(i)\) by using various normalizers in parallel and integrating the results. A simple way to generate various normalizers is to incorporate a random rotation into the normalizer used in this work.

The proposed method may suffer from the curse of dimensionality when a dataset has many continuous attributes and the number of data elements is not sufficient, because the leave-one-out density is zero in almost every subregion of the whole hypercube in such a situation. A simple solution is to embed some dimension reduction method into a normalizer. Another possible solution is to repeat outlier detection by using a small subset of attributes and combine these results to find the better outliers, as proposed in Lazarevic and Kumar (2005).

The leave-one-out density is proposed in this paper as a measure for detecting outliers, in which data samples around the focused datum are enumerated under the condition that categorical attributes are identical to the focused datum. However, such a condition can be too strong to detect outliers efficiently for a dataset that has a lot of categorical attributes, because most of the data points has very low leave-one-out density if few data points has exactly the same categorical attributes. Another problem arises when we deal with a dataset that is completely categorical. The leave-one-out density is then substantially equivalent to merely examining the cell frequency in the contingency table. To overcome these limitations, we need to consider how to relax the equivalent condition of categorical attributes in the definition of the leave-one-out density.

The proposed method outputs a score of being an outlier of each datum in the dataset, but it does not give a systematic way to determine the threshold to which the score of each datum is compared for judging whether it is an outlier or not. As far as we know, there are few methods for the outlier detection that have a systematic way to determine such a threshold. For example, the one-class SVM generally assumes that the threshold is zero when learning a hyperplane. We need to develop a method to determine the threshold if we want to use the proposed method in a completely automatic manner.