Keywords

1 Introduction

In data mining, anomaly detection is about the identification of samples which do not share the pattern or behavior followed by most of the elements in a dataset. Anomaly detection techniques have become important tools in several real-world applications such as malicious activity detection [1], intrusion detection [2,3,4], fraud detection [5, 6], surveillance [7, 8], and others. For instance, in certain businesses, specifically in banks or telecommunication companies where customers execute financial transactions between accounts, an anomaly detection system is very useful as it alerts on unusual account behavior in a period [5, 9].

In terms of anomaly detection, one of the most successful methods is autoencoders from Deep Learning. Autoencoder networks can learn a compressed representation of the input data, providing an efficient reconstructed output by reducing the input dimensionality [10]. Moreover, it turns out that autoencoders are more concerned in the difference between the input and output than in the output itself. This difference is known as reconstruction error. According to autoencoder features, a high reconstruction error or score indicates the occurrence of an anomaly [10].

High score samples are highlighted as anomalies. To do so, a selection of the score threshold to compare against is required [4, 8, 10]. Usually this value is either set by a human expert or it is statistically estimated assuming a theoretical distribution. For this reason, in this paper, an iterative training method is introduced which estimates the anomaly score threshold.

This paper is organized as follows: in Sect. 2, the related works are discussed. In Sect. 3, the proposed autoencoder pipeline method is introduced. In Sect. 4, the experimental results are shown and discussed. Finally, in Sect. 5, the conclusions and future work directions are presented.

2 Related Works

Rule-based approach is one of the mainstream techniques in the field of anomaly detection including malicious activity, intrusion or fraud detection [1]. Nowadays, a common scenario requires processing a large volume of unlabeled data where rule-based approaches fail and only unsupervised algorithms are able to support anomaly detection systems. The expansion of the computational power boots up the application of Deep Learning, whose unsupervised algorithms can be used to process large volumes of unlabeled data. These algorithms include the autoencoder networks [6, 11].

2.1 Anomaly Detection with Autoencoders

Autoencoders are a good option in the absence of the ground truth. Since, the introduction of replicator neural network as outlier detection tool [12], autoencoder networks have been used to solve anomaly detection problems [6, 10, 11]. This kind of network consists of two parts: the encoder, shaped like a funnel, and the decoder that expands back out to the full input dataset size at the output layer [10].

The autoencoder structure allows the network to learn a compressed representation of datasets, obtaining a reduced representation of the input data in terms of its dimensions. The output of autoencoders is a reconstruction of the input data in the most efficient way [10]. One of the most interesting characteristics of the autoencoders, since they are a variant of feed-forward neural networks, is the presence of an extra bias that allows the network to recognize normal regions in the feature space, and to compute the reconstruction error [10, 11]. As a consequence, a high reconstruction error indicates an anomaly.

There is a probabilistic version of autoencoders, known as Variational Autoencoders (VAE) [13]. The main advantage of a VAE over an autoencoder network is a probabilistic output which means a reconstruction probability instead of a reconstruction error as an anomaly score. As stated in the literature, probabilities do not require model specific thresholds for considering an evaluated sample as an anomaly since they represent the foundations of what is happening [13]. However, setting a threshold to identify the boundaries and to judge properly what high means is required.

Searching the best anomaly score threshold for automatic anomaly recognition is not a trivial task. The common approaches include to set the anomaly score threshold by human experts or to estimate it from a heuristic (e.g. three-sigma rule [14]) assuming that the dataset fits a theoretical distribution [10].

In the literature, a method for network intrusion detection, which attempts to compute the anomaly score threshold was reported [4]. The proposed training process uses normal samples only. Therefore each autoencoder from the ensemble computes its own anomaly score threshold by selecting the maximum score from training samples.

The reviewed applications of autoencoders are focused on detecting specific dataset anomalies. For example, the classical detection of outlier digits over MNIST database [10], anomalies detection over accounting data [6], or continuous video stream [7] or network intrusion detection [15]. In all mentioned application samples, the anomaly score threshold is a parameter. The estimation of this value is an expert task.

At this point, we conclude there are no reported solutions (neither an exploration) for automatically obtaining the anomaly score threshold from the autoencoders themselves. This paper introduces Autoencoders Pipeline as a valid method to estimate the normality limits.

3 Proposed Method

The goal of the method is to compute the anomaly score threshold. The idea is to arrange and train the autoencoders in sequence, resulting in an iterative training method from which the anomaly score threshold can be obtained. This approach is called “Autoencoders Pipeline” (AEP).

3.1 Autoencoders Pipeline

AEP starts as a regular training. The dataset is split into training and evaluation set. The method consists of the training of a new autoencoder network while normal samples remain in the evaluation set. The normal sample is defined as an evaluated one with a score below the expected anomaly score threshold for the iteration. All anomalous candidate data results are reintegrated for reprocessing in upcoming iterations.

In the first iteration, the scoreThreshold is initialized as follows:

$$\begin{aligned} scoreThreshold = min(score_0) \end{aligned}$$
(1)

and the scoreIncrement computed as follows:

$$\begin{aligned} scoreIncrement = \frac{max(score_0) - min(score_0)}{100} \end{aligned}$$
(2)

where \(score_0\) is the vector with the scores of each evaluated sample.

The evaluated sample score is compared with \(scoreThreshold + scoreIncrement\) on each iteration. Every sample with a score greater than this value is considered anomaly candidate. If anomaly candidates are collected at the end of the iteration, the scoreThreshold is updated as follows:

$$\begin{aligned} scoreThreshold = scoreThreshold + scoreIncrement. \end{aligned}$$
(3)

When all evaluated samples in the iteration are anomaly candidates, the stop condition of the algorithm is reached and the final scoreThreshold is computed as follows:

$$\begin{aligned} scoreThreshold = \frac{scoreThreshold + min(score_l)}{2} \end{aligned}$$
(4)

where l indicates the last iteration.

If the stop condition isn’t reached, all anomaly candidates are merged back in the training set and split again at random to train a new autoencoder and start a new iteration. The output of this method is the best-trained autoencoder network from the first iteration and the anomaly score threshold. Figure 1 depicts an overview of how AEP works.

Fig. 1.
figure 1

Overview of AEP algorithm.

Notice that as the anomaly candidates are merged back with the previous training set, a new set with more anomaly average (less normality) is obtained. This approach downgrades smoothly the learning capacity of the new autoencoders from the new datasets until the last autoencoder which considers all evaluated samples as anomalies. When the algorithm reaches this condition the anomaly score threshold is close to the dataset normality limits.

4 Experiments

In this section, the experimental results obtained over two datasets of outliers, following the training method explained in Sect. 3 are shown. All experiments were carried out on a personal computer with an Intel(R) Core(TM) i3-2100 CPU @ 3.10GHz with 16GB of RAM. The algorithm was implemented in Java, powered by Deeplearning4jFootnote 1 and executed on Microsoft Windows 10 Professional Operating System.

The datasets used in the experiments are from the Outlier Detection DataSets Library (ODDS)Footnote 2. The datasets were split into equal parts (50% for training and 50% for evaluating) as experiment design approach. Each experiment comprises fifteen executions of the method, looking for a tendency or similarity in the estimated anomaly score thresholds, that allows to validate its effectiveness.

For the scope of the experiments, similar network configurations for the autoencoders were used. The main hyperparameters are Stochastic Gradient Descent (SGD) as optimization algorithm, Xavier as weight initializer, Rectified linear units (ReLU) as activation function and RMSProp as the updater for each layer, and Mean squared error (MSE) as network loss function. The input and output layer sizes depend on the dataset dimensionality. The encoder reduces the dimensionality to 75% on each hidden layer reaching the bottleneck with a size close to 33% of the input size. The decoder increases the dimensionality back using the same values of the encoder in reverse order.

The experiment results are shown in a table that includes the following columns: Anomaly Score Threshold (AST), Detected Anolamies (DA), True Positives (TP), False Positives (FP), True Negatives (TN), False Negatives (FN), Accuracy (AC), Precision (P), Recall (R) and F-measure (F1). Each row of the table represents an isolated execution of the method.

The anomaly score thresholds output from every single execution are overall good estimations, with a few exceptions. From all these results, a better anomaly score threshold can be computed and, according to this value, an associated network can be selected. In the experiments results, the selection of a trained network by the proximity to the denser cluster mean is included. This cluster is an output of the Single-linkage clustering algorithm [16] alongside a heuristic to score each created cluster in the process.

4.1 Arrhythmia Dataset Experiment

“The arrhythmia dataset is a multi-class classification dataset with dimensionality 279. There are five categorical attributes which are discarded here, totalling 274 attributes. The smallest classes, i.e., 3, 4, 5, 7, 8, 9, 14, 15 are combined to form the outliers class and the rest of the classes are combined to form the inliers class”Footnote 3.

In Table 1a, the arrhythmia datasets composition used in this experiment is presented. Table 1b enumerates the input and output layer sizes of the network. Furthermore, Table 1c summarizes the results of the executions. The best anomaly score threshold is 430.6269 but the algorithm selects . This value is the closest to the mean of the cluster from to . This effects the selection of the trained network associated with the selected execution.

Table 1. Arrhythmia dataset experiment

This is a result of a very poorly tuned network. Notice values of F1 are under 0.5. As mentioned before, the algorithm is not looking for the best network configuration for the dataset, but for the best possible anomaly score threshold. The average of iterations to convergence was 62 (min:37, max:154).

4.2 Wisconsin-Breast Cancer Dataset Experiment

“The Wisconsin-Breast Cancer (Diagnostics) dataset (WBC) is a classification dataset with dimensionality 30, which records the measurements for breast cancer cases. There are two classes, benign and malignant. The malignant class of this dataset is downsampled to 21 points, which are considered as outliers, while points in the benign class are considered inliers”Footnote 4.

In Table 2a, the WBC datasets composition used in this experiment is presented. Table 2b enumerates the input and output layer sizes of the network. Furthermore, the results of the executions are listed in Table 2c. The best anomaly score threshold is 0.05666 but the unsupervised selection is due to its proximity to the mean of the cluster from to . This selection also includes the associated network.

Table 2. WBC dataset experiment

This is the result of a better trained network than the one presented in Sect. 4.1. The configuration and training are not the best possible but are sufficient for purpose of this paper. Values of F1 greater than 0.7 in some executions have been achieved. The average of iterations to convergence was 49 (min:28, max:101).

5 Conclusions

In this paper, the Autoencoder Pipeline as an iterative training method to find the anomaly score threshold for a dataset according to the network configuration, training and tuning was introduced. The method is evaluated over two well-known datasets.

The reliability of the method has been exposed through a couple of experiments and the results are encouraging. Based on the experiments, we conclude that is possible to automatically compute the anomaly score threshold by autoencoders themselves. In essence, an arrangement of autoencoders in a pipeline is required along with a smoothly downgrade of the normal samples from the training set.

As future work, the unsupervised selection of the best network from several executions can be improved. In an ideal case, all networks from the cluster with the best ones should work together by consolidating the anomaly criterion.