Elsevier

Knowledge-Based Systems

Volume 190, 29 February 2020, 105191
Knowledge-Based Systems

Anomaly detection method using center offset measurement based on leverage principle

https://doi.org/10.1016/j.knosys.2019.105191Get rights and content

Abstract

Anomaly detection is an important branch of data mining and has been researched within diverse research areas and application domains. Many existing unsupervised anomaly detection methods have high computational complexity and more adjustable parameters. In addition, the proportion of anomalies needs to be estimated by long-term experience in most methods to set the threshold that distinguishes normal instances and anomalies, which makes the algorithms more subjective. This paper presents an anomaly detection algorithm based on the leverage principle. When detecting a testing instance, we copy it in large quantities and take these replicated data into the training dataset. The anomaly degree of the testing instance can be assessed by measuring the offset of the dataset center. Meanwhile, an adaptive threshold setting method using the golden ratio is proposed to solve the subjectivity in distinguishing normal instances and anomalies. In the experiments, we compare the anomaly detection algorithm proposed in this paper with other eight detection methods and report the experimental results in terms of AUC, the F1 of the anomaly classes and the running time. The results show that our algorithm can achieve high detection performance with high efficiency, and the proposed threshold setting method also has strong practicality in unsupervised anomaly detection.

Introduction

Anomaly detection refers to the problem of finding patterns that do not conform to expected behavior in the dataset. Since the 19th century, anomaly detection has been widely used in various fields. For instance, the anomaly detection calls attention to network intrusion ensuring the security of user’s information [1], [2], [3], and it assists medical diagnosis by detecting abnormal physiological signals [4], [5]. The anomaly detection can also analyze bank transaction data preventing from money laundering crime [6], [7]. However, due to the limitation of usage fields, data scale and other factors, there are differences in the performance of different methods. Therefore, the problem of anomaly detection is still a hot topic for many scholars [8].

Due to the different characteristics of data in different fields, the methods applied in various fields are slightly different. Most existing methods come from the anomaly detection methods such as One-Class Support Vector Machine (SVM) [9] and Local Outlier Factor (LOF) [10]. According to whether the training data have exact normal or anomalous labels, scholars classify these methods into three categories: supervised methods that require all instances with labels, semi-supervised methods with partial tagged instances, and unsupervised methods without tagged instances [11]. Among them, the unsupervised methods are broader used, as data marking not only consumes a lot of human and material resources but also introduces subjective error.

There still are some problems in the conventional unsupervised methods, such as k-NearestNeighbor (KNN) [12], Density-Based Spatial Clustering of Applications with Noise (DBSCAN) [13], Connectivity based Outlier Factor (COF) [14]: (1). Many existing unsupervised anomaly detection methods have more adjustable parameters and it is difficult for users to select the appropriate combination of parameter values. For instance, the parameter nu and the kernel parameter gamma in SVM [9], the number of neighbors k in KNN [12]. (2). The subjectivity of demarcating the threshold introduces errors into the algorithms. For example, in LOF, the certain percentile of dispersion scores, which are calculated in the training dataset, is defined as the threshold, wherein the percentile is the anomaly proportion set by the user. If the score of testing instance is greater than the threshold, it would be marked as abnormal. However, we usually consume excessive cost to obtain labels in practice so that the anomaly proportion of data cannot be accurately estimated. Moreover, under the condition of the absence of tagged data to verify the accuracy and reasonableness of anomaly proportion estimation, the subjective setting of the anomaly proportion may cause too many false alarms. (3). The time complexity of some methods is high, and the performance of some methods is poor when handling high-dimensional data [15], which make them unsuitable for high-dimensional data. (4). Some methods are not applicable to the detection where the numerical features and the binary features coexist. What is more, discarding the binary features may result in the information loss.

To address the weakness above, this paper presents a new unsupervised anomaly detection method, called multiplication lever (mLever). According to the situation that anomalies are few and different, we copy the testing instance and take these replicated data into the training dataset. The anomaly degree of this instance is evaluated by measuring the deviation of dataset center. In this method, there are few parameters to be set manually, and it has linear time complexity and high accuracy. Moreover, it has good detection ability in low or high dimensional datasets, and has strong generalization ability. In addition, this paper presents solution to the problem of setting threshold. The obtained anomaly scores are fitted to a function, and then the appropriate threshold can be selected according to the gradient. Instead of estimating the anomaly ratio by users’ experience, we analyze the distribution of anomaly scores in the training dataset and use the golden ratio to set threshold more objectively.

The main contributions of this article are as follows:

  • (1)

    An anomaly detection method based on the leverage principle to measure the dataset center offset is proposed. It has the advantages of few setting parameters, low time complexity, high accuracy and wide adaptability. In the experiments, when handling the high-dimensional data, the datasets with binary features and numeric features and the datasets with abnormal classes, it works well.

  • (2)

    An adaptive threshold setting method using the golden ratio is proposed, which makes it more objective to distinguish normal data instances and anomalies. This method also avoids poor performance caused by errors in anomaly ratio estimation.

The rest of this paper is organized as follows: in Section 2, some related works about common and newest anomaly detection methods are reviewed. In Section 3, the anomaly detection method and the threshold setting method proposed in this paper are described in detail. Section 4 presents an experimental study, where the method proposed in this paper is compared with other anomaly detection methods and we analyze the results. Section 5 covers the conclusion.

Section snippets

Related work

Anomaly detection means to search for instances that do not meet the expected behavior in the dataset. At present, researchers have proposed a number of solutions for unsupervised anomaly detection problems. According to the assumptions and principles of each method, the unsupervised anomaly detection can be divided into classification-based methods, clustering-based methods, nearest neighbor methods, and statistical-based methods, etc.[8], [16], [17]

The classification-based detection methods,

Anomaly detection method by measuring dataset center offset

The ancient Greek scientist Archimedes proposed the leverage principle. When levers are balanced, the distance between each end and fulcrum is inversely proportional to their weight. This subsection mainly introduces the method of detecting anomalies by measuring the dataset center offset by means of the leverage principle. Table 1 shows the symbols used in this paper.

According to the principle of leverage, the position of the fulcrum can be adjusted to maintain the overall balance. If the mass

Experiment study

This section presents the detailed results for 12 sets of experiments designed to evaluate mLever. In the experiment, mLever is compared with other anomaly detection algorithms, including LOF, OneClassSVM (SVM), EllipticEnvelope (EE), iForest [19], CANF [23], SLDOF [28], EDADS-1 (EDADS) [32]. Meanwhile, we have improved the iForest as giForest to prove the effectiveness of the adaptive threshold setting method. LOF is a well-known density-based algorithm. OneClassSVM is an algorithm that

Conclusions

In this paper, an anomaly detection method based on the leverage principle is proposed with high efficiency and less parameter adjustment. On the basis that the anomalies are far less than the normal instances, the center of the entire dataset can be calculated and then a new center can be obtained after adding the testing instance. The anomaly degree can be measured by comparing the offset of two centers. In addition, a threshold-setting method using the golden ratio is proposed to reduce the

Acknowledgments

The authors sincerely appreciate spiritual encouragement of Chun-xu Chen’s parents (Gui-yun Liu and You Chen), which keeps the authors going forward. Meanwhile, the authors would like to thank their colleagues from the Machine learning group for discussions on this paper. Besides, the authors also appreciate Lin Zhai and Hui-yuan Qi for their support of language translation.

References (42)

  • RayS. et al.

    Using statistical anomaly detection models to find clinical decision support malfunctions

    J. Am. Med. Inform. Assoc.

    (2018)
  • SeebockP. et al.

    Unsupervised identification of disease marker candidates in retinal OCT imaging data

    IEEE Trans. Med. Imaging

    (2019)
  • CarminatiM. et al.

    Security evaluation of a banking fraud analysis system

    ACM Trans. Priv. Secur.

    (2018)
  • ValeroJ.J. et al.

    Improving the security and QoE in mobile devices through an intelligent and adaptive continuous authentication system

    Sensors

    (2018)
  • ChandolaV. et al.

    Anomaly detection

    ACM Comput. Surv.

    (2009)
  • VapnikV.N.

    The Nature of Statistical Learning Theory

    (2000)
  • BreunigM.M. et al.

    LOF, vol. 29

    (2000)
  • GoldsteinM. et al.

    A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data

    PLOS ONE

    (2016)
  • RamaswamyS. et al.

    Efficient Algorithms for Mining Outliers from Large Data Sets, vol. 29

    (2000)
  • M. Ester, H.P. Kriegel, X. Xu, A density-based algorithm for discovering clusters a density-based algorithm for...
  • TangJ. et al.

    Enhancing effectiveness of outlier detections for low density patterns

  • Cited by (0)

    No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.knosys.2019.105191.

    View full text