Anomaly detection method using center offset measurement based on leverage principle☆
Introduction
Anomaly detection refers to the problem of finding patterns that do not conform to expected behavior in the dataset. Since the 19th century, anomaly detection has been widely used in various fields. For instance, the anomaly detection calls attention to network intrusion ensuring the security of user’s information [1], [2], [3], and it assists medical diagnosis by detecting abnormal physiological signals [4], [5]. The anomaly detection can also analyze bank transaction data preventing from money laundering crime [6], [7]. However, due to the limitation of usage fields, data scale and other factors, there are differences in the performance of different methods. Therefore, the problem of anomaly detection is still a hot topic for many scholars [8].
Due to the different characteristics of data in different fields, the methods applied in various fields are slightly different. Most existing methods come from the anomaly detection methods such as One-Class Support Vector Machine (SVM) [9] and Local Outlier Factor (LOF) [10]. According to whether the training data have exact normal or anomalous labels, scholars classify these methods into three categories: supervised methods that require all instances with labels, semi-supervised methods with partial tagged instances, and unsupervised methods without tagged instances [11]. Among them, the unsupervised methods are broader used, as data marking not only consumes a lot of human and material resources but also introduces subjective error.
There still are some problems in the conventional unsupervised methods, such as k-NearestNeighbor (KNN) [12], Density-Based Spatial Clustering of Applications with Noise (DBSCAN) [13], Connectivity based Outlier Factor (COF) [14]: (1). Many existing unsupervised anomaly detection methods have more adjustable parameters and it is difficult for users to select the appropriate combination of parameter values. For instance, the parameter and the kernel parameter in SVM [9], the number of neighbors in KNN [12]. (2). The subjectivity of demarcating the threshold introduces errors into the algorithms. For example, in LOF, the certain percentile of dispersion scores, which are calculated in the training dataset, is defined as the threshold, wherein the percentile is the anomaly proportion set by the user. If the score of testing instance is greater than the threshold, it would be marked as abnormal. However, we usually consume excessive cost to obtain labels in practice so that the anomaly proportion of data cannot be accurately estimated. Moreover, under the condition of the absence of tagged data to verify the accuracy and reasonableness of anomaly proportion estimation, the subjective setting of the anomaly proportion may cause too many false alarms. (3). The time complexity of some methods is high, and the performance of some methods is poor when handling high-dimensional data [15], which make them unsuitable for high-dimensional data. (4). Some methods are not applicable to the detection where the numerical features and the binary features coexist. What is more, discarding the binary features may result in the information loss.
To address the weakness above, this paper presents a new unsupervised anomaly detection method, called multiplication lever (mLever). According to the situation that anomalies are few and different, we copy the testing instance and take these replicated data into the training dataset. The anomaly degree of this instance is evaluated by measuring the deviation of dataset center. In this method, there are few parameters to be set manually, and it has linear time complexity and high accuracy. Moreover, it has good detection ability in low or high dimensional datasets, and has strong generalization ability. In addition, this paper presents solution to the problem of setting threshold. The obtained anomaly scores are fitted to a function, and then the appropriate threshold can be selected according to the gradient. Instead of estimating the anomaly ratio by users’ experience, we analyze the distribution of anomaly scores in the training dataset and use the golden ratio to set threshold more objectively.
The main contributions of this article are as follows:
- (1)
An anomaly detection method based on the leverage principle to measure the dataset center offset is proposed. It has the advantages of few setting parameters, low time complexity, high accuracy and wide adaptability. In the experiments, when handling the high-dimensional data, the datasets with binary features and numeric features and the datasets with abnormal classes, it works well.
- (2)
An adaptive threshold setting method using the golden ratio is proposed, which makes it more objective to distinguish normal data instances and anomalies. This method also avoids poor performance caused by errors in anomaly ratio estimation.
The rest of this paper is organized as follows: in Section 2, some related works about common and newest anomaly detection methods are reviewed. In Section 3, the anomaly detection method and the threshold setting method proposed in this paper are described in detail. Section 4 presents an experimental study, where the method proposed in this paper is compared with other anomaly detection methods and we analyze the results. Section 5 covers the conclusion.
Section snippets
Related work
Anomaly detection means to search for instances that do not meet the expected behavior in the dataset. At present, researchers have proposed a number of solutions for unsupervised anomaly detection problems. According to the assumptions and principles of each method, the unsupervised anomaly detection can be divided into classification-based methods, clustering-based methods, nearest neighbor methods, and statistical-based methods, etc.[8], [16], [17]
The classification-based detection methods,
Anomaly detection method by measuring dataset center offset
The ancient Greek scientist Archimedes proposed the leverage principle. When levers are balanced, the distance between each end and fulcrum is inversely proportional to their weight. This subsection mainly introduces the method of detecting anomalies by measuring the dataset center offset by means of the leverage principle. Table 1 shows the symbols used in this paper.
According to the principle of leverage, the position of the fulcrum can be adjusted to maintain the overall balance. If the mass
Experiment study
This section presents the detailed results for 12 sets of experiments designed to evaluate mLever. In the experiment, mLever is compared with other anomaly detection algorithms, including LOF, OneClassSVM (SVM), EllipticEnvelope (EE), iForest [19], CANF [23], SLDOF [28], EDADS-1 (EDADS) [32]. Meanwhile, we have improved the iForest as giForest to prove the effectiveness of the adaptive threshold setting method. LOF is a well-known density-based algorithm. OneClassSVM is an algorithm that
Conclusions
In this paper, an anomaly detection method based on the leverage principle is proposed with high efficiency and less parameter adjustment. On the basis that the anomalies are far less than the normal instances, the center of the entire dataset can be calculated and then a new center can be obtained after adding the testing instance. The anomaly degree can be measured by comparing the offset of two centers. In addition, a threshold-setting method using the golden ratio is proposed to reduce the
Acknowledgments
The authors sincerely appreciate spiritual encouragement of Chun-xu Chen’s parents (Gui-yun Liu and You Chen), which keeps the authors going forward. Meanwhile, the authors would like to thank their colleagues from the Machine learning group for discussions on this paper. Besides, the authors also appreciate Lin Zhai and Hui-yuan Qi for their support of language translation.
References (42)
- et al.
Anomaly detection in wide area network meshes using two machine learning algorithms
Future Gener. Comput. Syst.
(2019) - et al.
Recent progress of anomaly detection
Complexity
(2019) - et al.
An intelligent algorithm with feature selection and decision rules applied to anomaly intrusion detection
Appl. Soft Comput.
(2012) - et al.
Canf: clustering and anomaly detection method using nearest and farthest neighbor
Future Gener. Comput. Syst.
(2018) - et al.
Outlier detection using neighborhood rank difference
Pattern Recognit. Lett.
(2015) - et al.
Shared nearest neighbors based outlier detection for biological se-quences
Int. J. Digit. Content Technol. Appl.
(2012) - et al.
An intelligent algorithm with feature selection and decision rules applied to anomaly intrusion detection
Appl. Soft Comput.
(2012) - et al.
Dynamic ensemble selection for multi-class imbalanced datasets
Inform. Sci.
(2018) - et al.
Intelligent network security monitoring based on optimum-path forest clustering
IEEE Netw.
(2019) - et al.
A parallel algorithm for network traffic anomaly detection based on isolation forest
Int. J. Distrib. Sens. Netw.
(2018)
Using statistical anomaly detection models to find clinical decision support malfunctions
J. Am. Med. Inform. Assoc.
Unsupervised identification of disease marker candidates in retinal OCT imaging data
IEEE Trans. Med. Imaging
Security evaluation of a banking fraud analysis system
ACM Trans. Priv. Secur.
Improving the security and QoE in mobile devices through an intelligent and adaptive continuous authentication system
Sensors
Anomaly detection
ACM Comput. Surv.
The Nature of Statistical Learning Theory
LOF, vol. 29
A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data
PLOS ONE
Efficient Algorithms for Mining Outliers from Large Data Sets, vol. 29
Enhancing effectiveness of outlier detections for low density patterns
Cited by (0)
- ☆
No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.knosys.2019.105191.