Robust outlier detection based on the changing rate of directed density ratio

https://doi.org/10.1016/j.eswa.2022.117988Get rights and content

Highlights

  • Propose an outlierness measure – directed density ratio – based on nearest neighbors.

  • Propose an outlier detector named DCROD based on the directed density ratio.

  • DCROD is robust against the change of the parameter and data distribution.

Abstract

The task of outlier detection aims at mining abnormal objects that deviate from normal distribution. Traditional unsupervised outlier detection methods can detect most global outliers, but only perform well under relatively single data distribution. Although the methods based on k-nearest neighbors can fit more complex data distribution, they also have the problem of hardly detecting local outliers or the performance easily influenced by data manifolds. At the same time, the outlier detection performance of most methods based on k-nearest neighbors is greatly affected by parameter k. We proposed a robust outlier detection method based on the changing rate of directed density ratio. The local density of samples is calculated by combining kernel density estimation and extended neighbor set which contains k-nearest neighbors and reverse k-nearest neighbors. Then we define the directed density ratio of a sample based on the density ratio and the vector between the sample and its neighbors. The local information can be better estimated by directed density ratio under different local densities and data manifolds. Then, by increasing the size of neighbors, the change of directed density ratio of a sample was calculated and finally summed up as the outlier score. Experiments are carried out on 12 synthetic datasets that simulate different data distributions and 22 public datasets. The experimental results show that compared with several state-of-the-art methods, the proposed method can achieve better outlier detection performance under different data distributions. In addition, the proposed method shows better robustness when parameter k changes in experimental results.

Introduction

Outliers refer to the objects that deviate significantly from the normal patterns as to arouse suspicious that they were generated by a different mechanism, according to the definition by Atkinson and Hawkins (1981). On the one hand, outliers are more important and interesting than normal samples because they may be generated by different mechanisms. On the other hand, the existence of outliers may influence the results of statistical analysis of the data such as clustering. Therefore, the detection of outliers is a basic and important task in data mining. Outlier detection has a wide range of applications in various fields in the real world, such as detecting fraud in bank account transactions (Aggarwal, 2017, Domingues et al., 2018), assisting doctors in medical diagnosis (Aggarwal, 2017, Domingues et al., 2018), fault detection in Wireless Sensor Networks and Internet of Things (Bhatti et al., 2020, Safaei et al., 2020), etc.

Outlier detection methods can be divided into three categories according to whether the data is labeled or not: supervised method, semi-supervised method and unsupervised method (Boukerche et al., 2020). Supervised methods construct classification models and detect outliers by learning from labeled data. Such methods require labeled samples as training sets, which are often difficult to obtain or require a large amount of time in real-world tasks. Semi-supervised methods use partial sample labels to build models and detect outliers. The challenge of these methods is how to build effective outlier detection models with limited information. Unsupervised methods do not require labeled samples, but detect outliers that deviate from the normal distribution by learning the underlying distribution mechanism of the data. Compared with supervised and semi-supervised methods, the research on unsupervised outlier detection methods is more challenging. At the same time, because outlier detection tasks without labels exist widely in the real world, the research on unsupervised outlier detection methods is also more meaningful. Most outlier detection research focuses on the unsupervised field recently.

The earliest unsupervised outlier detection is mostly based on statistics, such as 3σ criterion, which assumes that samples conform to Gaussian distribution in each dimension, while the objects deviating from the mean by 3 standard deviations are considered as outliers. HBOS (Histogram-Based Outlier Score) (Goldstein & Dengel, 2012) makes histogram of each dimension of data for density estimation and sums up to detect outliers. COPOD (COPula-based Outlier Detection) (Li et al., 2020) detects outliers by estimating the tail probability of the distribution through the copula function. Methods based on the idea of isolation, such as IForest (Isolation Forest) (Liu et al., 2008) detect outliers by measuring how difficult it is for a sample to be isolated. Methods based on the idea of reconstruction, such as PCA (Principal Component Analysis) (Shyu et al., 2003), first project and reconstruct the data, and estimate the outlier degree through the reconstruction error of each sample. Methods based on clustering, such as CBLOF (Cluster-Based Local Outlier Factor) (He et al., 2003), detect outliers through the clustering process, and the objects that do not belong to any cluster are determined as outliers. Most of the above methods do not require or are less affected by hyperparameters and perform well in a single normal pattern. However, the detection performance of these methods is limited when normal samples form multiple distribution patterns or there are local outliers.

Another kind of unsupervised outlier detection method is based on the idea of k-nearest neighbors, which obtains the local information of each sample by finding the nearest neighbors in the feature space, and measures the outlier degree of the sample on this basis. Because these methods measure the local information around the sample, it is less affected by the change of the overall sample distribution and can adapt to more complex data distribution. Most of them have a standard hyperparameter, namely the size of the nearest neighbors k, such as the classical KNN (K-Nearest Neighbors) (Ramaswamy et al., 2000) and LOF (Local Outlier Factor) (Breunig et al., 2000) methods. The parameter k is usually manually specified and it is often difficult to set an appropriate value for different tasks and different datasets. Unfortunately, most methods with the parameter “the size of neighbors k” are sensitive to the selection of k, and the performance of methods varies greatly between different k values and different datasets. Recently, a gravitation-based method (Xie et al., 2020) has been proposed to reduce the influence of parameter k on detection performance through vectors and changing rate. However, such methods also have the problem that the detection accuracy will be reduced when the data form different manifolds and different density distributions, which will be described in detail in Section 2.

To solve the problems mentioned above, this paper proposes a robust outlier detection method based on the changing rate of directed density ratio. The robustness is reflected in two aspects: it can adapt to diverse and complex data distribution, that is, it can effectively detect outliers when the data distribution form different patterns and manifolds or there are local outliers; It is insensitive to parameter setting, that is, no matter the parameter k is set to which value in practice, the method can achieve satisfactory results. The main contributions of the proposed method are as follows:

  • (1)

    The definition of directed density ratio is proposed. Firstly, the local kernel density of a sample is calculated based on kernel density estimation and extended neighbor set which contains k-nearest neighbors and reverse k-nearest neighbors. The density ratio is defined as the ratio of the density of each neighbor of a sample to itself. On this basis, we multiply the density ratio by the vector from the sample to the corresponding neighbor as the directed density ratio. By combining density, distance and direction information, the directed density ratio can effectively reflect the local and global outlierness of samples with different distributions.

  • (2)

    An outlier detector based on the changing rate of directed density ratio is proposed. By increasing the size of nearest neighbors, the sum of directed density ratios under different neighbor sizes is calculated and then the change of the sum is computed. The values of change with different size of neighbors is accumulated until k reaches the set parameter. By measuring the value of change to estimate the outlier degree, the influence of parameters on outlier detection performance can be alleviated. The cumulative value of the changes of directed density ratio is used as the final outlier score of the sample.

The remaining sections are organized as follows. Section 2 introduces the previous works related to this paper. In Section 3, the method proposed in this paper is introduced in detail. Section 4 is the comparative experiment and result analysis. Finally, the article is summarized in Section 5.

Section snippets

Related works

Based on the characteristics of the methods proposed in this paper, we will introduce the methods based on k-nearest neighbor in unsupervised outlier detection in this section. These methods can be roughly divided into distance-based methods and density-based methods, and their advantages and disadvantages will be introduced in turn.

The proposed method

The proposed method DCROD (Directed density ratio Changing Rate-based outlier detection) mainly consists of three parts: (1) calculating the local kernel density of samples; (2) definition of the directed density ratio; (3) the outlier detector based on the changing rate of directed density ratio.

Datasets

First of all, in order to better analyze the outlier detection performance of the proposed method under different data distributions, we used 12 2d synthetic datasets reflecting different distributions. The number of samples of the datasets is shown in the Table 2, and the specific distribution is shown in Fig. 3.

In addition, we also used 22 public real-world outlier detection datasets for experiments, as shown in Table 3. The sample size of the dataset ranges from 80 to 60 632, the percentage

Conclusion

This paper presents a robust outlier detection method based on the changing rate of directed density ratio. The local density of the sample is calculated by kernel density estimation and extended neighbor set, and then the density ratio of the sample and its neighbor sample is multiplied by the corresponding vector to get the defined directed density ratio. The directed density ratio is used to effectively measure the outlier degree of a sample under different local densities and distribution

CRediT authorship contribution statement

Kangsheng Li: Conceptualization, Methodology, Formal analysis, Software, Validation, Writing – original draft, Writing – review & editing. Xin Gao: Conceptualization, Methodology, Resources, Supervision, Funding acquisition, Writing – original draft, Writing – review & editing. Shiyuan Fu: Software, Validation. Xinping Diao: Software, Validation. Ping Ye: Writing – review & editing. Bing Xue: Writing – review & editing. Jiahao Yu: Software, Validation. Zijian Huang: Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (25)

  • BreunigM.M. et al.

    LOF: identifying density-based local outliers

    ACM SIGMOD Record

    (2000)
  • Davis, J., & Goadrich, M. 2006. The relationship between Precision-Recall and ROC curves. In Proceedings of the 23rd...
  • Cited by (11)

    • Outlier detection algorithm based on expected kernel density outlier factor

      2024, Gaojishu Tongxin/Chinese High Technology Letters
    View all citing articles on Scopus
    View full text