Elsevier

Knowledge-Based Systems

Volume 92, 15 January 2016, Pages 71-77
Knowledge-Based Systems

A non-parameter outlier detection algorithm based on Natural Neighbor

https://doi.org/10.1016/j.knosys.2015.10.014Get rights and content

Abstract

Outlier detection is an important task in data mining with numerous applications, including credit card fraud detection, video surveillance, etc. Although many Outlier detection algorithm have been proposed. However, for most of these algorithms faced a serious problem that it is very difficult to select an appropriate parameter when they run on a dataset. In this paper we use the method of Natural Neighbor to adaptively obtain the parameter, named Natural Value. We also propose a novel notion that Natural Outlier Factor (NOF) to measure the outliers and provide the algorithm based on Natural Neighbor (NaN) that does not require any parameters to compute the NOF of the objects in the database. The formal analysis and experiments show that this method can achieve good performance in outlier detection.

Introduction

Outlier detection is an important data mining activity with numerous applications, including credit card fraud detection, discovery of criminal activities in electronic commerce, video surveillance, weather prediction, and pharmaceutical research [1], [2], [3], [4], [5], [6], [7], [8], [9].

An outlier is an observation that deviates so much from other observations so that it arouses that it is generated by a different mechanism [8]. At present, the studies on outlier detection is very active. Many outlier detection algorithms have been proposed. Outlier detection algorithm can be roughly divided into distribution-based, depth-based, distance-based, clustering-based and density-based act.

In distribution-based methods, the observations that deviate from a standard distribution are considered as outliers [7]. But distribution-based methods not applicable to dataset that multidimensional or the distribution unknown. The depth-based [10], [11] methods can improve this problem. Depth-based methods relies on the computation of different layers of kd convex hulls. In this way, outliers are objects in the outer layer of these hulls. However, the efficiency of depth-based algorithms is low on 4-dimensional or more than 4-dimensional dataset. In clustering-based methods, the outliers are by-products of clustering, such as DBSCAN [12], CLARANS [13], CHAMELEON [14], BIRCH [15], and CURE [16]. But the target of clustering-based methods is finding clusters, not detecting outliers, so the efficiency of detecting outliers is low too.

The distance-based algorithms was widely used for the effectiveness and simplification. In paper [4], a distance-based outlier is described as the object that with pct% of the objects in database having a distance of more than dmin away from it. However, since distance-based algorithms do not take into account the changes of local density, so distance-based algorithms can only detect the global outliers, fail to detect the local outliers.

The local outliers have received much attention recently. The density-based methods can solve this problem well. And many density-based outlier detection algorithms have been proposed. In paper [17], authors define the concept of a local outlier factor (LOF) that a measure of outlier degree in density between an object and its neighborhood objects. The article [18] made an improved on LOF and proposed an outlier detection algorithm, which defined the influenced outlierness (INFLO) computed by considering both neighbors and reverse neighbors as the outlier degree. This results in a meaningful outlier detection.

Given our motivation, through the above analysis, although the density-based methods can solve problem of local outliers well, density-based methods face the same problem that parameter selection as the first four methods. All of these algorithms almost cannot effectively detect the outliers without appropriate parameter. In other words, most of these algorithms have high dependency to the parameter. Once the parameter changed, the result of outlier detecting would have obvious difference. So the selection of parameter is very important for outlier detection algorithm. In fact, however, determination of parameter is dependent on the knowledge of researcher’s experience and a lot of experiment. For example, it is difficult to select an appropriate parameter k that the number of neighbors when use LOF or INFLO to detect the outlier on database.

More detailed analysis of the problem with existing approaches can be available in paper [19]. Paper [19] also propose a new outliers detection algorithm (INS) using the instability factor. INS is insensitive to the parameterk when the value of k is large as shown in Fig. 7(c). However, the cost is that the accuracy is low when the accuracy stabilized. Moreover INS hardly find a properly parameter to detect the local outliers and global outliers simultaneously. In other words, when the value of k is well to detect the global outliers, the effect on local outliers detection is bad, and vice versa.

In this paper, in order to solve the above problem, we first introduce a novel concept of neighbor named Natural Neighbor (NaN) and its search algorithm (NaN-Searching). Then we obtain the number of neighbors, the value of parameter k, use the NaN-Searching algorithm. We also define a new concept of Natural Influence Space (NIS) and Natural Neighbor Graph (NNG), and compute the Natural Outlier Factor (NOF). The bigger the value of NOF is, the greater the possibility of object is outlier.

The paper is organized as follows. In Section 2, we present the existing definition and our motivation. In Section 3, properties of Natural Neighbor are introduced. In Section 4, we propose a outlier detection algorithm based on Natural Neighbor. In Section 5, a performance evaluation is made and the results are analyzed. Section 6 concludes the paper.

Section snippets

Related work

In this section, we will briefly introduce concept of LOF and INS. LOF is a famous density-based outlier detection algorithm. And INS is a novel outlier detection algorithm proposed in 2014. Interested readers are referred to papers [17] and [19].

Let D be a database, p, q, and o be some objects in D, and k be a positive integer. We use d(p,q) to denote the Euclidean distance between objects p and q.

Definition 1

k-distance and nearest neighborhood of p

The k-distance of p, denoted as kdist(p), is the distance d(p,o) between p and o in D, such

NaN definition and algorithm

Natural Neighbor is a new concept of neighbor. The concept originates in the knowledge of the objective reality. The number of one’s real friends should be the number of how many people are taken him or her as friends and he or she take them as friends at the same time. For data objects, object y is one of the Natural Neighbor of object x if object x considers y to be a neighbor and y considers x to be a neighbor at the same time. In particular, data points lying in sparse region should have

Metrics for measurement

For performance evaluation of the algorithms, we use two metrics, namely Accuracy and Rank-Power [23], to evaluate the detection results. Let N be the number of the true outliers that dataset D contains. And let M be the number of the true outliers that detected by an algorithm. In experiment, we detect out N most suspicious instances. Then the Accuracy (Acc) is given by: Acc=MN

If using a given detection method, true outliers occupy top positions with respect to the non-outliers among N

Conclusions and further study

In this study, we propose a new density-based algorithm for outlier detection. The proposed method combine the concept of the Natural Neighbor and previous density-based methods. As the most of the previous outlier detection methods, an object with a high outlierness scores is a promising candidate for an outlier. But unlike the most of the previous outlier detection approaches, our method is non-parametric. We use Algorithm 1 to adaptively obtain the value of k, named Natural Value.

Acknowledgment

This research was supported by the National Natural Science Foundation of China (Nos. 61272194 and 61073058).

References (24)

  • I. Ruts et al.

    Computing depth contours of bivariate point clouds

    Comput. Stat. Data Anal.

    (1996)
  • J. Ha et al.

    Robust outlier detection using the instability factor

    Knowledge-Based Syst.

    (2014)
  • X. Luo

    Boosting the K-nearest-neighborhood based incremental collaborative filtering

    Knowledge-Based Syst.

    (2013)
  • W. Jin et al.

    Mining top-n local outliers in large databases

    Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

    (2001)
  • J. Han et al.

    Data Mining: Concepts and Techniques: Concepts and Techniques

    (2011)
  • T. Pang-Ning et al.

    Introduction to Data Mining

    (2006)
  • E.M. Knox et al.

    Algorithms for mining distance-based outliers in large datasets

    Proceedings of the International Conference on Very Large Data Bases

    (1998)
  • E.M. Knorr et al.

    A unified notion of outliers: properties and computation

    Proceedings of International Conference on Knowledge Discovery and Data Mining, KDD

    (1997)
  • E.M. Knorr et al.

    Distance-based outliers: algorithms and applications

    VLDB J. – Int. J. Very Large Data Bases

    (2000)
  • V. Barnett et al.

    Outliers in Statistical Data

    (1994)
  • D.M. Hawkins

    Identification of Outliers

    (1980)
  • S. Shekhar et al.

    A Tour of Spatial Databases

    (2002)
  • Cited by (136)

    View all citing articles on Scopus
    View full text