A non-parameter outlier detection algorithm based on Natural Neighbor

doi:10.1016/j.knosys.2015.10.014

Knowledge-Based Systems

Volume 92, 15 January 2016, Pages 71-77

https://doi.org/10.1016/j.knosys.2015.10.014 Get rights and content

Abstract

Outlier detection is an important task in data mining with numerous applications, including credit card fraud detection, video surveillance, etc. Although many Outlier detection algorithm have been proposed. However, for most of these algorithms faced a serious problem that it is very difficult to select an appropriate parameter when they run on a dataset. In this paper we use the method of Natural Neighbor to adaptively obtain the parameter, named Natural Value. We also propose a novel notion that Natural Outlier Factor (NOF) to measure the outliers and provide the algorithm based on Natural Neighbor (NaN) that does not require any parameters to compute the NOF of the objects in the database. The formal analysis and experiments show that this method can achieve good performance in outlier detection.

Introduction

Outlier detection is an important data mining activity with numerous applications, including credit card fraud detection, discovery of criminal activities in electronic commerce, video surveillance, weather prediction, and pharmaceutical research [1], [2], [3], [4], [5], [6], [7], [8], [9].

An outlier is an observation that deviates so much from other observations so that it arouses that it is generated by a different mechanism [8]. At present, the studies on outlier detection is very active. Many outlier detection algorithms have been proposed. Outlier detection algorithm can be roughly divided into distribution-based, depth-based, distance-based, clustering-based and density-based act.

In distribution-based methods, the observations that deviate from a standard distribution are considered as outliers [7]. But distribution-based methods not applicable to dataset that multidimensional or the distribution unknown. The depth-based [10], [11] methods can improve this problem. Depth-based methods relies on the computation of different layers of k–d convex hulls. In this way, outliers are objects in the outer layer of these hulls. However, the efficiency of depth-based algorithms is low on 4-dimensional or more than 4-dimensional dataset. In clustering-based methods, the outliers are by-products of clustering, such as DBSCAN [12], CLARANS [13], CHAMELEON [14], BIRCH [15], and CURE [16]. But the target of clustering-based methods is finding clusters, not detecting outliers, so the efficiency of detecting outliers is low too.

The distance-based algorithms was widely used for the effectiveness and simplification. In paper [4], a distance-based outlier is described as the object that with pct% of the objects in database having a distance of more than d_min away from it. However, since distance-based algorithms do not take into account the changes of local density, so distance-based algorithms can only detect the global outliers, fail to detect the local outliers.

The local outliers have received much attention recently. The density-based methods can solve this problem well. And many density-based outlier detection algorithms have been proposed. In paper [17], authors define the concept of a local outlier factor (LOF) that a measure of outlier degree in density between an object and its neighborhood objects. The article [18] made an improved on LOF and proposed an outlier detection algorithm, which defined the influenced outlierness (INFLO) computed by considering both neighbors and reverse neighbors as the outlier degree. This results in a meaningful outlier detection.

Given our motivation, through the above analysis, although the density-based methods can solve problem of local outliers well, density-based methods face the same problem that parameter selection as the first four methods. All of these algorithms almost cannot effectively detect the outliers without appropriate parameter. In other words, most of these algorithms have high dependency to the parameter. Once the parameter changed, the result of outlier detecting would have obvious difference. So the selection of parameter is very important for outlier detection algorithm. In fact, however, determination of parameter is dependent on the knowledge of researcher’s experience and a lot of experiment. For example, it is difficult to select an appropriate parameter k that the number of neighbors when use LOF or INFLO to detect the outlier on database.

More detailed analysis of the problem with existing approaches can be available in paper [19]. Paper [19] also propose a new outliers detection algorithm (INS) using the instability factor. INS is insensitive to the parameterk when the value of k is large as shown in Fig. 7(c). However, the cost is that the accuracy is low when the accuracy stabilized. Moreover INS hardly find a properly parameter to detect the local outliers and global outliers simultaneously. In other words, when the value of k is well to detect the global outliers, the effect on local outliers detection is bad, and vice versa.

In this paper, in order to solve the above problem, we first introduce a novel concept of neighbor named Natural Neighbor (NaN) and its search algorithm (NaN-Searching). Then we obtain the number of neighbors, the value of parameter k, use the NaN-Searching algorithm. We also define a new concept of Natural Influence Space (NIS) and Natural Neighbor Graph (NNG), and compute the Natural Outlier Factor (NOF). The bigger the value of NOF is, the greater the possibility of object is outlier.

The paper is organized as follows. In Section 2, we present the existing definition and our motivation. In Section 3, properties of Natural Neighbor are introduced. In Section 4, we propose a outlier detection algorithm based on Natural Neighbor. In Section 5, a performance evaluation is made and the results are analyzed. Section 6 concludes the paper.

Section snippets

Related work

In this section, we will briefly introduce concept of LOF and INS. LOF is a famous density-based outlier detection algorithm. And INS is a novel outlier detection algorithm proposed in 2014. Interested readers are referred to papers [17] and [19].

Let D be a database, p, q, and o be some objects in D, and k be a positive integer. We use d(p,q) to denote the Euclidean distance between objects p and q.

Definition 1

k-distance and nearest neighborhood of p

The k-distance of p, denoted as k_dist(p), is the distance d(p,o) between p and o in D, such

NaN definition and algorithm

Natural Neighbor is a new concept of neighbor. The concept originates in the knowledge of the objective reality. The number of one’s real friends should be the number of how many people are taken him or her as friends and he or she take them as friends at the same time. For data objects, object y is one of the Natural Neighbor of object x if object x considers y to be a neighbor and y considers x to be a neighbor at the same time. In particular, data points lying in sparse region should have

Metrics for measurement

For performance evaluation of the algorithms, we use two metrics, namely Accuracy and Rank-Power [23], to evaluate the detection results. Let N be the number of the true outliers that dataset D contains. And let M be the number of the true outliers that detected by an algorithm. In experiment, we detect out N most suspicious instances. Then the Accuracy (Acc) is given by: $Acc = \frac{M}{N}$

If using a given detection method, true outliers occupy top positions with respect to the non-outliers among N

Conclusions and further study

In this study, we propose a new density-based algorithm for outlier detection. The proposed method combine the concept of the Natural Neighbor and previous density-based methods. As the most of the previous outlier detection methods, an object with a high outlierness scores is a promising candidate for an outlier. But unlike the most of the previous outlier detection approaches, our method is non-parametric. We use Algorithm 1 to adaptively obtain the value of k, named Natural Value.

Acknowledgment

This research was supported by the National Natural Science Foundation of China (Nos. 61272194 and 61073058).

References (24)

I. Ruts et al.
Computing depth contours of bivariate point clouds
Comput. Stat. Data Anal.
(1996)
J. Ha et al.
Robust outlier detection using the instability factor
Knowledge-Based Syst.
(2014)
X. Luo
Boosting the K-nearest-neighborhood based incremental collaborative filtering
Knowledge-Based Syst.
(2013)
W. Jin et al.
Mining top-n local outliers in large databases
Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(2001)
J. Han et al.
Data Mining: Concepts and Techniques: Concepts and Techniques
(2011)
T. Pang-Ning et al.
Introduction to Data Mining
(2006)
E.M. Knox et al.
Algorithms for mining distance-based outliers in large datasets
Proceedings of the International Conference on Very Large Data Bases
(1998)
E.M. Knorr et al.
A unified notion of outliers: properties and computation
Proceedings of International Conference on Knowledge Discovery and Data Mining, KDD
(1997)
E.M. Knorr et al.
Distance-based outliers: algorithms and applications
VLDB J. – Int. J. Very Large Data Bases
(2000)
V. Barnett et al.
Outliers in Statistical Data
(1994)

D.M. Hawkins

Identification of Outliers

(1980)

S. Shekhar et al.

A Tour of Spatial Databases

(2002)

Cited by (136)

Construction of ellipsoid convex model of bounded uncertainties with outlier detection for application in non-probabilistic topology optimization
2024, Computers and Structures
The ellipsoid convex model can be suitably used for structural analysis and design optimization under uncertain-but-bounded parameters and loads. Such a model can be constructed using measured samples of the uncertainties. However, the presence of outliers is often unavoidable due to system fluctuations in the measurements. Thus, it is necessary to detect any outliers among the samples before constructing the uncertainty model to prevent over-conservativeness. To this end, the present paper proposes a rational approach for constructing ellipsoid convex models with outlier detection. The concept of the local outlier factor (LOF) is utilized, in conjunction with the k-nearest natural neighbor to adaptively determine the neighborhood range. Then the outliers are detected using the scaled median absolute deviation method, based on the LOF values obtained. Finally, the minimum-volume ellipsoid convex model is constructed with a mathematically strict and efficient semi-definite programming formulation using the normal samples. The validity of the proposed approach is demonstrated with numerical examples. The application of the constructed uncertainty model in non-probabilistic robust topology optimization is demonstrated, and the results show the effectiveness of the proposed approach.
An adaptive density clustering approach with multi-granularity fusion
2024, Information Fusion
The real-world dataset exhibits diversity, incorporating instances with complex shapes and significant differences in density hierarchy, potentially disrupted by noise. However, most clustering algorithms typically rely on single-granularity fusion, requiring the pre-setting of global parameters for the entire dataset. Nevertheless, these global parameters may not adequately adapt to clusters with varying hierarchies or shapes, consequently reducing the clustering effectiveness. Therefore, we propose an adaptive density clustering approach with multi-granularity fusion. This approach characterizes the dataset with multi-granularity, forming natural granular-ball. After processing these natural granular-ball, overlapping ones are fused to yield the final clustering result. The entire approach adeptly identifies datasets with significant differences in shape or density hierarchy and exhibits a certain level of robustness. All codes have been released at https://github.com/xjnine/NGBC.
Learnable product quantization for anomaly detection
2024, Neurocomputing
In many anomaly detection applications, anomaly samples are difficult to obtain. We propose a novel product quantization (PQ)-based anomaly detection scheme: Learnable Product Quantization (LPQ), which only requires very few abnormal samples to train the model. The scheme extracts feature from high-dimensional data using the deep learning network, decomposes the feature space into a Cartesian product of low dimensional subspaces using PQ, and then produces sub-codebooks consisting of sub-codewords with clustering techniques. As a result, the extracted features with similar sub-vector are mapped into the same bucket, which reduces the time complexity of nearest neighbor retrieval significantly. In order to achieve reasonable codebooks, a PQ Table is embedded into the network. While training, we propose a novel metric learning strategy that makes the semantically similar (normal) samples closer and dissimilar (outlier) samples farther. The experimental results on benchmark datasets demonstrate that our metric learning strategy is better than the triplet loss and the sigmoid cross-entropy loss on the anomaly detection task. In general, LPQ shows excellent performance and high efficiency in anomaly detection.
Fusing multi-scale fuzzy information to detect outliers
2024, Information Fusion
Outlier detection aims to find objects that behave differently from the majority of the data. Existing unsupervised approaches often process data with a single scale, which may not capture the multi-scale nature of the data. In this paper, we propose a novel information fusion model based on multi-scale fuzzy granules and an unsupervised outlier detection algorithm with the fuzzy rough set theory. First, a multi-scale information fusion model is formulated based on fuzzy granules. Then we employ fuzzy approximations to define the outlier factor of multi-scale fuzzy granules centered at each data point. Finally, the outlier score is calculated by aggregating the outlier factors of a set of multi-scale fuzzy granules. Experimental results demonstrate that the proposed method is comparable with or better than the leading outlier detection methods. The codes and datasets are publicly available online at https://github.com/ChenBaiyang/MFIOD.
GNaN: A natural neighbor search algorithm based on universal gravitation
2024, Pattern Recognition
The natural neighbor (NaN) method and its search algorithm (NaN-Searching) are widely used in many fields, including pattern recognition and image processing. NaN-Searching fundamentally overcomes the problem of the conventional nearest neighbor algorithm in selecting parameters for datasets with arbitrary shapes and achieves good results. However, this algorithm uses the conventional distance metric as the neighbor judgment criterion, which cannot accurately reflect the overall structure of the dataset in the process of neighbor search. Inspired by Newton’s law of universal gravitation, we propose a NaN search algorithm based on universal gravitation (GNaN-Searching). Our algorithm calculates gravitation using the structural features of data points in the dataset, it utilizes the gravitation between data as the neighbor judgment criterion, and inherits the no-parameter and dynamic neighborhood characteristics of the NaN search algorithm. Experimental results show that the natural neighborhood graph obtained by our method has a high performance in the representation of manifold data. We also applied the new method to clustering and outlier detection and achieved satisfactory results.
An oversampling method based on differential evolution and natural neighbors
2023, Applied Soft Computing
The classification problem of imbalanced data is a research focus in machine learning. An effective method for solving the class-imbalance problem is to generate synthetic samples for minority class data. Popular methods for synthesize samples include SMOTE, variants of SMOTE, and oversampling methods applying metaheuristic algorithms. Recently, an oversampling method DEBOHID based on differential evolutionary algorithm has been proposed. It uses nearest neighbors of the minority class to synthesize samples for balancing the data set. However, samples synthesized by nearest neighbors are easily interfered by noisy samples. Therefore, we proposed an oversampling method based on natural neighbors, called NaN-DEBOHID. In NaN-DEBOHID, minority samples are divided into dense samples and outliers according to the number of natural neighbors. Samples of different types are processed by different methods. The main advantages of NaN-DEBOHID are concluded as follows: (a) it finds samples that better represent the minority class distribution by applying natural neighbors; (b) it creates synthetic samples that showed better consistency with the surrounding samples by the DEBOHID method; (c) it removes noise samples of the minority class to enhance the classification performance. For experiments, Support Vector Machine (SVM) and k-Nearest Neighbor (kNN) are used as classifiers. The results indicated that NaN-DEBOHID performed competitively in terms of Accuracy, $F$ -measure and AUC.

View all citing articles on Scopus

View full text

A non-parameter outlier detection algorithm based on Natural Neighbor

Abstract

Introduction

Section snippets

Related work

k-distance and nearest neighborhood of p

NaN definition and algorithm

Metrics for measurement

Conclusions and further study

Acknowledgment

Comput. Stat. Data Anal.

Knowledge-Based Syst.

Knowledge-Based Syst.

Mining top-n local outliers in large databases

Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Data Mining: Concepts and Techniques: Concepts and Techniques

Introduction to Data Mining

Algorithms for mining distance-based outliers in large datasets

Proceedings of the International Conference on Very Large Data Bases

A unified notion of outliers: properties and computation

Proceedings of International Conference on Knowledge Discovery and Data Mining, KDD

Distance-based outliers: algorithms and applications

VLDB J. – Int. J. Very Large Data Bases

Outliers in Statistical Data

Identification of Outliers

A Tour of Spatial Databases