Hybrid data-driven outlier detection based on neighborhood information entropy and its developmental measures
Introduction
Knowledge discovery in databases (KDD), or data mining, is an important issue in the development of knowledge-based and data-based systems. Usually, knowledge discovery tasks can be classified into four general categories: (a) dependency detection, (b) class identification, (c) class description, and (d) outlier/exception detection (Knorr & Ng, 1998). In contrast to most KDD tasks (e.g., the traditional pattern recognition aims to construct a general pattern map to the majority of data), outlier detection targets to find the rare data whose behavior is very exceptional when compared with rest large amount of data. In fact, an outlier (also known as an anomaly) is a data point that significantly deviates from the rest data objects in data set (Hawkins, 1980), and it usually adheres to a new perspective or a specific mechanism to become more exciting than normal instances in knowledge discovery. As a result, the outlier relies on its distinctive mechanism and valuable information to play an important role in expert and intelligent systems, so outlier detection has already extensively applied in relevant fields including the intrusion detection, image processing, medical treatment, public security, etc (Han, Kamber, Pei, 2011, Knorr, Ng, 1998). At present, outlier detection methods and their development exhibit both the theoretical significance and applied value in data mining, and thus this paper aims to establish a novel detection approach to process hybrid data which generally exist in practical systems.
Outlier detection concerns three traditional methods, i.e., the statistical method (Rousseeuw & Leroy, 1987), the proximity-based approach (Breunig, Kriegel, Ng, Sander, 2000, Knorr, Ng, 1997, Knorr, Ng, Tucakov, 2000), and the clustering-based method (Jain, 1999), to offer different features and advantages. The statistical method assures that normal data objects are generated by a statistical model, so abnormal points which never obey the model become outliers; this approach applies to data sets with the known distribution and simplex attribute. The proximity-based approach emerges to improve the statistical way, and it usually adopts two basic strategies: the distance-based detection (Knorr et al., 2000) and the density-based detection (Breunig et al., 2000). Moreover, the clustering-based method mainly depends on different clustering ways to exhibit different effectiveness.
Most of above traditional detection methods require some additional information. Thus, detection methods based on rough sets recently gain in-depth research, because they are data-driven and never require additional knowledge. In fact, the traditional distance-based method computes an object distance to more apply to numeric data rather than categorical data, because the latter data never have a similar distance relationship. For this issue, rough sets are introduced into outlier detection to handle categorical data (Berna-Martinez, Ortega, 2015, Chen, Miao, Wang, 2008, Jiang, Chen, 2015, Jiang, Sui, Cao, 2008, Jiang, Sui, Cao, 2009, Jiang, Sui, Cao, 2010, Jiang, Sui, Cao, 2011, Shaari, Bakar, Hamdan, 2009); in particular, rough sets originate from the study of intelligent systems characterized by insufficient and incomplete information (Pawlak, 1982, Pawlak, 1991), and they have been successfully applied in machine learning, data mining, pattern recognition, etc. However, classical rough set-based detection methods consider only the basic equivalence to directly apply to categorical (or nominal) data rather than numeric data; in fact, numeric data can be discretized to follow the rough set way, but this preprocessing usually leads to time increase and information loss. Except the numeric and categorical data, both-combined mixed (or hybrid or heterogeneous) data universally exist in the real world, and their studies on outlier detection are undoubtedly required and challenging but there are rarely relevant reports.
To improve classical rough sets, neighborhood rough sets adopt the robust neighborhood to adhere to numeric and hybrid data, thus providing a more powerful platform. In early research, neighborhood spaces are thought to become more general topological spaces (Lin, 1988, Lin, 2008), neighborhood approximation properties are revealed (Wu, Zhang, 2002, Yao, 1998), and neighborhood rough sets are utilized for heterogeneous data reduction (Hu, Liu, Yu, 2008, Hu, Yu, Liu, Wu, 2008, Hu, Yu, Xie, 2006). At present, neighborhood rough sets have been effectively and deeply applied in the attribute reduction, feature selection, classification recognition, and uncertainty reasoning, etc (Chen, Li, Cai, Luo, Fujita, 2016, Chen, Zhang, Zheng, Ying, Yu, 2017, Kumar, Inbarani, 2016, Liu, Yang, Chen, Tan, Wang, Yan, 2017, Wang, Shao, He, Qian, Qi, 2016). However for outlier detection, the neighborhood rough sets-based ways never gain enough attentions, especially regarding mixed data processing. In fact, relevant neighborhood-based detection works are mainly restricted to numeric data (Chen, Miao, Zhang, 2010, Li, Rao, 2012).
Against the above background, the hybrid data-driven outlier detection based on neighborhood rough sets becomes a valuable and novel work, and thus this paper mainly makes a preliminary study by virtue of information measure construction. In fact, the information entropy, proposed by Shannon (1948), establishes a fundamental mechanism of uncertainty measurement; it has been introduced into classical rough sets to make uncertainty representation via multiple entropy forms or information measures (Chen, Zhang, Zheng, Ying, Yu, 2017, Düntsch, Gediga, 1998, Liang, Shi, Li, Wierman, 2006, Liang, Wang, Qian, 2009, Wang, Ma, Yu, 2015, Zhang, Mei, Chen, Li, 2016, Zhang, Miao, 2017), and the neighborhood entropy is particularly discussed by Chen, Wu, Chen, Tang, and Zhu (2014), Chen, Xue, Ma, and Xu (2017) and Li and Rao (2012). In this paper, the hybrid data-driven outlier detection is concretely investigated by the neighborhood information entropy and its developmental measures, and a corresponding detection algorithm (i.e., the neighborhood information entropy-based outlier detection (NIEOD) algorithm) is designed and applied. Concretely, the neighborhood information system is first determined by the heterogeneous distance and self-adapting radius, the neighborhood information entropy and its three in-depth measures are then mined to describe data objects by uncertainty measurement, and finally the neighborhood entropy-based outlier factor (NEOF) is integratedly established to provide the outlier detection and NIEOD algorithm. Based on relevant UCI data experiments, the NIEOD algorithm is compared with six main detection algorithms (including the NED, IE, SEQ, FindCBLOF, DIS, KNN algorithms), and the obtained results show that the new method generally has better effectiveness and adaptability. Regarding the contributions, the new method extends both the traditional distance-based and rough set-based methods to enrich outlier detection, and thus it extensively applies to categorical, numeric, and heterogeneous data.
The remainder of this paper is organized as follows. Section 2 reviews the neighborhood information system; Section 3 constructs the core outlier detection by developing information measures, and three subsections are provided to state the theoretical method, specific algorithm, and illustrative example; Section 4 makes data experiments and analyses via three typical UCI data sets; finally, Section 5 concludes the paper.
Section snippets
Neighborhood information system
Note that neighborhood rough sets have a basic formal background: the neighborhood information system. This fundamental system is reviewed in this section via several references (Hu, Liu, Yu, 2008, Hu, Yu, Liu, Wu, 2008, Hu, Yu, Xie, 2008).
Usually, an information system is a basis of data mining, and can be written as a quadruple . Herein, universe is a nonempty finite set of objects; A is a nonempty finite set of attributes; is the union of attribute domain V
Hybrid data-driven outlier detection based on neighborhood information entropy and its developmental measures
Based on the neighborhood information (or decision) system, this section develops the neighborhood information entropy and its subsequent measures to gradually implement outlier detection, and there are three main parts to offer the theoretical method, specific algorithm, and illustrative example.
UCI data experiments and analyses
This section implements UCI data experiments and analyses to verify availability of the proposed method of outlier detection, especially the NIEOD algorithm (Algorithm 2).
In the concrete experiments, three UCI data sets (Bay, 1999) are mainly chosen, i.e., the Annealing data set (with hybrid attributes), Lymphography data set (with most categorical attributes), and Wisconsin Breast Cancer data set (with numeric attributes), and the NIEOD algorithm is compared to six main ways of outlier
Conclusion
Outlier detection has extensive applications in expert and intelligent systems. However, the traditional distance-based detection method never effectively applies to categorical data, while the classical rough set-based method can not effectively handle numeric data and further mixed data. Against the hybrid data-driving of neighborhood information system, this paper researches the outlier detection based on the neighborhood information entropy and its developmental measures. The traditional
Acknowledgments
The authors thank all of the editors and reviewers for their valuable suggestions, which have substantially improved this paper.
This work was supported by National Natural Science Foundation of China (61673285 and 61203285), Sichuan Youth Science & Technology Foundation of China (2017JQ0046), and Scientific Research Project of Sichuan Provincial Education Department of China (15ZB0029).
References (51)
- et al.
Parallel attribute reduction in dominance-based neighborhood rough set
Information Sciences
(2016) - et al.
Neighborhood outlier detection
Expert Systems with Applications
(2010) - et al.
An entropy-based uncertainty measurement approach in neighborhood systems
Information Sciences
(2014) - et al.
Measures of uncertainty for neighborhood rough sets
Knowledge-Based Systems
(2017) - et al.
Gene selection for tumor classification using neighborhood rough sets and entropy measures
Journal of Biomedical Informatics
(2017) - et al.
Uncertainty measures of rough set prediction
Artificial Intelligence
(1998) - et al.
Discovering cluster-based local outliers
Pattern Recognition Letters
(2003) - et al.
Mixed feature selection based on granulation and approximation
Knowledge-Based Systems
(2008) - et al.
Neighborhood rough set based heterogeneous feature subset selection
Information Sciences
(2008) - et al.
Information-preserving hybrid data reduction based on fuzzy-rough techniques
Pattern Recognition Letters
(2006)
Neighborhood classifiers
Expert Systems with Applications
Some issues about outlier detection in rough set theory
Expert Systems with Applications
An information entropy-based approach to outlier detection in rough sets
Expert Systems with Applications
A hybrid approach to outlier detection based on boundary region
Pattern Recognition Letters
A new measure of uncertainty based on knowledge granulation for rough sets
Information Sciences
Stability analysis of hyperspectral band selection algorithms based on neighborhood rough set theory for classification
Chemometrics & Intelligent Laboratory Systems
Feature subset selection based on fuzzy neighborhood rough sets
Knowledge-Based Systems
Monotonic uncertainty measures for attribute reduction in probabilistic rough set model
International Journal of Approximate Reasoning
Neighborhood operator systems and approximations
Information Sciences
Relational interpretations of neighborhood operators and rough set approximation operators
Information Sciences
Feature selection in mixed data: A method using a novel fuzzy rough set-based information entropy
Pattern Recognition
Three-layer granular structures and three-way informational measures of a decision table
Information Sciences
Outlier detection for high dimensional data
ACM Sigmod Record
Algorithm for the detection of outliers based on the theory of rough sets
Decision Support Systems
Cited by (73)
Exploiting fuzzy rough entropy to detect anomalies
2024, International Journal of Approximate ReasoningAttribute-weighted outlier detection for mixed data based on parallel mutual information
2024, Expert Systems with ApplicationsOutlier detection for partially labeled categorical data based on conditional information entropy
2024, International Journal of Approximate ReasoningBoundary-aware local Density-based outlier detection
2023, Information SciencesGranularity self-information based uncertainty measure for feature selection and robust classification
2023, Fuzzy Sets and Systems