Elsevier

Expert Systems with Applications

Volume 112, 1 December 2018, Pages 243-257
Expert Systems with Applications

Hybrid data-driven outlier detection based on neighborhood information entropy and its developmental measures

https://doi.org/10.1016/j.eswa.2018.06.013Get rights and content

Highlights

  • Neighborhood information entropy and its deep measures are built to detect outliers.

  • The new outlier detection method applies to categorical, numeric, and mixed data.

  • The NIEOD algorithm has better adaptability and effectiveness than six main ways.

  • This study deepens current outlier detection from a new view of hybrid data-driving.

Abstract

The outlier relies on its distinctive mechanism and valuable information to play an important role in expert and intelligent systems, and thus outlier detection has already been extensively applied in relevant fields including the fraud detection, medical diagnosis, public security, etc. The outlier detection methods of rough sets recently gain in-depth research, because they are data-driven and never require additional knowledge. However, classical rough set-based methods consider only categorical data; furthermore, neighborhood rough sets adhere to numeric and heterogeneous data, but their outlier detection is mainly restricted to numeric data now. According to the hybrid data-driving, this paper investigates outlier detection by the neighborhood information entropy and its developmental measures, and the applicable data sets widely concern categorical, numeric, and mixed data; as a result, the new method extends both the traditional distance-based and rough set-based methods to enrich outlier detection. Concretely, the neighborhood information system is first determined by the heterogeneous distance and self-adapting radius, the neighborhood information entropy is then defined to implement whole uncertainty measurement, three gradual information measures are further constructed to describe each single object, and finally the neighborhood entropy-based outlier factor (NEOF) is integratedly established to detect outliers; moreover, the NEOF-based outlier detection algorithm (called the NIEOD algorithm) is designed and applied. By virtue of UCI data experiments, the NIEOD algorithm is compared with six existing detection algorithms (including the NED, IE, SEQ, FindCBLOF, DIS, KNN algorithms), and the concrete results generally reflect the better effectiveness and adaptability of the new method.

Introduction

Knowledge discovery in databases (KDD), or data mining, is an important issue in the development of knowledge-based and data-based systems. Usually, knowledge discovery tasks can be classified into four general categories: (a) dependency detection, (b) class identification, (c) class description, and (d) outlier/exception detection (Knorr & Ng, 1998). In contrast to most KDD tasks (e.g., the traditional pattern recognition aims to construct a general pattern map to the majority of data), outlier detection targets to find the rare data whose behavior is very exceptional when compared with rest large amount of data. In fact, an outlier (also known as an anomaly) is a data point that significantly deviates from the rest data objects in data set (Hawkins, 1980), and it usually adheres to a new perspective or a specific mechanism to become more exciting than normal instances in knowledge discovery. As a result, the outlier relies on its distinctive mechanism and valuable information to play an important role in expert and intelligent systems, so outlier detection has already extensively applied in relevant fields including the intrusion detection, image processing, medical treatment, public security, etc (Han, Kamber, Pei, 2011, Knorr, Ng, 1998). At present, outlier detection methods and their development exhibit both the theoretical significance and applied value in data mining, and thus this paper aims to establish a novel detection approach to process hybrid data which generally exist in practical systems.

Outlier detection concerns three traditional methods, i.e., the statistical method (Rousseeuw & Leroy, 1987), the proximity-based approach (Breunig, Kriegel, Ng, Sander, 2000, Knorr, Ng, 1997, Knorr, Ng, Tucakov, 2000), and the clustering-based method (Jain, 1999), to offer different features and advantages. The statistical method assures that normal data objects are generated by a statistical model, so abnormal points which never obey the model become outliers; this approach applies to data sets with the known distribution and simplex attribute. The proximity-based approach emerges to improve the statistical way, and it usually adopts two basic strategies: the distance-based detection (Knorr et al., 2000) and the density-based detection (Breunig et al., 2000). Moreover, the clustering-based method mainly depends on different clustering ways to exhibit different effectiveness.

Most of above traditional detection methods require some additional information. Thus, detection methods based on rough sets recently gain in-depth research, because they are data-driven and never require additional knowledge. In fact, the traditional distance-based method computes an object distance to more apply to numeric data rather than categorical data, because the latter data never have a similar distance relationship. For this issue, rough sets are introduced into outlier detection to handle categorical data (Berna-Martinez, Ortega, 2015, Chen, Miao, Wang, 2008, Jiang, Chen, 2015, Jiang, Sui, Cao, 2008, Jiang, Sui, Cao, 2009, Jiang, Sui, Cao, 2010, Jiang, Sui, Cao, 2011, Shaari, Bakar, Hamdan, 2009); in particular, rough sets originate from the study of intelligent systems characterized by insufficient and incomplete information (Pawlak, 1982, Pawlak, 1991), and they have been successfully applied in machine learning, data mining, pattern recognition, etc. However, classical rough set-based detection methods consider only the basic equivalence to directly apply to categorical (or nominal) data rather than numeric data; in fact, numeric data can be discretized to follow the rough set way, but this preprocessing usually leads to time increase and information loss. Except the numeric and categorical data, both-combined mixed (or hybrid or heterogeneous) data universally exist in the real world, and their studies on outlier detection are undoubtedly required and challenging but there are rarely relevant reports.

To improve classical rough sets, neighborhood rough sets adopt the robust neighborhood to adhere to numeric and hybrid data, thus providing a more powerful platform. In early research, neighborhood spaces are thought to become more general topological spaces (Lin, 1988, Lin, 2008), neighborhood approximation properties are revealed (Wu, Zhang, 2002, Yao, 1998), and neighborhood rough sets are utilized for heterogeneous data reduction (Hu, Liu, Yu, 2008, Hu, Yu, Liu, Wu, 2008, Hu, Yu, Xie, 2006). At present, neighborhood rough sets have been effectively and deeply applied in the attribute reduction, feature selection, classification recognition, and uncertainty reasoning, etc (Chen, Li, Cai, Luo, Fujita, 2016, Chen, Zhang, Zheng, Ying, Yu, 2017, Kumar, Inbarani, 2016, Liu, Yang, Chen, Tan, Wang, Yan, 2017, Wang, Shao, He, Qian, Qi, 2016). However for outlier detection, the neighborhood rough sets-based ways never gain enough attentions, especially regarding mixed data processing. In fact, relevant neighborhood-based detection works are mainly restricted to numeric data (Chen, Miao, Zhang, 2010, Li, Rao, 2012).

Against the above background, the hybrid data-driven outlier detection based on neighborhood rough sets becomes a valuable and novel work, and thus this paper mainly makes a preliminary study by virtue of information measure construction. In fact, the information entropy, proposed by Shannon (1948), establishes a fundamental mechanism of uncertainty measurement; it has been introduced into classical rough sets to make uncertainty representation via multiple entropy forms or information measures (Chen, Zhang, Zheng, Ying, Yu, 2017, Düntsch, Gediga, 1998, Liang, Shi, Li, Wierman, 2006, Liang, Wang, Qian, 2009, Wang, Ma, Yu, 2015, Zhang, Mei, Chen, Li, 2016, Zhang, Miao, 2017), and the neighborhood entropy is particularly discussed by Chen, Wu, Chen, Tang, and Zhu (2014), Chen, Xue, Ma, and Xu (2017) and Li and Rao (2012). In this paper, the hybrid data-driven outlier detection is concretely investigated by the neighborhood information entropy and its developmental measures, and a corresponding detection algorithm (i.e., the neighborhood information entropy-based outlier detection (NIEOD) algorithm) is designed and applied. Concretely, the neighborhood information system is first determined by the heterogeneous distance and self-adapting radius, the neighborhood information entropy and its three in-depth measures are then mined to describe data objects by uncertainty measurement, and finally the neighborhood entropy-based outlier factor (NEOF) is integratedly established to provide the outlier detection and NIEOD algorithm. Based on relevant UCI data experiments, the NIEOD algorithm is compared with six main detection algorithms (including the NED, IE, SEQ, FindCBLOF, DIS, KNN algorithms), and the obtained results show that the new method generally has better effectiveness and adaptability. Regarding the contributions, the new method extends both the traditional distance-based and rough set-based methods to enrich outlier detection, and thus it extensively applies to categorical, numeric, and heterogeneous data.

The remainder of this paper is organized as follows. Section 2 reviews the neighborhood information system; Section 3 constructs the core outlier detection by developing information measures, and three subsections are provided to state the theoretical method, specific algorithm, and illustrative example; Section 4 makes data experiments and analyses via three typical UCI data sets; finally, Section 5 concludes the paper.

Section snippets

Neighborhood information system

Note that neighborhood rough sets have a basic formal background: the neighborhood information system. This fundamental system is reviewed in this section via several references (Hu, Liu, Yu, 2008, Hu, Yu, Liu, Wu, 2008, Hu, Yu, Xie, 2008).

Usually, an information system is a basis of data mining, and can be written as a quadruple IS=(U,A,V,f). Herein, universe U={x1,x2,,xn} is a nonempty finite set of objects; A is a nonempty finite set of attributes; V=aAVa is the union of attribute domain V

Hybrid data-driven outlier detection based on neighborhood information entropy and its developmental measures

Based on the neighborhood information (or decision) system, this section develops the neighborhood information entropy and its subsequent measures to gradually implement outlier detection, and there are three main parts to offer the theoretical method, specific algorithm, and illustrative example.

UCI data experiments and analyses

This section implements UCI data experiments and analyses to verify availability of the proposed method of outlier detection, especially the NIEOD algorithm (Algorithm 2).

In the concrete experiments, three UCI data sets (Bay, 1999) are mainly chosen, i.e., the Annealing data set (with hybrid attributes), Lymphography data set (with most categorical attributes), and Wisconsin Breast Cancer data set (with numeric attributes), and the NIEOD algorithm is compared to six main ways of outlier

Conclusion

Outlier detection has extensive applications in expert and intelligent systems. However, the traditional distance-based detection method never effectively applies to categorical data, while the classical rough set-based method can not effectively handle numeric data and further mixed data. Against the hybrid data-driving of neighborhood information system, this paper researches the outlier detection based on the neighborhood information entropy and its developmental measures. The traditional

Acknowledgments

The authors thank all of the editors and reviewers for their valuable suggestions, which have substantially improved this paper.

This work was supported by National Natural Science Foundation of China (61673285 and 61203285), Sichuan Youth Science & Technology Foundation of China (2017JQ0046), and Scientific Research Project of Sichuan Provincial Education Department of China (15ZB0029).

References (51)

  • Q.H. Hu et al.

    Neighborhood classifiers

    Expert Systems with Applications

    (2008)
  • F. Jiang et al.

    Some issues about outlier detection in rough set theory

    Expert Systems with Applications

    (2009)
  • F. Jiang et al.

    An information entropy-based approach to outlier detection in rough sets

    Expert Systems with Applications

    (2010)
  • F. Jiang et al.

    A hybrid approach to outlier detection based on boundary region

    Pattern Recognition Letters

    (2011)
  • J.Y. Liang et al.

    A new measure of uncertainty based on knowledge granulation for rough sets

    Information Sciences

    (2009)
  • Y. Liu et al.

    Stability analysis of hyperspectral band selection algorithms based on neighborhood rough set theory for classification

    Chemometrics & Intelligent Laboratory Systems

    (2017)
  • C.Z. Wang et al.

    Feature subset selection based on fuzzy neighborhood rough sets

    Knowledge-Based Systems

    (2016)
  • G.Y. Wang et al.

    Monotonic uncertainty measures for attribute reduction in probabilistic rough set model

    International Journal of Approximate Reasoning

    (2015)
  • W.Z. Wu et al.

    Neighborhood operator systems and approximations

    Information Sciences

    (2002)
  • Y.Y. Yao

    Relational interpretations of neighborhood operators and rough set approximation operators

    Information Sciences

    (1998)
  • X. Zhang et al.

    Feature selection in mixed data: A method using a novel fuzzy rough set-based information entropy

    Pattern Recognition

    (2016)
  • X.Y. Zhang et al.

    Three-layer granular structures and three-way informational measures of a decision table

    Information Sciences

    (2017)
  • C.C. Aggarwal et al.

    Outlier detection for high dimensional data

    ACM Sigmod Record

    (2001)
  • Bay, S. D. (1999). The uci kdd repository...
  • J.V. Berna-Martinez et al.

    Algorithm for the detection of outliers based on the theory of rough sets

    Decision Support Systems

    (2015)
  • Cited by (73)

    • Exploiting fuzzy rough entropy to detect anomalies

      2024, International Journal of Approximate Reasoning
    View all citing articles on Scopus
    View full text