Abstract
Detecting outliers in a dataset is an important data mining task with many applications, such as detection of credit card fraud or network intrusions. Traditional methods assume numerical data and compute pair-wise distances among points. Recently, outlier detection methods were proposed for categorical and mixed-attribute data using the concept of Frequent Itemsets (FIs). These methods face challenges when dealing with large high-dimensional data, where the number of generated FIs can be extremely large. To address this issue, we propose several outlier detection schemes inspired by the well-known condensed representation of FIs, Non-Derivable Itemsets (NDIs). Specifically, we contrast a method based on frequent NDIs, FNDI-OD, and a method based on the negative border of NDIs, NBNDI-OD, with their previously proposed FI-based counterparts. We also explore outlier detection based on Non-Almost Derivable Itemsets (NADIs), which approximate the NDIs in the data given a δ parameter. Our proposed methods use a far smaller collection of sets than the FI collection in order to compute an anomaly score for each data point. Experiments on real-life data show that, as expected, methods based on NDIs and NADIs offer substantial advantages in terms of speed and scalability over FI-based Outlier Detection method. What is significant is that NDI-based methods exhibit similar or better detection accuracy compared to the FI-based methods, which supports our claim that the NDI representation is especially well-suited for the task of detecting outliers. At the same time, the NDI approximation scheme, NADIs is shown to exhibit similar accuracy to the NDI-based method for various δ values and further runtime performance gains. Finally, we offer an in-depth discussion and experimentation regarding the trade-offs of the proposed algorithms and the choice of parameter values.
Similar content being viewed by others
References
Aggarwal C, Yu P (2001) Outlier detection for high dimensional data. ACM SIGMOD Record 30(2): 37–46
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings int’l conference on very large data bases, pp 487–499
Barnett V (1978) Outliers in statistical data. John Wiley and Sons, New York
Bay S, Schwabacher M (2003) Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Proceedings ACM SIGKDD int’l conference on knowledge discovery and data mining, pp 29–38
Blake C, Merz C (1998) UCI Repository of machine learning databases. http://archive.ics.uci.edu (Accessed Sep 2008)
Boley M, Grosskreutz H (2009) Approximating the number of frequent sets in dense data. Knowl Inf Syst 21(1): 65–89
Breunig M, Kriegel H, Ng R, Sander J (2000) LOF: identifying density-based local outliers. ACM SIGMOD Record 29(2): 93–104
Calders T, Goethals B (2007) Non-derivable itemset mining. Data Min Knowl Discov 14(1): 171–206
Calders T, Rigotti C, Boulicaut J (2004) A survey on condensed representations for frequent sets. LNCS Constraint-Based Min Inductive Databases 3848: 64–80
Dokas P, Ertoz L, Kumar V, Lazarevic A, Srivastava J, Tan P (2002) Data mining for network intrusion detection. In: Proceedings NSF workshop on next generation data mining, pp 21–30
Fan H, Zaiane O, Foss A, Wu J (2009) Resolution-based outlier factor: detecting the top-n most outlying data points in engineering data. Knowl Inf Syst 19(1): 31–51
Ganter B, Wille R (1999) Formal concept analysis. Springer, Berlin
Hawkins D (1980) Identification of outliers. Chapman and Hall, London
Hays C (2004) What Wal-Mart knows about customers habits. The New York Times
He Z, Deng S, Xu X, Huang J (2006) A fast greedy algorithm for outlier mining. In: Proceedings Pacific-Asia conference on knowledge and data discovery, pp 567–576
He Z, Xu X, Huang J, Deng S (2005) FP-Outlier: frequent pattern based outlier detection. Comp Sci Inf Syst 2(1): 103–118
Jea K, Chang M (2008) Discovering frequent itemsets by support approximation and itemset clustering. Data Knowl Eng 65(1): 90–107
Knorr E, Ng R, Tucakov V (2000) Distance-based outliers: algorithms and applications. Int’l J Very Large Data Bases VLDB 8(3): 237–253
Koufakou A, Georgiopoulos M (2010) A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes. Data Min Knowl Discov 20(2): 259–289
Koufakou A, Georgiopoulos M, Anagnostopoulos G (2008) Detecting outliers in high-dimensional datasets with mixed attributes. In: Int’l conference on data mining DMIN, pp 427–433
Koufakou A, Ortiz E, Georgiopoulos M, Anagnostopoulos G, Reynolds K (2007) A scalable and efficient outlier detection strategy for categorical data. In: IEEE int’l conference on tools with artificial intelligence ICTAI, pp 210–217
Otey M, Ghoting A, Parthasarathy S (2006) Fast distributed outlier detection in mixed-attribute data sets. Data Min Knowl Discov 12(2): 203–228
Pasquier N., Bastide Y, Taouil R, Lakhal L (1999) Discovering frequent closed itemsets for association rules. In: Proceedings 7th Int’l conference on database theory ICDT, pp 398–416
Tax D, Duin R (2004) Support vector data description. Mach Learn 54(1): 45–66
Wang J, Karypis G (2006) On efficiently summarizing categorical databases. Knowl Inf Syst 9(1): 19–37
Wu X, Kumar V, Ross Quinlan J, Ghosh J, Yang Q, Motoda H, McLachlan G, Ng A, Liu B, Yu P, Zhou Z, Steinbach M, Hand D, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1): 1–37
Xiong H, Pandey G, Steinbach M, Kumar V (2006) Enhancing data analysis with noise removal. IEEE Trans Knowl Data Eng 18(3): 304–319
Yang X, Wang Z, Bing L, Shouzhi Z, Wei W, Bole S (2005) Non-almost-derivable frequent itemsets mining. In: Proceedings int’l conference on computer and information technology, pp 157–161
Yankov D, Keogh E, Rebbapragada U (2008) Disk aware discord discovery: finding unusual time series in terabyte sized datasets. Knowl Inf Syst 17(2): 241–262
Zaki M, Hsiao C (2005) Efficient algorithms for mining closed itemsets and their lattice structure. IEEE Trans Knowl Data Eng 17(4): 462–478
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Koufakou, A., Secretan, J. & Georgiopoulos, M. Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data. Knowl Inf Syst 29, 697–725 (2011). https://doi.org/10.1007/s10115-010-0343-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-010-0343-7