Skip to main content
Log in

Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Detecting outliers in a dataset is an important data mining task with many applications, such as detection of credit card fraud or network intrusions. Traditional methods assume numerical data and compute pair-wise distances among points. Recently, outlier detection methods were proposed for categorical and mixed-attribute data using the concept of Frequent Itemsets (FIs). These methods face challenges when dealing with large high-dimensional data, where the number of generated FIs can be extremely large. To address this issue, we propose several outlier detection schemes inspired by the well-known condensed representation of FIs, Non-Derivable Itemsets (NDIs). Specifically, we contrast a method based on frequent NDIs, FNDI-OD, and a method based on the negative border of NDIs, NBNDI-OD, with their previously proposed FI-based counterparts. We also explore outlier detection based on Non-Almost Derivable Itemsets (NADIs), which approximate the NDIs in the data given a δ parameter. Our proposed methods use a far smaller collection of sets than the FI collection in order to compute an anomaly score for each data point. Experiments on real-life data show that, as expected, methods based on NDIs and NADIs offer substantial advantages in terms of speed and scalability over FI-based Outlier Detection method. What is significant is that NDI-based methods exhibit similar or better detection accuracy compared to the FI-based methods, which supports our claim that the NDI representation is especially well-suited for the task of detecting outliers. At the same time, the NDI approximation scheme, NADIs is shown to exhibit similar accuracy to the NDI-based method for various δ values and further runtime performance gains. Finally, we offer an in-depth discussion and experimentation regarding the trade-offs of the proposed algorithms and the choice of parameter values.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Aggarwal C, Yu P (2001) Outlier detection for high dimensional data. ACM SIGMOD Record 30(2): 37–46

    Article  Google Scholar 

  2. Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings int’l conference on very large data bases, pp 487–499

  3. Barnett V (1978) Outliers in statistical data. John Wiley and Sons, New York

    MATH  Google Scholar 

  4. Bay S, Schwabacher M (2003) Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Proceedings ACM SIGKDD int’l conference on knowledge discovery and data mining, pp 29–38

  5. Blake C, Merz C (1998) UCI Repository of machine learning databases. http://archive.ics.uci.edu (Accessed Sep 2008)

  6. Boley M, Grosskreutz H (2009) Approximating the number of frequent sets in dense data. Knowl Inf Syst 21(1): 65–89

    Article  Google Scholar 

  7. Breunig M, Kriegel H, Ng R, Sander J (2000) LOF: identifying density-based local outliers. ACM SIGMOD Record 29(2): 93–104

    Article  Google Scholar 

  8. Calders T, Goethals B (2007) Non-derivable itemset mining. Data Min Knowl Discov 14(1): 171–206

    Article  MathSciNet  Google Scholar 

  9. Calders T, Rigotti C, Boulicaut J (2004) A survey on condensed representations for frequent sets. LNCS Constraint-Based Min Inductive Databases 3848: 64–80

    Article  Google Scholar 

  10. Dokas P, Ertoz L, Kumar V, Lazarevic A, Srivastava J, Tan P (2002) Data mining for network intrusion detection. In: Proceedings NSF workshop on next generation data mining, pp 21–30

  11. Fan H, Zaiane O, Foss A, Wu J (2009) Resolution-based outlier factor: detecting the top-n most outlying data points in engineering data. Knowl Inf Syst 19(1): 31–51

    Article  Google Scholar 

  12. Ganter B, Wille R (1999) Formal concept analysis. Springer, Berlin

    Book  MATH  Google Scholar 

  13. Hawkins D (1980) Identification of outliers. Chapman and Hall, London

    MATH  Google Scholar 

  14. Hays C (2004) What Wal-Mart knows about customers habits. The New York Times

  15. He Z, Deng S, Xu X, Huang J (2006) A fast greedy algorithm for outlier mining. In: Proceedings Pacific-Asia conference on knowledge and data discovery, pp 567–576

  16. He Z, Xu X, Huang J, Deng S (2005) FP-Outlier: frequent pattern based outlier detection. Comp Sci Inf Syst 2(1): 103–118

    Google Scholar 

  17. Jea K, Chang M (2008) Discovering frequent itemsets by support approximation and itemset clustering. Data Knowl Eng 65(1): 90–107

    Google Scholar 

  18. Knorr E, Ng R, Tucakov V (2000) Distance-based outliers: algorithms and applications. Int’l J Very Large Data Bases VLDB 8(3): 237–253

    Article  Google Scholar 

  19. Koufakou A, Georgiopoulos M (2010) A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes. Data Min Knowl Discov 20(2): 259–289

    Article  MathSciNet  Google Scholar 

  20. Koufakou A, Georgiopoulos M, Anagnostopoulos G (2008) Detecting outliers in high-dimensional datasets with mixed attributes. In: Int’l conference on data mining DMIN, pp 427–433

  21. Koufakou A, Ortiz E, Georgiopoulos M, Anagnostopoulos G, Reynolds K (2007) A scalable and efficient outlier detection strategy for categorical data. In: IEEE int’l conference on tools with artificial intelligence ICTAI, pp 210–217

  22. Otey M, Ghoting A, Parthasarathy S (2006) Fast distributed outlier detection in mixed-attribute data sets. Data Min Knowl Discov 12(2): 203–228

    Article  MathSciNet  Google Scholar 

  23. Pasquier N., Bastide Y, Taouil R, Lakhal L (1999) Discovering frequent closed itemsets for association rules. In: Proceedings 7th Int’l conference on database theory ICDT, pp 398–416

  24. Tax D, Duin R (2004) Support vector data description. Mach Learn 54(1): 45–66

    Article  MATH  Google Scholar 

  25. Wang J, Karypis G (2006) On efficiently summarizing categorical databases. Knowl Inf Syst 9(1): 19–37

    Article  Google Scholar 

  26. Wu X, Kumar V, Ross Quinlan J, Ghosh J, Yang Q, Motoda H, McLachlan G, Ng A, Liu B, Yu P, Zhou Z, Steinbach M, Hand D, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1): 1–37

    Article  Google Scholar 

  27. Xiong H, Pandey G, Steinbach M, Kumar V (2006) Enhancing data analysis with noise removal. IEEE Trans Knowl Data Eng 18(3): 304–319

    Article  Google Scholar 

  28. Yang X, Wang Z, Bing L, Shouzhi Z, Wei W, Bole S (2005) Non-almost-derivable frequent itemsets mining. In: Proceedings int’l conference on computer and information technology, pp 157–161

  29. Yankov D, Keogh E, Rebbapragada U (2008) Disk aware discord discovery: finding unusual time series in terabyte sized datasets. Knowl Inf Syst 17(2): 241–262

    Article  Google Scholar 

  30. Zaki M, Hsiao C (2005) Efficient algorithms for mining closed itemsets and their lattice structure. IEEE Trans Knowl Data Eng 17(4): 462–478

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anna Koufakou.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Koufakou, A., Secretan, J. & Georgiopoulos, M. Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data. Knowl Inf Syst 29, 697–725 (2011). https://doi.org/10.1007/s10115-010-0343-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-010-0343-7

Keywords

Navigation