Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data

Koufakou, Anna; Secretan, Jimmy; Georgiopoulos, Michael

doi:10.1007/s10115-010-0343-7

Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data

Regular Paper
Published: 08 December 2010

Volume 29, pages 697–725, (2011)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Anna Koufakou¹,
Jimmy Secretan² &
Michael Georgiopoulos²

274 Accesses
22 Citations
3 Altmetric
Explore all metrics

Abstract

Detecting outliers in a dataset is an important data mining task with many applications, such as detection of credit card fraud or network intrusions. Traditional methods assume numerical data and compute pair-wise distances among points. Recently, outlier detection methods were proposed for categorical and mixed-attribute data using the concept of Frequent Itemsets (FIs). These methods face challenges when dealing with large high-dimensional data, where the number of generated FIs can be extremely large. To address this issue, we propose several outlier detection schemes inspired by the well-known condensed representation of FIs, Non-Derivable Itemsets (NDIs). Specifically, we contrast a method based on frequent NDIs, FNDI-OD, and a method based on the negative border of NDIs, NBNDI-OD, with their previously proposed FI-based counterparts. We also explore outlier detection based on Non-Almost Derivable Itemsets (NADIs), which approximate the NDIs in the data given a δ parameter. Our proposed methods use a far smaller collection of sets than the FI collection in order to compute an anomaly score for each data point. Experiments on real-life data show that, as expected, methods based on NDIs and NADIs offer substantial advantages in terms of speed and scalability over FI-based Outlier Detection method. What is significant is that NDI-based methods exhibit similar or better detection accuracy compared to the FI-based methods, which supports our claim that the NDI representation is especially well-suited for the task of detecting outliers. At the same time, the NDI approximation scheme, NADIs is shown to exhibit similar accuracy to the NDI-based method for various δ values and further runtime performance gains. Finally, we offer an in-depth discussion and experimentation regarding the trade-offs of the proposed algorithms and the choice of parameter values.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Trends and Future Perspective Challenges in Big Data

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Article 12 April 2024

Rashmin Gajera, Suresh Patel, … Ayush Solanki

Big data preprocessing: methods and prospects

Article Open access 01 November 2016

Salvador García, Sergio Ramírez-Gallego, … Francisco Herrera

References

Aggarwal C, Yu P (2001) Outlier detection for high dimensional data. ACM SIGMOD Record 30(2): 37–46
Article Google Scholar
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings int’l conference on very large data bases, pp 487–499
Barnett V (1978) Outliers in statistical data. John Wiley and Sons, New York
MATH Google Scholar
Bay S, Schwabacher M (2003) Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Proceedings ACM SIGKDD int’l conference on knowledge discovery and data mining, pp 29–38
Blake C, Merz C (1998) UCI Repository of machine learning databases. http://archive.ics.uci.edu (Accessed Sep 2008)
Boley M, Grosskreutz H (2009) Approximating the number of frequent sets in dense data. Knowl Inf Syst 21(1): 65–89
Article Google Scholar
Breunig M, Kriegel H, Ng R, Sander J (2000) LOF: identifying density-based local outliers. ACM SIGMOD Record 29(2): 93–104
Article Google Scholar
Calders T, Goethals B (2007) Non-derivable itemset mining. Data Min Knowl Discov 14(1): 171–206
Article MathSciNet Google Scholar
Calders T, Rigotti C, Boulicaut J (2004) A survey on condensed representations for frequent sets. LNCS Constraint-Based Min Inductive Databases 3848: 64–80
Article Google Scholar
Dokas P, Ertoz L, Kumar V, Lazarevic A, Srivastava J, Tan P (2002) Data mining for network intrusion detection. In: Proceedings NSF workshop on next generation data mining, pp 21–30
Fan H, Zaiane O, Foss A, Wu J (2009) Resolution-based outlier factor: detecting the top-n most outlying data points in engineering data. Knowl Inf Syst 19(1): 31–51
Article Google Scholar
Ganter B, Wille R (1999) Formal concept analysis. Springer, Berlin
Book MATH Google Scholar
Hawkins D (1980) Identification of outliers. Chapman and Hall, London
MATH Google Scholar
Hays C (2004) What Wal-Mart knows about customers habits. The New York Times
He Z, Deng S, Xu X, Huang J (2006) A fast greedy algorithm for outlier mining. In: Proceedings Pacific-Asia conference on knowledge and data discovery, pp 567–576
He Z, Xu X, Huang J, Deng S (2005) FP-Outlier: frequent pattern based outlier detection. Comp Sci Inf Syst 2(1): 103–118
Google Scholar
Jea K, Chang M (2008) Discovering frequent itemsets by support approximation and itemset clustering. Data Knowl Eng 65(1): 90–107
Google Scholar
Knorr E, Ng R, Tucakov V (2000) Distance-based outliers: algorithms and applications. Int’l J Very Large Data Bases VLDB 8(3): 237–253
Article Google Scholar
Koufakou A, Georgiopoulos M (2010) A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes. Data Min Knowl Discov 20(2): 259–289
Article MathSciNet Google Scholar
Koufakou A, Georgiopoulos M, Anagnostopoulos G (2008) Detecting outliers in high-dimensional datasets with mixed attributes. In: Int’l conference on data mining DMIN, pp 427–433
Koufakou A, Ortiz E, Georgiopoulos M, Anagnostopoulos G, Reynolds K (2007) A scalable and efficient outlier detection strategy for categorical data. In: IEEE int’l conference on tools with artificial intelligence ICTAI, pp 210–217
Otey M, Ghoting A, Parthasarathy S (2006) Fast distributed outlier detection in mixed-attribute data sets. Data Min Knowl Discov 12(2): 203–228
Article MathSciNet Google Scholar
Pasquier N., Bastide Y, Taouil R, Lakhal L (1999) Discovering frequent closed itemsets for association rules. In: Proceedings 7th Int’l conference on database theory ICDT, pp 398–416
Tax D, Duin R (2004) Support vector data description. Mach Learn 54(1): 45–66
Article MATH Google Scholar
Wang J, Karypis G (2006) On efficiently summarizing categorical databases. Knowl Inf Syst 9(1): 19–37
Article Google Scholar
Wu X, Kumar V, Ross Quinlan J, Ghosh J, Yang Q, Motoda H, McLachlan G, Ng A, Liu B, Yu P, Zhou Z, Steinbach M, Hand D, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1): 1–37
Article Google Scholar
Xiong H, Pandey G, Steinbach M, Kumar V (2006) Enhancing data analysis with noise removal. IEEE Trans Knowl Data Eng 18(3): 304–319
Article Google Scholar
Yang X, Wang Z, Bing L, Shouzhi Z, Wei W, Bole S (2005) Non-almost-derivable frequent itemsets mining. In: Proceedings int’l conference on computer and information technology, pp 157–161
Yankov D, Keogh E, Rebbapragada U (2008) Disk aware discord discovery: finding unusual time series in terabyte sized datasets. Knowl Inf Syst 17(2): 241–262
Article Google Scholar
Zaki M, Hsiao C (2005) Efficient algorithms for mining closed itemsets and their lattice structure. IEEE Trans Knowl Data Eng 17(4): 462–478
Article Google Scholar

Download references

Author information

Authors and Affiliations

U.A. Whitaker School of Engineering, Florida Gulf Coast University, Fort Myers, FL, 33965, USA
Anna Koufakou
School of Electrical Engineering and Computer Science, University of Central Florida, Orlando, FL, USA
Jimmy Secretan & Michael Georgiopoulos

Authors

Anna Koufakou
View author publications
You can also search for this author in PubMed Google Scholar
Jimmy Secretan
View author publications
You can also search for this author in PubMed Google Scholar
Michael Georgiopoulos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anna Koufakou.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Koufakou, A., Secretan, J. & Georgiopoulos, M. Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data. Knowl Inf Syst 29, 697–725 (2011). https://doi.org/10.1007/s10115-010-0343-7

Download citation

Received: 07 June 2009
Revised: 30 April 2010
Accepted: 04 September 2010
Published: 08 December 2010
Issue Date: December 2011
DOI: https://doi.org/10.1007/s10115-010-0343-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data

Abstract

Access this article

Similar content being viewed by others

Trends and Future Perspective Challenges in Big Data

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Big data preprocessing: methods and prospects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data

Abstract

Access this article

Similar content being viewed by others

Trends and Future Perspective Challenges in Big Data

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Big data preprocessing: methods and prospects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation