negFIN: An efficient algorithm for fast mining frequent itemsets
Introduction
``Frequent itemset mining'' is one of the important data mining tasks and has numerous applications in other data mining tasks, such as the discovery of association rules (Ceglar & Roddick, 2006), clustering (Wang, Wang, Yang, & Yu, 2002), and classification (Cheng, Yan, Han, & Yu, 2008). The original use of this task was for market basket analysis and was first proposed in (Agrawal, Imieliński, & Swami, 1993). It aims to find items in the customer transactions database that are frequently bought together.
Let I = {i1,i2,…, init} be the set of all items in the transactional database; a transaction T be a set of some items (T⊆I), with a unique identifier TID; and a database DB = {T1,T2,…, Tnt} be the set of transactions. Each P where P⊆I is called an "itemset." P is also called a k-itemset, where |P| = k. A transaction T contains an itemset P if and only if P⊆T; the support of P, which is denoted as support(P), is defined as the percentage of transactions in DB containing P. Let min − support be the user-defined minimum support threshold. P is called a frequent itemset if and only if min − support ≤ support(P). Given the database DB and the min − support threshold, the frequent itemset mining task is defined as "discovering all frequent itemsets with their supports." The number of itemsets that have to be checked to discover frequent itemsets is 2nit, where nit = |I|. Therefore, the problem of discovering frequent itemsets is NP.
Frequent itemset mining has been a hot research topic in the data mining field for the last two decades (Aliberti et al., 2015, Calders et al., 2014, Deng, 2014, Deng et al., 2011, Lan et al., 2015, Lin et al., 2015, Troiano and Scibelli, 2014, Vo et al., 2015). In recent years, four types of data structures based on the sets of nodes in a "prefix tree" have been presented to enhance the efficiency of mining frequent itemsets. They are: (1) Node − list (Deng & Wang, 2010), (2) N − list (Deng, Wang, & Jiang, 2012), (3) Nodeset (Deng & Lv, 2014), and (4) DiffNodeset (Deng, 2016). All of these data structures employ a prefix tree with encoded nodes and associate a set of nodes with each itemset. The nodes in Node − list and N − list are encoded by the pre-order rank and post-order rank of the node. Two algorithms, PPV (Deng & Wang, 2010) and PrePost (Deng et al., 2012), have been proposed for mining frequent itemsets based on these two data structures, respectively. These two algorithms outperform their predecessors. However, they have a drawback: they consume a lot of memory (Deng & Lv, 2014). To overcome this problem, another data structure, called Nodeset (Deng & Lv, 2014), has been proposed. Unlike N − list and Node − list, the nodes in a Nodeset are encoded only by the pre-order (or post-order) rank of the nodes (Deng & Lv, 2014). The Nodeset of each k-itemset (3 ≤ k) is extracted by the intersection of the Nodesets of two (k-1)-itemsets (Deng & Lv, 2014). The FIN algorithm (Deng & Lv, 2014) has been proposed for frequent itemset mining based on this structure. The disadvantage of Nodeset is that the Nodeset cardinality becomes very large for some datasets (Deng, 2016). To overcome this problem, another data structure, DiffNodeset (Deng, 2016), has been proposed. In contrast to Nodeset, the DiffNodeset of each k-itemset (3 ≤ k) is extracted by the difference between the DiffNodesets of two (k-1)-itemsets (Deng, 2016). Extensive experiments show that the cardinality of DiffNodeset is smaller than that of Nodeset (Deng, 2016). The dFIN algorithm (Deng, 2016) has been proposed for mining frequent itemsets based on the DiffNodeset data structure. Experimental results show that the dFIN algorithm is faster than its predecessors (Deng, 2016).
Despite the advantages of DiffNodeset, we find that calculating the difference between two DiffNodesets takes a long time on some databases. To overcome this problem, we propose a new data structure, NegNodeset, which employs a prefix tree as well as the previous four data structures. Unlike these data structures, NegNodeset employs a new encoding model for nodes. The node-encoding model of NegNodeset is based on the bitmap representation of sets. Consider a universal set U with cardinality n. We can represent each subset of U by a bitmap of size n. Each element of U is assigned to one of the bits in the bitmap. If an element is a member of a subset S (S⊆U), then its corresponding bit is 1; otherwise it is 0. Take the following example into account: let there be a universal set U = {a3,a2,a1,a0} and subsets A = {a3,a2} and B = {a3,a0}. With two bitmaps of size four, in which each ai (0 ≤ i ≤ 3) is assigned to their ith bit, these subsets are represented as A = 1100 and B = 1001. With this representation of sets, some common set operators can be implemented faster using bitwise operators. For example, to calculate the intersection (union) of two given sets, we can use the bitwise operator AND (OR) on their corresponding bitmaps. Bitwise operators are implemented efficiently in CPUs and done in one CPU cycle.
Based on the NegNodeset data structure, we propose negFIN, a fast algorithm for mining frequent itemsets. The efficiency of the negFIN algorithm is confirmed by the following three reasons: (1) new NegNodesets are extracted using bitwise operators, which are fast; (2) the complexity of extracting new NegNodesets and counting their supports is reduced to O(n), instead of O(m + n) in previous algorithms, where m and n are the cardinality of two sets of nodes, and n ≤ m; and (3) it employs a "set-enumeration tree" (Rymon, 1992) to generate frequent itemsets and uses a promotion method to prune search space in this tree. This pruning strategy generates the frequent itemsets, sometimes directly without candidate generation.
We conducted several experimental studies to evaluate the performance of the negFIN algorithm. We compared the performance of negFIN against dFIN (Deng, 2016), Goethals's Eclat (Goethals & Zaki, 2004), and FP-growth* (Grahne & Zhu, 2005), which have been the leading algorithms in the field of frequent itemset mining so far. The experimental results show that negFIN has good performance and, compared to the above mentioned algorithms, runs faster or equally fast. It runs faster than Goethals's Eclat (Goethals & Zaki, 2004) and FP-growth* (Grahne & Zhu, 2005) on all datasets. It still runs faster than dFIN (Deng, 2016) on some datasets, but runs as fast as dFIN (Deng, 2016) on other datasets.
The rest of this paper is organized as follows: Section 2 discusses background and related work for frequent itemset mining. Section 3 introduces basic definitions and properties relevant to the NegNodeset structure and the negFIN algorithm. Section 4 explains the negFIN algorithm. Section 5 shows experimental results. Section 6 concludes the paper, and section 7 provides some future research directions.
Section snippets
Related work
Many algorithms have been proposed to discover all frequent itemsets efficiently. These algorithms are divided into two main categories: (1) algorithms that use the "candidate generation" method, and (2) algorithms that use the "pattern growth" method (Ceglar & Roddick, 2006). In the candidate generation method, the candidate itemsets are generated first, and then frequent itemsets are identified from these candidate itemsets. This method employs an anti-monotone property, called Apriori (
Basic terminologies
In addition, similar data structure named PUN − list (Deng, 2018) has been proposed to discover "high utility itemsets," a new kind of mining task that is different from frequent itemset mining. In this task, each item has a utility value and can occur more than once in a transaction. A high utility itemset is an itemset that its utility is not less than a given minimum threshold. In addition to storing an information about frequent itemsets, PUN − list data structure also stores an information
negFIN: the proposed algorithm
negFIN employs a set-enumeration tree (Definition 11) to represent the search space. The framework of negFIN consists of three steps. In the first step, the BMC-tree is constructed, all frequent 1-itemsets and their Nodesets are identified, and level 1 of the set-enumeration tree is constructed. In the second step, all frequent 2-itemsets and their NegNodesets are identified, and level 2 of the set-enumeration tree is constructed. In the third step, all frequent k-itemsets (3 ≤ k) and their
Results of experiment and analysis
In order to evaluate the performance of the negFIN algorithm, we conducted two groups of experiments. The purpose of the first group of experiments is to compare the performance of the negFIN algorithm against the following algorithms: (1) Goethals's Eclat (Goethals & Zaki, 2004), which is the state-of-the-art algorithm in the family of vertical mining algorithms (Deng et al., 2012), and (2) FP-growth* (Grahne & Zhu, 2005), which is the state-of-the-art algorithm in the family of FP-tree-based
Conclusion
In this paper, we presented a new data structure, called NegNodeset, to store essential information about frequent itemsets. Based on NegNodeset, we present an algorithm, called negFIN, to rapidly discover all frequent itemsets in databases. Compared with nFIN, the key advantages of negFIN are as follows: (1) it employs bitwise operators to generate new sets of nodes. (2) It reduces the time complexity of discovering frequent itemsets to O(ln), instead of O(l(m + n)), where m and n are the
Future research directions
Future research directions are as follows: employing NegNodeset to (1) mine "closed frequent itemsets" (Le and Vo, 2015, Lee et al., 2008, Wang et al., 2003), (2) mine "maximal frequent itemsets" (Burdick et al., 2005, Roberto and Bayardo, 1998), (3) mine "Top-Rank-k frequent itemsets" (Deng, 2014, Huynh-Thi-Le et al., 2015), (4) mine "erasable itemsets" (Le, Vo, & Nguyen, 2014), (5) mine "fuzzy itemsets" (Lan et al., 2015, Lin et al., 2015), (6) mine "frequent disjunctive closed itemsets" (
Acknowledgments
The authors offer their gratitude to Dr. Zhi-Hong Deng at Peking University, Beijing, School of Electronics Engineering and Computer Science for providing the implementation codes of the FIN algorithm (Deng & Lv, 2014). We employ the FIN implementation codes to implement the dFIN algorithm (Deng, 2016) as well as our algorithm, negFIN.
Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
References (44)
- et al.
EXPEDITE: EXPress closED ITemset enumeration
Expert Systems Applications
(2015) - et al.
Mining frequent itemsets in a stream
Information Systems
(2014) Fast mining top-rank-k frequent patterns by using node-lists
Expert Systems Applications
(2014)DiffNodesets: An efficient structure for fast mining frequent itemsets
Applied Soft Computing
(2016)- et al.
Fast mining frequent itemsets using nodesets
Expert Systems Applications
(2014) - et al.
PrePost+: An efficient N-lists-based algorithm for mining frequent itemsets via children–parent equivalence pruning
Expert Systems Applications
(2015) - et al.
An efficient and effective algorithm for mining top-rank-k frequent patterns
Expert Systems with Applications
(2015) - et al.
Fuzzy utility mining with upper-bound measure
Applied Soft Computing
(2015) - et al.
An N-list-based algorithm for mining frequent closed patterns
Expert Systems Applications
(2015) - et al.
An efficient algorithm for mining closed inter-transaction itemsets
Data & Knowledge Engineering
(2008)
Mining frequent patterns from network flows for monitoring network
Expert Systems Applications
A CMFFP-tree algorithm to mine complete multiple fuzzy frequent itemsets
Applied Soft Computing
Parallel frequent itemset mining using systolic arrays
Knowledge-Based Systems
Mining frequent itemsets in data streams within a time horizon
Data Knowledge Engineering
Disclosed: An efficient depth-first, top-down algorithm for mining disjunctive closed itemsets in high-dimensional data
Information Sciences
Fast updated frequent-itemset lattice for transaction deletion
Data Knowledge Engineering
Mining association rules between sets of items in large databases
SIGMOD Record
Fast algorithms for mining association rules in large databases
MAFIA: A maximal frequent itemset algorithm
IEEE Transactions Knowledge Data Engineering
Association mining
ACM Computing Surveys
Finding recent frequent itemsets adaptively over online data streams
Direct discriminative pattern mining for effective classification
Cited by (72)
An efficient method for mining high occupancy itemsets based on equivalence class and early pruning
2023, Knowledge-Based SystemsPrivacy-preserving federated mining of frequent itemsets
2023, Information SciencesMeta-PCP: A concise representation of prevalent co-location patterns discovered from spatial data
2023, Expert Systems with ApplicationsFR-Tree: A novel rare association rule for big data problem
2022, Expert Systems with ApplicationsCitation Excerpt :The frequencies of itemsets are discovered from the node-sets data structure. Recent research is Aryabarzan, Minaei-Bidgoli, and Teshnehlab (2018), Deng (2016) and Vo, Le, Coenen, and Hong (2016). Various parallel approaches such as Apiletti, et al. (2017), Barkhordari and Niamanesh (2018), Chon, Hwang, and Kim (2018), Djenouri, Djenouri, Belhadi, and Cano (2019), Djenouri et al. (2019), Dlala, Jabbour, Raddaoui, and Sais (2018), Phan and Le (2018), Qiu, Gu, Yuan, and Huang (2014), Ragaventhiran and Kavithadevi (2020), Raj, Ramesh, Sreenu, and Sethi (2020), Soysal, Gupta, and Donepudi (2016), Vanahalli and Patil (2019), Xun, Zhang, Qin, and Zhao (2017), Zhang, et al. (2015) were proposed to overcome the significant database size problem.
UP-tree & UP-Mine: A fast method based on upper bound for frequent pattern mining from uncertain data
2021, Engineering Applications of Artificial IntelligenceCitation Excerpt :Therefore, to overcome these drawbacks, Han et al. proposed a tree-based structure named FP-tree (Han et al., 2000). Since the introduction of FP-tree, various FP-tree-based algorithms (Aryabarzan et al., 2018; Davashi and Nadimi-Shahraki, 2019; Jia and Liu, 2017; Leung et al., 2007b; Narvekar and Syed, 2015; Tanbeer et al., 2009) have been put forward to improve the performance of frequent pattern mining algorithms. However, all of these algorithms can only support precise databases.
Efficient Top-k Frequent Itemset Mining on Massive Data
2024, Data Science and Engineering