An improved frequent pattern growth method for mining association rules
Introduction
Mining association rules is a very important problem in the data mining field. It consists of identifying the frequent itemsets, and then forming conditional implication rules among them. This information is useful in improving the quality of many business decision-making processes, such as customer purchasing behavior analysis, cross-marketing and catalog design.
The task of mining association rules is formally stated as follows: let I = {i1, i2, … , in} be a set of items and D be a multiset of transactions, where each transaction T contains a set of items in I. We call a subset an itemset and call X a k-itemset if X contains k items. The support of itemset X, denoted as sup(X), is the number of transactions in D that contain all items in X. If sup(X) is not less than the min_sup (user-specified minimum support), we call X a frequent itemset. An association rule is a conditional implication among itemsets, , where , and .
The confidence of the association rule, given as sup(, is the conditional probability among X and Y, such that the appearance of X in T implies the appearance of Y in T. The problem of association rule mining is the discovery of association rules that have support and confidence greater than the min_sup and the user-defined minimum confidence.
Discovering frequent itemsets is the computationally intensive step in the task of mining association rules. The major challenge is that the mining often needs to generate a huge number of candidate itemsets. For example, if there are n items in the database, then in the worst case, all 2k − 1 candidate itemsets need to be generated and examined. The step of rule construction is straightforward and less expensive. Thus, most researchers concentrate on the first phase for finding frequent itemsets (Agrawal et al., 1993, Agrawal and Srikant, 1994, Han et al., 2000, Li and Lee, 2009, Orlando et al., 2003, Park et al., 1997, Zaki et al., 1997).
Agrawal and Srikant (1994) proposed the Apriori algorithm to solve the problem of mining frequent itemsets. Apriori uses a candidate generation method, such that the frequent k-itemset in one iteration can be used to construct candidate (k + 1)-itemsets for the next iteration. Apriori terminates its process when no new candidate itemsets can be generated. DHP, proposed by Park et al. (1997), improves the performance of Apriori. It uses a hash table to filter the infrequent candidate 2-itemsets and employs database trimming to lower the costs of database scanning. However, the aforementioned methods cannot avoid scanning the database many times to verify frequent itemsets.
Unlike Apriori, the FP-growth method (Han et al., 2004, Han et al., 2000) uses an FP-tree to store the frequency information of the transaction database. Without candidate generation, FP-growth uses a recursive divide-and-conquer method and the database projection approach to find the frequent itemsets. However, the recursive mining process may decrease the mining performance and raise the memory requirement. FPgrowth∗ (Grahne & Zhu, 2005) uses an FP-array technique to reduce the need to traverse FP-trees. Nevertheless, it still has to generate conditional FP-trees for recursive mining. The experimental results show that running time and memory consumption of FPgrowth∗ is almost equal to that of FP-growth. Nonordfp (Racz, 2004) improves FP-growth and it employs the tree structure to raise the mining performance. According to the result in Racz (2004), nonordfp outperforms FPgrowth∗ (Grahne & Zhu, 2005) and eclat (Zaki et al., 1997) in most cases. We will review nonordfp algorithm in Section 2.2.
In this paper, we propose the IFP-growth (Improved FP-growth) algorithm to improve the performance of FP-growth. First, the IFP-growth employs an address-table structure to lower the complexity of mapping frequent 1-itemsets in an FP-tree. Second, it uses a hybrid FP-tree mining method to reduce the need for rebuilding conditional FP-trees. Memory space can be saved and the cost of re-constructing conditional FP-trees can be reduced. We also present experimental results, and compare our methods to several existing algorithms, including FP-growth and nonordfp. Simulation results show that IFP-growth mines frequent itemsets efficiently with less memory space requirement. Under various minimum supports, IFP-growth can outperform FP-growth and nonordfp in execution time.
The remainder of this paper is organized as follows: Section 2 reviews FP-growth and related work. Section 3 and Section 4 present the IFP-growth mining algorithms and experimental results, respectively. Finally, Section 5 draws conclusions from this study.
Section snippets
Related work
FP-growth is a well-known frequent itemsets mining algorithm. It only scans database twice and finds all frequent itemsets efficiently compared to the Apriori algorithm. Section 2.1 shows the original FP-growth algorithm and Section 2.2 describes an improved FP-growth algorithm, namely nonordfp, which can efficiently derive frequent itemsets from a database.
The improved FP-growth (IFP-growth) algorithm
The proposed algorithm utilizes the address-table structure to speed up tree construction and a hybrid FP-tree mining method for frequent itemset generation. We introduce the address-table and the hybrid FP-tree mining method in Sections 3.1 Address-table, 3.2 The hybrid FP-tree mining method. Then, the entire IFP-growth algorithm is presented in Section 3.3.
Experimental results
To access the performance of IFP-growth, we used three algorithms, IFP-growth, FP-growth (Han et al., 2000, Han et al., 2004) and nonordfp Racz (2004) to mine frequent itemsets from various databases. The experiments were performed on an Intel Core2 Duo processor 1.66 GHz with 512 MB memory, running the Redhat AS3.0 GUN/Linux. We used the C language to code IFP-growth. Furthermore, both FP-growth and nonordfp were downloaded from http://fimi.cs.helsinki.fi/src/ and were compared with IFP-growth
Conclusion
By incorporating the FP-tree+ mining technique and the address-table into FP-growth, we propose the IFP-growth algorithm for frequent itemsets generation. The major advantages of FP-tree+ and the address-table are that they reduce the need to rebuild conditional trees and facilitate the task of tree construction. The memory requirement of IFP-growth is also lower than that of FP-growth and nonordfp. Experimental results showed that our algorithm is more than an order of magnitude faster than
References (10)
- et al.
Mining frequent itemsets over data streams using efficient window sliding techniques
Expert Systems with Applications
(2009) - Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining association rules between sets of items in large databases. In...
- Agrawal, R., & Srikant, R. (1994). Fast algorithm for mining association rules in large databases. In Proceedings of...
- et al.
Fast algorithms for frequent itemset mining using FP-trees
IEEE Transactions on Knowledge and Data Engineering
(2005) - Han, J., Pei, J. & Yin, Y. (2000). Mining frequent patterns without candidate generation. In Proceedings of the...