An improved frequent pattern growth method for mining association rules

https://doi.org/10.1016/j.eswa.2010.10.047Get rights and content

Abstract

Many algorithms have been proposed to efficiently mine association rules. One of the most important approaches is FP-growth. Without candidate generation, FP-growth proposes an algorithm to compress information needed for mining frequent itemsets in FP-tree and recursively constructs FP-trees to find all frequent itemsets. Performance results have demonstrated that the FP-growth method performs extremely well. In this paper, we propose the IFP-growth (improved FP-growth) algorithm to improve the performance of FP-growth. There are three major features of IFP-growth. First, it employs an address-table structure to lower the complexity of forming the entire FP-tree. Second, it uses a new structure called FP-tree+ to reduce the need for building conditional FP-trees recursively. Third, by using address-table and FP-tree+ the proposed algorithm has less memory requirement and better performance in comparison with FP-tree based algorithms. The experimental results show that the IFP-growth requires relatively little memory space during the mining process. Even when the minimum support is low, the space needed by IFP-growth is about one half of that of FP-growth and about one fourth of that of nonordfp algorithm. As to the execution time, our method outperforms FP-growth by one to 300 times under different minimum supports. The proposed algorithm also outperforms nonordfp algorithm in most cases. As a result, IFP-growth is very suitable for high performance applications.

Introduction

Mining association rules is a very important problem in the data mining field. It consists of identifying the frequent itemsets, and then forming conditional implication rules among them. This information is useful in improving the quality of many business decision-making processes, such as customer purchasing behavior analysis, cross-marketing and catalog design.

The task of mining association rules is formally stated as follows: let I = {i1, i2,  , in} be a set of items and D be a multiset of transactions, where each transaction T contains a set of items in I. We call a subset XI an itemset and call X a k-itemset if X contains k items. The support of itemset X, denoted as sup(X), is the number of transactions in D that contain all items in X. If sup(X) is not less than the min_sup (user-specified minimum support), we call X a frequent itemset. An association rule is a conditional implication among itemsets, XY, where XI,YI,Xϕ,Yϕ, and XY=ϕ.

The confidence of the association rule, given as sup(XY)/sup(X), is the conditional probability among X and Y, such that the appearance of X in T implies the appearance of Y in T. The problem of association rule mining is the discovery of association rules that have support and confidence greater than the min_sup and the user-defined minimum confidence.

Discovering frequent itemsets is the computationally intensive step in the task of mining association rules. The major challenge is that the mining often needs to generate a huge number of candidate itemsets. For example, if there are n items in the database, then in the worst case, all 2k  1 candidate itemsets need to be generated and examined. The step of rule construction is straightforward and less expensive. Thus, most researchers concentrate on the first phase for finding frequent itemsets (Agrawal et al., 1993, Agrawal and Srikant, 1994, Han et al., 2000, Li and Lee, 2009, Orlando et al., 2003, Park et al., 1997, Zaki et al., 1997).

Agrawal and Srikant (1994) proposed the Apriori algorithm to solve the problem of mining frequent itemsets. Apriori uses a candidate generation method, such that the frequent k-itemset in one iteration can be used to construct candidate (k + 1)-itemsets for the next iteration. Apriori terminates its process when no new candidate itemsets can be generated. DHP, proposed by Park et al. (1997), improves the performance of Apriori. It uses a hash table to filter the infrequent candidate 2-itemsets and employs database trimming to lower the costs of database scanning. However, the aforementioned methods cannot avoid scanning the database many times to verify frequent itemsets.

Unlike Apriori, the FP-growth method (Han et al., 2004, Han et al., 2000) uses an FP-tree to store the frequency information of the transaction database. Without candidate generation, FP-growth uses a recursive divide-and-conquer method and the database projection approach to find the frequent itemsets. However, the recursive mining process may decrease the mining performance and raise the memory requirement. FPgrowth∗ (Grahne & Zhu, 2005) uses an FP-array technique to reduce the need to traverse FP-trees. Nevertheless, it still has to generate conditional FP-trees for recursive mining. The experimental results show that running time and memory consumption of FPgrowth∗ is almost equal to that of FP-growth. Nonordfp (Racz, 2004) improves FP-growth and it employs the tree structure to raise the mining performance. According to the result in Racz (2004), nonordfp outperforms FPgrowth∗ (Grahne & Zhu, 2005) and eclat (Zaki et al., 1997) in most cases. We will review nonordfp algorithm in Section 2.2.

In this paper, we propose the IFP-growth (Improved FP-growth) algorithm to improve the performance of FP-growth. First, the IFP-growth employs an address-table structure to lower the complexity of mapping frequent 1-itemsets in an FP-tree. Second, it uses a hybrid FP-tree mining method to reduce the need for rebuilding conditional FP-trees. Memory space can be saved and the cost of re-constructing conditional FP-trees can be reduced. We also present experimental results, and compare our methods to several existing algorithms, including FP-growth and nonordfp. Simulation results show that IFP-growth mines frequent itemsets efficiently with less memory space requirement. Under various minimum supports, IFP-growth can outperform FP-growth and nonordfp in execution time.

The remainder of this paper is organized as follows: Section 2 reviews FP-growth and related work. Section 3 and Section 4 present the IFP-growth mining algorithms and experimental results, respectively. Finally, Section 5 draws conclusions from this study.

Section snippets

Related work

FP-growth is a well-known frequent itemsets mining algorithm. It only scans database twice and finds all frequent itemsets efficiently compared to the Apriori algorithm. Section 2.1 shows the original FP-growth algorithm and Section 2.2 describes an improved FP-growth algorithm, namely nonordfp, which can efficiently derive frequent itemsets from a database.

The improved FP-growth (IFP-growth) algorithm

The proposed algorithm utilizes the address-table structure to speed up tree construction and a hybrid FP-tree mining method for frequent itemset generation. We introduce the address-table and the hybrid FP-tree mining method in Sections 3.1 Address-table, 3.2 The hybrid FP-tree mining method. Then, the entire IFP-growth algorithm is presented in Section 3.3.

Experimental results

To access the performance of IFP-growth, we used three algorithms, IFP-growth, FP-growth (Han et al., 2000, Han et al., 2004) and nonordfp Racz (2004) to mine frequent itemsets from various databases. The experiments were performed on an Intel Core2 Duo processor 1.66 GHz with 512 MB memory, running the Redhat AS3.0 GUN/Linux. We used the C language to code IFP-growth. Furthermore, both FP-growth and nonordfp were downloaded from http://fimi.cs.helsinki.fi/src/ and were compared with IFP-growth

Conclusion

By incorporating the FP-tree+ mining technique and the address-table into FP-growth, we propose the IFP-growth algorithm for frequent itemsets generation. The major advantages of FP-tree+ and the address-table are that they reduce the need to rebuild conditional trees and facilitate the task of tree construction. The memory requirement of IFP-growth is also lower than that of FP-growth and nonordfp. Experimental results showed that our algorithm is more than an order of magnitude faster than

References (10)

  • H.F. Li et al.

    Mining frequent itemsets over data streams using efficient window sliding techniques

    Expert Systems with Applications

    (2009)
  • Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining association rules between sets of items in large databases. In...
  • Agrawal, R., & Srikant, R. (1994). Fast algorithm for mining association rules in large databases. In Proceedings of...
  • G. Grahne et al.

    Fast algorithms for frequent itemset mining using FP-trees

    IEEE Transactions on Knowledge and Data Engineering

    (2005)
  • Han, J., Pei, J. & Yin, Y. (2000). Mining frequent patterns without candidate generation. In Proceedings of the...
There are more references available in the full text version of this article.

Cited by (0)

View full text