A new approach to mine frequent patterns using item-transformation methods

doi:10.1016/j.is.2007.01.001

Information Systems

Volume 32, Issue 7, November 2007, Pages 1056-1072

https://doi.org/10.1016/j.is.2007.01.001 Get rights and content

Abstract

Mining frequent patterns is a fundamental and crucial task in data-mining problems. The algorithms reported in the literature for mining frequent patterns can be classified into two approaches: the candidate generation-and-test approach (for example, the Apriori algorithm) and the pattern-growth approach (such as the FP-growth algorithm). The approaches both suffered from the problems that their speed is slow for large databases. This paper proposes a novel and simple approach, which does not belong to the above two approaches. This approach treats the database as a stream of data and finds frequent patterns by scanning the database only once. In addition, the approach can incrementally mine frequent patterns if the database is updated or inserted subsequently. Three versions of the approach (i.e., mapping-table, transformation-function, and logic-circuit) are provided. The logic-circuit version is the first one that mines frequent patterns by simple logic gates, and the modeling of this version shows its speed is thousands of times faster than that of the FP-growth algorithm. Analyses and simulations of the approach are also performed. Analyses show that the transformation-function version is much better than the Apriori and FP-growth ones in storage complexity. Simulation results show that the mapping-table version is comparable to the FP-growth algorithm in execution time.

Introduction

Data mining has become more important than ever since it can discover valuable but hidden knowledge from databases. The applications of data mining can be found in many areas such as evaluating risks of financial investment, detection of credit-card fraud, and healthcare research on the population [1], [2], [3], [4]. A pattern is a specific itemset in a database, which consists of the items in the database. A pattern is called a frequent pattern if its frequency of occurrence is larger than or equal to a predefined threshold (called the minimum support threshold).

Mining frequent patterns is a fundamental and crucial step used to solve the problems in the areas such as association rules [5], [6], correlations [7], sequential patterns [8], [9], classification [10], maximal frequent patterns [11], [12], [13], and frequent closed patterns [14], [15], [16]. According to the literature [3], [17], the algorithms for mining frequent patterns can be classified into two approaches, namely, candidate generation-and-test and pattern-growth. The representative algorithm of the first approach is the Apriori algorithm [5]. The Apriori and the Apriori-like algorithms [4], [18], [19], [20], [21] all need to generate a candidate set from the database and then evaluate each candidate in the set to check whether the candidate is a frequent pattern. In evaluating these candidates, the approach needs to scan databases as many times as the maximal length of the patterns. Although the approach is simple, it is time-consuming when databases are large. The second approach, also including a lot of algorithms [4], [6], [14], [22], [23], [24], is proposed to increase the speed of mining process by reducing the times of the scans of databases. The FP-growth algorithm [6], the most famous one of the approach, uses a compact data structure, called an FP-tree, to store the transaction information of the database. Instead of scanning the database, the algorithm scans the FP-tree to find frequent patterns. Compared to the first approach, the second approach is more efficient but needs more space to store the intermediate data structure.

In enterprises, new data is produced continuously, and the influence of the obsolete data should be decreased. Managers need to periodically mine databases to reflect the current state of transactions and tune their marketing strategies subsequently. For mining the newest databases, the candidate generation-and-test approach needs to run the entire mining process again, while the pattern-growth approach needs to re-build the intermediate data structure (such as the FP-tree) and then performs the pattern-growth operation again [6].

This paper proposes a simple and new approach that does not belong to the above two approaches. It uses a simple method to transform the transactions from databases into their corresponding patterns and accumulates the frequency of occurrence of these patterns. After scanning all the transactions, the approach can immediately find frequent patterns. In addition, the approach can incrementally mine databases if some transactions are inserted, deleted, or updated subsequently. Three versions of this approach are proposed: one is developed upon a mapping table, another is upon a transformation function, and the other is implemented by logic circuit. As we knew from the literature, the third version is the first one to mine the frequent patterns by logic circuit. The execution time of the logic-circuit version, as expected, is much faster than those of all the other algorithms. The simulations show that the transformation-function version performs a little inferior than the FP-growth algorithm [6], but the transformation-table version performs better than the FP-growth algorithm under the condition that the number of the items is not large.

Section snippets

Notations and preliminary

Let $Γ$ denote a transaction database consisting of a set of $n$ transactions, and let $I$ denote the set of items that may appear in $Γ$ . Assume that there are $m$ items in $I$ , denoted as $| I | = m$ and that $I$ is represented as the set ${i_{0}, i_{1}, \dots, i_{m - 1}}$ . For a transaction $t$ in $Γ$ , the itemset contained in $t$ , denoted as $S_{t}$ , is the set of items whose elements are all from $I$ , i.e., $S_{t} \subseteq I$ . For simplicity, $S_{t}$ can be represented as a vector, as is defined in the following:

Definition 1

Assume that there is a transaction database $Γ$

Item-transformation approach

In this paper, a new approach, called the item-transformation approach, which scans the database only once, is proposed. During the scanning, the approach transforms the itemset contained in a transaction into its corresponding pattern vector and then accumulates these pattern vectors. After scanning all the transactions, the approach filters the final results of the accumulation according to the value of $ξ$ and then reports the frequent patterns. The following are the steps of the approach:

Empty Cell

Analyses and simulation

To evaluate the performance of the proposed approach, the mathematical analyses are presented at first. Two parameters used in the analyses are $m$ and $n$ , which are the number of transactions in database $Γ$ and the number of the items that may appear in $Γ$ , respectively.

Discussion and conclusion

An important characteristic of the proposed approach is that the approach accumulates the pattern vectors one by one. If some transactions are inserted into the database, its corresponding pattern vector can be easily added to the results of the accumulation. Similarly, if some transactions are updated, the obsolete pattern vectors are removed, and the new pattern vectors are added. However, for the update in the database, the Apriori algorithm needs to re-generate the candidates and test them

References (31)

R.C. Agarwal et al.
A tree projection algorithm for generation of frequent item sets
J. Parallel Distributed Comput.
(2001)
K. Ng et al.
A data mining application: customer retention at the Port of Singapore Authority (PSA)
R.J. Brachman et al.
Mining business databases
Commun. ACM
(1996)
G. Liu et al.
Ascending frequency ordered prefix-tree: efficient mining of frequent patterns
Q. Zou et al.
A pattern decomposition algorithm for finding all frequent patterns in large datasets
Knowl. Inf. Syst.
(2002)
R. Agrawal et al.
Fast algorithms for mining association rules
J. Han et al.
Mining frequent patterns without candidate generation
S. Brin et al.
Beyond market baskets: generalizing association rules to correlations
R. Agrawal et al.
Mining sequential patterns
R. Srikant et al.
Mining sequential patterns: generalizations and performance improvements

D. Meretakis et al.

Scalable association-based text classification

R.C. Agarwal et al.

Depth first generation of long patterns

D. Burdick et al.

MAFIA: a maximal frequent itemset algorithm for transactional databases

R.J. Bayardo

Efficiently mining long patterns from databases

J. Pei, J. Han, R. Mao, CLOSET: an efficient algorithm for mining frequent closed itemsets, ACM SIGMOD Workshop on...

Cited by (15)

Efficient single-pass frequent pattern mining using a prefix-tree
2009, Information Sciences
Citation Excerpt :
However, the static nature of the FP-tree still limits its applicability to incremental and interactive frequent pattern mining, and its two database scan requirement prevents its use in frequent pattern mining on a data stream. Many algorithms have been proposed to discover frequent patterns in incremental databases and to maintain association rules in dynamically updated databases [6,9,16,20,26,24,33,23,34,28]. FUP2 [9] and UWEP [30] are based on the framework of the Apriori-algorithm.
The FP-growth algorithm using the FP-tree has been widely studied for frequent pattern mining because it can dramatically improve performance compared to the candidate generation-and-test paradigm of Apriori. However, it still requires two database scans, which are not consistent with efficient data stream processing. In this paper, we present a novel tree structure, called CP-tree (compact pattern tree), that captures database information with one scan (insertion phase) and provides the same mining performance as the FP-growth method (restructuring phase). The CP-tree introduces the concept of dynamic tree restructuring to produce a highly compact frequency-descending tree structure at runtime. An efficient tree restructuring method, called the branch sorting method, that restructures a prefix-tree branch-by-branch, is also proposed in this paper. Moreover, the CP-tree provides full functionality for interactive and incremental mining. Extensive experimental results show that the CP-tree is efficient for frequent pattern mining, interactive, and incremental mining with a single database scan.
Evaluation of security risks using Apriori algorithm
2020, ACM International Conference Proceeding Series
A user-driven association rule mining based on templates for multi-relational data
2018, Journal of Computer Science
TFI-Apriori: Using new encoding to optimize the apriori algorithm
2018, Intelligent Data Analysis
TOP-N most frequent item set mining algorithm based on improved inverted list and set theory
2014, Journal of Computational Information Systems
The application of Apriori algorithm for network forensics analysis
2013, Journal of Theoretical and Applied Information Technology

View all citing articles on Scopus

View full text

A new approach to mine frequent patterns using item-transformation methods

Abstract

Introduction

Section snippets

Notations and preliminary

Item-transformation approach

Analyses and simulation

Discussion and conclusion

J. Parallel Distributed Comput.

A data mining application: customer retention at the Port of Singapore Authority (PSA)

Mining business databases

Commun. ACM

Ascending frequency ordered prefix-tree: efficient mining of frequent patterns

A pattern decomposition algorithm for finding all frequent patterns in large datasets

Knowl. Inf. Syst.

Fast algorithms for mining association rules

Mining frequent patterns without candidate generation

Beyond market baskets: generalizing association rules to correlations

Mining sequential patterns

Mining sequential patterns: generalizations and performance improvements

Scalable association-based text classification

Depth first generation of long patterns

MAFIA: a maximal frequent itemset algorithm for transactional databases

Efficiently mining long patterns from databases