Elsevier

Information Systems

Volume 32, Issue 7, November 2007, Pages 1056-1072
Information Systems

A new approach to mine frequent patterns using item-transformation methods

https://doi.org/10.1016/j.is.2007.01.001Get rights and content

Abstract

Mining frequent patterns is a fundamental and crucial task in data-mining problems. The algorithms reported in the literature for mining frequent patterns can be classified into two approaches: the candidate generation-and-test approach (for example, the Apriori algorithm) and the pattern-growth approach (such as the FP-growth algorithm). The approaches both suffered from the problems that their speed is slow for large databases. This paper proposes a novel and simple approach, which does not belong to the above two approaches. This approach treats the database as a stream of data and finds frequent patterns by scanning the database only once. In addition, the approach can incrementally mine frequent patterns if the database is updated or inserted subsequently. Three versions of the approach (i.e., mapping-table, transformation-function, and logic-circuit) are provided. The logic-circuit version is the first one that mines frequent patterns by simple logic gates, and the modeling of this version shows its speed is thousands of times faster than that of the FP-growth algorithm. Analyses and simulations of the approach are also performed. Analyses show that the transformation-function version is much better than the Apriori and FP-growth ones in storage complexity. Simulation results show that the mapping-table version is comparable to the FP-growth algorithm in execution time.

Introduction

Data mining has become more important than ever since it can discover valuable but hidden knowledge from databases. The applications of data mining can be found in many areas such as evaluating risks of financial investment, detection of credit-card fraud, and healthcare research on the population [1], [2], [3], [4]. A pattern is a specific itemset in a database, which consists of the items in the database. A pattern is called a frequent pattern if its frequency of occurrence is larger than or equal to a predefined threshold (called the minimum support threshold).

Mining frequent patterns is a fundamental and crucial step used to solve the problems in the areas such as association rules [5], [6], correlations [7], sequential patterns [8], [9], classification [10], maximal frequent patterns [11], [12], [13], and frequent closed patterns [14], [15], [16]. According to the literature [3], [17], the algorithms for mining frequent patterns can be classified into two approaches, namely, candidate generation-and-test and pattern-growth. The representative algorithm of the first approach is the Apriori algorithm [5]. The Apriori and the Apriori-like algorithms [4], [18], [19], [20], [21] all need to generate a candidate set from the database and then evaluate each candidate in the set to check whether the candidate is a frequent pattern. In evaluating these candidates, the approach needs to scan databases as many times as the maximal length of the patterns. Although the approach is simple, it is time-consuming when databases are large. The second approach, also including a lot of algorithms [4], [6], [14], [22], [23], [24], is proposed to increase the speed of mining process by reducing the times of the scans of databases. The FP-growth algorithm [6], the most famous one of the approach, uses a compact data structure, called an FP-tree, to store the transaction information of the database. Instead of scanning the database, the algorithm scans the FP-tree to find frequent patterns. Compared to the first approach, the second approach is more efficient but needs more space to store the intermediate data structure.

In enterprises, new data is produced continuously, and the influence of the obsolete data should be decreased. Managers need to periodically mine databases to reflect the current state of transactions and tune their marketing strategies subsequently. For mining the newest databases, the candidate generation-and-test approach needs to run the entire mining process again, while the pattern-growth approach needs to re-build the intermediate data structure (such as the FP-tree) and then performs the pattern-growth operation again [6].

This paper proposes a simple and new approach that does not belong to the above two approaches. It uses a simple method to transform the transactions from databases into their corresponding patterns and accumulates the frequency of occurrence of these patterns. After scanning all the transactions, the approach can immediately find frequent patterns. In addition, the approach can incrementally mine databases if some transactions are inserted, deleted, or updated subsequently. Three versions of this approach are proposed: one is developed upon a mapping table, another is upon a transformation function, and the other is implemented by logic circuit. As we knew from the literature, the third version is the first one to mine the frequent patterns by logic circuit. The execution time of the logic-circuit version, as expected, is much faster than those of all the other algorithms. The simulations show that the transformation-function version performs a little inferior than the FP-growth algorithm [6], but the transformation-table version performs better than the FP-growth algorithm under the condition that the number of the items is not large.

Section snippets

Notations and preliminary

Let Γ denote a transaction database consisting of a set of n transactions, and let I denote the set of items that may appear in Γ. Assume that there are m items in I, denoted as |I|=m and that I is represented as the set {i0,i1,,im-1}. For a transaction t in Γ, the itemset contained in t, denoted as St, is the set of items whose elements are all from I, i.e., StI. For simplicity, St can be represented as a vector, as is defined in the following:

Definition 1

Assume that there is a transaction database Γ

Item-transformation approach

In this paper, a new approach, called the item-transformation approach, which scans the database only once, is proposed. During the scanning, the approach transforms the itemset contained in a transaction into its corresponding pattern vector and then accumulates these pattern vectors. After scanning all the transactions, the approach filters the final results of the accumulation according to the value of ξ and then reports the frequent patterns. The following are the steps of the approach:

Empty Cell

Analyses and simulation

To evaluate the performance of the proposed approach, the mathematical analyses are presented at first. Two parameters used in the analyses are m and n, which are the number of transactions in database Γ and the number of the items that may appear in Γ, respectively.

Discussion and conclusion

An important characteristic of the proposed approach is that the approach accumulates the pattern vectors one by one. If some transactions are inserted into the database, its corresponding pattern vector can be easily added to the results of the accumulation. Similarly, if some transactions are updated, the obsolete pattern vectors are removed, and the new pattern vectors are added. However, for the update in the database, the Apriori algorithm needs to re-generate the candidates and test them

References (31)

  • R.C. Agarwal et al.

    A tree projection algorithm for generation of frequent item sets

    J. Parallel Distributed Comput.

    (2001)
  • K. Ng et al.

    A data mining application: customer retention at the Port of Singapore Authority (PSA)

  • R.J. Brachman et al.

    Mining business databases

    Commun. ACM

    (1996)
  • G. Liu et al.

    Ascending frequency ordered prefix-tree: efficient mining of frequent patterns

  • Q. Zou et al.

    A pattern decomposition algorithm for finding all frequent patterns in large datasets

    Knowl. Inf. Syst.

    (2002)
  • R. Agrawal et al.

    Fast algorithms for mining association rules

  • J. Han et al.

    Mining frequent patterns without candidate generation

  • S. Brin et al.

    Beyond market baskets: generalizing association rules to correlations

  • R. Agrawal et al.

    Mining sequential patterns

  • R. Srikant et al.

    Mining sequential patterns: generalizations and performance improvements

  • D. Meretakis et al.

    Scalable association-based text classification

  • R.C. Agarwal et al.

    Depth first generation of long patterns

  • D. Burdick et al.

    MAFIA: a maximal frequent itemset algorithm for transactional databases

  • R.J. Bayardo

    Efficiently mining long patterns from databases

  • J. Pei, J. Han, R. Mao, CLOSET: an efficient algorithm for mining frequent closed itemsets, ACM SIGMOD Workshop on...
  • Cited by (15)

    • Efficient single-pass frequent pattern mining using a prefix-tree

      2009, Information Sciences
      Citation Excerpt :

      However, the static nature of the FP-tree still limits its applicability to incremental and interactive frequent pattern mining, and its two database scan requirement prevents its use in frequent pattern mining on a data stream. Many algorithms have been proposed to discover frequent patterns in incremental databases and to maintain association rules in dynamically updated databases [6,9,16,20,26,24,33,23,34,28]. FUP2 [9] and UWEP [30] are based on the framework of the Apriori-algorithm.

    • Evaluation of security risks using Apriori algorithm

      2020, ACM International Conference Proceeding Series
    • The application of Apriori algorithm for network forensics analysis

      2013, Journal of Theoretical and Applied Information Technology
    View all citing articles on Scopus
    View full text