Elsevier

Expert Systems with Applications

Volume 105, 1 September 2018, Pages 129-143
Expert Systems with Applications

negFIN: An efficient algorithm for fast mining frequent itemsets

https://doi.org/10.1016/j.eswa.2018.03.041Get rights and content

Highlights

  • A fast algorithm for frequent itemset mining is proposed.

  • The experimental results on benchmark datasets indicate that this algorithm is effective.

  • This algorithm is based on a new data structure to store information about itemsets.

  • This data structure employs a novel encoding model for nodes in a prefix tree.

  • This encoding model is based on the bitmap representation of sets.

Abstract

Frequent itemset mining is a basic data mining task and has numerous applications in other data mining tasks. In recent years, some data structures based on sets of nodes in a prefix tree have been presented. These data structures store essential information about frequent itemsets. In this paper, we propose another efficient data structure, NegNodeset. Similar to other such data structures, the basis of NegNodeset is sets of nodes in a prefix tree. NegNodeset employs a novel encoding model for nodes in a prefix tree based on the bitmap representation of sets. Based on the NegNodeset data structure, we propose negFIN, which is an efficient algorithm for frequent itemset mining. The efficiency of the negFIN algorithm is confirmed by the following three reasons: (1) the NegNodesets of itemsets are extracted using bitwise operators, (2) the complexity of calculating NegNodesets and counting supports is reduced to O(n), where n is the cardinality of NegNodeset, and (3) it employs a set-enumeration tree to generate frequent itemsets and uses a promotion method to prune the search space in this tree. Our extensive performance study on a variety of benchmark datasets indicates that negFIN is the fastest algorithm, compared with previous state-of-the-art algorithms. However, our algorithm runs with the same speed as dFIN on some datasets.

Introduction

``Frequent itemset mining'' is one of the important data mining tasks and has numerous applications in other data mining tasks, such as the discovery of association rules (Ceglar & Roddick, 2006), clustering (Wang, Wang, Yang, & Yu, 2002), and classification (Cheng, Yan, Han, & Yu, 2008). The original use of this task was for market basket analysis and was first proposed in (Agrawal, Imieliński, & Swami, 1993). It aims to find items in the customer transactions database that are frequently bought together.

Let I = {i1,i2,…, init} be the set of all items in the transactional database; a transaction T be a set of some items (TI), with a unique identifier TID; and a database DB = {T1,T2,…, Tnt} be the set of transactions. Each P where PI is called an "itemset." P is also called a k-itemset, where |P| = k. A transaction T contains an itemset P if and only if PT; the support of P, which is denoted as support(P), is defined as the percentage of transactions in DB containing P. Let min − support be the user-defined minimum support threshold. P is called a frequent itemset if and only if min − supportsupport(P). Given the database DB and the min − support threshold, the frequent itemset mining task is defined as "discovering all frequent itemsets with their supports." The number of itemsets that have to be checked to discover frequent itemsets is 2nit, where nit = |I|. Therefore, the problem of discovering frequent itemsets is NP.

Frequent itemset mining has been a hot research topic in the data mining field for the last two decades (Aliberti et al., 2015, Calders et al., 2014, Deng, 2014, Deng et al., 2011, Lan et al., 2015, Lin et al., 2015, Troiano and Scibelli, 2014, Vo et al., 2015). In recent years, four types of data structures based on the sets of nodes in a "prefix tree" have been presented to enhance the efficiency of mining frequent itemsets. They are: (1) Node − list (Deng & Wang, 2010), (2) N − list (Deng, Wang, & Jiang, 2012), (3) Nodeset (Deng & Lv, 2014), and (4) DiffNodeset (Deng, 2016). All of these data structures employ a prefix tree with encoded nodes and associate a set of nodes with each itemset. The nodes in Node − list and N − list are encoded by the pre-order rank and post-order rank of the node. Two algorithms, PPV (Deng & Wang, 2010) and PrePost (Deng et al., 2012), have been proposed for mining frequent itemsets based on these two data structures, respectively. These two algorithms outperform their predecessors. However, they have a drawback: they consume a lot of memory (Deng & Lv, 2014). To overcome this problem, another data structure, called Nodeset (Deng & Lv, 2014), has been proposed. Unlike N − list and Node − list, the nodes in a Nodeset are encoded only by the pre-order (or post-order) rank of the nodes (Deng & Lv, 2014). The Nodeset of each k-itemset (3 ≤ k) is extracted by the intersection of the Nodesets of two (k-1)-itemsets (Deng & Lv, 2014). The FIN algorithm (Deng & Lv, 2014) has been proposed for frequent itemset mining based on this structure. The disadvantage of Nodeset is that the Nodeset cardinality becomes very large for some datasets (Deng, 2016). To overcome this problem, another data structure, DiffNodeset (Deng, 2016), has been proposed. In contrast to Nodeset, the DiffNodeset of each k-itemset (3 ≤ k) is extracted by the difference between the DiffNodesets of two (k-1)-itemsets (Deng, 2016). Extensive experiments show that the cardinality of DiffNodeset is smaller than that of Nodeset (Deng, 2016). The dFIN algorithm (Deng, 2016) has been proposed for mining frequent itemsets based on the DiffNodeset data structure. Experimental results show that the dFIN algorithm is faster than its predecessors (Deng, 2016).

Despite the advantages of DiffNodeset, we find that calculating the difference between two DiffNodesets takes a long time on some databases. To overcome this problem, we propose a new data structure, NegNodeset, which employs a prefix tree as well as the previous four data structures. Unlike these data structures, NegNodeset employs a new encoding model for nodes. The node-encoding model of NegNodeset is based on the bitmap representation of sets. Consider a universal set U with cardinality n. We can represent each subset of U by a bitmap of size n. Each element of U is assigned to one of the bits in the bitmap. If an element is a member of a subset S (SU), then its corresponding bit is 1; otherwise it is 0. Take the following example into account: let there be a universal set U = {a3,a2,a1,a0} and subsets A = {a3,a2} and B = {a3,a0}. With two bitmaps of size four, in which each ai (0 ≤ i ≤ 3) is assigned to their ith bit, these subsets are represented as A = 1100 and B = 1001. With this representation of sets, some common set operators can be implemented faster using bitwise operators. For example, to calculate the intersection (union) of two given sets, we can use the bitwise operator AND (OR) on their corresponding bitmaps. Bitwise operators are implemented efficiently in CPUs and done in one CPU cycle.

Based on the NegNodeset data structure, we propose negFIN, a fast algorithm for mining frequent itemsets. The efficiency of the negFIN algorithm is confirmed by the following three reasons: (1) new NegNodesets are extracted using bitwise operators, which are fast; (2) the complexity of extracting new NegNodesets and counting their supports is reduced to O(n), instead of O(m + n) in previous algorithms, where m and n are the cardinality of two sets of nodes, and nm; and (3) it employs a "set-enumeration tree" (Rymon, 1992) to generate frequent itemsets and uses a promotion method to prune search space in this tree. This pruning strategy generates the frequent itemsets, sometimes directly without candidate generation.

We conducted several experimental studies to evaluate the performance of the negFIN algorithm. We compared the performance of negFIN against dFIN (Deng, 2016), Goethals's Eclat (Goethals & Zaki, 2004), and FP-growth* (Grahne & Zhu, 2005), which have been the leading algorithms in the field of frequent itemset mining so far. The experimental results show that negFIN has good performance and, compared to the above mentioned algorithms, runs faster or equally fast. It runs faster than Goethals's Eclat (Goethals & Zaki, 2004) and FP-growth* (Grahne & Zhu, 2005) on all datasets. It still runs faster than dFIN (Deng, 2016) on some datasets, but runs as fast as dFIN (Deng, 2016) on other datasets.

The rest of this paper is organized as follows: Section 2 discusses background and related work for frequent itemset mining. Section 3 introduces basic definitions and properties relevant to the NegNodeset structure and the negFIN algorithm. Section 4 explains the negFIN algorithm. Section 5 shows experimental results. Section 6 concludes the paper, and section 7 provides some future research directions.

Section snippets

Related work

Many algorithms have been proposed to discover all frequent itemsets efficiently. These algorithms are divided into two main categories: (1) algorithms that use the "candidate generation" method, and (2) algorithms that use the "pattern growth" method (Ceglar & Roddick, 2006). In the candidate generation method, the candidate itemsets are generated first, and then frequent itemsets are identified from these candidate itemsets. This method employs an anti-monotone property, called Apriori (

Basic terminologies

In addition, similar data structure named PUN − list (Deng, 2018) has been proposed to discover "high utility itemsets," a new kind of mining task that is different from frequent itemset mining. In this task, each item has a utility value and can occur more than once in a transaction. A high utility itemset is an itemset that its utility is not less than a given minimum threshold. In addition to storing an information about frequent itemsets, PUN − list data structure also stores an information

negFIN: the proposed algorithm

negFIN employs a set-enumeration tree (Definition 11) to represent the search space. The framework of negFIN consists of three steps. In the first step, the BMC-tree is constructed, all frequent 1-itemsets and their Nodesets are identified, and level 1 of the set-enumeration tree is constructed. In the second step, all frequent 2-itemsets and their NegNodesets are identified, and level 2 of the set-enumeration tree is constructed. In the third step, all frequent k-itemsets (3 ≤ k) and their

Results of experiment and analysis

In order to evaluate the performance of the negFIN algorithm, we conducted two groups of experiments. The purpose of the first group of experiments is to compare the performance of the negFIN algorithm against the following algorithms: (1) Goethals's Eclat (Goethals & Zaki, 2004), which is the state-of-the-art algorithm in the family of vertical mining algorithms (Deng et al., 2012), and (2) FP-growth* (Grahne & Zhu, 2005), which is the state-of-the-art algorithm in the family of FP-tree-based

Conclusion

In this paper, we presented a new data structure, called NegNodeset, to store essential information about frequent itemsets. Based on NegNodeset, we present an algorithm, called negFIN, to rapidly discover all frequent itemsets in databases. Compared with nFIN, the key advantages of negFIN are as follows: (1) it employs bitwise operators to generate new sets of nodes. (2) It reduces the time complexity of discovering frequent itemsets to O(ln), instead of O(l(m + n)), where m and n are the

Future research directions

Future research directions are as follows: employing NegNodeset to (1) mine "closed frequent itemsets" (Le and Vo, 2015, Lee et al., 2008, Wang et al., 2003), (2) mine "maximal frequent itemsets" (Burdick et al., 2005, Roberto and Bayardo, 1998), (3) mine "Top-Rank-k frequent itemsets" (Deng, 2014, Huynh-Thi-Le et al., 2015), (4) mine "erasable itemsets" (Le, Vo, & Nguyen, 2014), (5) mine "fuzzy itemsets" (Lan et al., 2015, Lin et al., 2015), (6) mine "frequent disjunctive closed itemsets" (

Acknowledgments

The authors offer their gratitude to Dr. Zhi-Hong Deng at Peking University, Beijing, School of Electronics Engineering and Computer Science for providing the implementation codes of the FIN algorithm (Deng & Lv, 2014). We employ the FIN implementation codes to implement the dFIN algorithm (Deng, 2016) as well as our algorithm, negFIN.

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

References (44)

  • X. Li et al.

    Mining frequent patterns from network flows for monitoring network

    Expert Systems Applications

    (2010)
  • J.C.-W. Lin et al.

    A CMFFP-tree algorithm to mine complete multiple fuzzy frequent itemsets

    Applied Soft Computing

    (2015)
  • M.K. Sohrabi et al.

    Parallel frequent itemset mining using systolic arrays

    Knowledge-Based Systems

    (2013)
  • L. Troiano et al.

    Mining frequent itemsets in data streams within a time horizon

    Data Knowledge Engineering

    (2014)
  • R. Vimieiro et al.

    Disclosed: An efficient depth-first, top-down algorithm for mining disjunctive closed itemsets in high-dimensional data

    Information Sciences

    (2014)
  • B. Vo et al.

    Fast updated frequent-itemset lattice for transaction deletion

    Data Knowledge Engineering

    (2015)
  • R. Agrawal et al.

    Mining association rules between sets of items in large databases

    SIGMOD Record

    (1993)
  • R. Agrawal et al.

    Fast algorithms for mining association rules in large databases

  • D. Burdick et al.

    MAFIA: A maximal frequent itemset algorithm

    IEEE Transactions Knowledge Data Engineering

    (2005)
  • A. Ceglar et al.

    Association mining

    ACM Computing Surveys

    (2006)
  • J.H. Chang et al.

    Finding recent frequent itemsets adaptively over online data streams

  • H. Cheng et al.

    Direct discriminative pattern mining for effective classification

  • Cited by (72)

    • FR-Tree: A novel rare association rule for big data problem

      2022, Expert Systems with Applications
      Citation Excerpt :

      The frequencies of itemsets are discovered from the node-sets data structure. Recent research is Aryabarzan, Minaei-Bidgoli, and Teshnehlab (2018), Deng (2016) and Vo, Le, Coenen, and Hong (2016). Various parallel approaches such as Apiletti, et al. (2017), Barkhordari and Niamanesh (2018), Chon, Hwang, and Kim (2018), Djenouri, Djenouri, Belhadi, and Cano (2019), Djenouri et al. (2019), Dlala, Jabbour, Raddaoui, and Sais (2018), Phan and Le (2018), Qiu, Gu, Yuan, and Huang (2014), Ragaventhiran and Kavithadevi (2020), Raj, Ramesh, Sreenu, and Sethi (2020), Soysal, Gupta, and Donepudi (2016), Vanahalli and Patil (2019), Xun, Zhang, Qin, and Zhao (2017), Zhang, et al. (2015) were proposed to overcome the significant database size problem.

    • UP-tree & UP-Mine: A fast method based on upper bound for frequent pattern mining from uncertain data

      2021, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      Therefore, to overcome these drawbacks, Han et al. proposed a tree-based structure named FP-tree (Han et al., 2000). Since the introduction of FP-tree, various FP-tree-based algorithms (Aryabarzan et al., 2018; Davashi and Nadimi-Shahraki, 2019; Jia and Liu, 2017; Leung et al., 2007b; Narvekar and Syed, 2015; Tanbeer et al., 2009) have been put forward to improve the performance of frequent pattern mining algorithms. However, all of these algorithms can only support precise databases.

    View all citing articles on Scopus
    View full text