MEI: An efficient algorithm for mining erasable itemsets

https://doi.org/10.1016/j.engappai.2013.09.002Get rights and content

Highlights

  • The definition of dPidset and their theorems are derived for fast computing itemsets' information.

  • Index of gain is used to avoid the duplication of data in erasable itemsets' information.

  • An effective algorithm which has O(m+n) to subtract two dPidsets is presented to reduce the mining time.

  • We propose MEI which uses the divide-and-conquer and the difference pidset strategies for fully mining erasable itemsets.

  • The experimental results show that MEI is more efficient than VME and MERIT+ on both the mining time and the memory usage.

Abstract

Erasable itemset (EI) mining is an interesting variation of frequent itemset mining which allows managers to carefully consider their production plans to ensure the stability of the factory. Existing algorithms for EI mining require a lot of time and memory. This paper proposes an effective algorithm, called mining erasable itemsets (MEI), which uses the divide-and-conquer strategy and the difference pidset (dPidset) concept for mining EIs fully. Some theorems for efficiently computing itemset information to reduce mining time up and memory usage are also derived. Experimental results show that MEI outperforms existing approaches in terms of both the mining time and memory usage. Moreover, the proposed algorithm is capable of mining EIs with higher thresholds than those obtained using existing approaches.

Introduction

Data mining is the process of discovering the interesting patterns in large databases including methods at the intersection of artificial intelligence, machine learning, and statistics. Many problems in data mining have attracted research attention, including association rule mining (Agrawal and Srikant,, Grahne and Zhu, 2005, Han et al.,, Lin et al., 2011, Lucchese et al., 2006, Luna et al.,, Luna et al., 2012, Luna et al., 2013, Vo and Le, 2011, Vo et al., 2013, Wang et al.,, Zaki et al.,, Zaki and Hsiao, 2005), applications of association rule mining (Abdi and Giveki, 2013, Kang et al., 2012), cluster analysis (Agrawal et al., 1998), and classification (Liu et al.,, Nguyen et al., 2012, Özbakir and Delice, 2011). In order to solve these problems, one of the issues is the problem of mining itemsets (Agrawal et al., 1993). Many methods for frequent itemset mining have been proposed, such as the Apriori algorithm (Agrawal and Srikant, 1994), the FP-tree algorithm (Han et al., 2000), methods based on IT-tree (Zaki and Gouda,, Vo et al., 2012), and methods for mining frequent itemsets in incremental databases (Hong et al., 2009, Le et al.,). Studies related to pattern mining include those on high-utility pattern mining (Hu and Mojsilovic, 2007, Le et al., 2011), the mining of discriminative and essential frequent patterns (Fan et al., 2008), approximate frequent pattern mining (Gupta et al., 2008), concise representation of frequent itemsets (Jin et al., 2009), proportional fault-tolerant frequent itemset mining (Poernomo and Gopalkrishnan, 2009), frequent pattern mining of uncertain data (Aggarwal et al.,, Bernecker et al.,), frequent weighted itemset mining (Yun et al., 2012, Vo et al., 2013), emerging pattern mining (Dong and Li,, Dong and Bailey, 2013), and erasable itemset mining (Deng et al.,, Deng and Xu,, Deng and Xu, 2012, Deng, 2013).

Deng et al. (2009) defined EIs for pattern mining. The problem originates from production planning associated with a factory which produces many types of products. Each product is created from a number of components (items) and creates profit. In order to produce all the products, the factory has to purchase and store these items. In a financial crisis, the factory cannot afford to purchase all the necessary items as usual; therefore, the managers should consider their production plans to ensure the stability of the factory. The problem is to find the itemsets which can be eliminated but do not greatly affect the factory's profit. In other words, the goal is to find the sets of itemsets which can be eliminated (erased), allowing managers to create a new production plan.

For another application of the EI mining, assume that a factory produces n products. The managers know the new product's prospect; however, the production requires a financial investment and the factory does not want to expand the current production. In this situation, the managers can use this EI mining to locate the EIs, then, replace eliminated products with new products while keeping control of the factory's profit. With these EIs, the managers can determine which new products are beneficial for the factory without causing financial instability.

Several algorithms have been proposed to solve the EI mining problem, such as META (mining erasable itemsets with the anti-monotone property) (Deng et al., 2009), VME (vertical-format-based algorithm for mining erasable itemsets) (Deng and Xu, 2010), and MERIT (fast mining erasable itemsets) (Deng and Xu, 2012). The most notable of these is MERIT. However, MERIT loses a large number of EIs. Therefore, the present study proposes a revised algorithm called MERIT+ which is capable of mining EIs fully. An algorithm called mining erasable itemsets (MEI), which is also capable of mining EIs fully, is proposed. MEI only scans the product database once. It uses the divide-and-conquer strategy and the dPidset concept. MEI is more efficient than VME and MERIT+ in terms of mining time and memory usage. It is capable of mining EIs with higher thresholds than those obtained using VME and MERIT+.

The rest of the paper is organized as follows: Section 2 presents research related to EI mining. Section 3 formally states the problem. Section 4 introduces MEI for mining EIs fully. Section 5 presents experiments on performance of VME, MERIT+, and MEI algorithms. Conclusions and suggestions for future research are given in Section 6.

Section snippets

Related work

In 2009, Deng et al. defined EIs, the problem of mining EIs, and META, an Apriori-based algorithm to solve this problem. The results of META are all EIs. However, the mining time of this algorithm is long because:

  • (1)

    META scans database the first time to determine the total profit of the factory and k times to determine the information associated with each EI, where k is the maximum level of the result of EIs. For example, META finds one or more erasable 5-itemsets, which is the maximum level of

Problem statements and definitions

Let I = {i1, i2, …, im} be a set of all items, which are the abstract representations of components of products. A product database is denoted by DB={P1, P2, …, Pn}, where Pi(1≤in) is a product presented in the form of 〈Items, Val〉, where Items are the items (or components) that constitute Pi and Val is the profit that the factory obtains by selling the product Pi. A set XI is also called an itemset, and an itemset with k items is called a k-itemset.

The example database in Table 1 is used

MEI algorithm

In this section, an effective algorithm for mining EIs is proposed based on the index of gain and dPidset concepts. The index of gain concept is first introduced. Then, the pidset concept and two theorems for determining the pidset associated with k-itemsets based on the pidset associated with (k−1)-itemsets are proposed. The dPidset concept is then proposed. Five theorems associated with dPidset are used to determine the itemset information. The effectiveness of dPidset is compared to that of

Experimental environment

All experiments presented in this section were performed on a laptop with an Intel Core i3-3110M 2.4-GHz CPU and 4 GB of RAM. The operating system was Microsoft Windows 8. All the programs were coded in C# using Microsoft Visual Studio 2012 and run on Microsoft.Net Framework Version 4.5.50709.

Experimental databases

The experiments are conducted on databases such as Accidents, Chess, Connect, Mushroom, Pumsb, and T10I4D100K which were downloaded from http://fimi.cs.helsinki.fi/data/. To make these databases look like

Conclusion and future work

This paper proposed the pidset concept and two theorems determine the pidset associated with k-itemsets based on the pidset associated with (k−1)-itemsets. Then, dPidset and its theorems were defined for quickly determining EIs information. These theorems are applied in the proposed algorithm along with a divide-and-conquer strategy for mining EIs fully. In order to demonstrate the effectiveness of MEI, experiments were conducted to compare VME, MERIT+, and MEI in terms of mining time and

References (40)

  • B. Vo et al.

    DBV-Miner: a dynamic bit-vector approach for fast mining frequent closed itemsets

    Expert Systems with Applications

    (2012)
  • B. Vo et al.

    A new method for mining frequent weighted itemsets based on WIT-trees

    Expert Systems with Applications

    (2013)
  • B. Vo et al.

    A lattice-based approach for mining most generalization association rules

    Knowledge-Based Systems

    (2013)
  • U. Yun et al.

    An efficient mining algorithm for maximal weighted frequent patterns in transactional databases

    Knowledge-Based Systems

    (2012)
  • Aggarwal, C.C., Li, Y., Wang, J., Wang, J., 2009. Frequent pattern mining with uncertain data. In: SIGKDD'09, pp....
  • R. Agrawal et al.

    Database mining: a performance perspective

    IEEE Transactions on Knowledge and Data Engineering

    (1993)
  • Agrawal, R., Srikant, R., 1994. Fast algorithms for mining association rules. In: VLDB'94, pp....
  • Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P., 1998. Automatic subspace clustering of high dimensional data for...
  • Bernecker, T., Kriegel, H., Renz, M., Verhein, F., Zuefle, A., 2009. Probabilistic frequent itemset mining in uncertain...
  • Z.H. Deng

    Mining top-rank-k erasable itemsets by PID_lists

    International Journal of Intelligent Systems

    (2013)
  • Cited by (0)

    View full text