Elsevier

Knowledge-Based Systems

Volume 111, 1 November 2016, Pages 283-298
Knowledge-Based Systems

FHN: An efficient algorithm for mining high-utility itemsets with negative unit profits

https://doi.org/10.1016/j.knosys.2016.08.022Get rights and content

Abstract

High utility itemset mining is an emerging data mining task, which consists of discovering highly profitable itemsets (called high utility itemsets) in very large transactional databases. Many algorithms have been proposed to efficiently discover high utility itemsets but most of them assume that items may only have positive unit profits. However, in real-world transactional databases, items (products) often have positive or negative unit profits. Mining high utility itemsets in a transactional database where items have positive or negative unit profits is a computationally expensive task, and it is thus desirable to design more efficient algorithms. To address this issue, we propose an efficient algorithm named FHN (Faster High-Utility itemset miner with Negative unit profits). It relies on a novel PNU-list structure (Positive-and-Negative Utility-list) structure to efficiently mine high utility itemsets, while considering both positive and negative unit profits. Moreover, several pruning strategies are introduced in FHN to reduce the number of candidate itemsets, and thus enhance the performance of FHN. Extensive experimental results on both real-life and synthetic datasets show that the proposed FHN algorithm is in general two to three orders of magnitude faster and can use up to 200 times less memory than the state-of-the-art algorithm HUINIV-Mine. Moreover, it is shown that FHN performs especially well on dense datasets.

Introduction

Frequent Itemset Mining (FIM) [1], [11], [12], [22], [29] is a core data mining task, that is essential to a wide range of applications. Given a transactional database containing a large number of transactions and a user-specified minimum support threshold, FIM aims at discovering frequent itemsets, that is sets of items having occurrence frequencies no less than a minimum support threshold set by the user [1]. However, an important limitation of FIM is that each item cannot appear more than once in a transaction and that all items are assumed to have the same importance (i.e., weight, cost, risk, unit profit or value). But this assumption does not hold in real-world applications. For example, transactions made at retail stores usually contains information about the purchase quantities of items and all items do not have the same unit profit. If the traditional FIM algorithms are applied on such database, they would discard this information and may thus discover many frequent itemsets that generates a low profit.

To address this issue, the problem of FIM has been redefined as High-Utility Itemset Mining (HUIM). HUIM considers both the purchase quantities of items in transactions, and the unit profit of items, to discover the items/itemsets in a database that generate a high profit (have a utility that is no less than a minimum utility threshold). The discovered patterns are called high utility itemsets (HUIs). HUIM has many real-life applications such as website click stream analysis, cross-marketing in retail stores, and biomedical applications [2], [17], [21]. Many research issues related to HUIM have also emerged such as high-utility stream data mining [19], high-utility episode mining [24], high-utility sequential pattern mining [27], [28], and high-utility sequential rule mining [30].

The problem of HUIM is widely recognized as more difficult than the problem of FIM. In FIM, the well-known downward-closure property states that the support of an itemset is anti-monotonic; that is all supersets of an infrequent itemset are infrequent and all subsets of a frequent itemset are frequent. This property is very powerful to prune the search space in FIM. In HUIM, however, the utility of an itemset is neither monotonic nor anti-monotonic, which indicates that a high utility itemset may have supersets or subsets with lower, equal or higher utility [2], [18], [21]. Thus, techniques to prune the search space developed in FIM cannot be directly applied in HUIM.

A popular approach for HUIM is to discover high-utility itemsets in two phases by using the Transaction-Weigthed Utilization (TWU) downward closure model [2], [18], [21]. This approach has been adopted by numerous algorithms such as Two-Phase [18], IHUP [2], UP-Growth and UP-Growth+ [21]. The approach consists of first generating a set of candidate high-utility itemsets by overestimating their utility in Phase I. After that, the algorithms perform an additional database scan in Phase II to calculate the exact utility of the discovered candidates and filter low-utility itemsets.

Although, the TWU model has been largely used in HUIM, it suffers from an important drawback. It considers a huge number of low-utility itemsets as candidates since the TWU model uses a loose upper-bound called the transaction utility to overestimate the utility of candidates. Recently, a more efficient approach namely HUI-Miner [17] was proposed to directly mine high-utility itemsets using a single database scan. Based on the designed vertical data structure (utility-list), HUI-Miner utilizes the actual utility and remaining utility of an itemset in a database to calculate a tighter upper-bound, to more effectively prune the search space. Experimental results have shown that the HUI-Miner algorithm outperforms previous HUIM algorithms and is thus the current state-of-the-art algorithm for HUIM [17]. However, the task of high-utility itemset mining remains very costly in terms of execution time. Therefore, it remains an important challenge to design more efficient algorithms to handle the above limitations.

Besides, although many studies have been carried to develop efficient algorithms for HUIM (e.g. Two-Phase [18], IHUP [2], UP-Growth [21], HUI-Miner [17], FHM [9], BAHUI [20] and HUP-Miner [10]), they are designed under the assumption that all items in transactional databases have positive weights/unit profits. Thus, most algorithms developed for HUIM cannot be directly applied to mine HUIs while considering items having negative weights/unit profits, which usually occur in many real-life transaction databases. For example, if a customer buys three units of an item A in a supermarket, (s)he may receive one unit of item B for free as a promotion to promote product B. Now suppose that each unit of item A yields a profit of five dollars, and each unit of item B that is given away costs two dollars. Although giving away a unit of item B results in a loss of two dollars for the supermarket, selling three units of A that are cross-promoted with item B generates 15 dollars. Thus, the supermarket can have a net gain of 13 dollars each time that this promotion is applied.

It was shown that if classical HUIM algorithms are applied on databases containing items with negative unit profits, they can generate an incomplete set of HUIs [4]. The reason is that these algorithms overestimate the utilities of itemsets to prune the search space. But when items with negative unit profits are considered, these estimations may become underestimations, and numerous HUIs may be pruned. Recently, the HUINIV-Mine algorithm [4] was developed to handle the problem of HUIM with both positive and negative unit profits. The TS-HOUN algorithm [13] was then proposed, which considers both the on-shelf time periods of items and negative unit profits. But the state-of-the-art algorithm for mining HUIs while considering negative unit profits remains HUINIV-Mine [4]. However, mining HUIs with negative unit profits remains very costly in terms of execution time and memory [4], [13]. Therefore, it is an important challenge to design a more efficient algorithms for solving the above limitations.

In this paper, we address the challenge of designing a more efficient algorithm for discovering high utility itemsets from a transactional database by considering both positive and negative unit profits. We present a novel algorithm named FHN1. (Fast High-utility itemset miner with Negative unit profits) to mine HUIs. Based on the designed vertical PNU-list data structure and several pruning strategies, FHN can efficiently handle negative unit profits. Experimental results on both real-life and synthetic datasets show that the proposed FHN algorithm is in general two to three orders of magnitude faster than the state-of-the-art HUINIV-Mine algorithm and performs well on dense datasets. The key contributions of this paper are as follows:

  • 1.

    A vertical list structure, called PNU-list (positive-and-negative utility-list), is designed to maintain all the information required for mining HUIs without performing multiple time-consuming database scans. The designed PNU-list structure allows FHN to directly mine HUIs without generating candidates.

  • 2.

    Two efficient pruning strategies named remaining utility pruning and EUCP pruning are further proposed to reduce the search space when using the PNU-list structure, and thus speed up the mining process for obtaining HUIs.

  • 3.

    A modified LA-Prune strategy is adopted in FHN to prune numerous unpromising candidates early when constructing PNU-lists.

  • 4.

    An extensive experimental study is carried on several real-life datasets. Results show that the proposed algorithm outperforms the state-of-the-art HUINIV-Mine algorithm in terms of runtime, memory consumption and scalability.

The rest of this paper is organized as follows. Related work is discussed in Section 2. The preliminaries and problem definition are given in Section 3. The proposed FHN algorithm is described in Section 4. An extensive experimental evaluation is presented in Section 5. Finally, the conclusion and future work are discussed in Section 6.

Section snippets

Related work

In this section related work is discussed. The section reviews (1) the main approaches for frequent itemset mining, (2) previous work on high-utility itemset mining, and (3) state-of-the-art algorithms for mining high utility itemset with negative values.

Preliminaries and problem definition

In this section, we introduce some important preliminary definitions relative to high utility itemset mining and formalize the problem of HUIM while considering negative unit profit values.

Definition 1 Transaction database

Let I be a set of items (symbols). An itemset is a group of items XI, and is said to be of length k or to be a k-itemset if it contains k items. A transaction database is a set of transactions D={T1,T2,,Tn} such that for each transaction Tc, TcI and Tc has a unique identifier c called its tid. Each

Proposed FHN algorithm

In this section, we propose a Faster High-Utility itemset miner with Negative unit profits (FHN) algorithm based on a new designed Positive-and-Negative Utility list (PNU-list) structure. Several pruning strategies are also designed to prune the search space early, thus speeding up the mining process. The PNU-list structure is inspired by the utility-list structure from HUI-Miner [17] but also has some key differences. Some properties of the designed approach for handling the negative item unit

Experimental study

The goal of this paper is to propose a more efficient algorithm for mining HUIs when considering items having both positive or negative utilities. In this section, we thus compare the performance of the proposed FHN algorithm against the state-of-the-art algorithm for this task, named HUINIV-Mine. Experiments were done in Java and performed on a computer with a third generation 64 bit Core i5 processor running the Windows 7 operating system and with 4 GB of free RAM. We compared the performance

Conclusion

In this paper, we have studied the problem of mining high utility itemsets from transactional databases with negative unit profits. Specifically, we have presented a novel Fast High-utility itemset miner with Negative unit profits (FHN) algorithm for mining high utility itemsets in databases where item unit profits may be positive or negative. A vertical list structure, called Positive-and-Negative Utility (PNU)-list, is designed for FHN so that it mines high utility itemsets without generating

Acknowledgement

This work is financed by a National Science and Engineering Research Council (NSERC) of Canada research grant and by the Tencent Project under grant CCF-TencentRAGR20140114.

References (30)

  • P. Fournier-Viger et al.

    VMSP: efficient vertical mining of maximal sequential patterns

    Proc. 27th Canadian Conf. on Artificial Intelligence, Springer, LNAI

    (2014)
  • P. Fournier-Viger et al.

    Novel concise representations of high utility itemsets using generator patterns

    Proc. 10th Int. Conf. on Advanced Data Mining and Applications

    (2014)
  • P. Fournier-Viger et al.

    FHM: faster high-utility itemset mining using estimated utility co-occurrence pruning

    Proc. 21st Intern. Symp. Methodologies Intell. Systems

    (2014)
  • J. Han et al.

    Mining frequent patterns without candidate generation: a frequent-pattern tree approach

    Data Min. Knowl. Discov.

    (2004)
  • J. Han et al.

    Mining frequent patterns without candidate generation: a frequent-pattern tree approach

    Data Min. Knowl. Discov.

    (2004)
  • Cited by (94)

    • Mining periodic high-utility itemsets with both positive and negative utilities

      2023, Engineering Applications of Artificial Intelligence
    View all citing articles on Scopus
    View full text