Elsevier

Information Sciences

Volume 484, May 2019, Pages 44-70
Information Sciences

EHNL: An efficient algorithm for mining high utility itemsets with negative utility value and length constraints

https://doi.org/10.1016/j.ins.2019.01.056Get rights and content

Highlights

  • We ponder over the problem of high utility itemset mining algorithm with negative item value and length constrains.

  • We proposed a high utility itemset mining algorithm with negative utility values using a pattern-growth approach.

  • We introduced minimum length constraint to remove the numerous tiny itemsets. We also used maximum length constraint to restrict the too longer itemsets.

  • We utilized two upper bound for pruning the search space.

  • We utilized the transaction merging and dataset projection techniques to reduce the cost of dataset scans. A memory efficient array-based utility counting technique is also utilized to speed up the utility counting process.

  • The results of the comparison show that the proposed algorithm provides better results on real and synthetic datasets.

Abstract

High utility itemsets (HUIs) mining is one of the emerging topics in frequent itemset mining (FIM). HUIs mining provides more informative and actionable information compared to FIM. Although many HUIs mining algorithms have been proposed in recent years. They incur the problem of generating a large number of candidate itemsets and most of the generated itemsets are tiny in size which degrade mining performance and action-ability. Apart from these problems, most of the algorithms work only with positive utility value. To overcome these issues, we propose an algorithm named EHNL (Efficient High utility itemsets mining with Negative utility and Length constraints). Although negative utility and constraint-based mining are commonly seen in real-world applications, mining HUIs with negative utility and length constraints has not yet been proposed in literature. Most of the traditional algorithms suffer from multiple dataset scanning problem. To reduce the scanning cost, we utilize dataset projection and transaction merging techniques. To enhance the performance of the proposed algorithm, we utilize sub-tree based pruning technique. To check the efficiency of utilized techniques, the variations of the proposed algorithm named EHNL(RSUP) and EHNL(TM) are introduced. The experimental results show that variants of the proposed algorithm mine the HUIs efficiently.

Introduction

Mining high utility itemsets (HUIs) is a subfield of frequent itemset mining (FIM) which is a fundamental research topic in data mining. The aim of FIM is to discover the itemsets that frequently occur together. FIM is used to mine the most frequent items from the transactional dataset which fulfill the minimum support (min_sup) threshold, where min_sup is a parameter set by the user. The concept of FIM was introduced by Agrawal and Srikant [1]. Later on, many algorithms [2], [3] have been developed for mining frequent itemsets. The main challenge in FIM is to develop a fast and efficient algorithm that can handle large volume of data, minimum time scans dataset and finds rule very quickly. The level-wise mining algorithms for mining frequent itemset waste lots of time to generate candidate itemsets. FP-Growth algorithm [4] is also very useful for finding frequent itemset. FP-Growth mines FIM quickly and consumes less memory than level-wise mining algorithm. FP-Growth algorithm generates less number of candidate itemsets than level-wise mining algorithms so takes less time to find frequent itemsets. But it also has limitations in respect of space and time. FIM algorithms assume that item cannot appear more than once in each transaction, and each item does not has same importance like weight, unit profit, etc. Hiding importance and quantity of an item may also hide some important or relevant information. Hence, FIM not only loses valuable and important information of the itemsets but also generates many irrelevant and unimportant frequent itemsets. However, in real-life, retailers are interested to find the important itemsets rather than frequent itemsets. To address the issue of quantity of items, multi-frequency based FIM algorithm has been proposed [5], but only quantity based mining lose importance of the items. In order to overcome these problems, high utility itemsets mining algorithms have been proposed [6], [7], [8].

HUIs mining considers two characteristics of each item; quantity (internal utility) and importance (external utility). HUIs mines more meaningful or actionable itemsets compared to FIM. HUIs mining can be used in other application areas such as market basket analysis [7], [9], [10], website click-stream [11], [12], cross-marketing in retail stores [13], [14], [15] biomedical applications [6] and mobile commerce applications[16]. Mining HUIs is a tough task because it does not follow downward closure property1 [1] that is widely used in FIM to reduce the search space. Therefore, prune the search space is very difficult in HUIs mining. To address this problem, Liu et al. proposed an algorithm named Two-phase that presents TWU (Transaction weighted utility) strategy to prune the search space in HUIs mining [17]. TWU uses as a downward closure property for mining HUIs. All two-phase model based algorithms use TWU based pruning strategy [9], [10], [18], [19], [20], [21], [22]. The two-phase model uses join and prune technique to mine HUIs. Hence, the two-phase model suffers from a huge number of candidate generation and multiple dataset scans problem.

Later, Tseng et al. proposed a tree-based algorithm to overcome the limitations of join and prune based algorithms named UP-Growth [10]. Recently, utility-list based algorithms [23], [24], [25], [26] are proposed to further enhance the performance of HUIs mining algorithms. Utility-list based algorithms show significant improvement in terms of runtime and memory usage. Most of HUIs mining algorithms do not consider negative utility. However, negative utility items often occur in real-life transaction datasets. For example, retail stores and supermarkets may promote certain products to attract customers and increase sales. In this scenario, supermarkets may give free product whenever a customer buys a specific product. Giving a product free is a loss or negative utility for supermarkets. However, supermarkets may earn higher profit from other products that are cross-promoted with free products. This kind of practice is common for the supermarkets to promote the products. For example, if a customer buys three of item A, he will then get one free item B from the supermarkets. Suppose the supermarkets get 5 dollars of profit from each unit of item A, and loss of 2 dollars for each unit of item B. Although the supermarkets lose 2 dollars for giving item B as free, they could earn 15 dollars from the 3 units of items A. Thus, supermarkets can have a net gain of 13 dollars each time that this promotion is applied. This example demonstrates that the negative utility items occur many time in real-world, therefore it has many applications.

Most of the traditional HUIs mining algorithms tend to find itemsets containing many items, as they are more likely to have a high utility. However, HUIs containing many items are generally less useful than itemsets containing fewer (or limited) and more relevant items. HUIs containing many items generally represent situations that are more specific, and thus rare. For example, a retail store found two HUIs rules {mapleSyrup, pancake}, and {mapleSyrup, pancake, orange, cheese, cereal} [27]. The retail store wants to increase his overall profit and will select the first itemset rather than second one. Because first itemset contains only two items that are very common whereas second contains five items that are rarer. A similar situation occurs in FIM. To solve this situation, a solution is proposed that change the task of mining frequent patterns to constraint based mining frequent itemset. Constraint based mining proposed minimum length and maximum length based constraint. Unfortunately, most of the techniques developed in constraint based FIM cannot be directly applied in HUIs mining. This is because itemsets testing requires utility checking which is more difficult than frequency testing. Moreover, the search space of HUIs is much larger than that of FIM. Therefore, for mining more useful HUIs, length constraint based HUIs mining is desired. In this paper, we address some key challenges in HUIs mining, later we introduce several novel ideas or key contribution to improve the performance. The key challenges are as follows. First, most of the HUIs algorithms, mine rules from the datasets containing only positive utility value. Traditional HUIs algorithms may lose candidate itemsets while mining with negative utility value. The second challenge is that the utility-list based algorithms consider the candidates which may not appear in the datasets. They also mine lots of tiny itemsets that are not actionable. The third challenge is that HUIs mining algorithms scan dataset more than once. Hence dataset scanning cost reduction techniques are needed. Therefore, mining HUIs with negative utility items is a very computationally expensive task. The last challenge is that summation based overestimation utility counting techniques are not up-to the mark. Hence, length of itemsets become longer which creates problem to analyze the result itemsets. Hence, less in numbers but meaningful itemsets are required.

To address the above challenges, we propose an efficient algorithm named EHNL (Efficient High utility itemsets mining with Negative utility and Length constraints). The key contributions of this paper are summarized as follows:

  • We propose an efficient algorithm for mining HUIs with negative utility items using a pattern-growth approach which only considers candidate itemsets those appear in the dataset.

  • We introduce minimum length constraint to remove the numerous tiny itemsets. We also use maximum length constraint to restrict the too longer itemsets.

  • In order to reduce the dataset scanning cost, we utilize dataset projection and transaction merging techniques. A memory efficient array-based utility counting technique is also utilized to speed up the utility counting process.

  • In order to prune the search space and to speed up the mining process, we utilize sub-tree based pruning strategy that was proposed by EFIM [26].

In this regard, Pei and Han presented the constrained frequent pattern mining and also presented the advantages and applications of constrained frequent pattern mining [28], [29], [30], [31]. They classified the constraints according to their applications, including, item constraint, length constraint, super-pattern constraint (Model-based constraint), aggregate constraint, regular expression constraint, duration constraint and gap constraint. We took the idea from there and applied length constraint into HUIs mining. For more details and information about length constraint in FIM, we can follow [28], [29], [30], [32], [33], [34].

The rest of this paper is organized as follows: In Section 2, we introduce the background and related work for HUIs mining. In Section 3, we describe preliminary definitions. In Section 4, the proposed algorithm is described in details. Experiment results are shown in Section 5 and conclusions are given in Section 6.

Section snippets

Related literature

HUIs mining problem is widely recognized as a harder problem than FIM mining problem because utility of an item does not support the downward closure property. In 2005, Liu et al. proposed an algorithm named Two-phase to mine HUIs [17]. This algorithm presents a TWU based downward closure property to prune the search space. Later on, other two-phase algorithms such as [9], [18], [20], [35] follow this TWU based strategy to enhance the performance. Later Tseng et al. proposed a tree-based method

Preliminary definitions

A transaction dataset D is a set of several transactions as D = {T1,T2,,Tm} where m is the maximum number of transactions. For example, in Table 1, dataset D includes seven transactions (T1 to T7). I={x1,x2,,xn} is a set of items which may appear in transactions. An itemset X is a set of items where XI. If itemset X contains k distinct items, itemset X is called a k-itemset. For example, a 2-itemset AB contains two items A and B. Transaction T1 indicates that items A, B, D and E appear

The proposed algorithm

In this section, we give a step-by-step analysis of the proposed algorithm named EHNL. Section 4.1 describes the dataset cost reduction techniques to reduce the dataset scanning. Section 4.2 describes the pruning strategies. Section 4.3 introduces array-based utility counting technique. Finally, Section 4.4 gives the pseudo-code and explanation of the proposed algorithm.

Experimental evaluation

In this section, we check the performance of our proposed algorithm (EHNL). We implemented the proposed algorithm by extending the open-source Java library SPMF [46]. Experiments were performed on a PC with an Intel Core-i7-6700 machine, 3.40 GHz CPU with 8GB of memory, running on a Windows 10 Pro-(64 bit Operating System).

For the performance test, the following benchmark datasets in SPMF [46] were chosen: Accidents, chess, mushroom, T10I4D100K, and T40I10D100K. The accidents, chess, and

Conclusion

In this paper, we proposed an algorithm for mining HUIs with negative utility and length constraints. Most of the traditional HUIs mining algorithms mine the rules from the datasets that have only positive utility. But in real-life negative utility is very important. In literature, only HUINIV-Mine [22] and FHN [38] algorithms are proposed to solve the negative utility itemsets mining problem. But the rules mined by both of these algorithms include very large number of tiny itemsets. We

Compliance with ethical standards

The authors declare no conflicts of interest. The article mines high utility itemsets with negative utility and length constraints.

References (46)

  • T. Uno et al.

    Lcm ver. 2: Efficient mining algorithms for frequent/closed/maximal itemsets

    IEEE ICDM Workshop on Frequent Itemset Mining Implementations

    (2004)
  • J. Han et al.

    Mining frequent patterns without candidate generation

    ACM Sigmod Record

    (2000)
  • K. Singh, H.K. Shakya, B. Biswas, Discovery of Multi-frequent Patterns Using Directed Graph, Springer India, New Delhi,...
  • R. Chan et al.

    Mining high utility itemsets

    Proceedings of the Third IEEE International Conference on Data Mining

    (2003)
  • H. Yao et al.

    A foundational approach to mining itemset utilities from databases

    Proceedings of the Third SIAM International Conference on Data Mining

    (2004)
  • Y. Liu et al.

    A fast high utility itemsets mining algorithm

    Proceedings of the 1st International Workshop on Utility-based Data Mining

    (2005)
  • C.F. Ahmed et al.

    Efficient tree structures for high utility pattern mining in incremental databases

    IEEE Trans. Knowl. Data Eng.

    (2009)
  • V.S. Tseng et al.

    Up-growth: an efficient algorithm for high utility itemset mining

    Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

    (2010)
  • H.F. Li et al.

    Fast and memory efficient mining of high utility itemsets in data streams

    2008 Eighth IEEE International Conference on Data Mining

    (2008)
  • M. Zihayat et al.

    Mining top-k high utility patterns over data streams

    Inf. Sci.

    (2014)
  • A. Erwin, R.P. Gopalan, N.R. Achuthan, Efficient Mining of High Utility Itemsets from Large Datasets, Springer Berlin...
  • B.-E. Shie, H.-F. Hsiao, V.S. Tseng, P.S. Yu, Mining High Utility Mobile Sequential Patterns in Mobile Commerce...
  • Y. Liu et al.

    A two-phase algorithm for fast discovery of high utility itemsets

    Proceedings of the 9th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining

    (2005)
  • Cited by (18)

    • Example-based explanations for streaming fraud detection on graphs

      2023, Information Sciences
      Citation Excerpt :

      Next, we investigate the utility of graph-based explanations. Since users in different application domains have different perspectives on utility [46], asking users for feedback would not yield generalizable insights. Instead, we focus on the utility of explanations derived with different selection algorithms.

    • EHMIN: Efficient approach of list based high-utility pattern mining with negative unit profits

      2022, Expert Systems with Applications
      Citation Excerpt :

      However, it has a poor performance on large datasets and on low thresholds, because it adopted the level-wise candidate generation. Subsequently, tree-based algorithms were proposed, such as ENIN (Singh, Shakya, Singh, & Biswas, 2018), UP-GNIV (Subramanian & Premalatha, 2015), and EHNL (Singh, Kumar, Singh, Shakya, & Biswas, 2019), because the level-wise candidate generation approaches suffer from scalability issues. EHNL adopted length constraints for certain purposes that are required in the real world.

    • Mining top-N high-utility operation patterns for taxi drivers

      2021, Expert Systems with Applications
      Citation Excerpt :

      Lin, Fournierviger, and Gan (2016) proposed a fast and high-utility itemsets mining algorithm with negative unit profit. Singh, Shakya, Singh, and Biswas (2018, 2019) proposed an algorithm for mining high-utility itemsets with negative utility by using pattern growth. The algorithm introduces two constraints, the minimum length and the maximum length, to control the size of itemsets and exclude the too small or too big itemsets.

    • Efficient list based mining of high average utility patterns with maximum average pruning strategies

      2021, Information Sciences
      Citation Excerpt :

      Therefore, high utility patterns with long lengths cannot always be high average utility patterns. However, average utility values cannot satisfy the anti-monotone property [39,46] alike utility values because low average utility patterns can be high average utility patterns if they are combined with sufficiently high average utility patterns. Because applying the anti-monotone property has a major influence on the performance of the mining process, many previous studies which deal with high average utility pattern mining have proposed the various upper-bounds [41,47], which gratify the anti-monotone property.

    View all citing articles on Scopus
    View full text