EHNL: An efficient algorithm for mining high utility itemsets with negative utility value and length constraints

doi:10.1016/j.ins.2019.01.056

Information Sciences

Volume 484, May 2019, Pages 44-70

https://doi.org/10.1016/j.ins.2019.01.056 Get rights and content

Highlights

•
We ponder over the problem of high utility itemset mining algorithm with negative item value and length constrains.
•
We proposed a high utility itemset mining algorithm with negative utility values using a pattern-growth approach.
•
We introduced minimum length constraint to remove the numerous tiny itemsets. We also used maximum length constraint to restrict the too longer itemsets.
•
We utilized two upper bound for pruning the search space.
•
We utilized the transaction merging and dataset projection techniques to reduce the cost of dataset scans. A memory efficient array-based utility counting technique is also utilized to speed up the utility counting process.
•
The results of the comparison show that the proposed algorithm provides better results on real and synthetic datasets.

Abstract

High utility itemsets (HUIs) mining is one of the emerging topics in frequent itemset mining (FIM). HUIs mining provides more informative and actionable information compared to FIM. Although many HUIs mining algorithms have been proposed in recent years. They incur the problem of generating a large number of candidate itemsets and most of the generated itemsets are tiny in size which degrade mining performance and action-ability. Apart from these problems, most of the algorithms work only with positive utility value. To overcome these issues, we propose an algorithm named EHNL (Efficient High utility itemsets mining with Negative utility and Length constraints). Although negative utility and constraint-based mining are commonly seen in real-world applications, mining HUIs with negative utility and length constraints has not yet been proposed in literature. Most of the traditional algorithms suffer from multiple dataset scanning problem. To reduce the scanning cost, we utilize dataset projection and transaction merging techniques. To enhance the performance of the proposed algorithm, we utilize sub-tree based pruning technique. To check the efficiency of utilized techniques, the variations of the proposed algorithm named EHNL(RSUP) and EHNL(TM) are introduced. The experimental results show that variants of the proposed algorithm mine the HUIs efficiently.

Introduction

Mining high utility itemsets (HUIs) is a subfield of frequent itemset mining (FIM) which is a fundamental research topic in data mining. The aim of FIM is to discover the itemsets that frequently occur together. FIM is used to mine the most frequent items from the transactional dataset which fulfill the minimum support ( $m i n_s u p$ ) threshold, where $m i n_s u p$ is a parameter set by the user. The concept of FIM was introduced by Agrawal and Srikant [1]. Later on, many algorithms [2], [3] have been developed for mining frequent itemsets. The main challenge in FIM is to develop a fast and efficient algorithm that can handle large volume of data, minimum time scans dataset and finds rule very quickly. The level-wise mining algorithms for mining frequent itemset waste lots of time to generate candidate itemsets. FP-Growth algorithm [4] is also very useful for finding frequent itemset. FP-Growth mines FIM quickly and consumes less memory than level-wise mining algorithm. FP-Growth algorithm generates less number of candidate itemsets than level-wise mining algorithms so takes less time to find frequent itemsets. But it also has limitations in respect of space and time. FIM algorithms assume that item cannot appear more than once in each transaction, and each item does not has same importance like weight, unit profit, etc. Hiding importance and quantity of an item may also hide some important or relevant information. Hence, FIM not only loses valuable and important information of the itemsets but also generates many irrelevant and unimportant frequent itemsets. However, in real-life, retailers are interested to find the important itemsets rather than frequent itemsets. To address the issue of quantity of items, multi-frequency based FIM algorithm has been proposed [5], but only quantity based mining lose importance of the items. In order to overcome these problems, high utility itemsets mining algorithms have been proposed [6], [7], [8].

HUIs mining considers two characteristics of each item; quantity (internal utility) and importance (external utility). HUIs mines more meaningful or actionable itemsets compared to FIM. HUIs mining can be used in other application areas such as market basket analysis [7], [9], [10], website click-stream [11], [12], cross-marketing in retail stores [13], [14], [15] biomedical applications [6] and mobile commerce applications[16]. Mining HUIs is a tough task because it does not follow downward closure property¹ [1] that is widely used in FIM to reduce the search space. Therefore, prune the search space is very difficult in HUIs mining. To address this problem, Liu et al. proposed an algorithm named Two-phase that presents TWU (Transaction weighted utility) strategy to prune the search space in HUIs mining [17]. TWU uses as a downward closure property for mining HUIs. All two-phase model based algorithms use TWU based pruning strategy [9], [10], [18], [19], [20], [21], [22]. The two-phase model uses join and prune technique to mine HUIs. Hence, the two-phase model suffers from a huge number of candidate generation and multiple dataset scans problem.

Later, Tseng et al. proposed a tree-based algorithm to overcome the limitations of join and prune based algorithms named UP-Growth [10]. Recently, utility-list based algorithms [23], [24], [25], [26] are proposed to further enhance the performance of HUIs mining algorithms. Utility-list based algorithms show significant improvement in terms of runtime and memory usage. Most of HUIs mining algorithms do not consider negative utility. However, negative utility items often occur in real-life transaction datasets. For example, retail stores and supermarkets may promote certain products to attract customers and increase sales. In this scenario, supermarkets may give free product whenever a customer buys a specific product. Giving a product free is a loss or negative utility for supermarkets. However, supermarkets may earn higher profit from other products that are cross-promoted with free products. This kind of practice is common for the supermarkets to promote the products. For example, if a customer buys three of item A, he will then get one free item B from the supermarkets. Suppose the supermarkets get 5 dollars of profit from each unit of item A, and loss of 2 dollars for each unit of item B. Although the supermarkets lose 2 dollars for giving item B as free, they could earn 15 dollars from the 3 units of items A. Thus, supermarkets can have a net gain of 13 dollars each time that this promotion is applied. This example demonstrates that the negative utility items occur many time in real-world, therefore it has many applications.

Most of the traditional HUIs mining algorithms tend to find itemsets containing many items, as they are more likely to have a high utility. However, HUIs containing many items are generally less useful than itemsets containing fewer (or limited) and more relevant items. HUIs containing many items generally represent situations that are more specific, and thus rare. For example, a retail store found two HUIs rules {mapleSyrup, pancake}, and {mapleSyrup, pancake, orange, cheese, cereal} [27]. The retail store wants to increase his overall profit and will select the first itemset rather than second one. Because first itemset contains only two items that are very common whereas second contains five items that are rarer. A similar situation occurs in FIM. To solve this situation, a solution is proposed that change the task of mining frequent patterns to constraint based mining frequent itemset. Constraint based mining proposed minimum length and maximum length based constraint. Unfortunately, most of the techniques developed in constraint based FIM cannot be directly applied in HUIs mining. This is because itemsets testing requires utility checking which is more difficult than frequency testing. Moreover, the search space of HUIs is much larger than that of FIM. Therefore, for mining more useful HUIs, length constraint based HUIs mining is desired. In this paper, we address some key challenges in HUIs mining, later we introduce several novel ideas or key contribution to improve the performance. The key challenges are as follows. First, most of the HUIs algorithms, mine rules from the datasets containing only positive utility value. Traditional HUIs algorithms may lose candidate itemsets while mining with negative utility value. The second challenge is that the utility-list based algorithms consider the candidates which may not appear in the datasets. They also mine lots of tiny itemsets that are not actionable. The third challenge is that HUIs mining algorithms scan dataset more than once. Hence dataset scanning cost reduction techniques are needed. Therefore, mining HUIs with negative utility items is a very computationally expensive task. The last challenge is that summation based overestimation utility counting techniques are not up-to the mark. Hence, length of itemsets become longer which creates problem to analyze the result itemsets. Hence, less in numbers but meaningful itemsets are required.

To address the above challenges, we propose an efficient algorithm named EHNL (Efficient High utility itemsets mining with Negative utility and Length constraints). The key contributions of this paper are summarized as follows:

•
We propose an efficient algorithm for mining HUIs with negative utility items using a pattern-growth approach which only considers candidate itemsets those appear in the dataset.
•
We introduce minimum length constraint to remove the numerous tiny itemsets. We also use maximum length constraint to restrict the too longer itemsets.
•
In order to reduce the dataset scanning cost, we utilize dataset projection and transaction merging techniques. A memory efficient array-based utility counting technique is also utilized to speed up the utility counting process.
•
In order to prune the search space and to speed up the mining process, we utilize sub-tree based pruning strategy that was proposed by EFIM [26].

In this regard, Pei and Han presented the constrained frequent pattern mining and also presented the advantages and applications of constrained frequent pattern mining [28], [29], [30], [31]. They classified the constraints according to their applications, including, item constraint, length constraint, super-pattern constraint (Model-based constraint), aggregate constraint, regular expression constraint, duration constraint and gap constraint. We took the idea from there and applied length constraint into HUIs mining. For more details and information about length constraint in FIM, we can follow [28], [29], [30], [32], [33], [34].

The rest of this paper is organized as follows: In Section 2, we introduce the background and related work for HUIs mining. In Section 3, we describe preliminary definitions. In Section 4, the proposed algorithm is described in details. Experiment results are shown in Section 5 and conclusions are given in Section 6.

Section snippets

Related literature

HUIs mining problem is widely recognized as a harder problem than FIM mining problem because utility of an item does not support the downward closure property. In 2005, Liu et al. proposed an algorithm named Two-phase to mine HUIs [17]. This algorithm presents a TWU based downward closure property to prune the search space. Later on, other two-phase algorithms such as [9], [18], [20], [35] follow this TWU based strategy to enhance the performance. Later Tseng et al. proposed a tree-based method

Preliminary definitions

A transaction dataset D is a set of several transactions as D = ${T_{1}, T_{2}, \dots, T_{m}}$ where m is the maximum number of transactions. For example, in Table 1, dataset D includes seven transactions (T₁ to T₇). $I = {x_{1}, x_{2}, \dots, x_{n}}$ is a set of items which may appear in transactions. An itemset X is a set of items where X ⊆ I. If itemset X contains k distinct items, itemset X is called a k-itemset. For example, a 2-itemset AB contains two items A and B. Transaction T₁ indicates that items A, B, D and E appear

The proposed algorithm

In this section, we give a step-by-step analysis of the proposed algorithm named EHNL. Section 4.1 describes the dataset cost reduction techniques to reduce the dataset scanning. Section 4.2 describes the pruning strategies. Section 4.3 introduces array-based utility counting technique. Finally, Section 4.4 gives the pseudo-code and explanation of the proposed algorithm.

Experimental evaluation

In this section, we check the performance of our proposed algorithm (EHNL). We implemented the proposed algorithm by extending the open-source Java library SPMF [46]. Experiments were performed on a PC with an Intel Core-i7-6700 machine, 3.40 GHz CPU with 8GB of memory, running on a Windows 10 Pro-(64 bit Operating System).

For the performance test, the following benchmark datasets in SPMF [46] were chosen: Accidents, chess, mushroom, T10I4D100K, and T40I10D100K. The accidents, chess, and

Conclusion

In this paper, we proposed an algorithm for mining HUIs with negative utility and length constraints. Most of the traditional HUIs mining algorithms mine the rules from the datasets that have only positive utility. But in real-life negative utility is very important. In literature, only HUINIV-Mine [22] and FHN [38] algorithms are proposed to solve the negative utility itemsets mining problem. But the rules mined by both of these algorithms include very large number of tiny itemsets. We

Compliance with ethical standards

The authors declare no conflicts of interest. The article mines high utility itemsets with negative utility and length constraints.

References (46)

D. Lee et al.
Utility-based association rule mining: a marketing solution for cross-selling
Expert Syst. Appl.
(2013)
S.-J. Yen, Y.-S. Lee, Mining High Utility Quantitative Association Rules, Springer Berlin Heidelberg, Berlin,...
W. Song et al.
Bahui: fast and memory efficient mining of high utility itemsets based on bitmap
Int. J. Data Warehous. Min.
(2014)
S. Krishnamoorthy
Pruning strategies for mining high utility itemsets
Expert Syst. Appl.
(2015)
S. Zida et al.
Efim: a fast and memory efficient algorithm for high-utility itemset mining
Knowl. Inf. Syst.
(2017)
H.-F. Li et al.
Fast and memory efficient mining of high-utility itemsets from data streams: with and without negative item profits
Knowl. Inf. Syst.
(2011)
G.-C. Lan et al.
On-shelf utility mining with negative item values
Expert Syst. Appl.
(2014)
K. Singh et al.
Mining of high utility itemsets with negative utility
Expert Syst.
(2018)
R. Agrawal et al.
Fast algorithms for mining association rules in large databases
Proceedings of the 20th International Conference on Very Large Data Bases
(1994)
M.J. Zaki
Scalable algorithms for association mining
IEEE Trans. Knowl. Data Eng.
(2000)

T. Uno et al.

Lcm ver. 2: Efficient mining algorithms for frequent/closed/maximal itemsets

IEEE ICDM Workshop on Frequent Itemset Mining Implementations

(2004)

J. Han et al.

Mining frequent patterns without candidate generation

ACM Sigmod Record

(2000)

K. Singh, H.K. Shakya, B. Biswas, Discovery of Multi-frequent Patterns Using Directed Graph, Springer India, New Delhi,...

R. Chan et al.

Mining high utility itemsets

Proceedings of the Third IEEE International Conference on Data Mining

(2003)

H. Yao et al.

A foundational approach to mining itemset utilities from databases

Proceedings of the Third SIAM International Conference on Data Mining

(2004)

Y. Liu et al.

A fast high utility itemsets mining algorithm

Proceedings of the 1st International Workshop on Utility-based Data Mining

(2005)

C.F. Ahmed et al.

Efficient tree structures for high utility pattern mining in incremental databases

IEEE Trans. Knowl. Data Eng.

(2009)

V.S. Tseng et al.

Up-growth: an efficient algorithm for high utility itemset mining

Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

(2010)

H.F. Li et al.

Fast and memory efficient mining of high utility itemsets in data streams

2008 Eighth IEEE International Conference on Data Mining

(2008)

M. Zihayat et al.

Mining top-k high utility patterns over data streams

Inf. Sci.

(2014)

A. Erwin, R.P. Gopalan, N.R. Achuthan, Efficient Mining of High Utility Itemsets from Large Datasets, Springer Berlin...

B.-E. Shie, H.-F. Hsiao, V.S. Tseng, P.S. Yu, Mining High Utility Mobile Sequential Patterns in Mobile Commerce...

Y. Liu et al.

A two-phase algorithm for fast discovery of high utility itemsets

Proceedings of the 9th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining

(2005)

Cited by (18)

An efficient method for mining High-Utility itemsets from unstable negative profit databases
2024, Expert Systems with Applications
The study of High-Utility Itemset Mining (HUIM) and Frequent Itemset Mining (FIM) is crucial since it explains consumer behavior and offers actionable advice to improve business results. HUIM algorithms have been successfully established to identify high-utility itemsets, including those with negative utilities. The problem with these approaches is that they presume incorrectly that items with negative utility across transactions would always be losses. Products with positive profitability may seem negative when combined with other items to increase sales or reduce inventory. Using strict upper-bound approaches, this paper presents strategies for making database scanning more efficient and reducing the number of prospective candidates. We also prove that it is correct to use the proposed upper-bounds for pruning on several types of items in the database. Based on all the proposed solutions, we develop a novel algorithm to solve this problem efficiently. To demonstrate their efficiency, the algorithms are tested against states-of-art HUIM algorithm on diverse datasets with regard to size and characteristics with unstable negative profits.
Example-based explanations for streaming fraud detection on graphs
2023, Information Sciences
Citation Excerpt :
Next, we investigate the utility of graph-based explanations. Since users in different application domains have different perspectives on utility [46], asking users for feedback would not yield generalizable insights. Instead, we focus on the utility of explanations derived with different selection algorithms.
Fraud detection is one of the most important tasks in Web platforms such as e-commerce, social media, network security, and financial systems. To prevent fraudulent actions from misleading customers or causing significant losses for businesses, various fraud detection methods have been proposed in recent years. However, research on fraud definitions, characteristics, and behaviours has been limited to which users, items, and transactions are considered fraudulent rather than why these entities have been classified as such. This inhibits effective validation of the detected frauds as well as countermeasure design. In this paper, we argue that explanations for discovered frauds may be provided in terms of prior identified frauds. A large variety of comparable frauds would assist investigators to generalise, allowing them to grasp the characteristics that are significant for fraud detection. Feature-annotated graphs are frequently used to detect the type of fraud in which fraudsters commonly interact with a large number of benign users to conceal themselves. Given a fraud subgraph, we propose a query-by-example approach for indexing and extracting the k most similar and diverse fraud subgraphs from prior frauds. To achieve an efficient and adaptive realisation of the approach in a streaming setting, we present a novel graph representation learning technique and discuss the implementation considerations. Comparing our study against baseline techniques revealed that our approach outperforms them in delivering meaningful explanations for various fraud camouflage behaviours.
EHMIN: Efficient approach of list based high-utility pattern mining with negative unit profits
2022, Expert Systems with Applications
Citation Excerpt :
However, it has a poor performance on large datasets and on low thresholds, because it adopted the level-wise candidate generation. Subsequently, tree-based algorithms were proposed, such as ENIN (Singh, Shakya, Singh, & Biswas, 2018), UP-GNIV (Subramanian & Premalatha, 2015), and EHNL (Singh, Kumar, Singh, Shakya, & Biswas, 2019), because the level-wise candidate generation approaches suffer from scalability issues. EHNL adopted length constraints for certain purposes that are required in the real world.
High-utility pattern mining is an important sub-literature in the data mining literature. This literature discusses the discovery of useful pattern information from large databases by considering not only supports of patterns but also profits and quantities of items. This literature has the potential to be applied to various problems in the real world, so many methods for the improvement of the algorithm performance have been studied. Moreover, there have also been attempts to extend the flexibility of this literature. The traditional approaches in this literature considered the positive unit profits of items in a given database only. However, this literature can take extended flexibility into account by considering negative as well as positive unit profits of the items. In this paper, we suggest an efficient approach for mining high-utility patterns with negative unit profits. Moreover, the experimental performance tests, which are performed on various real and synthetic datasets in this paper, show that the proposed algorithm has a better performance than the state-of-the-art methods in this literature in terms of the runtime, memory usage, and scalability.
TKN: An efficient approach for discovering top-k high utility itemsets with positive or negative profits
2022, Information Sciences
Top-k high utility itemsets (HUIs) mining permits discovering the required number of patterns - k, without having an optimal minimum utility threshold (i.e., minimum profit). Multiple top-k HUIs mining algorithms have been introduced with interesting results. However, these algorithms focus mainly on mining patterns from positive profit datasets, while few preliminary studies can handle datasets with negative profits. Moreover, conventional top-k HUI mining algorithms, that are meant for exploring positive profit datasets, perform poorly when mining top-k HUIs on highly dense and large datasets. In this paper, we propose TKN (efficiently mining Top-K HUIs with positive or Negative profits) which employs generalized and adaptive techniques to mine both positive and negative profit datasets effectively. The proposed approach adopts transactional projection and merging mechanisms to decrease the dataset traversing cost. Furthermore, several pruning and threshold elevating ideas are utilized to significantly narrow the exploration space. To highlight the reliability of the devised TKN, a series of extensive comparisons were conducted using two versions of six real datasets. The obtained results reveal that TKN is clearly superior in finding the required number of patterns, whether on positive or negative profit datasets, compared to the current cutting-edge competitors.
Mining top-N high-utility operation patterns for taxi drivers
2021, Expert Systems with Applications
Citation Excerpt :
Lin, Fournierviger, and Gan (2016) proposed a fast and high-utility itemsets mining algorithm with negative unit profit. Singh, Shakya, Singh, and Biswas (2018, 2019) proposed an algorithm for mining high-utility itemsets with negative utility by using pattern growth. The algorithm introduces two constraints, the minimum length and the maximum length, to control the size of itemsets and exclude the too small or too big itemsets.
In recent years, the rapid development of mobile network and wireless sensor technology has brought opportunities to change the way of the existing taxi business operation. How to improve the operation revenues of taxi drivers has become a topic worthy of research. This paper analyzes and mines taxi operation data to provide taxi drivers with personalized sequence recommendation services, thereby increasing their expected revenues. Different from previous works, the proposed method in this paper recommends a series of future operation orders for taxi drivers, instead of recommending several discrete locations for the current order. In this paper, firstly, by performing spatial-temporal clustering on the origins and destinations of passengers, the spatial and temporal distribution characteristics of passengers in the city are identified. Secondly, the origin of the current passenger is used as the root node to construct a top-N high-utility sequence tree, and this process can be divided into two processes: top-down building tree and bottom-up sorting path utility. The two pruning strategies of node utility and path utility are used to reduce the generation of candidate sets. Finally, a series of potential orders based on dynamic context are recommended to taxi drivers, so as to maximize the expected revenues of taxi drivers. The experimental results demonstrate that there is a close relationship between taxi drivers’ operation behavior patterns and their revenues. The proposed system framework and algorithm in this paper can effectively mine global and long-term top-N high-utility operation patterns.
Efficient list based mining of high average utility patterns with maximum average pruning strategies
2021, Information Sciences
Citation Excerpt :
Therefore, high utility patterns with long lengths cannot always be high average utility patterns. However, average utility values cannot satisfy the anti-monotone property [39,46] alike utility values because low average utility patterns can be high average utility patterns if they are combined with sufficiently high average utility patterns. Because applying the anti-monotone property has a major influence on the performance of the mining process, many previous studies which deal with high average utility pattern mining have proposed the various upper-bounds [41,47], which gratify the anti-monotone property.
High average utility pattern mining is the concept proposed to complement drawbacks of high utility pattern mining by considering lengths of patterns along with the utilities of the patterns. High average utility pattern mining should be able to gratify the anti-monotone property like other pattern mining techniques. Many high average utility pattern mining studies to satisfy the anti-monotone property have been proposed in order to improve various upper-bounds because the performance of pattern mining can be improved efficiently by satisfying the anti-monotone property. Although those upper-bounds can effectively reduce the search space, they still take a lot of cost to calculate all unpromising patterns or cannot find them in advance. Therefore, in this paper, a novel high average utility pattern mining approach is proposed by employing two novel upper-bounds called tight maximum average utility upper-bound and maximum remaining average utility upper-bound. Moreover, a newly suggested list-based structure, TA-List, is designed to adopt two pruning strategies. The proposed technique can efficiently extract high average utility patterns by reducing search space. To evaluate the performance of the proposed method, various experiments using real and synthetic datasets are conducted in terms of runtime, memory usage and scalability and the proposed approach is compared with the state-of-the-art high average utility pattern mining algorithms. The results of experiments show that the suggested algorithm has better performance with regard to runtime, memory usage and scalability.

View all citing articles on Scopus

View full text

EHNL: An efficient algorithm for mining high utility itemsets with negative utility value and length constraints

Highlights

Abstract

Introduction

Section snippets

Related literature

Preliminary definitions

The proposed algorithm

Experimental evaluation

Conclusion

Compliance with ethical standards

Expert Syst. Appl.

Int. J. Data Warehous. Min.

Expert Syst. Appl.

Knowl. Inf. Syst.

Knowl. Inf. Syst.

Expert Syst. Appl.

Expert Syst.

Fast algorithms for mining association rules in large databases

Proceedings of the 20th International Conference on Very Large Data Bases

Scalable algorithms for association mining

IEEE Trans. Knowl. Data Eng.

Lcm ver. 2: Efficient mining algorithms for frequent/closed/maximal itemsets

IEEE ICDM Workshop on Frequent Itemset Mining Implementations

Mining frequent patterns without candidate generation

ACM Sigmod Record

Mining high utility itemsets

Proceedings of the Third IEEE International Conference on Data Mining

A foundational approach to mining itemset utilities from databases

Proceedings of the Third SIAM International Conference on Data Mining

A fast high utility itemsets mining algorithm

Proceedings of the 1st International Workshop on Utility-based Data Mining

Efficient tree structures for high utility pattern mining in incremental databases

IEEE Trans. Knowl. Data Eng.

Up-growth: an efficient algorithm for high utility itemset mining

Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Fast and memory efficient mining of high utility itemsets in data streams

2008 Eighth IEEE International Conference on Data Mining

Mining top-k high utility patterns over data streams

Inf. Sci.

A two-phase algorithm for fast discovery of high utility itemsets

Proceedings of the 9th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining