Keywords

1 Introduction

In the last decade, there has been a rapid increase in the generation of large-scale temporal data in various real-world domains such as transaction, internet of things, and social media. As a result, this has triggered and placed emphases on the importance of pattern analysis for various applications. Over the years, High Utility Itemset Mining (HUIM) has gained grounds and become a classical data mining problem in the research community. The problem of HUIM states that given user defined minimum threshold and a transaction dataset, what are the itemsets whose utility are greater than or equal to the minimum threshold. Due to the interesting challenges that HUIM poses, there have been numerous [1,2,3] mining algorithms that have been developed to address these challenges. These challenges include but not limited to; the absence of the downward closure property, and large candidate generation, among others. However, regardless of its challenges, HUI also has notable applications like market basket analysis, decision making and planning in retail, and product catalog design. The challenges of HUIM have led to several branches [4,5,6] in the HUIM communities.

An important research direction extended from the HUIM is the discovery of high utility itemsets in data streams environment due to the wide applications on various domains. For time-variant data streams, there is a strong demand to develop an efficient and effective method to mine various temporal patterns. Previous work [7, 8] used the growth rate as the main indicating factor for emergence. An emerging itemset is an itemset i over two datasets \(D_1\) and \(D_2\) whose growth rate in favor of \(D_1\) is defined as \(\frac{supportD_1(\textit{i})}{supportD_2(\textit{i})}\). In other words, a positive change in the support of an item (itemsets) from dataset \(D_1\) to \(D_2\) is considered emerging itemset if the threshold set by the user is satisfied. Mining temporal emerging high utility itemsets are by no doubt equally insightful and in some cases preferred patterns over the traditional HUI output. For instance, if a user plans to purchase stock items in the next coming days, HUI output may not result in the most informed decision making. In this case, emerging high utility itemsets will be useful and interesting patterns to be considered. However, most methods designed for the traditional databases cannot be directly applied for mining temporal high utility itemsets in data streams.

In this paper, we address the problems of mining temporal emerging high utility itemsets over streaming database. As aforementioned, there are two major issues in this problem: (1) how to mine temporal emerging high utility itemsets correctly and efficiently, and (2) how to set the minimum utility threshold that satisfies all windows. To deal with these two issues, we propose a novel method named EFTemHUI (Efficient Framework for Temporal Emerging HUI mining), which offers high accuracy on mining temporal emerging high utility itemsets. To enhance the efficiency of the mining process, we devise a new mechanism to identify the high utility itemsets that will emerge in the future, which has the ability to capture and store the information about potential high utility itemsets.

The contributions of this paper are the following:

  1. 1.

    We propose a novel method that efficiently mines emerging high utility itemsets mining over streaming transaction databases.

  2. 2.

    To capture emerging high utility itemsets efficiently, we define a new class of itemsets, called Nearest High Utility Itemsets (NHUI).

  3. 3.

    In order to improve the accuracy of mining emerging high utility itemsets, we devise a novel mechanism that adopts regression-based predictive model to incorporate into our method.

  4. 4.

    Through a series of experiments, we prove the excellent performance of our proposed method. To the best of our knowledge, this is the first work that considers the topic of mining emerging high utility itemsets over streaming databases.

The rest of this paper is as follows: Sect. 2 highlights the related work of this research. In Sect. 3, we formally present the problem and other definitions. Section 4 gives detailed information of our proposed method. Finally, in Sects. 5 and 6, we present the results of our experimental evaluation based on the method, and the summary respectively.

2 Related Work

The background of this research encompasses a number of key research directions in data mining; Streaming data mining, high utility itemset mining, emerging patterns, and regression models which are well-studied techniques in both data mining and machine learning. Since this is the first work that considers the mining of emerging high utility itemsets, our related work will center around and highlight the advancements in both HUIM and emerging pattern mining research. We will also introduce some challenges in working with streaming data, which is the type of transaction data used in this research.

The research of high utility itemset (HUI) mining [1,2,3, 6, 9] has focused on the development of efficient algorithms that identify itemsets with high utility values based on user’s threshold. HUIM algorithms can be categorized into two main groups depending on whether there is a candidate generation (two-phase type) or no candidate generation (one-phase type). The latter of the two groups of algorithms performs best in terms of execution time and memory management [1]. [10] highlights some of the challenges encountered in mining over streaming data. Mining over streaming data posses the challenges of storage, data inconsistency, the arrival of data at high speeds, and a few others. As a result of these challenges, the traditional data mining algorithms are not sufficient for dealing with this type of data. Regardless of the challenges in mining over streaming data, there has been some pioneering [4, 11] as well as state-of-the-art algorithms that address these challenges in other directions [5] of high utility itemset mining over streaming data. Algorithms that are designed to mine HUI over stream data typically falls into two paradigms: (1) Time-fading [11] and (2) Sliding window [4] paradigm.

In 1999, Dong et al. [7] first introduced the idea of emerging patterns (EPs). EPs were described as patterns that show significant increment in terms of support from one dataset to another. Dong et al. [7] then introduced a border-based algorithm to solve this problem. Over the years there have been numerous publications on emerging patterns. Depending on the strategy adopted in a given algorithm, EP algorithm may fall under one of these four groups [12]; constraint-based [13], border-based [7], tree-based [8] and evolutionary fuzzy system-based [14].

3 Problem Definitions

In this section, we formally define the problem statement. In addition to the problem statement, we also introduce some preliminary definitions which are essential to our problem statement. For clarity, we will use Table 1 which is our streaming data, and Table 2 which is the external utility as our running example.

3.1 Definitions

Our streaming data is processed using the sliding window method. According to our running example in Table 1, here is the description of the table’s annotation; A window (W) represents the size of the portion of the data that is already captured and is being processed. From Table 1 a window’s size is measured in terms of batch, \(W_1\) contains two batches (\(B_1\) and \(B_2\)). A batch is a number of transactions that arrives in a window together. \(B_1\) contains three transactions (\(T_1, T_2,\) and \(T_3\)). Note that the window slides one batch at a time.

Table 1. Streaming data

Definition 1

Utility:

The utility of an itemset i in a given transaction T from a streaming dataset \(S_D\) is denoted as \(u(i,T_{S_D})\), is defined as the product of its internal utility (quantity) denoted as \(q(i,T_{S_D})\) and the external utility (unit profit) denoted as p(i). Example, \(u(a,T_7) = 4 \times 6 = 24\).

Definition 2

Utility of an Itemset in a Window:

The utility of an itemset in a given window w is the sum of the given itemset’s utility over all transactions in the given w, \(U(i_w) = \sum u(i_n, T_w)\). For example \(U(\{c,d\}_{w_1}) = u(\{c,d\},t_1) + u(\{c,d\},t_2)+ u(\{c,d\},t_5) + u(\{c,d\},t_6) = 40 + 40 + 105 + 85 = 270.\)

Table 2. External utility values

Definition 3

High Utility Itemset in a window

Given a Window w, and a user-specified minimum threshold mu, an itemset is considered high utility iff its utility in current w is no less than or equal to mu. For example, if \(mu=100\) and window \(=w_1\), \(\{c,d\}\) is a HUI whereas \(\{a,d\}\) isn’t.

Definition 4

Nearest High Utility Itemset (NHUI)

Given a current window w, user-specified minimum utility threshold mu, and a tolerance threshold in percentage, \(m\%\), a nearest HUI is an itemset whose utility value is less than mu but greater than \(m\%\) of mu. For example, from Definition 3, \(\{d\}\) in \(w_1\) is not an HUI but an NHUI if \(m\% = 75\%\) and mu = 130.

Definition 5

Growth Rate

According to [15] as shown in Eq. 1, we redefine growth rate as the utility of and itemset x in window i over utility x in window \(i-1\). See Eq. 1.

$$\begin{aligned} \qquad GR(x) ={\left\{ \begin{array}{ll}0, &{} \text {if }Util_{W_i}(x)=Util_{W_i-1}(x) = 0, \\ \infty , &{} \text {if }Util_{W_i-1}(x)\ne 0 \wedge Util_{W_i}(x) = 0, \\ \frac{Util_{W_i-1}(x)}{Util_{W_i}(x)}&{}\text {other case} \end{array}\right. } \end{aligned}$$
(1)

3.2 Problem Statement

Given a streaming data \(S_D\), user-specified minimum utility threshold mu, and tolerance threshold \(m\%\) the goal is to identify all high utility itemsets that will emerge in a future given window \(w_f\).

4 Proposed Method

Figure 1 illustrates the EFTemHUI (Efficient Framework for Temporal Emerging HUI mining) which incorporates regression model into the mining of emerging high utility itemsets. The following subsections highlight the different components of the EFTemHUI method.

Fig. 1.
figure 1

EFTemHUI (Efficient Framework for Temporal Emerging HUI mining)

4.1 Method Input - Streaming Transaction Data

Our datasets are streaming data which are the benchmark datasets used in HUIM research. The key features about streaming data are; (1) they are continuous (streaming), (2) there is a limitation in terms of data arrival (speed factor), and (3) Storage and memory consideration needs to be handled well. In today’s world, there are numerous sources of streaming data, some of which includes; RFID data, Web clicks, Telecommunication network data, Media data, Sensor network, Retail transaction, as well as many others.

In this section, we need to pre-process the dataset to fit the right format to be used by HUIM algorithms in the next stage of the method. The standard HUIM algorithms takes input in the form of this example; abc : 150 : 100, 25, 25. This format is also true for HUIM over streaming transaction data. Table 3 shows a quick summary of some of the benchmark datasets used in HUIM research [16].

The parameters required for the mining of the emerging high utility itemsets are minimum utility threshold (mu), tolerance threshold (t) and buffer window size (w). The parameter w indicates the amount of data to be captured for analysis at any given time.

Table 3. Dataset summary

4.2 Itemsets Mining

This section of the method is subdivided into three main components, the buffer transaction constructor, high utility itemsets miner, and nearest high utility itemsets miner. The two mining components utilizes the minimum utility threshold and the tolerance threshold to mine two types of High Utility Itemsets (HUI). These HUIs are (1) HUI, and (2) Nearest HUI (see Definition 4).

Buffer Transaction Constructor: The buffered transaction constructor component is used to identify the portion of the streaming data that should be captured for analysis by the method. Table 1 illustrates the window size and batch size mechanisms used in this component.

High Utility Itemsets (HUI) Miner: In short definition, a high utility itemset is a group of items that are sold together, and their combined utility meets a minimum utility threshold set by the user. Below is a definition by [1].

The problem of high-utility itemset mining is defined as follows. Let I be a finite set of items (symbols). An itemset X is a finite set of items such that \(X \subseteq I\). A transaction database is a multiset of transactions \( D = \{T_1, T_2, ..., T_n\}\) such that for each transaction \(T_c, T_c \subseteq I\) and \(T_c\) has a unique identifier c called its TID (Transaction ID). Each item \(i \in I\) is associated with a positive number p(i), called its external utility (e.g. unit profit).Every item i appearing in a transaction \(T_c\) has a positive number \(q(i, T_c)\), called its internal utility (e.g. purchase quantity)

The correctly mined HUI are the direct input of the Invalid High Utility Detector in the next section of the method.

Nearest High Utility Itemsets Miner: Nearest HUI is a new terminology that is introduced by this framework. A nearest HUI is an item/itemset that doesn’t meet the minimum utility requirement set by the user; however, it is close enough to satisfy \(m\%\) of the minimum threshold set by the user.

For example, given a minimum threshold of 100 and a minimum nearest (tolerance threshold) percentage of \(75\%\), any itemset whose utility is greater than or equal to 100 is considered HUI while any itemset whose utility is greater than or equal 75, but less than 100 is considered as an NHUI (see Definition 4). The NHUI are the input for the Emerging High Utility Itemets Predictor. Since their utility does not meet the mu, the predictor component is used to predict their value in the next future window.

4.3 Emerging Itemsets Prediction

In this section, two main operations are performed here. The NHUI is quite important at this level of the framework because it is the input of the regression model. The two main operations of this stage are; (1) Purge HUI, and (2) Regression model implementation.

Invalid High Utility Detector: Since we will be moving to the next window, it is essential to get rid of the older batch of transaction in preparation for the newer batch of transactions from the upcoming transaction stream. The Invalid High Utility Detector is used to identify itemsets that will not meet the mu threshold in the next window. We call this class of itemsets a Purge HUI (P-HUI). An itemset is considered a P-HUI if it’s utility is greater than or equal to minimum utility after the utility of the older batch has been removed, but newer ones are not added yet. For example \(U(\{c\}_{w_1}) = u(\{c\},t_1) + u(\{c\},t_2) + u(\{c\},t_3)+ u(\{c\},t_4)+ u(\{c\},t_5)+ u(\{c\},t_6) = 30 + 20 + 120 + 70 + 80 + 40 = 360\). However, after purge in transition to \(w_2\), utility of P-HUI \(\{c\}\) will become \(U(\{c\}_{w_1}) = u(\{c\},t_4)+ u(\{c\},t_5)+ u(\{c\},t_6)=70 + 80 + 40 = 190\). 190 is still a high utility itemset in the next time window if the \(mu=100\).

Emerging High Utility Itemsets Predictor: We tested and compared the performance of three different regression models. The models used for comparison (see Fig. 3) are: Linear Regression, Lasso Regression, and Random Forest Regression. The component of the framework responsible for utility prediction is the Emerging High Utility Itemsets Predictor, and it takes the NHUI as input. The task of a prediction model is to predict or estimate the utility of the NHUIs for the upcoming time window. When the estimated value is greater than or equal to the minimum utility threshold, it is considered as a potential HUI that could emerge in the next time window.

4.4 Emerging Itemsets

With a 100% certainty, all the outputs from the Invalid High Utility Detector will eventually emerge in the next time window. This observation is valid because their utility already satisfies the minimum requirement and as such, they are high utility itemsets. However, for the outputs from the Emerging High Utility Itemsets Predictor component, since the utility values are estimates, they need to be confirmed at this stage to evaluate the accuracy of the method. An optimal accuracy of the framework depends on the regression model used for the utility estimation.

5 Experimental Evaluation

We conduct a series of experiments for evaluating the performance of our framework by the semi-synthetic data under various system conditions. All experiments were performed on an Intel Xeon CPU E5-2630 2.20 GHz machine with 128 GB of memory running Ubuntu 14.04. Two programming languages (Java and Python) are utilized for implementing our framework, each processing a section of the framework.

5.1 Experimental Setup

Our evaluation is separated into two parts, which are internal observation and external comparison. In the first part, we adjust the threshold of minimum utility to observe the effect of the execution time on our framework. Moreover, we analyze the effect of window size and accuracy on three different regression models. The regression models were to predict which of the NHUI will emerge as HUI in the next time window. In the external comparison, we developed a baseline algorithm (see Algorithm 1) that implements the basic rule for emerging itemset identification. We then compared the accuracy of both the baseline and our framework’s method of itemset identification.

Dataset: In the experiment, we use three different datasets. One of the dataset is the chainstore (Table 3), which is a real dataset in HUIM research. The accidents (Table 3) dataset which is a semi-real dataset is also obtained from the HUIM datasets. The accident is considered semi-real because the utility value of the dataset is synthetic values which were generated following the same normal distribution formula in [1]. The third dataset is a full simulation data that was generated using the spmf [16] toolkit. However, to simulate and ensure trendiness in the dataset, we used the same technique by [17]. Table 4 describes the datasets used for our experiments.

Table 4. Description of datasets used for experiment.
figure a

Growth Rate-Based Algorithm: Our growth rate algorithm is implemented using the growth rate evaluation method in Definition 5. The algorithm takes streaming database D, window W, and minimum utility \(\delta \) as input. The output of this algorithm is list of high utility itemsets that according to Eq. 1 could be emerging itemsets. The growth rate function has three main evaluation outputs; (1) \(GR = 0 \) if \( Util_{W_i}(x)=Util_{W_{i-1}}(x)\), (2) \(GR = \infty \) if \(Util_{W_{i-1}}(x)\ne 0 \wedge Util_{W_i}(x) = 0\) (3) GR could be a negative value, or a positive value.

5.2 Internal Observation

We evaluated our framework based on some observed parameters that need to be considered carefully when mining emerging high utility itemsets. The most obvious parameter is the minimum utility. The value of the minimum utility determines how many itemsets will be generated and consequently how fast the entire framework performs.

Fig. 2.
figure 2

The execution time for the mining of NHUI

The minimum utility value for the chainstore dataset is 52 times the average transaction utility per window. The accident and the spmf150k_200di_75ipt are 400 and 350 respectively. In Fig. 2, the value of the minimum utility contributes to longer execution time. The second contribution factor for higher execution time is the number of windows; the smaller number of windows takes more time for execution. This trend is possible because as the number of windows decreases, the size of windows increases thereby more NHUI can be generated from each window.

Fig. 3.
figure 3

Performance of regression models with respect to window size

We also tested and compared the performance of the three different regression models. The models we used were linear regression, lasso regression, and random forest regression. According to the performance of these models, each model consumes a significant amount of time depending on the average number of NHUI. On average, random forest performed well on almost all datasets proving the highest accuracy which can be seen in Fig. 3.

As described above, we used three different datasets to implement our framework for the mining of emerging high utility itemsets. It is crucial to set the right minimum utility value as this will either result in a very high number of NHUI generated or a very low NHUI generated. For instance, if an absolute minimum utility is set for all windows, the minimum utility phenomenon previously mentioned will surface. To circumvent this, we used the relative minimum utility. In all the three datasets, the minimum utility is set relatively as the product of n (arbitrarily set value) and the average utility of the given window.

Fig. 4.
figure 4

The average number of NHUIs mined per window

In Fig. 4, we also observed that the number of NHUI generated increases as the number of windows decreases. Again, this is because the window sizes are much more significantly larger as the number of windows decreases; therefore many NHUI and HUI can be mine from such large-sized windows. We can, therefore, conclude that both the execution time and the number of NHUI generated are inversely proportional to the number of windows.

5.3 External Comparison

In Table 5, we compared the performance of our framework with the baseline algorithm. We used three different datasets, each of which had two number of windows settings. In all the datasets used, our proposed framework outperformed the baseline algorithm. One of the key advantages of our framework over the GR-Based algorithm was the ability to keep utility values of the NHUI list over several windows. The GR-Based algorithm, on the other hand, uses information from the current window and the immediate previous window. Using only the immediate previous window makes the GR-Based algorithm lose a lot of essential patterns, such is the case of accidents 500 in Table 5.

Table 5. Performance of our proposed framework in comparison to GR-Based method. Accuracy is measured in percentage (%)

It is also worth noticing that even though our regression-based method outperformed the baseline in all the datasets used; however, in some cases the difference is not quite significant, or both achieved the same performance. This observation is attributed to the dataset that is used. The weakness of the baseline model is missing itemsets when the immediate previous window had a value of 0. On the other hand, the weakness of our model is when there are several previous windows with 0 values. When there are several 0 or missing values, this causes the regression model to make less accurate predictions. The strength of our framework depends on the ability of the regression model used as well as the dataset used.

6 Conclusion

In this work, we introduced a promising research problem, named mining Temporal Emerging High Utility Itemsets over streaming database, which is extended by the classical high utility itemsets mining problem. To solve this problem, we designed and implemented a novel method to unearth these interesting itemsets correctly and efficiently. For ensuring the efficiency of our method, we devised a new mechanism that utilizes proven predictive model to evaluate high utility itemsets that will emerge, which has ability to capture and store the information about these potential high utility itemsets. We evaluated our proposed method using three different large public datasets and the experimental results show that EFTemHUI outperforms the GR-based algorithm in terms of accuracy. To conclude, EFTemHUI validated as promising solution that offers high accuracy and efficiency for mining temporal emerging high utility itemsets.

Temporal emerging high utility itemsets are essential and insightful patterns to mine as they help to give information about what might happen in the future. This type of pattern is particularly useful for applications that require prior planning like the stock market, retail store inventory management, etc as well as other applications that require prior planning. For our future work, we plan to incorporate and experiment with other powerful predictive models for the mining of emerging high utility itemsets. We also hope to obtain more real datasets that can help us to expand the applications of emerging HUI in domains such as biomedicine, retail market, and the stock market.