Mining cost-effective patterns in event logs

doi:10.1016/j.knosys.2019.105241

Knowledge-Based Systems

Volume 191, 5 March 2020, 105241

https://doi.org/10.1016/j.knosys.2019.105241 Get rights and content

Abstract

High Utility Pattern Mining is a popular task for analyzing data. It consists of discovering patterns having a high importance in databases. A popular application of high utility pattern mining is to identify high utility (profitable) patterns in customer transaction data. Though such analysis can be useful to understand data, it does not consider the cost (e.g. effort, resources, money or time) required for obtaining the utility (benefits). In this paper, we argue that to discover interesting patterns in event sequences, it is useful to consider both a utility model and a cost model. For example, to identify cost-effective ways of treating patients from medical pathways data, it is desirable to consider not only the ability of treatments to inhibit symptoms or cure a disease (utility) but also the resources consumed and the time spent (cost) to provide these treatments. Based on this perspective, this paper defines a novel task of discovering Cost-Effective Event Sequences in event logs. In this task, cost is modeled as numeric values, while utility is represented either as binary or numeric values. Measures are proposed to evaluate the trade-off and correlation between cost and utility of patterns to identify cost-effective patterns (patterns having a low cost but providing a high utility). Three efficient algorithms called CEPB, corCEPB and CEPN are designed to extract these patterns. They rely on a tight lower-bound on the cost and a memory buffering technique to find patterns efficiently. Experiments show that the proposed algorithms achieve high efficiency, that proposed optimizations improve efficiency, and that insightful cost-effective patterns are found in real-life e-learning data.

Introduction

Discovering patterns in symbolic data is an important research area in data mining. Early studies in this area have mainly focused on discovering frequent patterns. For instance, in frequent itemset mining [1], [2], the goal is to identify sets of items that appear at least a minimum number of times in a transaction database. For numerous applications, frequent patterns reveal important information that can help to understand the data or take decisions. For instance, discovering frequent itemsets in medical data can reveal that some symptoms frequently appear together, which provide useful information for disease diagnosis. However, a key limitation of frequent itemset mining is that the time dimension is ignored. To address this limitation, a more general problem called sequential pattern mining was proposed [3], [4], [5], which consists of identifying subsequences that appear frequently in a sequence database. A sequence is an ordered list of itemsets and can be used to represent various types of data such as protein sequences, text, click streams and event logs. Many studies have been devoted to SPM. The first SPM algorithms are AprioriAll and GSP [4], [5]. These latter apply a breadth-first search to count the support of sequences in a database and output all frequent sequences. Recently, more efficient sequential pattern mining algorithms were designed such as SPADE [6], FreeSpan [7], PrefixSpan [8], CM-SPADE [9], VMSP [10] and FCloSM [11]. These latter adopt various strategies such as using a vertical database representation or a pattern-growth approach to find frequent sequences efficiently. To meet requirements of various domains, several sequential pattern mining extensions have been proposed, which allow to take into account constraints and more complex data types [3]. One of the most popular SPM extensions in recent years is High Utility Sequential Pattern Mining (HUSPM) [12], [13], [14], [15]. It consists of finding sequences having a high utility (e.g. sequences of purchases that yield a high profit). HUSPM generalizes SPM and is a much more difficult problem than SPM because the utility measure is not anti-monotonic nor monotonic. That is, the utility of a sequence may be equal to, greater or smaller than those of its supersequences or subsequences. For this reason, HUSPM cannot be performed using traditional SPM techniques, which requires using anti-monotonic or monotonic measures to reduce the search space [12]. HUSPM allows to find important patterns in data, where importance is assessed using a numerical utility measure. For example, in customer transaction analysis, the utility measure can model the profit yield by sequences of purchases to identify profitable patterns. Another example is website click-stream analysis, where the utility can model the time spent to find sequences of webpages where people spend a lot of time. Identifying such patterns can be useful for decision-making [12], [13], [14].

Albeit HUSPM is an emerging research problem with several applications, a major limitation is that utility is often insufficient to truly evaluate the usefulness of a pattern. In fact, traditional HUSPM algorithms assess the utility or benefits that each pattern provides, but ignores the resources, effort, time or cost required to apply these patterns. This problem is illustrated with an example. Consider medical pathway data indicating the various treatments received by patients of a hospital. Applying HUSPM on this data allows to discover high utility patterns, where the utility can represent whether patients are cured or not after receiving treatments. For example a pattern $〈 t r e a t m e n t A, t r e a m e n t B, t r e a t m e n t C, c u r e d 〉$ may be discovered by traditional HUSPM algorithms indicating that many people who have received these treatments are cured, where $c u r e d$ is a high utility item. Although such patterns may seem useful, a major problem is that HUSPM ignores the cost for applying these patterns, that is the money, time or resources spent to cure each patient using these treatments. As a result, a HUSPM algorithm may find many patterns that have a high utility but have a very high cost, which is undesirable. And similarly, a HUSPM algorithm would miss all patterns that do not have a very high utility but are still useful because they provide a good trade-off between cost and utility.

A second example of this issue can be found in the domain of e-learning. Consider the analysis of sequences of learning activities performed by learners in an e-learning environment, to identify sequences of learning activities that help obtain high scores. A HUSPM algorithm could find many sequences of activities leading to high scores (scores would be modeled as the utility values of items). But HUSPM algorithms would ignore the time spent (cost) on these activities to achieve these scores. Thus, many patterns could be found having a high utility but also a very high cost (requiring to spend a lot of time), which may represent ineffective ways of studying. And patterns having a better trade-off between cost and utility may be missed.

To address this limitation, this paper proposes to find patterns by not only considering a utility model indicating the benefits provided by patterns, but also a cost model to assess the resources used or time spent to apply the patterns. Combining the concept of cost and utility is desirable but challenging. A reason is that there are many ways of measuring utility and cost. Cost can for example be measured in terms of time or money, while utility may represent time, profit, evaluation scores or failure/success. Because utility and cost can be measured using different units, it is not possible to simply combine the concepts of utility and cost by subtracting the cost from the utility, and then to apply a traditional HUSPM algorithm. Another problem of this approach is that it would not allow to evaluate how good the trade-off is between cost and utility for each pattern, and how strong the relationship is between resources spent and utility. For applications such as analyzing medical pathways, it is generally desirable to find patterns that not only have a high utility but also offer an excellent trade-off between utility and cost. Thus, cost and utility must be modeled separately and a tailored solution needs to be designed.

This article addresses this challenge by defining the novel problem of extracting Cost-Effective Patterns (CEP). The aim is to find patterns that provide a good trade-off between cost and utility from sequences containing utility and cost information. The main contributions of this study are threefold.

•
The task of discovering cost-effective patterns in sequences is defined. In particular, three variations of the problem are defined to address the needs of different applications. In the first problem, information about the utility is encoded as a binary label for each sequence. The utility represents a desirable or undesirable outcome (e.g. passing or failing an exam, being cured or not after receiving some medical treatments). In the second problem, utility is encoded as a positive number (e.g. a score obtained after completing a course). In the third problem, the utility is a binary value and it is assumed that only records are available for the positive class. Properties of the three proposed problems are studied. Moreover, statistical measures are designed to assess the correlation between utility and cost for these three problems.
•
Three pattern-growth algorithms are presented to find cost-effective patterns for the three problem variations. The algorithms are named CEPB, corCEPB and CEPN, respectively. To efficiently find patterns, it is necessary to design a strategy to avoid exploring all possible patterns. However, a challenge is that the average cost measure cannot be used to reduce the search space because it is neither anti-monotonic nor monotonic. To address this issue and prune the search space efficiently, two tight lower-bound on the average cost of patterns are proposed named ASC and AMSC.
•
Moreover, to reduce memory usage and speed up the algorithms, a technique called Projected Database Buffer (PDB) is integrated into the designed algorithms.

Several experiments have been performed to assess the performance of the proposed algorithms on various benchmark datasets. Results indicate that the proposed optimizations and the pruning strategy based on the lower-bound decrease runtimes of the proposed algorithms by up to 10 times. Moreover, an analysis of patterns found in real-life e-learning data shows that insightful patterns are found using the proposed algorithms.

The following sections are organized as follows. Section 2 reviews related work. Section 3 defines the proposed problems of discovering cost-effective patterns. Section 4 describes the proposed algorithms. Section 5 presents the experimental evaluation. Finally, Section 6 draws the conclusion.

Section snippets

Related work

This section is divided into five subsections. The first subsection introduces pattern mining and relevant concepts. The second and third subsections discuss frequent and high utility sequential pattern mining, respectively. Finally, the fourth and fifth subsections discuss early work toward the consideration of cost in pattern mining, and related work on emerging pattern mining.

Problem definition

This section proposes the novel problem of discovering Cost-Effective Patterns (CEP). Three cases (variations of this problem) are presented to address the needs of different applications. The type of database that is considered is Sequential Event Logs (SEL) where events are annotated with cost values and each sequence has utility information either encoded as a binary or a numeric value. This section first introduces the type of data and important concepts. Then the proposed problem is

Proposed algorithms

This section presents algorithms to efficiently discover Cost-Effective Patterns (CEP) for the three cases introduced in the previous section. The algorithms rely on the basic search procedure of the PrefixSpan algorithm [8] to explore the search space of sequences in a database. This procedure starts by considering patterns containing single events and then recursively grows these patterns by appending items one at a time. To reduce the cost of scanning the database to calculate the measures,

Experimental evaluation

This section first reports results of experiments to assess the performance of the proposed CEPB, corCEPB and CEPN algorithms. Then, a case study is presented, where the proposed algorithms have been applied on e-learning data to identify interesting cost-effective patterns. All algorithms were implemented in Java and experiments were carried out on a computer having a $64$ bit Xeon E3-1270 3.6 Ghz CPU, running the Windows $10$ operating system and equipped with $64$ GB of RAM.

Conclusion

This article presented a novel problem of discovering cost-effective patterns in event sequences by considering both a utility and a cost model. Three versions of the problem have been defined, to be applied in different real-life scenarios. A performance evaluation performed on four real-life datasets has shown that search space pruning using the average cost measure is effective and that the projected database buffer technique improves performance. A case study on data from an e-learning

References (55)

ZhangC. et al.
An up-to-date comparison of state-of-the-art classification algorithms
Expert Syst. Appl.
(2017)
Fournier-VigerP. et al.
Mining local and peak high utility itemsets
Inform. Sci.
(2019)
Fournier-VigerP. et al.
Efficient algorithms to identify periodic patterns in multiple sequences
Inform. Sci.
(2019)
LanG.-C. et al.
Applying the maximum utility measure in high utility sequential pattern mining
Expert Syst. Appl.
(2014)
RyangH. et al.
High utility pattern mining over data streams with sliding window technique
Expert Syst. Appl.
(2016)
YunU. et al.
An efficient algorithm for mining high utility patterns from incremental databases with one database scan
Knowl.-Based Syst.
(2017)
Van der AalstW.M. et al.
Process mining: a research agenda
Comput. Ind.
(2004)
DalmasB. et al.
Twincle: A constrained sequential rule mining algorithm for event logs
HanJ. et al.
Mining frequent patterns without candidate generation: A frequent-pattern tree approach
Data Min. Knowl. Discov.
(2004)
Fournier-VigerP. et al.
A survey of itemset mining
WIREs Data Min. Knowl. Discov.
(2017)

Fournier-VigerP. et al.

A survey of sequential pattern mining

Data Sci. Pattern Recognit.

(2017)

AgrawalR. et al.

Mining sequential patterns

SrikantR. et al.

Mining sequential patterns: Generalizations and performance improvements

ZakiM.J.

Spade: An efficient algorithm for mining frequent sequences

Mach. Learn.

(2001)

HanJ. et al.

Freespan: frequent pattern-projected sequential pattern mining

PeiJ. et al.

Mining sequential patterns by pattern-growth: The prefixspan approach

IEEE Trans. Knowl. Data Eng.

(2004)

P. Fournier-Viger, A. Gomariz, M. Campos, R. Thomas, Fast vertical mining of sequential patterns using co-occurrence...

P. Fournier-Viger, C. Wu, A. Gomariz, V.S. Tseng, VMSP: efficient vertical mining of maximal sequential patterns, in:...

LeB. et al.

Fclosm, fgensm: two efficient algorithms for mining frequent closed and generator sequences using the local pruning strategy

Knowl. Inf. Syst.

(2017)

YinJ. et al.

Uspan: an efficient algorithm for mining high utility sequential patterns

ZihayatM. et al.

Mining significant high utility gene regulation sequential patterns

BMC Syst. Biol.

(2017)

O.K. Alkan, P. Karagoz, Crom and huspext: Improving efficiency of high utility sequential pattern extraction, in: Proc....

Truong-ChiT. et al.

A survey of high utility sequential pattern mining

U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, et al. Knowledge discovery and data mining: Towards a unifying framework....

AggarwalC.C.

Data Mining: The Textbook

(2015)

MaimonO. et al.

Introduction to knowledge discovery and data mining

NovakP.K. et al.

Supervised descriptive rule discovery: A unifying survey of contrast set, emerging pattern and subgroup mining

J. Mach. Learn. Res.

(2009)

Cited by (43)

HEPM: High-efficiency pattern mining
2023, Knowledge-Based Systems
Pattern mining (PM) is an important field of data mining and has gained considerable momentum recently, mainly owing to the massive growth of big data. PM often sets attentive objectives such as mining frequent or high utility patterns to obtain attractive patterns. High utility patterns address the defect of frequent patterns that cannot reveal the maximum profit. However, it neglects another vital factor, cost or investment. This paper proposes a new high-efficiency PM problem that considers both utility and investment. The problem aims to find patterns with the maximum profit-to-investment ratio. Our paper is devoted to studying high-efficiency itemsets in transaction databases. We first formulate the criteria for a high-efficiency PM problem. Subsequently, we propose a two-phase algorithm called HEPM and an improved one-phase algorithm called HEPMiner to discover high-efficiency patterns in a transaction database. We design a corresponding pruning strategy within HEPM to reduce the search space. In HEPMiner, we utilize a novel efficiency-list and an estimated efficiency co-occurrence structure in the pruning strategies to further improve the mining performance. Moreover, we derive the upper bounds of efficiency for both algorithms. The experimental results demonstrate the effectiveness and efficiency of our two algorithms.
Efficient privacy preserving algorithms for hiding sensitive high utility itemsets
2023, Computers and Security
Utility-driven mining is a powerful data mining technique that aims to extract significant and valuable knowledge from different kinds of datasets. Nevertheless, analyzing datasets with sensitive or private information may raise security and privacy concerns. To balance utility maximization and privacy preservation, Privacy Preserving Utility Mining (PPUM) has been presented. The primary objective of PPUM algorithms is to conceal the sensitive knowledge that can be found via the application of utility mining algorithms to sensitive data. However, the current PPUM literature shows a paucity of privacy preserving algorithms scalable and efficient enough to handle large and dense datasets. Based on this perspective, this paper proposes three heuristic-based algorithms, namely Selecting the Most Real item sensitive utility First (SMRF), Selecting the Least Real item sensitive utility First (SLRF), and Selecting the most Desirable Item First (SDIF), to efficiently conceal all Sensitive High utility Itemsets (SHIs) while mitigating the projected detrimental impact on the non-sensitive information. The proposed algorithms rely on a novel concept, called Real Item Sensitive Utility (RISU), to effectively select a definite victim item for each SHI during the whole sanitization process. Furthermore, a new sensitive dataset sorting technique is proposed to reduce the time needed to find suitable transactions for sanitization. Through comprehensive experimental evaluations with state-of-the-arts, the viability of the proposed three algorithms was testified. The acquired findings clearly demonstrate the efficacy of the proposed algorithms in terms of reducing the sanitization time and side effects.
Sequential data mining of infection patterns as predictors for onset of type 1 diabetes in genetically at-risk individuals
2023, Journal of Biomedical Informatics
Infections are implicated in the etiology of type 1 diabetes mellitus (T1DM); however, conflicting epidemiologic evidence makes designing effective strategies for presymptomatic screening and disease prevention difficult. Considering the temporality and combination in which infections occur may provide valuable insights into understanding T1DM etiology but is rarely studied due to limited longitudinal datasets and insufficient analytical techniques. The objective of this work was to demonstrate a computational approach to classify the temporality and combination of infections in presymptomatic T1DM. We present a sequential data mining pipeline that leverages routinely collected infectious disease data from a prospective cohort study, the Environmental Determinants of Diabetes in the Young (TEDDY) study, to extract, interpret, and compare infection sequences. We then utilize this pipeline to assess risk for developing presymptomatic biomarkers of islet autoimmunity and clinical onset of T1DM. Overall, we identified 229 significant sequential rules that increased the risk for developing presymptomatic biomarkers of islet autoimmunity or clinical onset of T1DM. Multiple significant sequential rules involving varicella increased the risk for all presymptomatic biomarker-specific outcomes, while a single significant sequential rule involving parasites significantly increased risk for T1DM. Significant sequential rules involving respiratory illnesses were differentially represented among the presymptomatic biomarkers of islet autoimmunity and clinical onset of T1DM. Risk for T1DM was significantly increased by a single episode of sixth disease at 12 months, representing the only single-event sequence that increased disease risk. Together, these findings provide the first insights into the timing and combination of infections in T1DM etiology, which may ultimately lead to personalized disease screening and prevention strategies. The sequential data mining pipeline developed in this work demonstrates how temporal data mining can be used to address clinically meaningful questions. This method can be adapted to other presymptomatic factors and clinical conditions.
An efficient biobjective evolutionary algorithm for mining frequent and high utility itemsets
2023, Applied Soft Computing
Mining frequent and high utility itemsets (FHUIs) from transactional databases is essential in data mining. From a multiobjective perspective, modelling the task of mining FHUIs in a unified framework requires support and utility to be considered simultaneously. In contrast to traditional algorithms for mining FHUIs, multiobjective evolutionary algorithms (MOEAs) can overcome the difficulty of setting the parameter and can generate multiple solutions in one pass, which brings advantages to mining FHUIs. However, MOEAs may be inefficient when the number of transactions and the number of items in the transaction database are large. To address this problem, we propose an efficient biobjective evolutionary algorithm for obtaining FHUIs (BOEA-FHUI) based on three novel strategies. In BOEA-FHUI, a pruning strategy is proposed to reduce the search space. Based on the pruning results, a repair strategy is proposed to make the generated inferior offspring jump out of the dominated region of the previous Pareto solutions. With the proposed pruning and repair strategies, the search space can be significantly reduced, which helps improve the search efficiency. To increase the number of items with higher support and higher utility values, an improved mutation strategy based on the sparse nature of the FHUI is proposed, which can accelerate the convergence speed of the algorithm. The experimental results on the real-world and synthetic datasets show that the proposed algorithm performs better than state-of-the-art MOEAs in finding FHUIs.
Effective algorithms to mine skyline frequent-utility itemsets
2022, Engineering Applications of Artificial Intelligence
Citation Excerpt :
Several high-utility itemset mining algorithms have been presented, which showed that HUIM has become an emerging topic in recent years. Moreover, various extensions have also been studied, such as closed high-utility itemset mining (Lin et al., 2021a,b), utility-oriented pattern mining in sequence data (Gan et al., 2021b, 2020, 2021e) and low-cost high utility itemset mining (Fournier-Viger et al., 2020a,b). In the traditional FIM and HUIM, only one measure is considered to discover patterns from databases.
Skyline frequent-utility itemset mining is used to discover itemsets that are non-dominated by considering both support and utility factors. It is an extension of high-utility itemset mining. Most existing algorithms are based on the utility-list structure to mine skyline frequent-utility itemsets. A major limitation of utility-list based algorithms is that numerous join operations consume a huge amount of time and memory. To address this issue, two algorithms named EMSFUI-D and EMSFUI-B are proposed to mine skyline frequent-utility itemsets. EMSFUI-D performs the depth-first search to explore the search space of all itemsets. EMSFUI-B discovers itemsets based on the breadth-first search. Both algorithms utilize two pruning strategies to limit the search space. Moreover, in order to further facilitate the mining performance, the ISU-1 and ISU-2 structures are presented in EMSFUI-D to provide tighter utility upper bounds. These structures maintain the support and utility information of all 1-itemsets and 2-itemsets, respectively. Thus, there is no need to use these structures to prune search space in the breadth-first search algorithm. An extensive experimental study on real and synthetic datasets shows that our proposed algorithms outperform the state-of-the-art SKYFUP-D and SKYFUP-B algorithms in terms of execution time, memory consumption and pruning performance. Moreover, our designed algorithms are scalable for handling a large number of distinct items and transactions.
Efficient strategies for incremental mining of frequent closed itemsets over data streams
2022, Expert Systems with Applications
Citation Excerpt :
CLOSET (Pei et al., 2000), CLOSET+ (Wang et al., 2003), and CHARM (Zaki & Hsiao, 2005) are well-known closed itemset mining algorithms. Mining frequent closed itemsets can play a role in mining maximal patterns (Vo et al., 2017), high utility patterns (Liu et al., 2016; Tseng et al., 2013), extraordinary patterns (Liu et al., 2019), fault tolerant patterns (Bashir, 2020), occupancy patterns (Deng, 2020), similar patterns (González et al., 2018), and cost-effective patterns (Fournier-Viger et al., 2020). In recent years, mining frequent closed itemsets over data streams has attracted much attention from the industry and the research community.
Mining frequent closed itemsets over data streams is an important data mining problem. Mining data streams is more challenging than mining static data because of the nature of data streams, including high arrival rate, massive volume of incoming data, and concept drift. The existing algorithms for mining frequent closed itemsets over data streams suffer from scalability and efficiency bottlenecks. This paper proposes a novel algorithm for mining frequent closed itemsets over data streams both for the sliding window model and for the landmark model. An indexed prefix closed itemset tree is proposed for compressing all closed itemsets and for quick searching of closed itemsets, and novel search strategies are proposed to prune the search space in updating the set of closed itemsets. The proposed algorithm outperforms the state-of-the-art intersection-based algorithms, CICLAD, ConPatSet, and CloStream, by several times to 2 orders of magnitude in efficiency, and also outperforms the state-of-the-art pattern enumeration algorithm, Moment, by up to 2 orders of magnitude over data streams with large windows and sparse data streams. The proposed algorithm is also superior in scalability.

View all citing articles on Scopus

^☆: No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.knosys.2019.105241.

View full text

Mining cost-effective patterns in event logs☆

Abstract

Introduction

Section snippets

Related work

Problem definition

Proposed algorithms

Experimental evaluation

Conclusion

Expert Syst. Appl.

Inform. Sci.

Inform. Sci.

Expert Syst. Appl.

Expert Syst. Appl.

Knowl.-Based Syst.

Comput. Ind.

Mining frequent patterns without candidate generation: A frequent-pattern tree approach

Data Min. Knowl. Discov.

A survey of itemset mining

WIREs Data Min. Knowl. Discov.

A survey of sequential pattern mining

Data Sci. Pattern Recognit.

Mining sequential patterns

Mining sequential patterns: Generalizations and performance improvements

Spade: An efficient algorithm for mining frequent sequences

Mach. Learn.

Freespan: frequent pattern-projected sequential pattern mining

Mining sequential patterns by pattern-growth: The prefixspan approach

IEEE Trans. Knowl. Data Eng.

Fclosm, fgensm: two efficient algorithms for mining frequent closed and generator sequences using the local pruning strategy

Knowl. Inf. Syst.

Uspan: an efficient algorithm for mining high utility sequential patterns

Mining significant high utility gene regulation sequential patterns

BMC Syst. Biol.

A survey of high utility sequential pattern mining

Data Mining: The Textbook

Introduction to knowledge discovery and data mining

Supervised descriptive rule discovery: A unifying survey of contrast set, emerging pattern and subgroup mining

J. Mach. Learn. Res.