Mining cost-effective patterns in event logs☆
Introduction
Discovering patterns in symbolic data is an important research area in data mining. Early studies in this area have mainly focused on discovering frequent patterns. For instance, in frequent itemset mining [1], [2], the goal is to identify sets of items that appear at least a minimum number of times in a transaction database. For numerous applications, frequent patterns reveal important information that can help to understand the data or take decisions. For instance, discovering frequent itemsets in medical data can reveal that some symptoms frequently appear together, which provide useful information for disease diagnosis. However, a key limitation of frequent itemset mining is that the time dimension is ignored. To address this limitation, a more general problem called sequential pattern mining was proposed [3], [4], [5], which consists of identifying subsequences that appear frequently in a sequence database. A sequence is an ordered list of itemsets and can be used to represent various types of data such as protein sequences, text, click streams and event logs. Many studies have been devoted to SPM. The first SPM algorithms are AprioriAll and GSP [4], [5]. These latter apply a breadth-first search to count the support of sequences in a database and output all frequent sequences. Recently, more efficient sequential pattern mining algorithms were designed such as SPADE [6], FreeSpan [7], PrefixSpan [8], CM-SPADE [9], VMSP [10] and FCloSM [11]. These latter adopt various strategies such as using a vertical database representation or a pattern-growth approach to find frequent sequences efficiently. To meet requirements of various domains, several sequential pattern mining extensions have been proposed, which allow to take into account constraints and more complex data types [3]. One of the most popular SPM extensions in recent years is High Utility Sequential Pattern Mining (HUSPM) [12], [13], [14], [15]. It consists of finding sequences having a high utility (e.g. sequences of purchases that yield a high profit). HUSPM generalizes SPM and is a much more difficult problem than SPM because the utility measure is not anti-monotonic nor monotonic. That is, the utility of a sequence may be equal to, greater or smaller than those of its supersequences or subsequences. For this reason, HUSPM cannot be performed using traditional SPM techniques, which requires using anti-monotonic or monotonic measures to reduce the search space [12]. HUSPM allows to find important patterns in data, where importance is assessed using a numerical utility measure. For example, in customer transaction analysis, the utility measure can model the profit yield by sequences of purchases to identify profitable patterns. Another example is website click-stream analysis, where the utility can model the time spent to find sequences of webpages where people spend a lot of time. Identifying such patterns can be useful for decision-making [12], [13], [14].
Albeit HUSPM is an emerging research problem with several applications, a major limitation is that utility is often insufficient to truly evaluate the usefulness of a pattern. In fact, traditional HUSPM algorithms assess the utility or benefits that each pattern provides, but ignores the resources, effort, time or cost required to apply these patterns. This problem is illustrated with an example. Consider medical pathway data indicating the various treatments received by patients of a hospital. Applying HUSPM on this data allows to discover high utility patterns, where the utility can represent whether patients are cured or not after receiving treatments. For example a pattern may be discovered by traditional HUSPM algorithms indicating that many people who have received these treatments are cured, where is a high utility item. Although such patterns may seem useful, a major problem is that HUSPM ignores the cost for applying these patterns, that is the money, time or resources spent to cure each patient using these treatments. As a result, a HUSPM algorithm may find many patterns that have a high utility but have a very high cost, which is undesirable. And similarly, a HUSPM algorithm would miss all patterns that do not have a very high utility but are still useful because they provide a good trade-off between cost and utility.
A second example of this issue can be found in the domain of e-learning. Consider the analysis of sequences of learning activities performed by learners in an e-learning environment, to identify sequences of learning activities that help obtain high scores. A HUSPM algorithm could find many sequences of activities leading to high scores (scores would be modeled as the utility values of items). But HUSPM algorithms would ignore the time spent (cost) on these activities to achieve these scores. Thus, many patterns could be found having a high utility but also a very high cost (requiring to spend a lot of time), which may represent ineffective ways of studying. And patterns having a better trade-off between cost and utility may be missed.
To address this limitation, this paper proposes to find patterns by not only considering a utility model indicating the benefits provided by patterns, but also a cost model to assess the resources used or time spent to apply the patterns. Combining the concept of cost and utility is desirable but challenging. A reason is that there are many ways of measuring utility and cost. Cost can for example be measured in terms of time or money, while utility may represent time, profit, evaluation scores or failure/success. Because utility and cost can be measured using different units, it is not possible to simply combine the concepts of utility and cost by subtracting the cost from the utility, and then to apply a traditional HUSPM algorithm. Another problem of this approach is that it would not allow to evaluate how good the trade-off is between cost and utility for each pattern, and how strong the relationship is between resources spent and utility. For applications such as analyzing medical pathways, it is generally desirable to find patterns that not only have a high utility but also offer an excellent trade-off between utility and cost. Thus, cost and utility must be modeled separately and a tailored solution needs to be designed.
This article addresses this challenge by defining the novel problem of extracting Cost-Effective Patterns (CEP). The aim is to find patterns that provide a good trade-off between cost and utility from sequences containing utility and cost information. The main contributions of this study are threefold.
- •
The task of discovering cost-effective patterns in sequences is defined. In particular, three variations of the problem are defined to address the needs of different applications. In the first problem, information about the utility is encoded as a binary label for each sequence. The utility represents a desirable or undesirable outcome (e.g. passing or failing an exam, being cured or not after receiving some medical treatments). In the second problem, utility is encoded as a positive number (e.g. a score obtained after completing a course). In the third problem, the utility is a binary value and it is assumed that only records are available for the positive class. Properties of the three proposed problems are studied. Moreover, statistical measures are designed to assess the correlation between utility and cost for these three problems.
- •
Three pattern-growth algorithms are presented to find cost-effective patterns for the three problem variations. The algorithms are named CEPB, corCEPB and CEPN, respectively. To efficiently find patterns, it is necessary to design a strategy to avoid exploring all possible patterns. However, a challenge is that the average cost measure cannot be used to reduce the search space because it is neither anti-monotonic nor monotonic. To address this issue and prune the search space efficiently, two tight lower-bound on the average cost of patterns are proposed named ASC and AMSC.
- •
Moreover, to reduce memory usage and speed up the algorithms, a technique called Projected Database Buffer (PDB) is integrated into the designed algorithms.
Several experiments have been performed to assess the performance of the proposed algorithms on various benchmark datasets. Results indicate that the proposed optimizations and the pruning strategy based on the lower-bound decrease runtimes of the proposed algorithms by up to 10 times. Moreover, an analysis of patterns found in real-life e-learning data shows that insightful patterns are found using the proposed algorithms.
The following sections are organized as follows. Section 2 reviews related work. Section 3 defines the proposed problems of discovering cost-effective patterns. Section 4 describes the proposed algorithms. Section 5 presents the experimental evaluation. Finally, Section 6 draws the conclusion.
Section snippets
Related work
This section is divided into five subsections. The first subsection introduces pattern mining and relevant concepts. The second and third subsections discuss frequent and high utility sequential pattern mining, respectively. Finally, the fourth and fifth subsections discuss early work toward the consideration of cost in pattern mining, and related work on emerging pattern mining.
Problem definition
This section proposes the novel problem of discovering Cost-Effective Patterns (CEP). Three cases (variations of this problem) are presented to address the needs of different applications. The type of database that is considered is Sequential Event Logs (SEL) where events are annotated with cost values and each sequence has utility information either encoded as a binary or a numeric value. This section first introduces the type of data and important concepts. Then the proposed problem is
Proposed algorithms
This section presents algorithms to efficiently discover Cost-Effective Patterns (CEP) for the three cases introduced in the previous section. The algorithms rely on the basic search procedure of the PrefixSpan algorithm [8] to explore the search space of sequences in a database. This procedure starts by considering patterns containing single events and then recursively grows these patterns by appending items one at a time. To reduce the cost of scanning the database to calculate the measures,
Experimental evaluation
This section first reports results of experiments to assess the performance of the proposed CEPB, corCEPB and CEPN algorithms. Then, a case study is presented, where the proposed algorithms have been applied on e-learning data to identify interesting cost-effective patterns. All algorithms were implemented in Java and experiments were carried out on a computer having a bit Xeon E3-1270 3.6 Ghz CPU, running the Windows operating system and equipped with GB of RAM.
Conclusion
This article presented a novel problem of discovering cost-effective patterns in event sequences by considering both a utility and a cost model. Three versions of the problem have been defined, to be applied in different real-life scenarios. A performance evaluation performed on four real-life datasets has shown that search space pruning using the average cost measure is effective and that the projected database buffer technique improves performance. A case study on data from an e-learning
References (55)
- et al.
An up-to-date comparison of state-of-the-art classification algorithms
Expert Syst. Appl.
(2017) - et al.
Mining local and peak high utility itemsets
Inform. Sci.
(2019) - et al.
Efficient algorithms to identify periodic patterns in multiple sequences
Inform. Sci.
(2019) - et al.
Applying the maximum utility measure in high utility sequential pattern mining
Expert Syst. Appl.
(2014) - et al.
High utility pattern mining over data streams with sliding window technique
Expert Syst. Appl.
(2016) - et al.
An efficient algorithm for mining high utility patterns from incremental databases with one database scan
Knowl.-Based Syst.
(2017) - et al.
Process mining: a research agenda
Comput. Ind.
(2004) - et al.
Twincle: A constrained sequential rule mining algorithm for event logs
- et al.
Mining frequent patterns without candidate generation: A frequent-pattern tree approach
Data Min. Knowl. Discov.
(2004) - et al.
A survey of itemset mining
WIREs Data Min. Knowl. Discov.
(2017)
A survey of sequential pattern mining
Data Sci. Pattern Recognit.
Mining sequential patterns
Mining sequential patterns: Generalizations and performance improvements
Spade: An efficient algorithm for mining frequent sequences
Mach. Learn.
Freespan: frequent pattern-projected sequential pattern mining
Mining sequential patterns by pattern-growth: The prefixspan approach
IEEE Trans. Knowl. Data Eng.
Fclosm, fgensm: two efficient algorithms for mining frequent closed and generator sequences using the local pruning strategy
Knowl. Inf. Syst.
Uspan: an efficient algorithm for mining high utility sequential patterns
Mining significant high utility gene regulation sequential patterns
BMC Syst. Biol.
A survey of high utility sequential pattern mining
Data Mining: The Textbook
Introduction to knowledge discovery and data mining
Supervised descriptive rule discovery: A unifying survey of contrast set, emerging pattern and subgroup mining
J. Mach. Learn. Res.
Cited by (43)
HEPM: High-efficiency pattern mining
2023, Knowledge-Based SystemsEfficient privacy preserving algorithms for hiding sensitive high utility itemsets
2023, Computers and SecuritySequential data mining of infection patterns as predictors for onset of type 1 diabetes in genetically at-risk individuals
2023, Journal of Biomedical InformaticsAn efficient biobjective evolutionary algorithm for mining frequent and high utility itemsets
2023, Applied Soft ComputingEffective algorithms to mine skyline frequent-utility itemsets
2022, Engineering Applications of Artificial IntelligenceCitation Excerpt :Several high-utility itemset mining algorithms have been presented, which showed that HUIM has become an emerging topic in recent years. Moreover, various extensions have also been studied, such as closed high-utility itemset mining (Lin et al., 2021a,b), utility-oriented pattern mining in sequence data (Gan et al., 2021b, 2020, 2021e) and low-cost high utility itemset mining (Fournier-Viger et al., 2020a,b). In the traditional FIM and HUIM, only one measure is considered to discover patterns from databases.
Efficient strategies for incremental mining of frequent closed itemsets over data streams
2022, Expert Systems with ApplicationsCitation Excerpt :CLOSET (Pei et al., 2000), CLOSET+ (Wang et al., 2003), and CHARM (Zaki & Hsiao, 2005) are well-known closed itemset mining algorithms. Mining frequent closed itemsets can play a role in mining maximal patterns (Vo et al., 2017), high utility patterns (Liu et al., 2016; Tseng et al., 2013), extraordinary patterns (Liu et al., 2019), fault tolerant patterns (Bashir, 2020), occupancy patterns (Deng, 2020), similar patterns (González et al., 2018), and cost-effective patterns (Fournier-Viger et al., 2020). In recent years, mining frequent closed itemsets over data streams has attracted much attention from the industry and the research community.
- ☆
No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.knosys.2019.105241.