HANP-Miner: High average utility nonoverlapping sequential pattern mining
Introduction
Sequential pattern mining (SPM) [1] as an important research topic, aims to mine sub-sequences (patterns) in sequence datasets. It has been widely used for analyzing biological sequence [2], customer purchase behavior [3], time series [4], [5], inspection reports [6], [7], etc. To meet different requirements, many types of patterns have been proposed, such as high utility pattern [8], [9], high average utility pattern [10], and closed pattern [11]. To mine these types of patterns, various mining methods have been developed, such as high utility pattern mining [12], [13], [14], high average utility pattern mining [15], maximal frequent pattern mining [16], [17], tri-partition pattern mining [18], negative sequential pattern mining [19], [20], co-location pattern mining [21], outlying pattern mining [22], and closed pattern mining [23], [24]. Traditional SPM describes a sequence such as c(abd)(bd)(cd) as an ordered list of sets of items (i.e. a list sets of symbols or characters). For instance, this sequence indicates that item c is followed by a, b, and d, followed by b and d, and then followed by c and d. Traditional methods only consider whether or not a pattern occurs in a sequence, but ignore how many times the pattern occurs in it. To solve this problem, a sequence can be instead described as a list of single items such as cabdbdcd, and all occurrences of patterns can be counted.
Furthermore, to avoid mining useless patterns, SPM with gap constraint [25], [26] was proposed to make the patterns more flexible, and a pattern with gap constraint is expressed as , , where and mean that at least and at most characters occur between and , respectively [27]. Compared with traditional SPM, SPM with gap constraint is difficult to be solved, and has three forms: no condition [5], [28], one-off condition [29], [30], [31], and nonoverlapping condition [27], [32], [33]. Previous research work has shown that, unlike the no condition case, the nonoverlapping condition avoids producing many redundant patterns, and unlike the one-off condition case, it also avoids overlooking valuable patterns [34]. In addition, nonoverlapping SPM (or SPM under the nonoverlapping condition) is a complete pattern mining method that satisfies the Apriori property.
Unfortunately, current research studies based on nonoverlapping SPM only take the occurrence frequencies of patterns into account [34], and do not consider other factors that can help evaluate the importance of the patterns, such as purchase quantity, unit profit of item [35], and the interest and weight of each item. As a result, the information extracted by traditional nonoverlapping SPM algorithms is insufficient for many applications. For example, in biological sequences, the frequency may not be enough to discovery a gene sequence related to a certain disease since although a gene may not occur frequently, its high expression may mean that it is very significant. Conversely, an inhibitory gene may occur frequently, but not have a strong effect. Researchers have therefore proposed a more general problem called high utility SPM [36] that incorporates the external utility values of items into traditional SPM, to comprehensively take into consideration the frequencies of pattern occurrences and the importance of each item. The goal is to find all sequences having a utility (importance) that is no less than a minimum utility threshold. An illustrative example is as follows.
Example 1 Suppose we have a sequence = CTCTTG, and a pattern = C[0,2]T[0,2]T, where the utilities of C, G, and T are 8, 8, and 2, respectively, and that the minimum utility threshold is set to 20. Fig. 1 shows all occurrences of pattern p in sequence s. As shown in Fig. 1, there are four occurrences of pattern p in sequence s under the no condition: 1,2,4, 1,2,5, 1,4,5, and 3,4,5, where the number indicates the position in sequence s. The nonoverlapping condition means that any character in the sequence can be rematched, but not at the same position. In this example, 1,2,4 and 3,4,5 are two nonoverlapping occurrences, since matches and , respectively. Hence, pattern p occurs twice under the nonoverlapping condition. According to the utility values, = 24, which is greater than the minimum utility threshold 20. Hence, pattern p is a high utility pattern.
From the above example, it can be seen that the utility of a pattern is the product of the sum of the utility of each item and its support (the number of occurrences). However, a serious flaw of this method is that it does not take the length of the pattern into account, meaning that it is easy to mine long but valueless patterns. For example, pattern = C[0,2]T[0,2]C[0,2]T[0,2]T[0,2]G occurs only once in the sequence, but the utility of is 30, which is greater than 20. Thus, is also a high utility pattern. From this example, we can see that it is unfair to measure the importance of patterns with different lengths using the same minimum utility threshold, since the longer the pattern length, the higher the utility. To address this problem, inspired by the concept of high average utility in SPM [37], we propose a novel method called high average utility nonoverlapping sequential pattern (HANP) mining. There are three main differences between the previous work [34] and our problem: the mined patterns, the support calculation strategies, and the candidate pattern reduction strategies. More detail will be shown in Related Work section. The main contributions are as follows.
- (1)
We address the problem of HANP mining, and propose an efficient mining algorithm called HANP-Miner which has two essential steps: support calculation and candidate pattern reduction.
- (2)
To efficiently calculate the support, we propose a depth-first online matching (DFOM) algorithm, which adopts a depth-first online matching strategy and employs a simplified Nettree data structure [38].
- (3)
We derive a strategy based on an upper bound on the average utility and combine it with the pattern join strategy to generate candidate patterns. These strategies can efficiently reduce the number of candidate patterns.
- (4)
Experimental results on the DNA, VIRUS, and sales datasets verify that not only does HANP-Miner outperform other competitive algorithms, but also the HANPs mined in this way are significant.
The structure of this paper is as follows. Section 2 introduces related work. Section 3 defines the problem. Section 4 proposes the HANP-Miner algorithm, which employs a depth-first online matching strategy to calculate the support and an upper bound on the average utility to reduce the number of candidate patterns, and presents a complexity analysis. Section 5 reports the results of comparative experiments on DNA, VIRUS and sales datasets, and analyzes these results. Section 6 concludes this paper.
Section snippets
Related work
SPM has been widely used in many fields, and various mining methods have been proposed. For example, web log mining [39] and transaction flow mining [40] were developed for different types of datasets. Van et al. [41], [42], [43] considered the problem of mining sequential patterns with itemset constraints which mined the user access behavior on the web. Pseudo-IDList data structure was proposed [44], which is more suitable for mining clickstream patterns. Based on this structure, a vertical
Problem definition
A sequence with length n is denoted as , where , represents the set of items in sequence s, and the size of can be expressed as . For example, in a DNA sequence, is A,T,C,G and = 4.
Definition 1 Pattern A pattern p with length m is denoted as
(or abbreviated as with gap = [a, b]), where a and b () are integers indicating the minimum and maximum wildcards between and , respectively.
Definition 2 Occurrence and Nonoverlapping Occurrence Suppose we have a sequence and pattern
Proposed algorithm
The main factors affecting the performance of HANP mining are the calculation of the average utility and the generation of candidate patterns. The core difficulty in calculating the average utility lies in calculating the support. In Section 4.1, we therefore present the DFOM algorithm, which employs depth-first search and backtracking strategies to reduce the time and space complexities. To effectively generate candidate patterns, we apply a pattern join strategy that involves an upper bound
Experimental analysis
Section 5.1 describes the experimental datasets used and the algorithms used for comparison. In Section 5.2, we analyze the mining efficiency of different candidate pattern generation and support calculation strategies, and in Section 5.3, we compare and analyze the mining ability of the proposed algorithm and competitive algorithms. Section 5.4 presents a comparison of the mining results of NOSEP, GSgrow, and HANP-Miner, and an analysis of the superiority of adding the high average utility to
Conclusion and future work
To discover high average utility patterns in a SDB, we focus on HANP mining, which takes into account not only the frequency of the pattern and the utility value of each item, but also the influence of the pattern length of the sequence. To mine HANPs, we propose HANP-Miner, which is based on two main steps: pattern support calculation and candidate pattern reduction. To calculate the pattern support, we propose the DFOM algorithm, which applies a depth-first search online matching strategy.
CRediT authorship contribution statement
Youxi Wu: Conceptualization, Writing – original draft, Supervision, Funding acquisition. Meng Geng: Software, Writing – original draft, Validation, Investigation, Data curation. Yan Li: Conceptualization, Methodology, Formal analysis. Lei Guo: Validation, Resources. Zhao Li: Validation, Resources. Philippe Fournier-Viger: Writing – review & editing. Xingquan Zhu: Investigation, Writing – review & editing. Xindong Wu: Supervision, Funding acquisition.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was partly supported by National Natural Science Foundation of China (61976240, 52077056, 917446209), National Key Research and Development Program of China (2016YFB1000 901), and Natural Science Foundation of Hebei Province, China (Nos. F2020202013, E2020202033).
References (74)
- et al.
A predictive GA-based model for closed high-utility itemset mining
Appl. Soft Comput.
(2021) - et al.
Efficient closed high-utility fusion patter model in large-scale databases
Inf. Fusion
(2021) - et al.
Verttirp: Robust and efficient vertical frequent time interval-related pattern mining
Expert Syst. Appl.
(2021) - et al.
Efficient sequential pattern mining with wildcards for keyphrase extraction
Knowl.-Based Syst.
(2017) - et al.
NetNCSP: Nonoverlapping closed sequential pattern mining
Knowl.-Based Syst.
(2020) - et al.
Efficient algorithms for mining frequent high utility sequences with constrsints
Inform. Sci.
(2021) - et al.
Damped window based high average utility pattern mining over data streams
Knowl.-Based Syst.
(2018) - et al.
Efficient transaction deleting approach of pre-large based high utility pattern mining in dynamic databases
Future Gener. Comput. Syst.
(2020) - et al.
Efficient algorithms for mining clickstream patterns using pseudo-idlists
Future Gener. Comput. Syst.
(2020) - et al.
Efficient methods for mining weighted clickstream patterns
Expert Syst. Appl.
(2020)
Combination of dynamic bit vectors and transaction information for mining frequent closed sequences efficiently
Eng. Appl. Artif. Intell.
Efficient high average-utility itemset mining using novel vertical weak upper-bounds
Knowl.-Based Syst.
EHAUSM: An efficient algorithm for high average utility sequence mining
Inform. Sci.
Advanced approach of sliding window based erasable pattern mining with list structure of industrial fields
Inform. Sci.
Mining cost-effective patterns in event logs
Knowl.-Based Syst.
Efficient approach of recent high utility stream pattern mining with indexed list structure and pruning strategy considering arrival times of transactions-sciencedirect
Inform. Sci.
One scan based high average-utility pattern mining in static and dynamic databases
Future Gener. Comput. Syst.
PMBC: Pattern mining from biological sequences with wildcard constraints
Comput. Biol. Med.
SPMF: A java open-source pattern mining library
J. Mach. Learn. Res.
Mining distinguishing subsequence patterns with nonoverlapping condition
Cluster Comput.
Conversion prediction from clickstream: Modeling market prediction and customer predictability
IEEE Trans. Knowl. Data Eng.
Sequential pattern mining in databases with temporal uncertainty
Knowl. Inf. Syst.
NetDAP: (delta, gamma) approximate pattern matching with length constraints
Appl. Intell.
Fuzzy clustering of crowdsourced test reports for apps
ACM Trans. Internet Technol.
Toward better summarizing bug reports with crowdsourcing elicited attributes
IEEE Trans. Reliab.
A pre-large weighted-fusion system of sensed high-utility patterns
IEEE Sens. J.
A survey of utility-oriented pattern mining
IEEE Trans. Knowl. Data Eng.
Incrementally updating the high average-utility patterns with pre-large concept
Appl. Intell.
Efficient algorithms for mining frequent high utility sequences with constraints
Inform. Sci.
HUOPM: High-utility occupancy pattern mining
IEEE Trans. Cybern.
Proum: Projection-based utility mining on sequence data
Inform. Sci.
High average-utility sequential pattern mining based on uncertain databases
Knowl. Inf. Syst.
Performance and characteristic analysis of maximal frequent pattern mining methods using additional factors
Soft Comput.
Discovering long maximal frequent pattern
Frequent pattern discovery with tri-partition alphabets
Inform. Sci.
Mining top- useful negative sequential patterns via learning
IEEE Trans. Neural Netw. Learn. Syst.
NegPSpan: Efficient extraction of negative sequential patterns with embedding constraints
Data Min. Knowl. Discov.
Cited by (35)
A dependence graph pattern mining method for processor performance analysis
2024, Performance EvaluationSN-RNSP: Mining self-adaptive nonoverlapping repetitive negative sequential patterns in transaction sequences
2024, Knowledge-Based SystemsEfficient mining of concept-hierarchy aware distinguishing sequential patterns
2022, Knowledge-Based SystemsCitation Excerpt :Recently, Wu et al. proposed so-called nonoverlapping conditional sequence patterns [17] to allow the same sequence letter to match and rematch pattern letters at different positions. And on this basis, the authors successively designed corresponding nonoverlapping sequential pattern mining approaches with closed [18], high average utility [19], weak-gap [20], three-way [21] and self-adaptive [22]. There are many applications of sequential pattern mining, such as classifying and predicting classes/properties of protein sequences [23,24], analyzing electronic medical records [25], detecting malicious software [26], discovering periodic outliers [27], assisting education [28], and analyzing financial products [29].
NWP-Miner: Nonoverlapping weak-gap sequential pattern mining
2022, Information SciencesCitation Excerpt :The results show that the three viruses are similar in the short patterns but different in the long patterns. To mine the patterns with low frequency but high utility, high-average utility nonoverlapping SPM was proposed, which can be used in product recommendations [40]. Therefore, the method can help retailers achieve the maximum profits.
- 1
Both authors contributed equally to this research..