HAOP-Miner: Self-adaptive high-average utility one-off sequential pattern mining
Introduction
Sequential pattern mining (SPM) (He et al., 2019, Wu, Zhu et al., 2020) is a type of data mining method in which sub-sequences (also known as patterns) are discovered from sequences (Fournier-Viger, Li, Lin, Kiran, & Fujita, 2019). This approach has been widely applied in many fields, such as big data mining (Wu, Zhu, Wu, & Ding, 2013), big data intelligence (Liu et al., 2019, Wu and Wu, 2019), inspection reports (Jiang et al., 2018, Jiang et al., 2019), e-commerce shopping analysis (Le et al., 2015, Nam et al., 2020), and biological sequence analysis (Wang, Hou, & Wang, 2018). Many SPM methods have been proposed to meet a range of requirements, such as constrast SPM (Wu et al., 2021, Ghosh et al., 2017) negative SPM (Dong et al., 2020, Dong et al., 2019, Chen et al., 2019), tri-partition pattern mining (Min, Zhang, Zhai, & Shen, 2020), high utility pattern mining (Choi and Park, 2019, Gan, Lin, Fournier-Viger et al., 2020, Kim et al., 2021) and gap constraint SPM (Wu et al., 2018, Zhang et al., 2007). One of the disadvantages of traditional SPM is that it neglects the repetition in the sequence. For example, pattern “AC” occurs in sequence “ABC”. Thus, the support (number of occurrences) of pattern “AC” in sequence “ABC” is 1 according to traditional SPM. However, pattern “AC” occurs more than once in sequence “AACCACC”. If the support of pattern “AC” in sequence “AACCAC” is also 1, the repetition is neglected. To solve this issue and avoid mining some meaningless patterns, gap constraint SPM was proposed. The gap constraint pattern can be expressed as , where a and represent the minimum and maximum wildcards, respectively. For example, pattern “C[1,2]G” means that there are one or two wildcards between “C” and “G”. Thus, pattern “C[1,2]G” occurs in sequence “CAG”, while does not occur in sequence “CG”. Since patterns can be flexibly matched, gap constraint SPM has been used in many applications such as time series (Huang et al., 2019, Miao et al., 2016, Sumalatha and Subramanyam, 2020), biological sequence retrieval (Ghosh et al., 2017) and feature selection (Wei, Xing, Shi, Ji, & Zou, 2017).
One-off SPM (Huang et al., 2009, Wu, Xie et al., 2013) is a branch of gap constraint SPM that mines frequent patterns from sequences under the one-off condition, which refers that each letter in the sequence can be used at most once. An illustrative example is given below. Example 1 Suppose we have sequence , pattern . From Fig. 1, it can be seen that there are five occurrences for pattern in sequence s. The sub-sequence is an occurrence of pattern that can be written as . Similarly, the other four occurrences are , , and . Occurrences and do not satisfy the one-off condition since is used twice. However, occurrences and satisfy the one-off condition since there is no common used character in sequence. There are three occurrences , and under the one-off condition. Hence, the support of pattern in sequence s is 3.
Current research on one-off SPM ignores the utility (can be price or profit) of items. For example, although a gene may not appear frequently, its behaviour may be extremely remarkable. If we only consider the number of patterns, these highly expressed genes will not be mined (Morteza, Heidar, & Aijun, 2017). We also employ Example 1 to illustrate that high utility pattern mining is more meaningful. The utility of each item is shown in Table 1. The utility of pattern = A[0, 3]G in is since utilities of “A” and “G” are 1 and 2, respectively. Similarly, the one-off support of pattern = A[0, 3]C[0, 3]G in sequence is 2. The utility of pattern is . This example shows that the frequency of pattern is greater than that of , while the utility of pattern is less than that of . Therefore, high utility one-off SPM is worthy to be investigated, and the following two issues should be considered.
- (1)
Gap setting. It is very difficult to set suitable gap constraints without prior knowledge, which makes it challenging to discover valuable patterns (Wang, Duan et al., 2016). In Example 1, there is no occurrence for pattern , which means that improper gap constraints will lead to mining failure.
- (2)
Length of the pattern. As the length of the pattern increases, its utility also increases (Lin, Li, Pirouz, Zhang, & Fournier-Viger, 2020). For instance, in Example 1, although
G appears only once in sequence is also a high utility pattern since . Obviously, it is not reasonable.
To solve these issues, this paper proposes self-adaptive High-Average utility One-off sequential Pattern (HAOP) mining with the following characteristics: (1) any two occurrences cannot share any letter in the sequence; (2) the support, utility and length of the pattern are considered simultaneously; (3) this method discovers patterns with a self-adaptive gap which means that users do not need to set the gap constraints. The main contributions of this paper are as follows.
- (1)
This paper addresses self-adaptive HAOP mining to discover HAOPs and proposes an effective algorithm called HAOP-Miner that contains two key steps: support calculation and candidate pattern generation.
- (2)
For the support calculation, we propose a Reverse filling (Rf) strategy which can effectively calculate the support since it avoids creating redundant nodes and does not need to prune the redundant and useless nodes after finding an occurrence.
- (3)
HAOP mining does not satisfy the Apriori property, and thus a support lower bound method combined with a pattern growth strategy is proposed to prune the candidate patterns effectively.
- (4)
The experimental results validate the effectiveness of HAOP-Miner, and demonstrate that HAOP-Miner has better performance than other state-of-the-art algorithms. More importantly, HAOP-Miner is easier to mine valuable patterns.
The remainder of this paper is organised as follows. Section 2 introduces related work. Section 3 defines the problem considered here. Section 4 designs the Rf strategy to calculate the support, and describes the high lower bound pattern, which satisfies the Apriori property and is used to prune the candidate patterns effectively. Finally, the HAOP-Miner algorithm is proposed. Section 5 reports the results of experiments on biological sequences and sales dataset. Finally, Section 6 presents the conclusion of this paper.
Section snippets
Related work
SPM has been widely applied in various fields, such as event log (Dalmas et al., 2017, Fournier-Viger et al., 2020), data streams (Chen, Xiao, Xin, Lin, & Lin, 2018), transaction databases (Karim, Cochez, Beyan, Ahmed, & Decker, 2018) and biological sequences (Wu, Zhu, He, & Arslan, 2013). Frequent pattern mining considers the support (number of occurrences) of patterns, but ignores the effect of utility (can be price or profit) on patterns (Yun et al., 2018, Gan, Lin, Zhang et al., 2020),
Problem definition
In this section, we formally define HAOP mining. Definition 1 A sequence is described by , where represents the set of items in sequence , and the size of can be expressed as . Since this paper is a self-adaptive gap, we define pattern as ,where is the traditional wildcard, meaning that any letter can appear between the letters in the pattern. Definition 2 is an occurrence of pattern in sequence , if and only if , where .
Proposed algorithm
Fig. 2 shows the overall workflow of HAOP-Miner. HAOP-Miner has two key steps: support calculation and candidate pattern generation. Section 4.1 shows that the support calculation is an NP-hard problem. To calculate the pattern support, Section 4.2 explores a heuristic algorithm named the Positive filling (Pf) algorithm, which is easier to understand at first. To overcome the shortages of Pf, Section 4.3 roposes a more effective heuristic method named the Rf algorithm. Section 4.4 employs a
Experimental results and analysis
We introduce the benchmark datasets in Section 5.1. To evaluate the performance of our approach, we also propose some competitive algorithms whose principles are introduced in Section 5.2. Section 5.3 shows the efficiency of HAOP-Miner. Section 5.4 validate the running performance of HAOP-Miner. Mining performance is evaluated in Section 5.5. Section 5.6 furtherly reports the application in the BABYSALE dataset.
Conclusion
In order to consider the utility of items and to solve the problem of setting gaps without prior knowledge, this paper explores self-adaptive HAOP mining which can discover some extremely important but low-frequency patterns. Self-adaptive HAOP mining as a kind of repetitive SPM (or sequence pattern mining), considers the support (number of occurrences), utility and the pattern length simultaneously. This paper proposes an efficient algorithm, named HAOP-Miner which has two key steps: support
CRediT authorship contribution statement
Youxi Wu: Conceptualization, Methodology, Formal analysis, Supervision, Funding acquisition. Rong Lei: Software, Writing - original draft, Validation, Investigation, Data curation. Yan Li: Investigation, Writing - review & editing. Lei Guo: Validation, Resources. Xindong Wu: Supervision, Funding acquisition.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
This work was partly supported by National Natural Science Foundation of China (61976240, 52077056, 917446209), National Key Research and Development Program of China (2016YFB 1000901), and Natural Science Foundation of Hebei Province, China (Nos. F2020202013, E2020202033).
References (61)
- et al.
Emerging topic detection in twitter stream based on high utility pattern mining
Expert Systems with Applications
(2019) - et al.
TWINCLE: A constrained sequential rule mining algorithm for event logs
Procedia Computer Science
(2017) - et al.
Efficient algorithms to identify periodic patterns in multiple sequences
Information Sciences
(2019) - et al.
Septic shock prediction for ICU patients via coupled HMM walking on sequential contrast patterns
Journal of Biomedical Informatics
(2017) - et al.
Significance-based discriminative sequential pattern mining
Expert Systems with Applications
(2019) - et al.
Mining maximal frequent patterns in transactional databases and dynamic data streams: a spark-based approach
Information Sciences
(2018) - et al.
Efficient list based mining of high average utility patterns with maximum average pruning strategies
Information Sciences
(2021) - et al.
Efficient mining of extraordinary patterns by pruning and predicting
Expert Systems with Applications
(2019) - et al.
Predefined pattern detection in large time series
Information Sciences
(2016) - et al.
Frequent pattern discovery with tri-partition alphabets
Information Sciences
(2020)
Distributed mining of high utility time interval sequential patterns using mapreduce approach
Expert Systems with Applications
On the complexity of iterated shuffle
Journal of Computer and System Sciences
PMBC: Pattern mining from biological sequences with wildcard constraints
Computers in Biology and Medicine
Efficient sequential pattern mining with wildcards for keyphrase extraction
Knowledge-Based Systems
Damped window based high average utility pattern mining over data streams
Knowledge-Based Systems
An efficient algorithm for mining high utility patterns from incremental databases with one database scan
Knowledge-Based Systems
A novel approach for mining high-utility sequential patterns in sequence databases
ETRI Journal
Sentiment classification using negative and intensive sentiment supplement information
Data Science and Engineering
Efficient mining of closed repetitive gapped subsequences from a sequence database
e-RNSP: An efficient method for mining repetition negative sequential patterns
IEEE Transactions on Cybernetics
Mining cost-effective patterns in event logs
Knowledge-Based Systems
HUOPM: High-utility occupancy pattern mining
IEEE Transactions on Cybernetics
Utility mining across multi-sequences with individualized thresholds. ACM/IMS Transactions on Data
Science
Word cloud explorer: Text analytics based on word clouds
Mining frequent and top-K high utility time interval-based events with duration patterns
Knowledge and Information Systems
Mining frequent patterns with gaps and one-off condition
An efficient tree-based algorithm for mining high average-utility itemset
IEEE Access
Fuzzy clustering of crowdsourced test reports for apps
ACM Transactions on Internet Technology (TOIT)
Cited by (37)
A new tree-based approach to mine sequential patterns[Formula presented]
2024, Expert Systems with ApplicationsA dependence graph pattern mining method for processor performance analysis
2024, Performance EvaluationTargeted mining of top-k high utility itemsets
2023, Engineering Applications of Artificial IntelligenceNWP-Miner: Nonoverlapping weak-gap sequential pattern mining
2022, Information SciencesCitation Excerpt :Therefore, it is impossible to accurately calculate the support. Kinds of heuristic strategies are therefore employed to calculate the pattern support [11,41]. The nonoverlapping condition means that any character in the sequence can be rematched by characters at different positions in the pattern [43,47].