NWP-Miner: Nonoverlapping weak-gap sequential pattern mining
Graphical abstract
Introduction
Sequential pattern mining (SPM) can be used to discover valuable subsequences (called patterns) from a large amount of data [7], [45], and has been widely used in many fields, such as outlying sequence data analysis [35], communication networking [2], big data intelligence [15], [37], co-location mining [34], and e-commerce shopping analysis [17]. Classical SPM methods take an ordered lists of sets of items (characters) as input [27]. For example, in a transaction sequence , it means that a customer purchased item “a” in the first time, items “a” and “c” in the second time, item “d” in the third time, and “a”, “c”, and “d” in the last time. In this example, we know that each item is composed of a set, such as “(acd)”. Moreover, classical SPM methods generally ignore the fact that a pattern such as “ac” may appear multiple times in the sequence. However, in many cases, such as DNA, protein, virus, and time series, each item is composed of only an element, which means that a sequence is a list of single items [20]. For example, “attaaaggttt” is a segment of SARS-CoV-2. More importantly, it is worthy to consider a pattern that may appear multiple times in the sequence. Nevertheless, if we only mine continuous subsequences, we will get very little information. If we mine discontinuous subsequences, we will get an excessive number of patterns. To solve this problem, gap constraint SPM [50] was proposed, whose goal is to identify all subsequences that appear frequently by only counting occurrences that satisfy some user-defined gap constraints. A pattern with gap constraints can be represented as , where , and a and b are the minimum and maximum number of wildcards between characters and , respectively [29]. Gap constraints allow users to set a particular gap based on their specific needs, which allows for a more flexible and targeted search. Thus, gap constraint SPM has been used in many different fields, such as biological sequence analysis [38], text corpus mining [26], and feature extraction [47].
Nonoverlapping SPM is a type of SPM with gap constraints that can effectively mine valuable information in sequences of characters [47], and NOSEP was proposed to mine frequent patterns [45]. Gap constraint can match any at least a characters and at most b characters without any restrictions [23]. If the events concerned by users are matched by the gap constraint, it may result in a significant difference between the trend of a pattern and those of its occurrences. For clarification, we use time series data as an example to illustrate this issue. If we use NOSEP to mine frequent patterns with gap constraint [0,3] on the closing index of West Texas Intermediate (WTI) from January 1, 2019 to October 2, 2019, we can get a frequent pattern “A[0,3]f[0,3]O”, whose corresponding trend is shown in Fig. 1(a). We also select two occurrences of the pattern shown in Fig. 1(b) and (c) which are two fragments of the time series from April 26, 2019 to May 6, 2019, and from August 9, 2019 to August 20, 2019, respectively. To observe the trend of time series, they are symbolized according to their gradients [19], [39]. The gradient can be calculated according to . Therefore, can be symbolized as characters according to certain rules. For example, in the range of [1%, 2%), is symbolized as character ‘A’. In the range of [2%, 4%), it is symbolized as character ‘B’. In the range of [32%, inf), it is symbolized as character ‘F’. The corresponding negative range is symbolized as a lowercase. Therefore, the fragments of time series are symbolized as “AOafAO” and “AGfcAEO”, respectively. According to Nonoverlapping SPM, the trend in Fig. 1(a) should be both similar to those in Fig. 1 (b) and (c), since “AOafAO” and “AGfcAEO” are two occurrences of the pattern. However, the trend of the time series in (a) is similar to that in (b), while it is significantly different from that in (c). The reason is as follows. The fluctuation of the time series in (a) is a small rise, followed by a large drop, and followed by a small rise. The fluctuation of the time series in (b) is similar to that in (a). However, the fluctuation of the time series in (c) is significantly different from that in (a), since it is a small rise, followed by a large rise, followed by a large drop, followed by a small drop, followed by a small rise, followed by a large rise, and followed by a small rise. This example shows that whether the curves have the same trend is determined by strong characters rather than weak characters, where strong character corresponds to a large rise/drop in time series, while weak character corresponds to a small rise/drop. Hence, this example illustrates that strong characters cannot be ignored in gap constraints.
To avoid the phenomenon that a pattern with gap constraints that is significantly different from its occurrences, this paper proposes a novel task of nonoverlapping weak-gap sequential pattern (NWP) mining, which requires that the gap constraints can only be matched by weak characters. Since our method can avoid the phenomenon that NWPs are significantly different from their occurrences, the NWPs mined by our method are more accurate than the patterns mined by existing methods [3], [45]. The main contributions of this paper are as follows.
- (1)
To avoid noise patterns, we address the NWP mining which limits the gap constraints to make it match only weak characters, and propose a complete and efficient algorithm called NWP-Miner, which involves two key steps: support calculation and candidate pattern generation.
- (2)
To efficiently calculate the support of patterns, NWP-Miner performs the depth-first search and backtracking strategies based on a simplified Nettree structure, which reduce the time and space complexities. To generate candidate patterns, a pattern join strategy is applied which effectively reduces the number of candidate patterns.
- (3)
Experimental results on real-life time series datasets demonstrate that NWP-Miner not only gives better performance than other competitive algorithms, but can also effectively filter out noise patterns and discover more meaningful patterns.
The remainder of the paper is organized as follows. Related work is summarized in Section 2, and problem definition is given in Section 3. Section 4 describes the proposed NWP-Miner scheme and illustrates the principle of the algorithm through some examples. Section 5 reports the performance of NWP-Miner, and Section 6 concludes the paper.
Section snippets
Related work
SPM has been widely applied in many fields [9]. A variety of mining methods have been proposed to meet a range of different needs, such as negative SPM [4], closed SPM [5], [48], contrast SPM [12], periodic pattern mining [8], and high utility SPM [28], [32]. For example, Qiu et al. [25] presented an efficient method for modeling nonoccurring behaviors by negative SPM. Bai et al. [1] explored an efficient incremental algorithm to discover historical moments in sequence data. Fournier-Viger et
Problem Definition
This section defines key concepts and formally introduces the problem of NWP mining. Definition 1 Sequence A sequence with length n is described as , where represents the set of items, and the size of is expressed as .
It is worth noting that the definition of sequence in this paper is different from that in classical SPM. In classical SPM, the sequence is defined as a series of sets of items. However, the sequence in this paper is a set of items, and can be regarded as a special case of
Algorithm design
There are two essential tasks in NWP mining: support calculation and candidate pattern generation. Section 4.1 briefly reviews NETGAP as a method of calculating the support [45]. However, NETGAP is relatively inefficient, and Section 4.2 therefore presents our NetWeak algorithm, which employs depth-first and backtracking strategies to calculate the support more efficiently. Section 4.3 discusses the use of a pattern join strategy to generate candidate patterns, and Section 4.4 proposes the
Experimental analysis
Section 5.1 presents a pre-processing method of converting time series data into character sequences and benchmark datasets. Section 5.2 introduces the competitive algorithms considered here. Since NWP-Miner involves two key steps, support calculation and candidate pattern generation, Section 5.3 validates the efficiency of the two key steps. Section 5.4 employs some state-of-the-art methods to further evaluate the mining performance. Section 5.5 presents a case study based on data from the Dow
Conclusion
Gap constraint SPM is more flexible than other approaches, since it allows users to set a gap according to their specific needs. The patterns mined in this way may contain some noise, since the gap can match any character. To tackle this issue, we propose a method of NWP mining that uses a weak-gap to make it only match weak characters, which can therefore mine the required patterns more accurately. To mine these NWPs, we propose the NWP-Miner algorithm, which adopts the Netweak algorithm to
CRediT authorship contribution statement
Youxi Wu: Conceptualization, Methodology, Formal analysis, Funding acquisition. Zhu Yuan: Software, Writing - original draft, Validation, Data curation. Yan Li: Investigation, Supervision, Writing - review & editing. Lei Guo: Validation, Resources. Philippe Fournier-Viger: Validation, Writing - review & editing. Xindong Wu: Supervision, Funding acquisition.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgement
This work was partly supported by National Natural Science Foundation of China (61976240, 52077056, 917446209), National Key Research and Development Program of China (2016YFB1000901), and Natural Science Foundation of Hebei Province, China (Nos. F2020202013, E2020202033).
References (50)
- et al.
Mining closed partially ordered patterns, a new optimized algorithm
Knowledge-Based Systems
(2015) - et al.
Mining significant trend sequences in dynamic attributed graphs
Knowledge-Based Systems
(2019) - et al.
Mining local periodic patterns in a discrete sequence
Information Sciences
(2021) - et al.
ProUM: Projection-based utility mining on sequence data
Information Sciences
(2020) - et al.
Mining conditional discriminative sequential patterns
Information Sciences
(2019) - et al.
Frequent pattern discovery with tri-partition alphabets
Information Sciences
(2020) - et al.
SAX-ARM: Deviant event pattern discovery from multivariate time series using symbolic aggregate approximation and association rule mining
Expert Systems with Applications
(2020) - et al.
Distributed mining of high utility time interval sequential patterns using mapreduce approach
Expert Systems with Applications
(2020) - et al.
Efficient high average-utility itemset mining using novel vertical weak upper-bounds
Knowledge-Based Systems
(2019) - et al.
On the complexity of iterated shuffle
Journal of Computer and System Sciences
(1984)
PMBC: Pattern mining from biological sequences with wildcard constraints
Computers in Biology and Medicine
HANP-Miner: High average utility nonoverlapping sequential pattern mining
Knowledge-Based Systems
HAOP-Miner: Self-adaptive high-average utility one-off sequential pattern mining
Expert Systems With Applications
NetNCSP: Nonoverlapping closed sequential pattern mining
Knowledge-Based Systems
Mining maximal frequent patterns by considering weight conditions over data streams
Knowledge-Based Systems
Historic moments discovery in sequence data
ACM Transactions on Database Systems
Efficient mining of frequent patterns on uncertain graphs
IEEE Transactions on Knowledge and Data Engineering
Efficient mining of closed repetitive gapped subsequences from a sequence database, in
e-RNSP: An efficient method for mining repetition negative sequential patterns
IEEE Transactions on Cybernetics
SPMF: A java open-source pattern mining library
Journal of Machine Learning Research
A survey of parallel sequential pattern mining
ACM Transactions on Knowledge Discovery from Data
Pattern matching with wildcards and gap-length constraints based on a centrality-degree graph
Applied Intelligence
Word Cloud Explorer: Text analytics based on word clouds
Mining frequent patterns with gaps and one-off condition
Toward better summarizing bug reports with crowdsourcing elicited attribute
IEEE Transactions on Reliability
Cited by (19)
Stable convolutional neural network for economy applications
2024, Engineering Applications of Artificial IntelligenceA new tree-based approach to mine sequential patterns[Formula presented]
2024, Expert Systems with ApplicationsMRI-CE: Minimal rare itemset discovery using the cross-entropy method
2024, Information SciencesAn efficient pruning method for mining inter-sequence patterns based on pseudo-IDList
2024, Expert Systems with ApplicationsEfficient mining of concept-hierarchy aware distinguishing sequential patterns
2022, Knowledge-Based SystemsCitation Excerpt :Recently, Wu et al. proposed so-called nonoverlapping conditional sequence patterns [17] to allow the same sequence letter to match and rematch pattern letters at different positions. And on this basis, the authors successively designed corresponding nonoverlapping sequential pattern mining approaches with closed [18], high average utility [19], weak-gap [20], three-way [21] and self-adaptive [22]. There are many applications of sequential pattern mining, such as classifying and predicting classes/properties of protein sequences [23,24], analyzing electronic medical records [25], detecting malicious software [26], discovering periodic outliers [27], assisting education [28], and analyzing financial products [29].
An efficient approach for mining maximized erasable utility patterns
2022, Information SciencesCitation Excerpt :Because data can be generated in multiple environments, the types of databases are also diverse. There are various pattern mining areas, such as uncertain pattern mining [10,38,39], high average utility pattern mining [43–45], weighted pattern mining [14,33,36], top–k pattern mining [16,29,50], sequential pattern mining [11,41,47], maximal pattern mining [4,5,31,35], and closed pattern mining [30,37]. As pattern mining has been actively studied to process various types of data, it has been applied to many applications, such as improving ranking of retrieved images [28], mining clickstream patterns [15], and analyzing IoT data [13] and linguistic data [46].