Elsevier

Information Sciences

Volume 588, April 2022, Pages 124-141
Information Sciences

NWP-Miner: Nonoverlapping weak-gap sequential pattern mining

https://doi.org/10.1016/j.ins.2021.12.064Get rights and content

Highlights

  • We limit the gap constraints to make it match weak characters to avoid noise patterns.

  • We propose a complete and efficient algorithm NWP-Miner.

  • NWP-Miner employs depth-first search and backtracking strategies to calculate support.

  • NWP-Miner has better performance than other competitive algorithms.

  • NWP-Miner discovers more meaningful patterns and filters out noise patterns.

Abstract

Nonoverlapping sequential pattern mining (SPM) is a type of SPM with gap constraints that can mine valuable information in sequences. One of the disadvantages of nonoverlapping SPM is that any characters can match with gap constraints. Hence, there can be a significant difference between the trend of a pattern and those of its occurrences. To tackle this issue, we propose nonoverlapping weak-gap sequential pattern (NWP) mining, where characters are divided into two types: weak and strong. This allows discovering frequent patterns more accurately by limiting the gap constraints to match only weak characters. To discover NWPs, we propose NMP-Miner which involves two key steps: support calculation and candidate pattern generation. To efficiently calculate the support of candidate patterns, depth-first search and backtracking strategies based on a simplified Nettree structure are adopted, which effectively reduce the time and space complexities of the algorithm. Moreover, a pattern join approach is applied to effectively reduce the number of candidate patterns. The experimental results show that NWP-Miner is more efficient than other competitive algorithms. More importantly, the case study of time series shows that NWP-Miner can effectively filter out noise patterns and discover more meaningful patterns. Algorithms and datasets can be downloaded from  https://github.com/wuc567/Pattern-Mining/tree/master/NWP-Miner.

Introduction

Sequential pattern mining (SPM) can be used to discover valuable subsequences (called patterns) from a large amount of data [7], [45], and has been widely used in many fields, such as outlying sequence data analysis [35], communication networking [2], big data intelligence [15], [37], co-location mining [34], and e-commerce shopping analysis [17]. Classical SPM methods take an ordered lists of sets of items (characters) as input [27]. For example, in a transaction sequence S=<a(ac)d(acd)>, it means that a customer purchased item “a” in the first time, items “a” and “c” in the second time, item “d” in the third time, and “a”, “c”, and “d” in the last time. In this example, we know that each item is composed of a set, such as “(acd)”. Moreover, classical SPM methods generally ignore the fact that a pattern such as “ac” may appear multiple times in the sequence. However, in many cases, such as DNA, protein, virus, and time series, each item is composed of only an element, which means that a sequence is a list of single items [20]. For example, “attaaaggttt” is a segment of SARS-CoV-2. More importantly, it is worthy to consider a pattern that may appear multiple times in the sequence. Nevertheless, if we only mine continuous subsequences, we will get very little information. If we mine discontinuous subsequences, we will get an excessive number of patterns. To solve this problem, gap constraint SPM [50] was proposed, whose goal is to identify all subsequences that appear frequently by only counting occurrences that satisfy some user-defined gap constraints. A pattern p with gap constraints can be represented as p=p1[a,b]p2 [a,b]pj[a,b]pm, where 0ab, and a and b are the minimum and maximum number of wildcards between characters pj-1and pj, respectively [29]. Gap constraints allow users to set a particular gap based on their specific needs, which allows for a more flexible and targeted search. Thus, gap constraint SPM has been used in many different fields, such as biological sequence analysis [38], text corpus mining [26], and feature extraction [47].

Nonoverlapping SPM is a type of SPM with gap constraints that can effectively mine valuable information in sequences of characters [47], and NOSEP was proposed to mine frequent patterns [45]. Gap constraint [a,b] can match any at least a characters and at most b characters without any restrictions [23]. If the events concerned by users are matched by the gap constraint, it may result in a significant difference between the trend of a pattern and those of its occurrences. For clarification, we use time series data as an example to illustrate this issue. If we use NOSEP to mine frequent patterns with gap constraint [0,3] on the closing index of West Texas Intermediate (WTI) from January 1, 2019 to October 2, 2019, we can get a frequent pattern “A[0,3]f[0,3]O”, whose corresponding trend is shown in Fig. 1(a). We also select two occurrences of the pattern shown in Fig. 1(b) and (c) which are two fragments of the time series from April 26, 2019 to May 6, 2019, and from August 9, 2019 to August 20, 2019, respectively. To observe the trend of time series, they are symbolized according to their gradients [19], [39]. The gradient can be calculated according to fi=(ti+1-ti)/ti. Therefore, fi can be symbolized as characters according to certain rules. For example, in the range of [1%, 2%), fi is symbolized as character ‘A’. In the range of [2%, 4%), it is symbolized as character ‘B’. In the range of [32%, inf), it is symbolized as character ‘F’. The corresponding negative range is symbolized as a lowercase. Therefore, the fragments of time series are symbolized as “AOafAO” and “AGfcAEO”, respectively. According to Nonoverlapping SPM, the trend in Fig. 1(a) should be both similar to those in Fig. 1 (b) and (c), since “AOafAO” and “AGfcAEO” are two occurrences of the pattern. However, the trend of the time series in (a) is similar to that in (b), while it is significantly different from that in (c). The reason is as follows. The fluctuation of the time series in (a) is a small rise, followed by a large drop, and followed by a small rise. The fluctuation of the time series in (b) is similar to that in (a). However, the fluctuation of the time series in (c) is significantly different from that in (a), since it is a small rise, followed by a large rise, followed by a large drop, followed by a small drop, followed by a small rise, followed by a large rise, and followed by a small rise. This example shows that whether the curves have the same trend is determined by strong characters rather than weak characters, where strong character corresponds to a large rise/drop in time series, while weak character corresponds to a small rise/drop. Hence, this example illustrates that strong characters cannot be ignored in gap constraints.

To avoid the phenomenon that a pattern with gap constraints that is significantly different from its occurrences, this paper proposes a novel task of nonoverlapping weak-gap sequential pattern (NWP) mining, which requires that the gap constraints can only be matched by weak characters. Since our method can avoid the phenomenon that NWPs are significantly different from their occurrences, the NWPs mined by our method are more accurate than the patterns mined by existing methods [3], [45]. The main contributions of this paper are as follows.

  • (1)

    To avoid noise patterns, we address the NWP mining which limits the gap constraints to make it match only weak characters, and propose a complete and efficient algorithm called NWP-Miner, which involves two key steps: support calculation and candidate pattern generation.

  • (2)

    To efficiently calculate the support of patterns, NWP-Miner performs the depth-first search and backtracking strategies based on a simplified Nettree structure, which reduce the time and space complexities. To generate candidate patterns, a pattern join strategy is applied which effectively reduces the number of candidate patterns.

  • (3)

    Experimental results on real-life time series datasets demonstrate that NWP-Miner not only gives better performance than other competitive algorithms, but can also effectively filter out noise patterns and discover more meaningful patterns.

The remainder of the paper is organized as follows. Related work is summarized in Section 2, and problem definition is given in Section 3. Section 4 describes the proposed NWP-Miner scheme and illustrates the principle of the algorithm through some examples. Section 5 reports the performance of NWP-Miner, and Section 6 concludes the paper.

Section snippets

Related work

SPM has been widely applied in many fields [9]. A variety of mining methods have been proposed to meet a range of different needs, such as negative SPM [4], closed SPM [5], [48], contrast SPM [12], periodic pattern mining [8], and high utility SPM [28], [32]. For example, Qiu et al. [25] presented an efficient method for modeling nonoccurring behaviors by negative SPM. Bai et al. [1] explored an efficient incremental algorithm to discover historical moments in sequence data. Fournier-Viger et

Problem Definition

This section defines key concepts and formally introduces the problem of NWP mining.

Definition 1 Sequence

A sequence with length n is described as s=s1sisn, where si(1in)Σ,Σ represents the set of items, and the size of Σ is expressed as |Σ|.

It is worth noting that the definition of sequence in this paper is different from that in classical SPM. In classical SPM, the sequence is defined as a series of sets of items. However, the sequence in this paper is a set of items, and can be regarded as a special case of

Algorithm design

There are two essential tasks in NWP mining: support calculation and candidate pattern generation. Section 4.1 briefly reviews NETGAP as a method of calculating the support [45]. However, NETGAP is relatively inefficient, and Section 4.2 therefore presents our NetWeak algorithm, which employs depth-first and backtracking strategies to calculate the support more efficiently. Section 4.3 discusses the use of a pattern join strategy to generate candidate patterns, and Section 4.4 proposes the

Experimental analysis

Section 5.1 presents a pre-processing method of converting time series data into character sequences and benchmark datasets. Section 5.2 introduces the competitive algorithms considered here. Since NWP-Miner involves two key steps, support calculation and candidate pattern generation, Section 5.3 validates the efficiency of the two key steps. Section 5.4 employs some state-of-the-art methods to further evaluate the mining performance. Section 5.5 presents a case study based on data from the Dow

Conclusion

Gap constraint SPM is more flexible than other approaches, since it allows users to set a gap according to their specific needs. The patterns mined in this way may contain some noise, since the gap can match any character. To tackle this issue, we propose a method of NWP mining that uses a weak-gap to make it only match weak characters, which can therefore mine the required patterns more accurately. To mine these NWPs, we propose the NWP-Miner algorithm, which adopts the Netweak algorithm to

CRediT authorship contribution statement

Youxi Wu: Conceptualization, Methodology, Formal analysis, Funding acquisition. Zhu Yuan: Software, Writing - original draft, Validation, Data curation. Yan Li: Investigation, Supervision, Writing - review & editing. Lei Guo: Validation, Resources. Philippe Fournier-Viger: Validation, Writing - review & editing. Xindong Wu: Supervision, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was partly supported by National Natural Science Foundation of China (61976240, 52077056, 917446209), National Key Research and Development Program of China (2016YFB1000901), and Natural Science Foundation of Hebei Province, China (Nos. F2020202013, E2020202033).

References (50)

  • X. Wu et al.

    PMBC: Pattern mining from biological sequences with wildcard constraints

    Computers in Biology and Medicine

    (2013)
  • Y. Wu et al.

    HANP-Miner: High average utility nonoverlapping sequential pattern mining

    Knowledge-Based Systems

    (2021)
  • Y. Wu et al.

    HAOP-Miner: Self-adaptive high-average utility one-off sequential pattern mining

    Expert Systems With Applications

    (2021)
  • Y. Wu et al.

    NetNCSP: Nonoverlapping closed sequential pattern mining

    Knowledge-Based Systems

    (2020)
  • U. Yun et al.

    Mining maximal frequent patterns by considering weight conditions over data streams

    Knowledge-Based Systems

    (2014)
  • R. Bai et al.

    Historic moments discovery in sequence data

    ACM Transactions on Database Systems

    (2019)
  • Y. Chen et al.

    Efficient mining of frequent patterns on uncertain graphs

    IEEE Transactions on Knowledge and Data Engineering

    (2019)
  • B. Ding et al.

    Efficient mining of closed repetitive gapped subsequences from a sequence database, in

  • X. Dong et al.

    e-RNSP: An efficient method for mining repetition negative sequential patterns

    IEEE Transactions on Cybernetics

    (2020)
  • P. Fournier-Viger et al.

    SPMF: A java open-source pattern mining library

    Journal of Machine Learning Research

    (2014)
  • W. Gan et al.

    A survey of parallel sequential pattern mining

    ACM Transactions on Knowledge Discovery from Data

    (2019)
  • D. Guo et al.

    Pattern matching with wildcards and gap-length constraints based on a centrality-degree graph

    Applied Intelligence

    (2013)
  • F. Heimerl et al.

    Word Cloud Explorer: Text analytics based on word clouds

  • Y. Huang et al.

    Mining frequent patterns with gaps and one-off condition

  • H. Jiang et al.

    Toward better summarizing bug reports with crowdsourcing elicited attribute

    IEEE Transactions on Reliability

    (2019)
  • Cited by (19)

    • Stable convolutional neural network for economy applications

      2024, Engineering Applications of Artificial Intelligence
    • Efficient mining of concept-hierarchy aware distinguishing sequential patterns

      2022, Knowledge-Based Systems
      Citation Excerpt :

      Recently, Wu et al. proposed so-called nonoverlapping conditional sequence patterns [17] to allow the same sequence letter to match and rematch pattern letters at different positions. And on this basis, the authors successively designed corresponding nonoverlapping sequential pattern mining approaches with closed [18], high average utility [19], weak-gap [20], three-way [21] and self-adaptive [22]. There are many applications of sequential pattern mining, such as classifying and predicting classes/properties of protein sequences [23,24], analyzing electronic medical records [25], detecting malicious software [26], discovering periodic outliers [27], assisting education [28], and analyzing financial products [29].

    • An efficient approach for mining maximized erasable utility patterns

      2022, Information Sciences
      Citation Excerpt :

      Because data can be generated in multiple environments, the types of databases are also diverse. There are various pattern mining areas, such as uncertain pattern mining [10,38,39], high average utility pattern mining [43–45], weighted pattern mining [14,33,36], top–k pattern mining [16,29,50], sequential pattern mining [11,41,47], maximal pattern mining [4,5,31,35], and closed pattern mining [30,37]. As pattern mining has been actively studied to process various types of data, it has been applied to many applications, such as improving ranking of retrieved images [28], mining clickstream patterns [15], and analyzing IoT data [13] and linguistic data [46].

    View all citing articles on Scopus
    View full text