HAOP-Miner: Self-adaptive high-average utility one-off sequential pattern mining

https://doi.org/10.1016/j.eswa.2021.115449Get rights and content

Highlights

  • Address self-adaptive HAOP mining which can discover extremely important patterns.

  • We propose the HAOP-Miner algorithm that contains two key steps.

  • HAOP-Miner employs an online Reverse filling strategy to calculate the support.

  • HAOP-Miner adopts Apriori-like strategy to prune the candidate patterns.

  • HAOP-Miner has a high level of efficiency and it is easier to find valuable patterns.

Abstract

One-off sequential pattern mining (SPM) (or SPM under the one-off condition) is a kind of repetitive SPM with gap constraints, and has been widely applied in many fields. However, current research on one-off SPM ignores the utility (can be price or profit) of items, resulting in some low-frequency but extremely important patterns being ignored. To solve this issue, this paper addresses self-adaptive High-Average utility One-off sequential Pattern (HAOP) mining which has following three characteristics. Any two occurrences cannot share any letter in the sequence. The support (number of occurrences), utility and length of the pattern are considered simultaneously. The HAOP mining discovers patterns with a self-adaptive gap which means that users do not need to set the gap constraints. We propose an effective algorithm called HAOP-Miner that involves two key steps: support calculation and candidate pattern generation. For the support calculation, we propose a heuristic algorithm named the Reverse filling (Rf) algorithm that can effectively calculate the support by avoiding creating redundant nodes and pruning the redundant and useless nodes after finding an occurrence. Since HAOP mining does not satisfy the Apriori property, a support lower bound method combined with the pattern growth strategy is adopted to generate the candidate patterns. The experimental results first validate the effectiveness of HAOP-Miner, and then demonstrate that HAOP-Miner has better performance than other state-of-the-art algorithms. More importantly, HAOP-Miner is easier to mine valuable patterns. The algorithms and datasets are available at  https://github.com/wuc567/Pattern-Mining/tree/master/HAOP-Miner.

Introduction

Sequential pattern mining (SPM) (He et al., 2019, Wu, Zhu et al., 2020) is a type of data mining method in which sub-sequences (also known as patterns) are discovered from sequences (Fournier-Viger, Li, Lin, Kiran, & Fujita, 2019). This approach has been widely applied in many fields, such as big data mining (Wu, Zhu, Wu, & Ding, 2013), big data intelligence (Liu et al., 2019, Wu and Wu, 2019), inspection reports (Jiang et al., 2018, Jiang et al., 2019), e-commerce shopping analysis (Le et al., 2015, Nam et al., 2020), and biological sequence analysis (Wang, Hou, & Wang, 2018). Many SPM methods have been proposed to meet a range of requirements, such as constrast SPM (Wu et al., 2021, Ghosh et al., 2017) negative SPM (Dong et al., 2020, Dong et al., 2019, Chen et al., 2019), tri-partition pattern mining (Min, Zhang, Zhai, & Shen, 2020), high utility pattern mining (Choi and Park, 2019, Gan, Lin, Fournier-Viger et al., 2020, Kim et al., 2021) and gap constraint SPM (Wu et al., 2018, Zhang et al., 2007). One of the disadvantages of traditional SPM is that it neglects the repetition in the sequence. For example, pattern “AC” occurs in sequence “ABC”. Thus, the support (number of occurrences) of pattern “AC” in sequence “ABC” is 1 according to traditional SPM. However, pattern “AC” occurs more than once in sequence “AACCACC”. If the support of pattern “AC” in sequence “AACCAC” is also 1, the repetition is neglected. To solve this issue and avoid mining some meaningless patterns, gap constraint SPM was proposed. The gap constraint pattern can be expressed as p=p1[a,b]p2pm-1[a,b]pm, where a and b(0ab) represent the minimum and maximum wildcards, respectively. For example, pattern “C[1,2]G” means that there are one or two wildcards between “C” and “G”. Thus, pattern “C[1,2]G” occurs in sequence “CAG”, while does not occur in sequence “CG”. Since patterns can be flexibly matched, gap constraint SPM has been used in many applications such as time series (Huang et al., 2019, Miao et al., 2016, Sumalatha and Subramanyam, 2020), biological sequence retrieval (Ghosh et al., 2017) and feature selection (Wei, Xing, Shi, Ji, & Zou, 2017).

One-off SPM (Huang et al., 2009, Wu, Xie et al., 2013) is a branch of gap constraint SPM that mines frequent patterns from sequences under the one-off condition, which refers that each letter in the sequence can be used at most once. An illustrative example is given below.

Example 1

Suppose we have sequence s=s1s2s3s4s5s6s7s8=ACGAGACG, pattern p1=p1p2=A[0,3]G.

From Fig. 1, it can be seen that there are five occurrences for pattern p1 in sequence s. The sub-sequence s1s3 is an occurrence of pattern p1 that can be written as 1,3. Similarly, the other four occurrences are 1,5, 4,5, 4,8 and 6,8. Occurrences 1,3 and 1,5 do not satisfy the one-off condition since s1 is used twice. However, occurrences 1,3 and 4,5 satisfy the one-off condition since there is no common used character in sequence. There are three occurrences 1,3, 4,5 and 6,8 under the one-off condition. Hence, the support of pattern p1 in sequence s is 3.

Current research on one-off SPM ignores the utility (can be price or profit) of items. For example, although a gene may not appear frequently, its behaviour may be extremely remarkable. If we only consider the number of patterns, these highly expressed genes will not be mined (Morteza, Heidar, & Aijun, 2017). We also employ Example 1 to illustrate that high utility pattern mining is more meaningful. The utility of each item is shown in Table 1. The utility of pattern p1 = A[0, 3]G in s is (1+2)×3=9 since utilities of “A” and “G” are 1 and 2, respectively. Similarly, the one-off support of pattern p2 = A[0, 3]C[0, 3]G in sequence s is 2. The utility of pattern is (1+4+2)×2=14. This example shows that the frequency of pattern p1 is greater than that of p2, while the utility of pattern p1 is less than that of p2. Therefore, high utility one-off SPM is worthy to be investigated, and the following two issues should be considered.

  • (1)

    Gap setting. It is very difficult to set suitable gap constraints without prior knowledge, which makes it challenging to discover valuable patterns (Wang, Duan et al., 2016). In Example 1, there is no occurrence for pattern p3=A[4,5]G, which means that improper gap constraints will lead to mining failure.

  • (2)

    Length of the pattern. As the length of the pattern increases, its utility also increases (Lin, Li, Pirouz, Zhang, & Fournier-Viger, 2020). For instance, in Example 1, although p3=C[0,1]G[0,1]A[0,1] G[0,1]A[0,1]C[0,1]

G appears only once in sequence s,p3 is also a high utility pattern since PU(p3,s)=(4+2+1+2+1+4+2)×1=16. Obviously, it is not reasonable.

To solve these issues, this paper proposes self-adaptive High-Average utility One-off sequential Pattern (HAOP) mining with the following characteristics: (1) any two occurrences cannot share any letter in the sequence; (2) the support, utility and length of the pattern are considered simultaneously; (3) this method discovers patterns with a self-adaptive gap which means that users do not need to set the gap constraints. The main contributions of this paper are as follows.

  • (1)

    This paper addresses self-adaptive HAOP mining to discover HAOPs and proposes an effective algorithm called HAOP-Miner that contains two key steps: support calculation and candidate pattern generation.

  • (2)

    For the support calculation, we propose a Reverse filling (Rf) strategy which can effectively calculate the support since it avoids creating redundant nodes and does not need to prune the redundant and useless nodes after finding an occurrence.

  • (3)

    HAOP mining does not satisfy the Apriori property, and thus a support lower bound method combined with a pattern growth strategy is proposed to prune the candidate patterns effectively.

  • (4)

    The experimental results validate the effectiveness of HAOP-Miner, and demonstrate that HAOP-Miner has better performance than other state-of-the-art algorithms. More importantly, HAOP-Miner is easier to mine valuable patterns.

The remainder of this paper is organised as follows. Section 2 introduces related work. Section 3 defines the problem considered here. Section 4 designs the Rf strategy to calculate the support, and describes the high lower bound pattern, which satisfies the Apriori property and is used to prune the candidate patterns effectively. Finally, the HAOP-Miner algorithm is proposed. Section 5 reports the results of experiments on biological sequences and sales dataset. Finally, Section 6 presents the conclusion of this paper.

Section snippets

Related work

SPM has been widely applied in various fields, such as event log (Dalmas et al., 2017, Fournier-Viger et al., 2020), data streams (Chen, Xiao, Xin, Lin, & Lin, 2018), transaction databases (Karim, Cochez, Beyan, Ahmed, & Decker, 2018) and biological sequences (Wu, Zhu, He, & Arslan, 2013). Frequent pattern mining considers the support (number of occurrences) of patterns, but ignores the effect of utility (can be price or profit) on patterns (Yun et al., 2018, Gan, Lin, Zhang et al., 2020),

Problem definition

In this section, we formally define HAOP mining.

Definition 1

A sequence s is described by s1sisn, where si(1in), represents the set of items in sequence s, and the size of can be expressed as . Since this paper is a self-adaptive gap, we define pattern p as p1pjpm,where is the traditional wildcard, meaning that any letter can appear between the letters in the pattern.

Definition 2

L=l1,l2,,lm is an occurrence of pattern p in sequence s, if and only if 1l1<<lj<<lmn, where slj=pj(1jmand1ljn).

Proposed algorithm

Fig. 2 shows the overall workflow of HAOP-Miner. HAOP-Miner has two key steps: support calculation and candidate pattern generation. Section 4.1 shows that the support calculation is an NP-hard problem. To calculate the pattern support, Section 4.2 explores a heuristic algorithm named the Positive filling (Pf) algorithm, which is easier to understand at first. To overcome the shortages of Pf, Section 4.3 roposes a more effective heuristic method named the Rf algorithm. Section 4.4 employs a

Experimental results and analysis

We introduce the benchmark datasets in Section 5.1. To evaluate the performance of our approach, we also propose some competitive algorithms whose principles are introduced in Section 5.2. Section 5.3 shows the efficiency of HAOP-Miner. Section 5.4 validate the running performance of HAOP-Miner. Mining performance is evaluated in Section 5.5. Section 5.6 furtherly reports the application in the BABYSALE dataset.

Conclusion

In order to consider the utility of items and to solve the problem of setting gaps without prior knowledge, this paper explores self-adaptive HAOP mining which can discover some extremely important but low-frequency patterns. Self-adaptive HAOP mining as a kind of repetitive SPM (or sequence pattern mining), considers the support (number of occurrences), utility and the pattern length simultaneously. This paper proposes an efficient algorithm, named HAOP-Miner which has two key steps: support

CRediT authorship contribution statement

Youxi Wu: Conceptualization, Methodology, Formal analysis, Supervision, Funding acquisition. Rong Lei: Software, Writing - original draft, Validation, Investigation, Data curation. Yan Li: Investigation, Writing - review & editing. Lei Guo: Validation, Resources. Xindong Wu: Supervision, Funding acquisition.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work was partly supported by National Natural Science Foundation of China (61976240, 52077056, 917446209), National Key Research and Development Program of China (2016YFB 1000901), and Natural Science Foundation of Hebei Province, China (Nos. F2020202013, E2020202033).

References (61)

  • S. Sumalatha et al.

    Distributed mining of high utility time interval sequential patterns using mapreduce approach

    Expert Systems with Applications

    (2020)
  • M.K. Warmuth et al.

    On the complexity of iterated shuffle

    Journal of Computer and System Sciences

    (1984)
  • X. Wu et al.

    PMBC: Pattern mining from biological sequences with wildcard constraints

    Computers in Biology and Medicine

    (2013)
  • F. Xie et al.

    Efficient sequential pattern mining with wildcards for keyphrase extraction

    Knowledge-Based Systems

    (2017)
  • U. Yun et al.

    Damped window based high average utility pattern mining over data streams

    Knowledge-Based Systems

    (2018)
  • U. Yun et al.

    An efficient algorithm for mining high utility patterns from incremental databases with one database scan

    Knowledge-Based Systems

    (2017)
  • C.F. Ahmed et al.

    A novel approach for mining high-utility sequential patterns in sequence databases

    ETRI Journal

    (2010)
  • Chen, X., Xiao, R., Xin, D., Lin, X. & Lin, L. (2018). Constructing a novel spark-based distributed maximum frequent...
  • X. Chen et al.

    Sentiment classification using negative and intensive sentiment supplement information

    Data Science and Engineering

    (2019)
  • B. Ding et al.

    Efficient mining of closed repetitive gapped subsequences from a sequence database

  • Dong, X., Qiu, P., L, J. Cao, L. & Xu, T. (2019). Mining top-k useful negative sequential patterns via learning. IEEE...
  • X. Dong et al.

    e-RNSP: An efficient method for mining repetition negative sequential patterns

    IEEE Transactions on Cybernetics

    (2020)
  • P. Fournier-Viger et al.

    Mining cost-effective patterns in event logs

    Knowledge-Based Systems

    (2020)
  • W. Gan et al.

    HUOPM: High-utility occupancy pattern mining

    IEEE Transactions on Cybernetics

    (2020)
  • W. Gan et al.

    Utility mining across multi-sequences with individualized thresholds. ACM/IMS Transactions on Data

    Science

    (2020)
  • F. Heimerl et al.

    Word cloud explorer: Text analytics based on word clouds

  • J. Huang et al.

    Mining frequent and top-K high utility time interval-based events with duration patterns

    Knowledge and Information Systems

    (2019)
  • Y. Huang et al.

    Mining frequent patterns with gaps and one-off condition

  • Y. Irfan et al.

    An efficient tree-based algorithm for mining high average-utility itemset

    IEEE Access

    (2019)
  • H. Jiang et al.

    Fuzzy clustering of crowdsourced test reports for apps

    ACM Transactions on Internet Technology (TOIT)

    (2018)
  • Cited by (37)

    • Targeted mining of top-k high utility itemsets

      2023, Engineering Applications of Artificial Intelligence
    • NWP-Miner: Nonoverlapping weak-gap sequential pattern mining

      2022, Information Sciences
      Citation Excerpt :

      Therefore, it is impossible to accurately calculate the support. Kinds of heuristic strategies are therefore employed to calculate the pattern support [11,41]. The nonoverlapping condition means that any character in the sequence can be rematched by characters at different positions in the pattern [43,47].

    View all citing articles on Scopus
    View full text