Elsevier

Knowledge-Based Systems

Volume 79, May 2015, Pages 68-79
Knowledge-Based Systems

Mining closed partially ordered patterns, a new optimized algorithm

https://doi.org/10.1016/j.knosys.2014.12.027Get rights and content

Abstract

Nowadays, sequence databases are available in several domains with increasing sizes. Exploring such databases with new pattern mining approaches involving new data structures is thus important. This paper investigates this data mining challenge by presenting OrderSpan, an algorithm that is able to extract a set of closed partially ordered patterns from a sequence database. It combines well-known properties of prefixes and suffixes. Furthermore, we extend OrderSpan by adapting efficient optimizations used in sequential pattern mining domain. Indeed, the proposed method is flexible and follows the sequential pattern paradigm. It is more efficient in the search space exploration, as it skips redundant branches. Experiments were performed on different real datasets to show (1) the effectiveness of the optimized approach and (2) the benefit of closed partially ordered patterns with respect to closed sequential patterns.

Introduction

Due to the exponential growth of temporal and spatiotemporal databases, sequential pattern mining has become a very active research area. Many studies have demonstrated the usefulness of such patterns for analysis [1], classification [2], [3] or prediction [4]. These patterns were introduced in [5] and are an extension of association rules [6]. Several algorithms to mine such patterns have been proposed and are presented in [7], [8]. They are used when information is totally ordered according to a specific criterion, which is usually temporal. For instance, let us take the well-know “market basket” problem. We consider a customer database where the pattern (Bread)(Chocolate) is found. This means that the product Bread is frequently purchased before the product Chocolate. Mining such related items according to temporal aspects is very useful for specialists in various domains such as marketing [9], software engineering [10] or medicine [11]. Despite their advantages, sequential patterns often generate little information since they only provide totally ordered information about data. For example, let us consider a second pattern, (Bread)(Milk), discovered in the same database. If these two patterns describe the same customers, their coexistence is not taken into account with sequential pattern approaches. However, they can be synthesized via partial ordering.

Fig. 1 presents a so-called partially ordered pattern that combines the two previous sequential patterns. This new pattern means that customers frequently purchase the product Bread before purchasing the two other products Chocolate and Milk which themselves are not ordered. Partially ordered patterns can be used in sequential databases and have many advantages: (1) they provide more information on order among elements; (2) they are represented as a directed acyclic graph, which facilitates the understanding; and (3) they summarize sequential pattern sets.

In a previous paper [12], we presented a method designed to directly extract closed partially ordered patterns in the general case of itemset sequences with item repetitions. In the present paper, we propose an improvement of our algorithm:

  • Based on the property presented in [13], we present an optimized version of the so-called OrderSpan algorithm that explores the search space and outputs the complete set of closed partially ordered patterns.

  • We use a new data structure to represent patterns in the algorithm. Properties of this new data structure lead to a different and generic way to remove the redundancy in patterns.

  • We provide a complexity analysis of the approach and an upper bound on the number of extracted patterns given a minimum support.

The method proposed in [12] is close to the non-optimized algorithm presented in this paper. As we will see, the main difference is the use of an expanded data-structure that easily allows the addition of effective optimizations during the process. We thus integrated optimizations from [13].

Based on sequential pattern mining work, OrderSpan extracts partially ordered patterns based on the prefix and suffix properties of sequences. We opted to extract closed partially ordered patterns because they provide a compact representation of all partially ordered patterns. Thus, the output result set is smaller and it is possible to retrieve the complete set of all partially ordered patterns. There is no information loss. Our approach follows the Pattern-Growth paradigm on sequences, thus it is related to other approaches in sequential pattern mining. Some of these methods [13], [14] are optimized to explore the search space of closed sequential patterns in a very efficient way. These optimizations are performed according to some properties that help to prune the search space to reduce its exploration. Thus, we analyzed closed sequential pattern properties that can be applied to the problem of mining closed partially ordered patterns. We adapted the optimization based on the equivalence databases proposed in CloSpan [13]. This property efficiently prunes the search space in the case of sequential pattern mining. We generalized it to the sub-search space that corresponds to a closed partially ordered pattern.

This paper is organized as follows. Section 2 gives some preliminary definitions on sequences and partially ordered patterns. Section 3 describes existing studies on partially ordered pattern mining. Section 4 introduces the OrderSpan algorithm including an optimization step and complexity analysis. Experimental results are presented in Section 5. Firstly, we compare the non-optimized and the optimized algorithm on a set of examples. Secondly, we compare the optimized version of OrderSpan with the algorithm proposed in [15]. Finally, we study the semantic aspects of closed partially ordered patterns.

Section snippets

Problem definition

Before presenting the partially ordered pattern concept, we provide some important definitions relative to closed sequential pattern mining. As we will see later, a partially ordered pattern is a more complex structure composed of closed sequential patterns. Let us first define a sequence (Definition 1), sub-sequence (Definition 2), a sequential pattern (Definition 3) and a closed sequential pattern (Definition 4).

Definition 1 Sequence

Let I = {I1,I2,,Im} be a set of items. An itemset IS is a non empty, unordered,

Related work

In the literature, po-pattern mining has been studied in two main contexts. The first involves mining po-patterns as frequent episodes occurring within a single sequence of events. In the second one, po-patterns are mined in a sequence database.

The OrderSpan algorithm

We now present the OrderSpan algorithm which is designed to meet the previously outlined challenges inherent to po-pattern mining: (1) mining po-patterns directly from a sequence database; (2) focusing extraction on closed po-patterns in order to reduce the result size; and (3) considering sequences of itemsets with repetitive items. This algorithm relies on a two-phase approach, based on the prefix and suffix properties of sequences. The following subsection presents the Pattern-Growth

Experiments

In this section, some tests were conducted on sequence databases with different characteristics. The experiments were performed on a laptop computer with an Intel Core i7 and 8 GB main memory, running on Debian sTable 7.0. We implemented the OrderSpan algorithm in C++. The experiments are divided into two subsections: (1) a performance study of the algorithm with respect to the computation time; (2) a qualitative study of closed po-patterns compared to closed seq-patterns.

Conclusion

Closed po-pattern mining requires the development of new techniques to extract such patterns in the general context of large temporal databases. Greater complexity and a much vaster search space require optimization techniques to efficiently discover po-patterns.

This paper presents OrderSpan, an algorithm which can be used to efficiently mine the complete set of closed po-patterns. Our method uses both the prefix and suffix properties of seq-patterns, based on ForwardTreeMining and

Acknowledgment

This work was funded by the French National Research Agency (ANR), as part of the ANR11_ MONU14 Fresqueau project.

References (27)

  • C.H. Mooney et al.

    Sequential pattern mining – approaches and algorithms

    ACM Comput. Surveys (CSUR)

    (2013)
  • A. George et al.

    DRL-prefixspan: a novel pattern growth algorithm for discovering downturn, revision and launch (DRL) sequential patterns

    Central Eur. J. Comput. Sci.

    (2012)
  • J. Ren, L. Wang, J. Dong, C. Hu, K. Wang, A novel sequential pattern mining algorithm for the feature discovery of...
  • Cited by (17)

    • NWP-Miner: Nonoverlapping weak-gap sequential pattern mining

      2022, Information Sciences
      Citation Excerpt :

      SPM has been widely applied in many fields [9]. A variety of mining methods have been proposed to meet a range of different needs, such as negative SPM [4], closed SPM [5,48], contrast SPM [12], periodic pattern mining [8], and high utility SPM [28,32]. For example, Qiu et al. [25] presented an efficient method for modeling nonoccurring behaviors by negative SPM.

    • FastRCA-Seq: An efficient approach for extracting hierarchies of multilevel closed partially-ordered patterns

      2020, Knowledge-Based Systems
      Citation Excerpt :

      To study the performance of our approach FastRCA-Seq we use three different datasets, namely Gazelle, Kosarak, and FIFA,4 whose characteristics are given in Table 2. These datasets have been used in various researches from sequential pattern mining area [45,46,62]. Gazelle contains 59,601 sequences of click-stream data from an e-commerce website.

    • NetNCSP: Nonoverlapping closed sequential pattern mining

      2020, Knowledge-Based Systems
      Citation Excerpt :

      It has been widely applied in various fields, such as big data mining [4,5], big data intelligence [6], e-commerce shopping analysis [7], biological sequence analysis [8], and event analysis [9]. To handle some specific issues, many methods have been proposed, such as negative SPM [10,11], maximal frequent pattern mining [12,13], three-way pattern mining [14,15], closed SPM [16,17], gap constraint SPM [18,19]. Our previous work proposed an effective algorithm NOSEP and reported that the nonoverlapping SPM has better performance than other state-of-the-art gap constraint SPM methods in finding useful patterns in biology sequences and avoiding under-expression and over-expression in time series [34].

    • RCA-SEQ: An original approach for enhancing the analysis of sequential data based on hierarchies of multilevel closed partially-ordered patterns

      2020, Discrete Applied Mathematics
      Citation Excerpt :

      A concrete CPO-pattern contains only concrete items, and reveals an accurate common regularity of the analysed sequences. We present a time complexity analysis of the RCA-Seq approach that is compared with the time complexity of [15,31]. RCA-Seq results in a hierarchy of multilevel CPO-patterns as shown in Fig. 10 for our running example.

    • CCSpan: Mining closed contiguous sequential patterns

      2015, Knowledge-Based Systems
      Citation Excerpt :

      Sequential pattern (SP) mining, which discovers frequent subsequences as patterns in sequence databases, is an important data mining problem with broad applications, such as feature selection for sequence classification and prediction [1–3], discovering access patterns in Web logs [4], finding copy-paste and related bugs in software code [5], and biological sequence analysis [6,7]. Many sequential pattern mining approaches have been studied, such as general sequential pattern mining [8,9], closed sequential pattern (CloSP) mining [10–13], maximal sequential pattern mining [14–16], and interesting sequential pattern mining [17–22]. In particular, closed sequential pattern mining has become an active topic in data mining community, since it is a compact yet lossless compression of sequential patterns.

    View all citing articles on Scopus
    View full text