A fast algorithm for order-preserving pattern matching

https://doi.org/10.1016/j.ipl.2014.10.018Get rights and content

Highlights

  • We present a new method of deciding the order-isomorphism between two strings.

  • We show that the bad character rule can be applied to the OPPM problem.

  • We present a space-efficient algorithm computing the shift table for text search.

  • We present a linear-time algorithm for an integer alphabet in the worst case.

Abstract

Given a text T and a pattern P, the order-preserving pattern matching (OPPM) problem is to find all substrings in T which have the same relative orders as P. The OPPM has been studied in the fields of finding some patterns affected by relative orders, not by their absolute values. In this paper, we present a method of deciding the order-isomorphism between two strings even when there are same characters. Then, we show that the bad character rule of the Horspool algorithm for generic pattern matching problems can be applied to the OPPM problem and we present a space-efficient algorithm for computing shift tables for text search. Finally, we combine our bad character rule with the KMP-based algorithm to improve the worst-case running time. We give experimental results to show that our algorithm is about 2 to 6 times faster than the KMP-based algorithm in reasonable cases.

Introduction

Given a text T and a pattern P, the order-preserving pattern matching (OPPM for short) problem is to find all substrings in T which have the same relative orders as P. For example, when P=(35,40,23,40,40,28,30) and T=(10,20,15,28,32,12,32,32,20,25,15,25) are given, P has the same relative orders as the substring T=(28,32,12,32,32,20,25) of T. In T  (resp. P), the first character 28 (resp. 35) is the 4-th smallest, the second character 32 (resp. 40) is the 5-th smallest, the third character 12 (resp. 23) is the smallest, and so on. See Fig. 1. The OPPM has been studied in the fields of finding some patterns affected by relative orders, not by their absolute values. For example, it can be applied to time series analysis like share prices on stock markets and to musical melody matching of two musical scores [2].

Recently, several results were presented on the OPPM problem. For the OPPM problem, the order-isomorphism must be defined. Kim et al. [2] defined the order-isomorphism as the equivalence of permutations converted from strings with an assumption that all the characters in a string are distinct. Given T (|T|=n) and P (|P|=m), they proposed an algorithm for the OPPM problem running in O(n+mlogm) time based on the Knuth–Morris–Pratt (KMP) algorithm [3]. Meanwhile, Kubica et al. [4] defined the order-isomorphism as the equivalence of all relative orders between two strings, and presented a method of deciding the order-isomorphism of two strings even when there are same characters. They independently proposed an algorithm for the OPPM problem based on the KMP algorithm running in O(n+mlogm) time for a general alphabet and O(n+m) time for an integer alphabet whose characters can be sorted in linear time. More recently, Crochemore et al. [5] introduced order-preserving suffix trees, and they suggested an algorithm finding all occurrences of P in T running in O(m+z) time where z is the number of occurrences.

In this paper, we propose fast algorithms for the OPPM problem based on the Horspool algorithm [6], [7], [8]. Experimental results show that our algorithms are about 2 to 6 times faster than the KMP-based algorithm in reasonable cases. Our contributions are as follows.

  • We present a method of deciding the order-isomorphism between two strings even when there are same characters. We show that Kubica et al.'s method [4] may decide it incorrectly when there are same characters.

  • We show that the bad character rule can be applied to the OPPM problem by defining a group of characters as one character. Kim et al. [2] mentioned the hardness of applying the Boyer–Moore algorithm [9] to the OPPM problem. The good suffix rule could be well-defined but the bad character rule could not be directly applied to the OPPM problem.

  • We present a space-efficient algorithm computing the shift table for text search based on a factorial number system. Let q be a size of the group of characters and |Σ| be the size of an alphabet. Then, our algorithm uses O(q!) space for the shift table while the algorithms of [6], [7] for the generic pattern matching problem use O(|Σ|q) space for the shift table.

  • We also show that our bad character rule can be combined with the KMP-based algorithm to improve the worst-case running time of [1]. The combined algorithm guarantees O(n+mlogm) time for a general alphabet and O(n+m) time for an integer alphabet in the worst case when q is a constant.

Section snippets

Preliminaries

Let Σ denote an alphabet and σ=|Σ|. Let |x| denote the length of a string x. A string x is described by a sequence of characters (x[0],x[1],,x[|x|1]).

Now, we formally define the order-isomorphism and the order-preserving pattern matching problem. Two strings x and y of the same length over Σ are called order-isomorphic, written xy, ifx[i]x[j]y[i]y[j]for 0i,j<|x|[4]. If two strings x and y are not order-isomorphic, we write xy. Given a text T[0..n1] and a pattern P[0..m1], we say that

New decision of order-isomorphism

In this section, we show that Kubica et al.'s method [4] for deciding the order-isomorphism of two strings may be incorrect when there are same characters and present a new method which corrects Kubica et al.'s one. Kubica et al. [4] claimed that the order-isomorphism of two strings x and y could be decided using the location tables as follows.

Lemma 3

(See [4].) Assume that x[0..t]y[0..t], t<|x|1,|y|1 and a=LMaxx[t+1], b=LMinx[t+1]. Then, x[0..t+1]y[0..t+1]y[a]y[t+1]y[b]. In case a or b is

Basic idea

Basically, our algorithm for the OPPM problem is based on the Horspool algorithm widely used for generic pattern matching problems. The Horspool algorithm for generic pattern matching problems uses the shift table for filtering mismatched positions to expect sublinear behavior. (This method is well known as the bad character rule.) That is, when a mismatch occurs, the generic Horspool algorithm shifts the pattern using the shift table by setting the character of T compared with P[m1] as the

Experimental results

We conducted experiments to compare the practical performance of our algorithms and the KMP-based algorithm. Our Horspool-based algorithm and the worst-case improved algorithm are denoted by OHq and OHy, respectively. The KMP-based algorithm, denoted by OKMP, was implemented based on the algorithms of [2], [4]. We used a naive approach to compute the fingerprints instead of using dynamic order-statistics trees or word-encoded sets because they are less practical when implemented. All algorithms

Acknowledgements

Joong Chae Na was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (2014R1A1A1004901). Kunsoo Park was supported by Next-Generation Information Computing Development Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (2011-0029924). Jeong Seop Sim was supported by the National Research Foundation of Korea (NRF) grant

References (12)

There are more references available in the full text version of this article.

Cited by (55)

  • Order-preserving pattern matching with scaling

    2023, Information Processing Letters
  • String periods in the order-preserving model

    2020, Information and Computation
  • OPR-Miner: Order-Preserving Rule Mining for Time Series

    2023, IEEE Transactions on Knowledge and Data Engineering
View all citing articles on Scopus

A preliminary version of this paper appeared in COCOA 2013 [1].

View full text