Elsevier

Information Sciences

Volume 178, Issue 6, 15 March 2008, Pages 1461-1478
Information Sciences

Similar sequence matching supporting variable-length and variable-tolerance continuous queries on time-series data stream

https://doi.org/10.1016/j.ins.2007.10.026Get rights and content

Abstract

We propose a new similar sequence matching method that efficiently supports variable-length and variable-tolerance continuous query sequences on time-series data stream. Earlier methods do not support variable lengths or variable tolerances adequately for continuous query sequences if there are too many query sequences registered to handle in main memory. To support variable-length query sequences, we use the window construction mechanism that divides long sequences into smaller windows for indexing and searching the sequences. To support variable-tolerance query sequences, we present a new notion of intervaled sequences whose individual entries are an interval of real numbers rather than a real number itself. We also propose a new similar sequence matching method based on these notions, and then, formally prove correctness of the method. In addition, we show that our method has the prematching characteristic, which finds future candidates of similar sequences in advance. Experimental results show that our method outperforms the naive one by 2.6–102.1 times and the existing methods in the literature by 1.4–9.8 times over the entire ranges of parameters tested when the query selectivities are low (<32%), which are practically useful in large database applications.

Introduction

A time-series is a sequence of real numbers representing values at specific time points [8], [9], [18]. Examples of time-series data include stock prices, weather data, network traffic data, and sensor data. There have been a number of efforts to handle the time-series data stored in databases [1], [8], [17], [18], [21], [27]. Recently, the data stream has become of growing importance with new requirements due to advances in network technology and mobile/sensor devices in emerging ubiquitous environments [6], [9], [10], [14], [20]. A data stream is a sequence of data entries that continuously arrive in a sequential order [2], [19]. Examples include real-time sensor data, frequently changing trajectories of moving objects, and continuous flows of network packet data. The primary characteristic of a data stream is that the data are generated continuously, rapidly, unboundedly, and in real-time. Due to this characteristic, the entire data cannot be stored in a database, but should be processed on the fly. Hence, queries on data streams are not one-time queries, which are executed only once against stored data, but they are continuous queries that are registered in advance and run repeatedly over a period of time [2], [19]. Thus, query processing in data streams can be seen as dual of that in databases because the former searches against stored continuous queries for data entries that newly arrive while the latter searches against stored data for one-time queries [5], [15].

We define the time-series data that arrive in the form of data streams as the data stream sequence and the time-series data registered in the database as the continuous query sequence. We then define the similar sequence matching on data stream as finding the continuous query sequences that match the incoming data stream sequence up to a specific point in time within the user-specified tolerance.

We focus on similar sequence matching that can support a large number of variable-length and variable-tolerance continuous query sequences. Query sequences can be registered by many different users with different requirements on the lengths and tolerances. Nevertheless, existing results reported in the literature either support only fixed-length or fixed-tolerance continuous query sequences [9], [14] or are unable to support a large number of query sequences with variable lengths or variable tolerances [10]. Other recent similar sequence matching methods reported in the data stream environment are only capable of handling one continuous query sequence [6], [20].

We propose a new similar sequence matching method on data streams, which we call Similar Sequence Matching based on Intervaled Sequence (SSM-IS), that efficiently supports a large number of variable-length and variable-tolerance continuous query sequences. First, to support variable tolerances, we propose a new notion of the intervaled sequence. The intervaled sequence is defined as a sequence whose individual entries are an interval of real numbers rather than a real number itself. Using this notion, SSM-IS models a pair 〈query sequence, tolerance〉 as an intervaled sequence. Thus, it can efficiently support variable-tolerance query sequences by indexing and searching query sequences together with tolerances. We note that the work by Gao et al. [10] cannot support variable-tolerance continuous query sequences since it represents query sequences and tolerances separately. Next, to support variable-length continuous query sequences, we employ the window construction mechanism used in the traditional time-series subsequence matching methods [8], [17], [18]. The window construction mechanism divides long sequences into smaller windows. These divided windows are then used for indexing and searching.

We also use prematching and early abandoning [13] in SSM-IS. Prematching is a novel technique that finds not only current candidates, which can be concluded as similar sequences, but also precandidates, which are future candidates, whenever a new data entry arrives. We can use prematching to efficiently process similar sequence matching by reading precandidates and computing their distances in advance. We also use early abandoning, originally proposed by Keogh et al. [13], to reduce needless distance computation by computing the intermediate distance incrementally whenever a new data entry arrives and by abandoning as early as possible the candidates that cannot possibly be concluded to be similar sequences.

The rest of this paper is organized as follows. In Section 2, we introduce the data stream sequence and continuous query sequence, and then, formally define the problem of similar sequence matching on data streams. In Section 3, we review previous work on similar sequence matching on data streams. In Section 4, we propose a new model of representing continuous query sequences for performing similar sequence matching, and then, present the notion of prematching in the model. In Section 5, we present SSM-IS, the similar sequence matching method that supports variable-length and variable-tolerance continuous query sequences. In Section 6, we present the results of performance evaluation. Finally, in Section 7, we summarize and conclude the paper.

Section snippets

Preliminaries

In this section, we formally define the data stream sequence, continuous query sequence, and the similar sequence matching problem. We first summarize in Table 1 the notation to be used throughout the paper.

Related work

In this section, we review previous work on similar sequence matching on data streams. For similar sequence matching on static time-series databases, readers are referred to the literature for whole matching [1], [26] and for subsequence matching [8], [16], [17].

Gao and Wang [9] proposed a similar sequence matching method (let us call it Gao-1) based on prediction. Gao-1 predicts future stream entries based on the previous entries that already arrived, and computes the distances between query

Modeling continuous query sequences

To support variable tolerances, we need new models of query sequences and of similar sequence matching. In earlier work in the literature [8], [10], [17], only query sequences are indexed. Tolerances are independently used to construct the region queries when searching the index; i.e., the tolerance is modeled as part of search operations rather than as part of the index. This approach is reasonable when supporting only fixed tolerances, i.e., when all query sequences have the same tolerance.

Similar sequence matching algorithm that supports variable-length and variable-tolerance query sequences

In this section, we propose index building and matching algorithms of SSM-IS. Fig. 5 shows an overview of similar sequence matching in SSM-IS. As mentioned in Section 4, SSM-IS consists of two steps: the index building step (① in Fig. 5) and the matching step (②–⑤). The index building step is executed only once before the matching step. Here, all the registered query sequences are stored into the multidimensional index. The matching step is executed whenever a new data entry arrives. Here,

Performance evaluation

In this section, we explain the results of performance evaluation. We describe the experimental data and environments in Section 6.1 and present results of the experiments in Section 6.2.

Conclusions

We proposed a new method for similar sequence matching on data streams, which we call Similar Sequence Matching based on Intervaled Sequence (SSM-IS). SSM-IS efficiently supports variable-length and variable-tolerance continuous query sequences for large databases stored in disk. Supporting variable lengths and variable tolerances is important in applications such as real-time sensor data, stock prices, and trajectories of moving objects.

The contributions of this paper are summarized as

Acknowledgement

This work was supported by the Korea Science and Engineering Foundation (KOSEF) and the Korean Government (MOST) through the NRL Program (No. R0A-2007-000-20101-0).

References (27)

  • A.F. Sheta et al.

    Time-series forecasting using GA-tuned radial basis functions

    Information Sciences

    (2001)
  • M. Zhou et al.

    A segment-wise time warping method for time scaling searching

    Information Sciences

    (2005)
  • R. Agrawal, C. Faloutsos, A. Swami, Efficient similarity search in sequence databases, in: Proceedings of the Fourth...
  • B. Babcock et al., Models and issues in data stream systems, in: Proceedings of the 21st ACM SIGACT-SIGMOD-SIGART...
  • N. Beckmann et al., The R∗-tree: an efficient and robust access method for points and rectangles, in: Proceedings of...
  • K.P. Chan, A.W.C. Fu, Efficient time series matching by wavelets, in: Proceedings of the 15th IEEE International...
  • S. Chandrasekaran, M.J. Franklin, Streaming queries over streaming data, in: Proceedings of the 28th International...
  • Y. Chen et al., SpADe: on shape-based pattern detection in streaming time series, in: Proceedings of the 23rd IEEE...
  • C.K. Chui

    An Introduction to Wavelets

    (1992)
  • C. Faloutsos, M. Ranganathan, Y. Manolopoulos, Fast subsequence matching in time-series databases, in: Proceedings of...
  • L. Gao, X.S. Wang, Continually evaluating similarity-based pattern queries on a streaming time series, in: Proceedings...
  • L. Gao, Z. Yao, X.S. Wang, Evaluating continuous nearest neighbor queries for streaming time series via pre-fetching,...
  • D.S. Hirschberg

    Algorithms for the longest common subsequence problem

    Journal of the ACM

    (1977)
  • Cited by (0)

    View full text