Elsevier

Information Systems

Volume 26, Issue 4, June 2001, Pages 279-293
Information Systems

Efficient time-series subsequence matching using duality in constructing windows

https://doi.org/10.1016/S0306-4379(01)00021-7Get rights and content

Abstract

In this paper, we propose a new subsequence matching method, Dual Match. Dual Match exploits duality in constructing windows and significantly improves performance. Dual Match divides data sequences into disjoint windows and the query sequence into sliding windows, and thus, is a dual approach of the one by Faloutsos et al. (Proceedings of the ACM SIGMOD International Conference on Management of Data, Seattle, Washington, 1994, pp. 419–429.) (FRM in short), which divides data sequences into sliding windows and the query sequence into disjoint windows. FRM causes a lot of false alarms (i.e., candidates that do not qualify) by storing minimum bounding rectangles rather than individual points representing windows to save storage space for the index. Dual Match solves this problem by directly storing points without incurring excessive storage overhead. Experimental results show that, in most cases, Dual Match provides large improvement both in false alarms and performance over FRM given the same amount of storage space. In particular, for low selectivities (less than 10−4), Dual Match significantly improves performance up to 430-fold. On the other hand, for high selectivities (more than 10−2), it shows a very minor degradation (less than 29%). For selectivities in between (10−4–10−2), Dual Match shows performance slightly better than that of FRM. Overall, these results indicate that our approach provides a new paradigm in subsequence matching that improves performance significantly in large database applications.

Introduction

A time-series is a sequence of real numbers, representing values at specific time points. Typical examples of time-series data include stock prices, exchange rates, and weather data. The time-series data stored in a database are called data sequences. Finding data sequences similar to the given query sequence from the database is called similar sequence matching [1], [7]. Owing to faster computing speed and larger storage devices, there has been a number of efforts to utilize the large amount of time-series data, and accordingly, similar sequence matching has become an important research topic in data mining [1], [6], [7], [8], [9].

Various similarity models have been studied in similar sequence matching [1], [2], [10]. In this paper, we use the similarity model based on the Euclidean distance [1], [5], [7], [11]. In this model, we say that two sequences X={X[1],…,X[n]} and Y={Y[1],…,Y[n]} of the same length n are similar if the Euclidean distance D(X,Y)(=i=1n(X[i]−Y[i])2) is less than or equal to the user specified tolerance ε [1]. More specifically, we define that two sequences X and Y are in ε-match if D(X,Y) is less than or equal to ε.

Similar sequence matching can be classified into two categories [7]:

  • Whole matching: Given N data sequences S1,S2,…,SN, a query sequence Q, and the tolerance ε, we find those data sequences that are in ε-match with Q. Here, the data and query sequences must have the same length.

  • Subsequence matching: Given N data sequences S1,S2,…,SN of varying lengths, a query sequence Q, and the tolerance ε, we find all the sequences Si, one or more subsequences of which are in ε-match with Q, and the offsets in Si of those subsequences.


Thus, subsequence matching is a generalization of whole matching [5], [6], [7], [10]. In this paper, we focus on subsequence matching.

Faloutsos et al. [7] have proposed a novel solution for subsequence matching on query sequences of varying lengths (we simply call this solution FRM by taking authors’ initials). In FRM, they use a sliding window of size ω starting from every possible offset in the data sequence. Then, they divide a query sequence into disjoint windows of size ω and retrieve similar subsequences by using those disjoint windows. They transform each sliding window to a point in a lower dimensional space (we call it lower-dimensional transformation) to avoid the high dimensionality problem [4], [12] in multidimensional indexes. Since too many points are generated to be stored individually in an index, they construct minimum bounding rectangles (MBRs) that contain multiple points, and then, store those MBRs into a multidimensional index, R-tree [3]. For subsequence matching, they first identify, using the index, those MBRs containing information to identify the subsequences, called candidates, that are potentially in ε-match with the query sequence. They subsequently refine the result by accessing the database and selecting only those subsequences that are in ε-match with the query sequence.

FRM entails many false alarms (i.e., candidates that do not qualify) by storing only MBRs rather than individual points, and accordingly, degrades performance. In this paper, we propose a new subsequence matching method, Dual Match (Duality-based subsequence Matching), that reduces false alarms and improves performance significantly. We use the dual approach of FRM in constructing windows (we simply call it duality); i.e., we divide data sequences into disjoint windows and a query sequence into sliding windows. By dividing the data sequences into disjoint windows rather than sliding windows, Dual Match reduces the number of points to store drastically, to 1/ω of that of FRM, and thus, is able to store individual points instead of MBRs in the index. For subsequence matching, it first transforms the sliding windows of the query sequence into points, constructs range queries using these individual points and the user-specified tolerance ε, and then searches the index to get the candidates. By storing and searching individual points directly in the index, Dual Match reduces false alarms.

The rest of this paper is organized as follows. Section 2 describes related work. Section 3 explains the motivation of this research. Section 4 proposes Dual Match. Section 5 presents the results of performance evaluation. Section 6 concludes the paper.

Section snippets

Related work

We summarize in Table 1 the notation to be used throughout the paper. The symbols in Table 1 are self-explanatory and do not need further elaboration.

Motivation of the research

In this section, we explain the motivation of our approach. In similar sequence matching, the more false alarms occur, the more disk accesses and CPU operations for computing the Len(Q)-dimensional distance are incurred in the post-processing step. Thus, false alarms are the main cause of performance degradation.

We note that storing only MBRs instead of individual points is one of the main reasons for false alarms in FRM. We explain this point using Fig. 1. In Fig. 1, Pi(1⩽i⩽14) represents a

The concept

Dual Match divides data sequences into disjoint windows and the query sequence into sliding windows. This way, we are able to store and search individual points directly in the index without much storage overhead and improve disk and CPU performance.

We first define some terminology. Given a sequence S, a subsequence S[i2:j2] includes a subsequence S[i1:j1] if i1i2 and j1j2. When S is divided into fixed disjoint windows, we define the included windows for S[i:j] as those disjoint windows

Experimental data and environment

To prove the effectiveness of Dual Match, we have performed extensive experiments using three types of data sets. A data set consists of a long data sequence and has the same effect as the one consisting of multiple data sequences. The first data set, a real stock data set2 used in FRM [7], consists of 329112 entries. We call this data set STOCK-DATA. The second data set, also used in FRM, contains random walk data

Conclusions

In this paper, we have proposed Dual Match, a new subsequence matching method based on duality in constructing windows. We have shown that Dual Match reduces false alarms and improves performance drastically compared with the previous method by Faloutsos et al. [7] (FRM in short). Dual Match divides data sequences into disjoint windows and the query sequence into sliding windows, and thus, is a dual approach of FRM, which divides data sequences into sliding windows and the query sequence into

Acknowledgements

We would like to thank Byoung-Yong Moon for helping in revising an earlier English version of this paper.

References (14)

  • R. Agrawal, C. Faloutsos, A. Swami, Efficient similarity search in sequence databases, Proceedings of the fourth...
  • R. Agrawal, K.-I. Lin, H. S. Sawhney, K. Shim, Fast similarity search in the presence of noise, scaling, and...
  • N. Beckmann, H.-P. Kriegel, R. Schneider, B. Seeger, The r*-tree: an efficient and robust access method for points and...
  • S. Berchtold, C. Bohm, H.-P. Kriegel, The pyramid-technique: towards breaking the curse of dimensionality, Proceedings...
  • K.-P. Chan, A.W.-C. Fu, Efficient time series matching by wavelets, Proceedings of the 15th IEEE International...
  • K.W. Chu, M.H. Wong, Fast time-series searching with scaling and shifting, Proceedings of the 15th ACM...
  • C. Faloutsos, M. Ranganathan, Y. Manolopoulos, Fast subsequence matching in time-series databases, Proceedings of the...
There are more references available in the full text version of this article.

Cited by (14)

View all citing articles on Scopus

Recommended by Maurizio Lenzerini. This work was supported by the Korea Science and Engineering Foundation (KOSEF) through the Advanced Information Technology Research Center (ALTrc).

View full text