Efficient time-series subsequence matching using duality in constructing windows

doi:10.1016/S0306-4379(01)00021-7

Information Systems

Volume 26, Issue 4, June 2001, Pages 279-293

https://doi.org/10.1016/S0306-4379(01)00021-7 Get rights and content

Abstract

In this paper, we propose a new subsequence matching method, Dual Match. Dual Match exploits duality in constructing windows and significantly improves performance. Dual Match divides data sequences into disjoint windows and the query sequence into sliding windows, and thus, is a dual approach of the one by Faloutsos et al. (Proceedings of the ACM SIGMOD International Conference on Management of Data, Seattle, Washington, 1994, pp. 419–429.) (FRM in short), which divides data sequences into sliding windows and the query sequence into disjoint windows. FRM causes a lot of false alarms (i.e., candidates that do not qualify) by storing minimum bounding rectangles rather than individual points representing windows to save storage space for the index. Dual Match solves this problem by directly storing points without incurring excessive storage overhead. Experimental results show that, in most cases, Dual Match provides large improvement both in false alarms and performance over FRM given the same amount of storage space. In particular, for low selectivities (less than 10⁻⁴), Dual Match significantly improves performance up to 430-fold. On the other hand, for high selectivities (more than 10⁻²), it shows a very minor degradation (less than 29%). For selectivities in between (10⁻⁴–10⁻²), Dual Match shows performance slightly better than that of FRM. Overall, these results indicate that our approach provides a new paradigm in subsequence matching that improves performance significantly in large database applications.

Introduction

A time-series is a sequence of real numbers, representing values at specific time points. Typical examples of time-series data include stock prices, exchange rates, and weather data. The time-series data stored in a database are called data sequences. Finding data sequences similar to the given query sequence from the database is called similar sequence matching [1], [7]. Owing to faster computing speed and larger storage devices, there has been a number of efforts to utilize the large amount of time-series data, and accordingly, similar sequence matching has become an important research topic in data mining [1], [6], [7], [8], [9].

Various similarity models have been studied in similar sequence matching [1], [2], [10]. In this paper, we use the similarity model based on the Euclidean distance [1], [5], [7], [11]. In this model, we say that two sequences X={X[1],…,X[n]} and Y={Y[1],…,Y[n]} of the same length n are similar if the Euclidean distance $D(X,Y) (= ∑_{i=1}^{n} (X[i]−Y[i])^{2})$ is less than or equal to the user specified tolerance ε [1]. More specifically, we define that two sequences X and Y are in ε-match if D(X,Y) is less than or equal to ε.

Similar sequence matching can be classified into two categories [7]:

•
Whole matching: Given N data sequences S₁,S₂,…,S_N, a query sequence Q, and the tolerance ε, we find those data sequences that are in ε-match with Q. Here, the data and query sequences must have the same length.
•
Subsequence matching: Given N data sequences S₁,S₂,…,S_N of varying lengths, a query sequence Q, and the tolerance ε, we find all the sequences S_i, one or more subsequences of which are in ε-match with Q, and the offsets in S_i of those subsequences.

Thus, subsequence matching is a generalization of whole matching [5], [6], [7], [10]. In this paper, we focus on subsequence matching.

Faloutsos et al. [7] have proposed a novel solution for subsequence matching on query sequences of varying lengths (we simply call this solution FRM by taking authors’ initials). In FRM, they use a sliding window of size ω starting from every possible offset in the data sequence. Then, they divide a query sequence into disjoint windows of size ω and retrieve similar subsequences by using those disjoint windows. They transform each sliding window to a point in a lower dimensional space (we call it lower-dimensional transformation) to avoid the high dimensionality problem [4], [12] in multidimensional indexes. Since too many points are generated to be stored individually in an index, they construct minimum bounding rectangles (MBRs) that contain multiple points, and then, store those MBRs into a multidimensional index, $R^{∗}$ -tree [3]. For subsequence matching, they first identify, using the index, those MBRs containing information to identify the subsequences, called candidates, that are potentially in ε-match with the query sequence. They subsequently refine the result by accessing the database and selecting only those subsequences that are in ε-match with the query sequence.

FRM entails many false alarms (i.e., candidates that do not qualify) by storing only MBRs rather than individual points, and accordingly, degrades performance. In this paper, we propose a new subsequence matching method, Dual Match (Duality-based subsequence Matching), that reduces false alarms and improves performance significantly. We use the dual approach of FRM in constructing windows (we simply call it duality); i.e., we divide data sequences into disjoint windows and a query sequence into sliding windows. By dividing the data sequences into disjoint windows rather than sliding windows, Dual Match reduces the number of points to store drastically, to 1/ω of that of FRM, and thus, is able to store individual points instead of MBRs in the index. For subsequence matching, it first transforms the sliding windows of the query sequence into points, constructs range queries using these individual points and the user-specified tolerance ε, and then searches the index to get the candidates. By storing and searching individual points directly in the index, Dual Match reduces false alarms.

The rest of this paper is organized as follows. Section 2 describes related work. Section 3 explains the motivation of this research. Section 4 proposes Dual Match. Section 5 presents the results of performance evaluation. Section 6 concludes the paper.

Section snippets

Related work

We summarize in Table 1 the notation to be used throughout the paper. The symbols in Table 1 are self-explanatory and do not need further elaboration.

Motivation of the research

In this section, we explain the motivation of our approach. In similar sequence matching, the more false alarms occur, the more disk accesses and CPU operations for computing the Len(Q)-dimensional distance are incurred in the post-processing step. Thus, false alarms are the main cause of performance degradation.

We note that storing only MBRs instead of individual points is one of the main reasons for false alarms in FRM. We explain this point using Fig. 1. In Fig. 1, $P_{i} (1⩽i⩽14)$ represents a

The concept

Dual Match divides data sequences into disjoint windows and the query sequence into sliding windows. This way, we are able to store and search individual points directly in the index without much storage overhead and improve disk and CPU performance.

We first define some terminology. Given a sequence S, a subsequence $S[i_{2} : j_{2}]$ includes a subsequence $S[i_{1} : j_{1}]$ if i₁⩾i₂ and j₁⩽j₂. When S is divided into fixed disjoint windows, we define the included windows for $S[i : j]$ as those disjoint windows

Experimental data and environment

To prove the effectiveness of Dual Match, we have performed extensive experiments using three types of data sets. A data set consists of a long data sequence and has the same effect as the one consisting of multiple data sequences. The first data set, a real stock data set² used in FRM [7], consists of 329112 entries. We call this data set STOCK-DATA. The second data set, also used in FRM, contains random walk data

Conclusions

In this paper, we have proposed Dual Match, a new subsequence matching method based on duality in constructing windows. We have shown that Dual Match reduces false alarms and improves performance drastically compared with the previous method by Faloutsos et al. [7] (FRM in short). Dual Match divides data sequences into disjoint windows and the query sequence into sliding windows, and thus, is a dual approach of FRM, which divides data sequences into sliding windows and the query sequence into

Acknowledgements

We would like to thank Byoung-Yong Moon for helping in revising an earlier English version of this paper.

References (14)

R. Agrawal, C. Faloutsos, A. Swami, Efficient similarity search in sequence databases, Proceedings of the fourth...
R. Agrawal, K.-I. Lin, H. S. Sawhney, K. Shim, Fast similarity search in the presence of noise, scaling, and...
N. Beckmann, H.-P. Kriegel, R. Schneider, B. Seeger, The r*-tree: an efficient and robust access method for points and...
S. Berchtold, C. Bohm, H.-P. Kriegel, The pyramid-technique: towards breaking the curse of dimensionality, Proceedings...
K.-P. Chan, A.W.-C. Fu, Efficient time series matching by wavelets, Proceedings of the 15th IEEE International...
K.W. Chu, M.H. Wong, Fast time-series searching with scaling and shifting, Proceedings of the 15th ACM...
C. Faloutsos, M. Ranganathan, Y. Manolopoulos, Fast subsequence matching in time-series databases, Proceedings of the...

There are more references available in the full text version of this article.

Cited by (14)

Searching for variable-speed motions in long sequences of motion capture data
2019, Information Systems
Citation Excerpt :
This problem is traditionally overcome by a fine-grained segmentation that partitions both query and data sequence into short segments of a fixed-size. The query can be partitioned into overlapping segments using the sliding-window principle and the data sequence into disjoint (i.e., non-overlapping) segments to reduce the data replication, or vice versa [8,9]. In both cases, however, query segments need to be matched with data segments respecting their temporal order.
Motion capture data digitally represent human movements by sequences of body configurations in time. Subsequence searching in long sequences of such spatio-temporal data is difficult as query-relevant motions can vary in execution speeds and styles and can occur anywhere in a very long data sequence. To deal with these problems, we employ a fast and effective similarity measure that is elastic. The property of elasticity enables matching of two overlapping but slightly misaligned subsequences with a high confidence. Based on the elasticity, the long data sequence is partitioned into overlapping segments that are organized in multiple levels. The number of levels and sizes of overlaps are optimized to generate a modest number of segments while being able to trace an arbitrary query. In a retrieval phase, a query is always represented as a single segment and fast matched against segments within a relevant level without any costly post-processing. Moreover, visiting adjacent levels makes possible subsequence searching of time-warped (i.e., faster or slower executed) queries. To efficiently search on a large scale, segment features can be binarized and segmentation levels independently indexed. We experimentally demonstrate effectiveness and efficiency of the proposed approach for subsequence searching on a real-life dataset.
Indoor positioning using magnetic fingerprint map captured by magnetic sensor array
2021, Sensors
Measurement noise recommendation for efficient kalman filtering over a large amount of sensor data
2019, Sensors (Switzerland)
Based on associated multi-indexes model for limited-SLCSS subsequence retrieval on time series
2012, Journal of Convergence Information Technology
Similar pattern-matching algorithm of time series based on improved empirical mode decomposition method
2011, Xitong Fangzhen Xuebao / Journal of System Simulation
Improving the index structure with hierarchical techniques in time-series databases
2010, Proceedings - 2010 7th International Conference on Fuzzy Systems and Knowledge Discovery, FSKD 2010

View all citing articles on Scopus

^☆: Recommended by Maurizio Lenzerini. This work was supported by the Korea Science and Engineering Foundation (KOSEF) through the Advanced Information Technology Research Center (ALTrc).

View full text

Efficient time-series subsequence matching using duality in constructing windows☆