Elsevier

Pattern Recognition

Volume 42, Issue 11, November 2009, Pages 2998-3014
Pattern Recognition

A time series representation model for accurate and fast similarity detection

https://doi.org/10.1016/j.patcog.2009.03.030Get rights and content

Abstract

Similarity search and detection is a central problem in time series data processing and management. Most approaches to this problem have been developed around the notion of dynamic time warping, whereas several dimensionality reduction techniques have been proposed to improve the efficiency of similarity searches. Due to the continuous increasing of sources of time series data and the cruciality of real-world applications that use such data, we believe there is a challenging demand for supporting similarity detection in time series in a both accurate and fast way. Our proposal is to define a concise yet feature-rich representation of time series, on which the dynamic time warping can be applied for effective and efficient similarity detection of time series. We present the Derivative time series Segment Approximation (DSA) representation model, which originally features derivative estimation, segmentation and segment approximation to provide both high sensitivity in capturing the main trends of time series and data compression. We extensively compare DSA with state-of-the-art similarity methods and dimensionality reduction techniques in clustering and classification frameworks. Experimental evidence from effectiveness and efficiency tests on various datasets shows that DSA is well-suited to support both accurate and fast similarity detection.

Introduction

A time series is a sequence of (real) numeric values upon which a total order based on timestamps is defined. Time series are generally used to represent the temporal evolution of objects, hence enormous amounts of such data are naturally available from several sources of different domains, including speech recognition, medicine and biology measurement, financial and market data analysis, telecommunication and telemetry, sensor networking, motion tracking, meteorology, and so on.

Most research on time series data management and knowledge discovery has been devoted to the similarity search and detection problem, which arises in many tasks such as indexing and query processing, change detection, frequent pattern mining, classification, and clustering. In this work we refer to clustering and classification as evaluation frameworks for similarity detection. In particular, we focus on the clustering task as it is necessary when the data being organized are not associated with predefined categories, which is a very frequent context in real-world application domains. Indeed, clustering of time series data has been attracting a growing interest in several scenarios. For instance, in the biomedical domain, frequently posed problems include finding groups of genes with similar expression profiles across a number of experiments, organizing patients according to different healthy/disease conditions, and finding groups of similar functional activities of the human brain in response to a given stimulus. In the socio-economics domain, clustering energy/power consumption patterns can support applications of fraud detection. Other challenging scenarios involve, for instance, seasonality patterns of retail data, personal income data, models of ecological dynamics, multimedia data streams. A more exhaustive list of applications which demand for time series clustering can be found in [23].

The common way to compare two time series is “warping” the time axis in order to achieve an alignment between the data points of the series. The Dynamic Time Warping (DTW) algorithm has long been known in speech recognition [30], and shown to be an effective solution for measuring the distance between time series [1]. Unlike the Euclidean distance, DTW allows elastic shifting of a sequence to provide a better match with another sequence, thus it can handle time series with local time shifting and different lengths.

Besides the similarity problem in time series, another issue concerns the high dimensionality that characterizes time series data in many application domains. To address this issue, various dimensionality reduction techniques have been proposed, following two main approaches in which a (continuous) time series is approximated with either a piecewise discontinuous function or a low-order continuous function.

Dimensionality reduction methods are useful for modeling time series into a more compact form. However, while this can help to compare time series efficiently, dimensionality reduction methods may lose significant information about the main trends in a time series, which are essential to effective similarity detection. Indeed, in many real-world applications there is a growing interest in developing methods that are able to fit an emerging demand for both accurate and fast similarity detection. In this respect, we believe there is a number of special requirements that shouldbe satisfied by any representation model to support accurate and fast similarity detection in time series, which are summarized as follows:

  • Time warping-awareness. Time series should be modeled into a form that can be naturally mapped to the time domain. This will make it feasible to benefit from using dynamic time warping for similarity detection.

  • Low complexity. Due to the high dimensionality of time series data, modeling time series should be performed maintaining a reasonably low complexity, which is possibly linear with the series length.

  • Sensitivity to relevant features. It is clearly desirable that time series approximation is able to preserve as much information in the original series as possible. For this purpose, approximating a time series should be accomplished in such a way that it tailors itself to the local features of the series, in order to capture the important trends of the series.

  • Absence of parameters. Most representation models and dimensionality reduction methods require the user to specify some input parameters, such as, e.g., the number of coefficients or symbols. However, prior domain knowledge is often unavailable, and the sensitivity to input parameters can seriously affect the accuracy of the representation model or dimensionality reduction method.

In this paper, we present a time series representation model which is conceived to support accurate and fast similarity detection. This model is called derivative time series segment approximation, as it yields a concise yet feature-rich time series representation by combining the notions of derivative estimation, segmentation and segment approximation.

Our DSA involves a segmentation scheme that employs the paradigm based on a piecewise discontinuous function. However, in contrast to any other technique of dimensionality reduction, the segmentation step is performed on the derivative version of the original time series, rather than directly on the raw time series. The derivative estimates represent a new feature space that enables the identification of the trends of the original series. Moreover, the final step of segment modeling allows for concisely fitting the detected trends in a low-dimensional, time warping-aware representation of the original time series. As we proved experimentally, the intuition underlying the DSA model works out very advantageously in supporting accurate and fast similarity detection; indeed, DSA is able to fulfill all of the desiderata mentioned above:

  • DSA sequences can be compared by using DTW directly;

  • the derivative-based feature generation allows for representing a time series by focusing on the characteristic trends in the series;

  • the segmentation step in DSA has a computational complexity which is linear with the series length, and is adaptive with respect to the identified trends of the series;

  • the absence of mandatory input parameters in DSA addresses the unavailability of prior domain knowledge.

We conducted an extensive experimental evaluation of DSA within clustering and classification frameworks, by considering aspects of effectiveness as well as efficiency. This evaluation necessarily involved the prominent state-of-the-art methods for time series representation and dimensionality reduction. Experimental evidence has shown that DSA supports accurate and fast similarity detection, in terms of a number of results that are summarized in Section 5.4.

The rest of the paper is organized as follows. Section 2 discusses the state-of-the-art for similarity search/detection and dimensionality reduction, and provides a first comparison between our proposal and the competing methods. Section 3 presents our DSA model in detail. 4 Experimental methodology, 5 Results describe the experimental methodology and relating results to assess DSA and the competing methods on benchmark datasets. Section 6 presents an application of DSA on a real case study. Finally, Section 7 provides concluding remarks and some pointers to future research.

Section snippets

Related work

As we mentioned in Introduction, DTW is widely used to perform similarity search and detection in time series. Given any two sequences T1 and T2, DTW performs a non-linear mapping of one sequence to the other one by minimizing the total distance between them. For doing this, a (|T1|×|T2|)-matrix storing the squared Euclidean distances between the two sequences is used to find an optimal warping path (i.e., a sequence of matrix elements) via a dynamic programming algorithm. Moreover, a number of

Derivative time series segment approximation

In this section we describe our derivative time series segment approximation model to represent time series into a concise form which is designed to capture the significant variations in the time series profile.2 More precisely, a DSA sequence is the result of a transformation that applies to a time series and yields a shorter sequence of values approximating the segments identified in the derivative version of the original series.

Experimental methodology

We devised an experimental evaluation to assess the ability of our DSA in supporting effective and efficient similarity detection within clustering and classification frameworks. We compared DSA against state-of-the-art methods for modeling and comparing time series data, which include LCSS, EDR, ERP, DTW, DDTW, and FTW as distance measures, and APCA, SAX, PAA, PLA, SD, Chebyshev, DWT, and DFT as dimensionality reduction methods. Since our DSA and the competing dimensionality reduction methods

Data description

We selected seven datasets, which come from various application domains and are characterized by different series profiles and dimensionality. <?MCtwidthcolumnwidth?>Table 1(a) summarizes the characteristics of the datasets used in the evaluation, and Fig. 1 shows the shapes of sample representative instances in each dataset.

GunX comes from the video surveillance domain, whereas Tracedata simulates signals representing instrumentation failures. In Cylinder-Bell-Funnel (CBF), each class is

Application: profiling of electricity company customers

We briefly present here a real case study on electricity customer profiling. This is part of our ongoing research on fraud detection in electricity customer data in the context of a research project subsidized by the ENEL Italian electricity power company.

Conclusion

In this paper we proposed DSA, a representation model to support accurate and fast similarity detection in time series. DSA is able to transform a time series into a compact yet feature-rich sequence by combining the notions of derivative estimation, segmentation and segment modeling. We experimentally evaluated DSA in clustering and classification frameworks, and compared it to state-of-the-art similarity measures and dimensionality reduction methods. Experiments conducted on various benchmark

Acknowledgments

This work was supported partly by an ENEL (Italian electricity power company) grant under the project “Eureka! An Idea for Energy – Profiling and Anomaly Detection in ENEL Low Voltage Customer Load Data” by Diego Labate, ENEL unit of Meter Devices Engineering (TER/IAM). We are grateful to the anonymous reviewers for their valuable suggestions which helped to improve the quality of this paper.

About the Author—FRANCESCO GULLO is currently Ph.D. student in Computer and Systems Engineering at the University of Calabria, Italy. He graduated in Computer Engineering, in 2005. He is concerned with research topics falling into the areas of knowledge discovery in databases, web and semistructured data management, spatio-temporal databases.

References (41)

  • L. Chen et al.

    Robust and fast similarity search for moving object trajectories

  • S. Greco et al.

    Effective and efficient similarity search in time series

  • F. Gullo et al.

    MSPtool: a versatile tool for mass spectrometry data preprocessing

  • S.A. Imtiaz et al.

    Building multivariate models from compressed data

    Industrial Engineering Chemistry Research and Development

    (2007)
  • A.K. Jain et al.

    Algorithms for Clustering Data

    (1988)
  • K.V. Kanth et al.

    Dimensionality reduction for similarity searching in dynamic databases

  • E. Keogh et al.

    Dimensionality reduction for fast similarity search in large time series databases

    Knowledge and Information Systems

    (2001)
  • E. Keogh et al.

    An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback

  • E. Keogh et al.

    Scaling up dynamic time warping for data mining applications

  • E. Keogh et al.

    Derivative dynamic time warping

  • Cited by (69)

    View all citing articles on Scopus

    About the Author—FRANCESCO GULLO is currently Ph.D. student in Computer and Systems Engineering at the University of Calabria, Italy. He graduated in Computer Engineering, in 2005. He is concerned with research topics falling into the areas of knowledge discovery in databases, web and semistructured data management, spatio-temporal databases.

    About the Author—GIOVANNI PONTI is currently Ph.D. student in Computer and Systems Engineering at the University of Calabria, Italy. He graduated in Computer Engineering, in 2005. His research activities are within the areas of knowledge discovery in databases, text mining, spatio-temporal databases.

    About the Author—ANDREA TAGARELLI is an assistant professor of Computer Science with the Department of Electronics, Computer and Systems Sciences, University of Calabria, Italy. He graduated in Computer Engineering, in 2001, and obtained his Ph.D. in Computer and Systems Engineering, in 2006. He was visiting researcher at the Dept. of Computer Science & Engineering, University of Minnesota at Minneapolis, USA. His research interests include topics in knowledge discovery and text/data mining, information extraction, web and semistructured data management, spatio-temporal databases and applications in biomedicine. On these topics, he has coauthored journal articles, conference papers and book chapters and developed practical software tools. He has served as a reviewer as well as a member of program committee for leading journals and conferences in the fields of information systems, knowledge and data management, and artificial intelligence. He has been a SIAM member since 2008.

    About the Author—SERGIO GRECO is full professor of Computer Science at the Faculty of Engineering at the University of Calabria, chair of the Department of Electronics, Computer and Systems Sciences and associated researcher at the Institute of High Performance Computing and Networks of the Italian National Research Council. He was researcher at CRAI, a research consortium of informatics. He was visiting researcher at the Microelectronics and Computer Center (MCC) of Austin, Texas, and at the Computer Science Department of University of California at Los Angeles, USA. He has published more than 150 papers including 40 journal papers and about 80 papers published in the proceedings of international conferences. His primary research interests include database theory, logic programming, logic and deductive database, nonmonotonic reasoning, data integration, web search engines, and mining and querying semistructured data. He is a member of the IEEE Computer Society and ACM and associated editor of IEEE-TKDE.

    View full text