A time series representation model for accurate and fast similarity detection

doi:10.1016/j.patcog.2009.03.030

Pattern Recognition

Volume 42, Issue 11, November 2009, Pages 2998-3014

https://doi.org/10.1016/j.patcog.2009.03.030 Get rights and content

Abstract

Similarity search and detection is a central problem in time series data processing and management. Most approaches to this problem have been developed around the notion of dynamic time warping, whereas several dimensionality reduction techniques have been proposed to improve the efficiency of similarity searches. Due to the continuous increasing of sources of time series data and the cruciality of real-world applications that use such data, we believe there is a challenging demand for supporting similarity detection in time series in a both accurate and fast way. Our proposal is to define a concise yet feature-rich representation of time series, on which the dynamic time warping can be applied for effective and efficient similarity detection of time series. We present the Derivative time series Segment Approximation (DSA) representation model, which originally features derivative estimation, segmentation and segment approximation to provide both high sensitivity in capturing the main trends of time series and data compression. We extensively compare DSA with state-of-the-art similarity methods and dimensionality reduction techniques in clustering and classification frameworks. Experimental evidence from effectiveness and efficiency tests on various datasets shows that DSA is well-suited to support both accurate and fast similarity detection.

Introduction

A time series is a sequence of (real) numeric values upon which a total order based on timestamps is defined. Time series are generally used to represent the temporal evolution of objects, hence enormous amounts of such data are naturally available from several sources of different domains, including speech recognition, medicine and biology measurement, financial and market data analysis, telecommunication and telemetry, sensor networking, motion tracking, meteorology, and so on.

Most research on time series data management and knowledge discovery has been devoted to the similarity search and detection problem, which arises in many tasks such as indexing and query processing, change detection, frequent pattern mining, classification, and clustering. In this work we refer to clustering and classification as evaluation frameworks for similarity detection. In particular, we focus on the clustering task as it is necessary when the data being organized are not associated with predefined categories, which is a very frequent context in real-world application domains. Indeed, clustering of time series data has been attracting a growing interest in several scenarios. For instance, in the biomedical domain, frequently posed problems include finding groups of genes with similar expression profiles across a number of experiments, organizing patients according to different healthy/disease conditions, and finding groups of similar functional activities of the human brain in response to a given stimulus. In the socio-economics domain, clustering energy/power consumption patterns can support applications of fraud detection. Other challenging scenarios involve, for instance, seasonality patterns of retail data, personal income data, models of ecological dynamics, multimedia data streams. A more exhaustive list of applications which demand for time series clustering can be found in [23].

The common way to compare two time series is “warping” the time axis in order to achieve an alignment between the data points of the series. The Dynamic Time Warping (DTW) algorithm has long been known in speech recognition [30], and shown to be an effective solution for measuring the distance between time series [1]. Unlike the Euclidean distance, DTW allows elastic shifting of a sequence to provide a better match with another sequence, thus it can handle time series with local time shifting and different lengths.

Besides the similarity problem in time series, another issue concerns the high dimensionality that characterizes time series data in many application domains. To address this issue, various dimensionality reduction techniques have been proposed, following two main approaches in which a (continuous) time series is approximated with either a piecewise discontinuous function or a low-order continuous function.

Dimensionality reduction methods are useful for modeling time series into a more compact form. However, while this can help to compare time series efficiently, dimensionality reduction methods may lose significant information about the main trends in a time series, which are essential to effective similarity detection. Indeed, in many real-world applications there is a growing interest in developing methods that are able to fit an emerging demand for both accurate and fast similarity detection. In this respect, we believe there is a number of special requirements that shouldbe satisfied by any representation model to support accurate and fast similarity detection in time series, which are summarized as follows:

$•$
Time warping-awareness. Time series should be modeled into a form that can be naturally mapped to the time domain. This will make it feasible to benefit from using dynamic time warping for similarity detection.
$•$
Low complexity. Due to the high dimensionality of time series data, modeling time series should be performed maintaining a reasonably low complexity, which is possibly linear with the series length.
$•$
Sensitivity to relevant features. It is clearly desirable that time series approximation is able to preserve as much information in the original series as possible. For this purpose, approximating a time series should be accomplished in such a way that it tailors itself to the local features of the series, in order to capture the important trends of the series.
$•$
Absence of parameters. Most representation models and dimensionality reduction methods require the user to specify some input parameters, such as, e.g., the number of coefficients or symbols. However, prior domain knowledge is often unavailable, and the sensitivity to input parameters can seriously affect the accuracy of the representation model or dimensionality reduction method.

In this paper, we present a time series representation model which is conceived to support accurate and fast similarity detection. This model is called derivative time series segment approximation, as it yields a concise yet feature-rich time series representation by combining the notions of derivative estimation, segmentation and segment approximation.

Our DSA involves a segmentation scheme that employs the paradigm based on a piecewise discontinuous function. However, in contrast to any other technique of dimensionality reduction, the segmentation step is performed on the derivative version of the original time series, rather than directly on the raw time series. The derivative estimates represent a new feature space that enables the identification of the trends of the original series. Moreover, the final step of segment modeling allows for concisely fitting the detected trends in a low-dimensional, time warping-aware representation of the original time series. As we proved experimentally, the intuition underlying the DSA model works out very advantageously in supporting accurate and fast similarity detection; indeed, DSA is able to fulfill all of the desiderata mentioned above:

$•$
DSA sequences can be compared by using DTW directly;
$•$
the derivative-based feature generation allows for representing a time series by focusing on the characteristic trends in the series;
$•$
the segmentation step in DSA has a computational complexity which is linear with the series length, and is adaptive with respect to the identified trends of the series;
$•$
the absence of mandatory input parameters in DSA addresses the unavailability of prior domain knowledge.

We conducted an extensive experimental evaluation of DSA within clustering and classification frameworks, by considering aspects of effectiveness as well as efficiency. This evaluation necessarily involved the prominent state-of-the-art methods for time series representation and dimensionality reduction. Experimental evidence has shown that DSA supports accurate and fast similarity detection, in terms of a number of results that are summarized in Section 5.4.

The rest of the paper is organized as follows. Section 2 discusses the state-of-the-art for similarity search/detection and dimensionality reduction, and provides a first comparison between our proposal and the competing methods. Section 3 presents our DSA model in detail. 4 Experimental methodology, 5 Results describe the experimental methodology and relating results to assess DSA and the competing methods on benchmark datasets. Section 6 presents an application of DSA on a real case study. Finally, Section 7 provides concluding remarks and some pointers to future research.

Section snippets

Related work

As we mentioned in Introduction, DTW is widely used to perform similarity search and detection in time series. Given any two sequences $T_{1}$ and $T_{2}$ , DTW performs a non-linear mapping of one sequence to the other one by minimizing the total distance between them. For doing this, a ( $| T_{1} | \times | T_{2} |$ )-matrix storing the squared Euclidean distances between the two sequences is used to find an optimal warping path (i.e., a sequence of matrix elements) via a dynamic programming algorithm. Moreover, a number of

Derivative time series segment approximation

In this section we describe our derivative time series segment approximation model to represent time series into a concise form which is designed to capture the significant variations in the time series profile.² More precisely, a DSA sequence is the result of a transformation that applies to a time series and yields a shorter sequence of values approximating the segments identified in the derivative version of the original series.

Experimental methodology

We devised an experimental evaluation to assess the ability of our DSA in supporting effective and efficient similarity detection within clustering and classification frameworks. We compared DSA against state-of-the-art methods for modeling and comparing time series data, which include LCSS, EDR, ERP, DTW, DDTW, and FTW as distance measures, and APCA, SAX, PAA, PLA, SD, Chebyshev, DWT, and DFT as dimensionality reduction methods. Since our DSA and the competing dimensionality reduction methods

Data description

We selected seven datasets, which come from various application domains and are characterized by different series profiles and dimensionality. <?MCtwidthcolumnwidth?>Table 1(a) summarizes the characteristics of the datasets used in the evaluation, and Fig. 1 shows the shapes of sample representative instances in each dataset.

GunX comes from the video surveillance domain, whereas Tracedata simulates signals representing instrumentation failures. In Cylinder-Bell-Funnel (CBF), each class is

Application: profiling of electricity company customers

We briefly present here a real case study on electricity customer profiling. This is part of our ongoing research on fraud detection in electricity customer data in the context of a research project subsidized by the ENEL Italian electricity power company.

Conclusion

In this paper we proposed DSA, a representation model to support accurate and fast similarity detection in time series. DSA is able to transform a time series into a compact yet feature-rich sequence by combining the notions of derivative estimation, segmentation and segment modeling. We experimentally evaluated DSA in clustering and classification frameworks, and compared it to state-of-the-art similarity measures and dimensionality reduction methods. Experiments conducted on various benchmark

Acknowledgments

This work was supported partly by an ENEL (Italian electricity power company) grant under the project “Eureka! An Idea for Energy – Profiling and Anomaly Detection in ENEL Low Voltage Customer Load Data” by Diego Labate, ENEL unit of Meter Devices Engineering (TER/IAM). We are grateful to the anonymous reviewers for their valuable suggestions which helped to improve the quality of this paper.

About the Author—FRANCESCO GULLO is currently Ph.D. student in Computer and Systems Engineering at the University of Calabria, Italy. He graduated in Computer Engineering, in 2005. He is concerned with research topics falling into the areas of knowledge discovery in databases, web and semistructured data management, spatio-temporal databases.

References (41)

L. Chen et al.
On the marriage of Lp-norms and edit distance
E. Keogh
Exact indexing of dynamic time warping
E.F. Petricoin et al.
Use of proteomic patterns in serum to identify ovarian cancer
Lancet
(2002)
N.F. Thornhill et al.
The impact of compression on data-driven process analyses
Process Control
(2004)
D.J. Berndt et al.
Using dynamic time warping to find patterns in time series
E.H. Bristol
Swinging door trending: adaptive trending recording
C.S. Burrus et al.
Introduction to Wavelets and Wavelet Transforms: A Primer
(1997)
Y. Cai et al.
Indexing spatio-temporal trajectories with Chebyshev polynomials
K. Chakrabarti et al.
Locally adaptive dimensionality reduction for indexing large time series databases
ACM Transactions on Database Systems
(2002)
K. Chan et al.
Efficient time series matching by wavelets

L. Chen et al.

Robust and fast similarity search for moving object trajectories

S. Greco et al.

Effective and efficient similarity search in time series

F. Gullo et al.

MSPtool: a versatile tool for mass spectrometry data preprocessing

S.A. Imtiaz et al.

Building multivariate models from compressed data

Industrial Engineering Chemistry Research and Development

(2007)

A.K. Jain et al.

Algorithms for Clustering Data

(1988)

K.V. Kanth et al.

Dimensionality reduction for similarity searching in dynamic databases

E. Keogh et al.

Dimensionality reduction for fast similarity search in large time series databases

Knowledge and Information Systems

(2001)

E. Keogh et al.

An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback

E. Keogh et al.

Scaling up dynamic time warping for data mining applications

E. Keogh et al.

Derivative dynamic time warping

Cited by (69)

An adaptive time series segmentation algorithm based on visibility graph and particle swarm optimization
2024, Physica A: Statistical Mechanics and its Applications
Time series segmentation is a crucial area of research in time series analysis as it can reveal meaningful patterns or segments hidden within time series data. In this paper, we present an accurate and efficient time series segmentation method that combines the visibility graph method, particle swarm optimization, and community detection algorithm. We start by applying visibility graph theory to process time series data, resulting in a corresponding complex network. Next, we introduce an adaptive particle swarm optimization algorithm with modularity $Q$ as the objective function to optimize community detection. Finally, mapping the communities back to the nodes of the time series yields the segmented sequence. Our proposed method offers high segmentation accuracy and low time complexity $(O (n^{2}))$ . Experimental results demonstrate that our approach outperforms existing methods in terms of segmentation accuracy on two different synthetic datasets. Furthermore, when applied to the S&P500 index dataset, it accurately identifies financial cycles and key financial events.
Rare earth elements price forecasting by means of transgenic time series developed with ARIMA models
2018, Resources Policy
A time series can be thought of as a numerical organism with a continuous nature from a chronological point of view and something that is permanently updated. Up to this moment time series research related with their features, traits, and characteristics, is mainly focused on data mining, in order to discover hidden information or specific knowledge within the time series or their transformations. However, time series representation is crucial, as they are difficult to handle in their original structure due to their high dimensionality.
In this paper, the “theory of transgenic time series” is developed, and applied to the forecasting of several rare earth oxide prices: dysprosium, europium, terbium, neodymium, and praseodymium oxides. This theory addresses, specifically, the existence of metal price cycles and the presence of anomalous phenomena that the theory allows to eliminate from the time series, improving the accuracy of the forecast.
After representing the time series in a way that allows their genome to be sequenced, a restriction enzyme is defined in order to create a genetically modified time series. There was no need to develop DNA ligases as time series can be cut and pasted without further considerations.
Results clearly state that transgenic time series lead to more accurate short term forecasts in cases where a consistent time series genome can be represented. Further research should address the feasibility of developing more accurate long term forecasts by adding new gene sequences based on the time series genome, in order to achieve greater confidence from investors and professional advisers in the feasibility studies developed for future mining investment projects.
Finally, it has to be remarked that this theory has nothing to do with “genetic algorithms”, a metaheuristic that was inspired by the natural selection process and not by the sequencing and manipulation of the genome.
Three-dimensional piecewise cloud representation for time series data mining
2018, Neurocomputing
Many researchers have taken interests in time series data mining to discover potential knowledge and information as the amount of data from various domains rapidly increases. Representation, as a necessary implementation component of data mining, is critical to reduce the high dimensionality of time series data and generate a corresponding distance measure to process time series data effectively and efficiently. Many high-level representation approaches for mining time series data have been proposed in the past decades, e.g., PAA, SAX, PWCA and 2D-NCR. In this paper, a novel representation method for time series data, which is named Three-Dimensional Piecewise Cloud Representation (TDPCR), is proposed. The new representation contains a flexible partitioning strategy which protects the connection information between consecutive points by overlapping two adjacent segments. Using the improved cloud model theory, the proposed representation achieves the reduction of the data dimensionality and captures distribution and variation features of segments. Furthermore, a new distance measure, which has adaptive weight factors to adjust the proportion of data information, is defined to describe the relationship between two three-dimensional clouds. Accompanied with the comparisons of state-of-the-art representation methods, a sufficient performance evaluation for the proposed representation is carried out in the classification and query by content tasks. The experimental results show that TDPCR is effective and competitive on most of datasets from several domains.
An adaptive time series representation method for anode current signals in aluminium electrolysis<sup>⁎</sup>
2018, undefined
In aluminum electrolysis, anode current signals can not only provide an insight into the localized anodic dynamic behavior, but also can be used as a new way to study the process in the harsh industrial environment. This kind of the data is stored in the form of time series, which is a sequence of real numeric values. Because the data is huge and growing fast, the number of elements in anode current signals must be reduced to make further analysis much easier and faster, which is a typical time series representation problem. In this paper, an adaptive time series representation method for anode current signals is proposed for this purpose. The essence of this method is that the time series representation problem is transformed into the optimization problem. In addition, a new cognitively inspired optimization method named state transition algorithm (STA) is introduced to solve the optimization problem. The experimental results indicate that the proposed method outperforms common methods used for time series representation in aluminum electrolysis.
A shape-based adaptive segmentation of time-series using particle swarm optimization
2017, Information Systems
The increasing size of large databases has motivated many researchers to develop methods to reduce the dimensionality of data so that their further analysis can be easier and faster. There are many techniques for time-series’ dimensionality reduction; however, majority of them need an input by the user such as the number of segments. In this paper, the segmentation problem is analyzed from the optimization point of view. A new approach for time-series’ segmentation based on Particle Swarm Optimization (PSO) is proposed which is highly adaptive to time-series’ shape and shape-based characteristics. The proposed approach, called Adaptive Particle Swarm Optimization Segmentation (APSOS), is tested on various datasets to demonstrate its effectiveness and efficiency. Experiments are conducted to show that APSOS is independent of user input parameters and the results indicate that the proposed approach outperforms common methods used for the time-series segmentation.
Person re-identification by unsupervised video matching
2017, Pattern Recognition
Most existing person re-identification (ReID) methods rely only on the spatial appearance information from either one or multiple person images, whilst ignore the space-time cues readily available in video or image-sequence data. Moreover, they often assume the availability of exhaustively labelled cross-view pairwise data for every camera pair, making them non-scalable to ReID applications in real-world large scale camera networks. In this work, we introduce a novel video based person ReID method capable of accurately matching people across views from arbitrary unaligned image-sequences without any labelled pairwise data. Specifically, we introduce a new space-time person representation by encoding multiple granularities of spatio-temporal dynamics in form of time series. Moreover, a Time Shift Dynamic Time Warping (TS-DTW) model is derived for performing automatically alignment whilst achieving data selection and matching between inherently inaccurate and incomplete sequences in a unified way. We further extend the TS-DTW model for accommodating multiple feature-sequences of an image-sequence in order to fuse information from different descriptions. Crucially, this model does not require pairwise labelled training data (i.e. unsupervised) therefore readily scalable to large scale camera networks of arbitrary camera pairs without the need for exhaustive data annotation for every camera pair. We show the effectiveness and advantages of the proposed method by extensive comparisons with related state-of-the-art approaches using two benchmarking ReID datasets, PRID2011 and iLIDS-VID.

View all citing articles on Scopus

About the Author—GIOVANNI PONTI is currently Ph.D. student in Computer and Systems Engineering at the University of Calabria, Italy. He graduated in Computer Engineering, in 2005. His research activities are within the areas of knowledge discovery in databases, text mining, spatio-temporal databases.

About the Author—ANDREA TAGARELLI is an assistant professor of Computer Science with the Department of Electronics, Computer and Systems Sciences, University of Calabria, Italy. He graduated in Computer Engineering, in 2001, and obtained his Ph.D. in Computer and Systems Engineering, in 2006. He was visiting researcher at the Dept. of Computer Science & Engineering, University of Minnesota at Minneapolis, USA. His research interests include topics in knowledge discovery and text/data mining, information extraction, web and semistructured data management, spatio-temporal databases and applications in biomedicine. On these topics, he has coauthored journal articles, conference papers and book chapters and developed practical software tools. He has served as a reviewer as well as a member of program committee for leading journals and conferences in the fields of information systems, knowledge and data management, and artificial intelligence. He has been a SIAM member since 2008.

About the Author—SERGIO GRECO is full professor of Computer Science at the Faculty of Engineering at the University of Calabria, chair of the Department of Electronics, Computer and Systems Sciences and associated researcher at the Institute of High Performance Computing and Networks of the Italian National Research Council. He was researcher at CRAI, a research consortium of informatics. He was visiting researcher at the Microelectronics and Computer Center (MCC) of Austin, Texas, and at the Computer Science Department of University of California at Los Angeles, USA. He has published more than 150 papers including 40 journal papers and about 80 papers published in the proceedings of international conferences. His primary research interests include database theory, logic programming, logic and deductive database, nonmonotonic reasoning, data integration, web search engines, and mining and querying semistructured data. He is a member of the IEEE Computer Society and ACM and associated editor of IEEE-TKDE.

View full text

A time series representation model for accurate and fast similarity detection

Abstract

Introduction

Section snippets

Related work

Derivative time series segment approximation

Experimental methodology

Data description

Application: profiling of electricity company customers

Conclusion

Acknowledgments

Lancet

Process Control

Using dynamic time warping to find patterns in time series

Swinging door trending: adaptive trending recording

Introduction to Wavelets and Wavelet Transforms: A Primer

Indexing spatio-temporal trajectories with Chebyshev polynomials

Locally adaptive dimensionality reduction for indexing large time series databases

ACM Transactions on Database Systems

Efficient time series matching by wavelets

Robust and fast similarity search for moving object trajectories

Effective and efficient similarity search in time series

MSPtool: a versatile tool for mass spectrometry data preprocessing

Building multivariate models from compressed data

Industrial Engineering Chemistry Research and Development

Algorithms for Clustering Data

Dimensionality reduction for similarity searching in dynamic databases

Dimensionality reduction for fast similarity search in large time series databases

Knowledge and Information Systems

An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback

Scaling up dynamic time warping for data mining applications

Derivative dynamic time warping