Elsevier

Information Sciences

Volume 260, 1 March 2014, Pages 15-36
Information Sciences

An approach to dimensionality reduction in time series

https://doi.org/10.1016/j.ins.2013.10.037Get rights and content

Abstract

Many methods of dimensionality reduction of data series (time series) have been introduced over the past decades. Some of them rely on a symbolic representation of the original data, however in this case the obtained dimensionality reduction is not substantial. In this paper, we introduce a new approach referred to as Symbolic Essential Attributes Approximation (SEAA) to reduce the dimensionality of multidimensional time series. In such a way we form a new nominal representation of the original data series. The approach is based on the concept of data series envelopes and essential attributes generated by a multilayer neural network. The real-valued attributes are discretized, and in this way symbolic data series representation is formed. The SEAA generates a vector of nominal values of new attributes which form the compressed representation of original data series. The nominal attributes are synthetic, and while not being directly interpretable, they still retain important features of the original data series. A validation of usefulness of the proposed dimensionality reduction is carried out for classification and clustering tasks. The experiments have shown that even for a significant reduction of dimensionality, the new representation retains information about the data series sufficient for classification and clustering of the time series.

Introduction

The term “data series” is often used to refer to any data set with a single independent time variable. Nowadays time series mining problems arise in many areas, such as medicine, finance, industry, and climate changes. The majority of data series research focuses on the following data mining problems:

  • indexing (e.g. Keogh et al. [31]),

  • clustering (e.g. Keogh and Pazzani [33], Wu and Chang [79], Krawczak and Szkatuła [41], [43], [44], [45], [46], [47]),

  • classification (e.g. Nanopoulos et al. [58], Krawczak and Szkatuła [39], [40], [42], Wang [77]),

  • summarization (e.g. Lin et al. [53]), and

  • anomaly detection (e.g. Shahabi et al. [63]).

Due to a huge amount of data, different kinds of data series representations were developed. In the literature one can encounter specialized algorithms including decision trees [62], neural networks [58], Bayesian classifiers [79], etc. Some representations are sufficiently general to be used in the above mentioned problems, while others are considerably specialized and focused on individual applications. It is worth to mention that there is an increasing interest in data series mining [70]. It is said that the time series or data series mining is considered as one of the ten challenging problems in data mining [16], [73].

The high dimensionality of data series renders many data mining methods ineffective and fragile [6]. In general, the data mining methods require high computational overhead when being applied to very large data sets. This obstacle is sometimes referred to the “curse of dimensionality” [13]. In most of the data series mining problems there is a necessity of dimensionality reduction and forming new data series representations. It is required that the new representation preserves sufficient information for solving the above data series problem correctly. Dimensionality reduction (either the number of data point or the number of records), can effectively reduce this computational overhead.

Therefore, the aim is to propose a new representation of the original time series which is based on dimensionality reduction. In general, methods of dimensionality reduction can lead to attribute selection, attribute extraction, or record selection. The process of extracting attributes reduces the dimensionality of data, which means that it forms a kind of lossy data compression. Note that lossy compression methods invoke a trade-off between the compression rate and retained information.

There are many approaches to dimensionality reduction and similarity searches of data series in large databases [69]. In general, a data series of arbitrary length M can be reduced to another representation of data series of length K, K < M. The simplest method is sampling [3], in which the rate of M/K stands for the compression rate. However, the shape of compressed time series is roughly alike. Piecewise approximation methods divide the data series into segments and approximate each segment using functions. There are enhanced methods which use the average value of each segment to represent the data point in the new, compressed representation. One of these methods is based on piecewise constant approximation, also known as Piecewise Aggregate Approximation (PAA). Yi and Faloutsos [75], Keogh et al. [29] proposed to divide each data series into segments of equal length and to use the average value of each segment to represent the latter PAA. Keogh et al. [29], [30] also proposed an extended version called Adaptive Piecewise Constant Approximation (APCA), where the segments length is unfixed. Instead of using the average value to represent each segment, other methods are used. Lee et al. [51] proposed to use the Segmented Sum of Variation (SSV) to present each segment, while Ratanamahatana et al. [60] and Bagnall et al. [5] proposed a bit level approximation. There are other methods to approximate a time series by straight lines; for example, linear interpolation [28], [34], [65] or linear regression [64]. Furthermore, preserving the salient points seems to be a promising method, such as Perceptually Important Points (PIP) introduced by Chung et al. [10].

The idea of upper and lower envelopes of data series was introduced by Krawczak and Szkatuła [39], [40] and is worth using.

Representing data series in the transformed domain is another approach. One of the popular transformation techniques is the Discrete Fourier Transform (DFT) [14] and the Discrete Wavelet Transform (DWT) [7]. Principal Component Analysis (PCA) is a popular multivariate technique using statistical methods [72], [76]. Other methods use Hidden Markov Models (HMMs) [4]. Many of the approaches use different indexing methods.

An important feature of the majority of the above approaches is that they use real values. Meanwhile, little attention has been paid to symbolic representation of data series. However, there is a common family of approaches which converts the numerical time series to symbolic forms. The simplest method is the discretization of the time series into segments and converting segments into symbols [71], [74].

One of the most competitive methods in the literature for dimensionality reduction of time series with introduced symbolic representation is the Symbolic Aggregate approXimation (SAX), cf. [54]. They convert the result from PAA to a symbol string. Two parameters must be specified for the conversion: the length of subsequence and the alphabet of symbols used. SAX preserves the general shape of the original time series. In general, the SAX method consists of two main parts, in the first part a time series is approximated by Piecewise Aggregate Approximation (PAA) based on piecewise constant approximation, while in the second part such time series representation is converted into a sequence of symbols. The sequence corresponds to the original time series, see Fig. 1.

The SAX method provides dimensionality reduction not only in the first part but also encoding symbols in bits. For a fixed number of segments, e.g. 4, and a fixed alphabet of symbols, e.g. {a,b,c,d,e,f,g,h,i,j}, the time series can be represented as a word, e.g. baad or aade.

This study is motivated by the observation that using a combination of several methods for data series dimensionality reduction is more effective than using a single one. It seems that instead of using one method with an extensive loss of information, it is much more efficient to merge several methods in which information is reduced gradually.

In this paper, we propose a new approach referred to as Symbolic Essential Attributes Approximation (SEAA) for gradual reduction of dimensionality of multidimensional data series. The approach allows a data series of arbitrary length M to be reduced to an arbitrary length K, where K  M. For symbolic representation of data series, we use the alphabet of finite size R  3. The approach differs from other methods known in literature. In general, the existing methods provide compressed representation of data series, which preserve the time order of the original data series. It means that the original data series is replaced by a considerable ‘shorter’ representation. In this case we obtain a vector of nominal-valued attributes representing the original data series and preserving properties which are important for the original data series mining problems.

The change of data series representation evokes significant dimensionality reduction of the data series. The proposed methodology consists of several steps, through which dimensionality reduction is obtained, and the compression ratio is determined. It must be emphasized that the data series is understood as real-valued time series or pseudo-time series, while SEAA generates a vector of nominal-valued attributes.

The essence of the methodology is highlighted in Fig. 2.

Before proceeding with the dimensionality reduction of the data series, preliminary preprocessing of the each data series must be done in order to obtain the normalized data series representation with the mean equal zero and the standard deviation equal one. Let us denote the original data series of arbitrary length M and indexed by n as the following vector [x1(n), x2(n),  , xM(n)], n = 1,  , N, where M stands for the dimensionality of the time series and N is the number of time series in the data set. The concept of each step of the SEAA methodology is briefly described below. It is assumed that different data series representations are introduced in order to reduce their dimensionality. It seems that combining different techniques, as presented in this paper SEAA, will render promising results.

The approach is based on the use of a multilayer neural network as an auto-associative memory. Such a neural network memory consists of two modules: the first is responsible for encoding and the second for decoding. The inputs as well as the outputs of the neural network are just envelopes, while in between there are hidden neurons. The outputs of the hidden neurons describe only the essential attributes. The vector of the new attributes gives a new representation of the original data series. The approach is known, in applications of neural networks, as signal coding and provides a new representation of envelopes. The new representation of envelopes is formed as encoded information described by hidden layer neurons.

The number of hidden neurons is significantly lower than the number of the network inputs. This way, the approach enables significant dimensionality reduction of the envelope representation of the original data series. There is no physical interpretation of essential attributes, but they still hold the most important features of the original data series (it will be shown in subsequent sections). Although the approach does not preserve the general shape of the original time series, it contains enough information for data mining tasks.

The proposed methodology for data series dimensionality reduction consists of several steps; in each step considerable dimensionality reduction is obtained. Compression ratio at each step is determined in an experimental way and depends on the considered data. At the final step we obtain a new representation of the original data series which reduces data series dimensionality considerably, and is characterized by a vector of essential attributes. This kind of approach requires validation whether the new representation preserves the main features involved in the original data series. In order to perform the validation we use a commonly used way, namely solving data series classification as well as clustering problems. The results of these data mining experiments were afterwards comparing to results obtained by other approaches. It seems that the proposed methodology is satisfactory universal to be applied to different data series mining problems.

The SEAA method differs considerably from other methods known in the literature; however it is possible to find some similarity to the SAX method. We can say that SAX consists of two parts, while SEAA of three, and it can be said that the first and third part of SEAA are similar to the first and second part of SAX, see Fig. 1, Fig. 2.

In SEAA method, within the first part we generate envelopes (upper or lower) which are based on piecewise constant approximation obtained for the topmost points for the upper envelopes, or the lowest points for the lower envelopes. In some sense envelopes bound the original data series, i.e. the upper envelope from up while the lower envelope from bottom. Such approximations emphasize changes of the considered original data series. Thus this part of SEAA has some similarities to SAX first part, and in SEAA we can also use different piecewise constant approximation. It is, however, difficult to say which approximation is better – much depends of particular problem. Therefore, within the first part, both methods seem to be comparable. In the second part of SEAA methodology we generate a vector of essential attributes. In this part further dimensionality reduction is obtained, and there is not any counterpart of SAX methodology. In the third part of SEAA methodology, the discretization of real-valued transformed essential attributes is transmuted into symbolic form – in the same sense similar as in the second part of SAX approach. In result we can observe certain similarities between these two methods, namely both data series representation have a form of a word over a fixed symbols alphabet.

The paper is organized as follows: in Section 2, we present the description of SEAA methodology. Practical presentation of the approach was carried out for the database available at the Irvine University of California in Section 3. Using the attributes with nominal values as data series representation verification of the approach was applied for solving two data series mining problems, namely classification and clustering. We consider classification and clustering problem, because they are among of the most common data mining problems. We have made calculations on compressed data in order to determine whether they still kept enough information to their proper classification and clustering. In Section 4 we present several experiments, which show the efficiency of the proposed methodology.

Section snippets

Description of SEAA methodology

Let us consider the normalized (with the mean equal zero and the standard deviation equal one) data series described in the following way:[x1(n),x2(n),,xM(n)],where xk(n)  R, k = 1, 2,  , M, n = 1, 2,  , N, while M denotes the dimensionality of the time series and N stands for the number of time series in the data set.

Details of the SEAA approach are presented in the forthcoming subsections.

Illustrative experiments

Practical presentation of the proposed approach for the reduction of dimensionality of data series described by (1) was carried out for the Synthetic Control Chart Time Series database available at the Irvine University of California [1]. Control chart patterns have often been used in the testing of many different data mining techniques. The considered database consists of data series synthetically generated by defined equations. Each equation represents a different type of pattern. There are

Experimental validation of dimensionality reduction

Practical verification of quality of dimensionality reduction of data series can be done via analysis of particular data series mining tasks. The same problem arises when we would like to compare dimensionality reduction performed by using different methods. After symbolic representation preprocessing, calculations can be performed to verify whether the proposed methodology of dimensionality reduction still retains important features of the original data series.

Conclusions

In this paper, we introduced the SEAA approach to reduce dimensions of data series. The approach differs from algorithms known in literature. The concept is based on the upper and lower envelopes, and on essential attributes of the envelopes. Both representations, related to the upper or lower envelopes, allow achieving high dimensionality reduction of the original data series. Next, the real values of the essential attributes of data series were transformed and converted into nominal values.

References (79)

  • L. Chen, M.S. Kamel, Design of multiple classifier systems for time series data, in: Proceedings of 6th International...
  • L. Chen, M.T. Ozsu, V. Oria, Using multi-scale histograms to answer pattern existence and shape match queries, in:...
  • F.L. Chung, T.C. Fu, R. Luk, V. Ng, Flexible time series pattern matching based on perceptually important points, in:...
  • G. Cybenko

    Approximations by superpositions of sigmoidal functions

    Mathematics of Control, Signals, and Systems

    (1989)
  • G. Dreyfus

    Neural Networks Methodology and Applications

    (2005)
  • J. Elder et al.

    A statistical perspective on knowledge discovery in databases

  • C. Faloutsos et al.

    Fast subsequence matching in time-series databases

    SIGMOD Record

    (1994)
  • H. Frohlich et al.

    Feature selection for support vector machines by means of genetic algorithms

    ICTAI

    (2003)
  • P. Geurts, Pattern extraction for time series classification, in: Proceedings of 5th European Conference on Principles...
  • M. Hall et al.

    G. Benchmarking attribute selection techniques for discrete class data mining

    IEEE Transactions on Knowledge Data Engineering

    (2003)
  • A. Inselberg

    Parallel Coordinates: VISUAL Multidimensional Geometry and its Applications

    (2009)
  • I. T Jolliffe

    Principal Component Analysis

    (2002)
  • S.C. Johnson

    Hierarchical clustering schemes

    Psychometrika

    (1967)
  • J. Kacprzyk et al.

    An inductive learning algorithm with a preanalysis data

    International Journal of Knowledge – Based Intelligent Engineering Systems

    (1999)
  • J. Kacprzyk et al.

    An integer programming approach to inductive learning using genetic and greedy algorithms

  • J. Kacprzyk et al.

    A softened formulation of inductive learning and its use for coronary disease data

    Lecture Notes in Artificial Intelligence

    (2005)
  • J. Kacprzyk et al.

    An inductive learning algorithm with a partial completeness and consistence via a modified set covering problem

    Lecture Notes in Computer Science

    (2005)
  • J. Kacprzyk et al.

    Inductive learning: a combinatorial optimization

  • E. Keogh, Fast similarity search in the presence of longitudinal scaling in time series databases, in: Proceedings of...
  • E. Keogh et al.

    Dimensionality reduction for fast similarity search in large time series databases

    Journal of Knowledge Information Systems

    (2000)
  • E. Keogh, K. Chakrabarti, S. Mehrotra, M. Pazzani, Locally adaptive dimensionality reduction for indexing large time...
  • E. Keogh, K. Chakrabarti, M. Pazzani, Locally adaptive dimensionality reduction for indexing large time series...
  • E. Keogh, S. Kasetty, On the need for time series data mining benchmarks: A survey and empirical demonstration, in:...
  • E. Keogh, M. Pazzani, Derivative dynamic time warping, in: Proceedings of the First SIAM International Conference on...
  • E. Keogh, P.A. Smyth, Probabilistic approach to fast pattern matching in time series databases, in: Proceedings of the...
  • R. Kohavi, D. Sommerfield, Feature subset selection using the wrapper method: overfitting and dynamic search space...
  • M. Krawczak, Multilayer Neural Systems and Generalized Net Models, Ac. Publ. House EXIT, Warsaw,...
  • M. Krawczak

    Heuristic dynamic programming – learning as control problem

  • Cited by (0)

    View full text