1 Introduction

Time series classification has attracted a lot of attention in many applications, such as finance (e.g. stock market), medical diagnosis (e.g. EEG and ECG), motion capture and speech recognition. Since the order of series and the dependence of close time stamps are crucial in finding the most discriminative features and patterns, it raises great difficulty for algorithms to classify time series data. Recently, a new primitive named Shapelets has been generating increasing interests in time series classification (TSC). Shapelets are discriminative continuous snippets of full series, which maximally distinguish among different classes. Hence, shapelets can be treated as representative of some class, and time series classification turns out to be the problem of presence or absence of some shapelets for representing time series data.

Shapelets for time series have attracted many researchers’ interest for two main reasons. First, shapelets focus on local variation rather than global variation as traditional algorithms did, which could be more robust to noise and available for early time series prediction. Second, shapelets reveal the inherent attention mechanism of data so that allow for easier summarization and visualization, which provides explanatory insights to the classification problem.

Fig. 1.
figure 1

Demo of deformation. The left column is a cosine curve; the right column is the transformed cosine curve obtained by imposing shift, translation, scaling, stretching transformations and Gaussian noise on the original cosine curve.

Despite the above advantages of shapelets, current shapelets are a little clumsy. We observe that samples in the same class always share a basic shape, while in practice, the basic shape may subject to individual difference and go through various deformations, such as shift, translation, scaling, stretching and so on. As shown in Fig. 1, the cosine curve is actually similar to the transformed cosine curve in term of shape, despite individual noises and slight differences in phase (shift), amplitude (scaling), offset (translation), uniform scaling (stretching).

In this paper, we introduce a new conception, “shapelets prototypes” and propose a transform-invariant robust shapelet learning framework based on dictionary learning theory. The proposed framework aims to learn representative basic shapes that are learned from the transformed subsequences by alternative iteration. In each iteration, robust shapelet learning performs two steps: a) in the alignment step, the best transformation operators are automatically obtained by minimizing the average least square error between the transformed subsequences and current shapelets prototype; b) in the refinement step, the dictionaries are updated to reflect the basic shapes from transformed training series. Figure 2 shows a real-world example from the GunPoint dataset. S1 and S2 are learned shapelet prototypes, T1–T3 are three time series from gun class, and T4–T6 are examples from no gun class. Here S1 represents an action of drawing guns from holsters, and S2 represents returning guns to the holsters, which are critical patterns for classifying the two classes. While such actions are subject-dependent, and the corresponding patterns from samples shows a high variety due to individual factors. As seen from Fig. 2, our proposed method can align probe sample to learned prototypes and reveals the inherent knowledge of data. Our contributions can be summarized as:

  • We propose a robust shapelet learning framework based on dictionary learning theory, which is invariant to various deformations;

  • The discovered shapelet prototypes can explore intrinsic shapes which are more general and expressive.

  • Shapelet prototypes well preserve the basic shapes and hence have better interpretability.

Fig. 2.
figure 2

Illustration of two shapelet prototypes S1, S2 (leftmost plots) learned from GunPoint dataset. First row (T1, T2, T3) and second row (T4, T5, T6), are instances from Gun class and NoGun class respectively. Each sample mathced prototypes to all segments and projected corresponding segments (red and green) to prototypes by optimal transformation parameters. It can be observed that series in Gun have a better match to prototypes than series in NoGun. (Color figure online)

2 Related Work

Shapelets are discriminative shapes that can be used to classify time series data effectively. Shapelet learning algorithms make classification decision based on the presence or absence of shapelets in representing a time series. Existing shapelet learning methods can be categorized into search-based algorithms [7, 8, 10, 13, 14] and learning-based algorithms [4, 12]. The original search-based algorithm adopts a brute-force strategy to select shapelets from a large pool of candidate segments and select the most expressive subsequences by various quality criteria, which is time-consuming [14]. In order to reduce the searching time, several algorithms have been proposed to speed up the algorithm by skipping similar segments so that the number of candidates are greatly reduced [7, 8, 10, 13]. Instead of learning the shapelets first, learning-based algorithms try to learn shapelets and classify time series data simultaneously. Grabocka et al. proposed a classification logistic loss function to jointly learn the shapelets and the logistic regression classifier through stochastic gradient descent approach. Search-based approaches choose existing subsequences from training time series, while learning-based approach sometimes may not preserve the basic shape of shapelets so that may lack of interpretability, as it mainly considered classification ability. In other words, the above methods lack of the ability of learning a transform-invariant prototypes from training series.

Invariance of transformations is important for time-series domain because sequences are easily distorted and always show high variety due to geometric transformations. For examples, scaling (amplitude) and translation (offset) invariance might benefit for seasonal variations of markets with inflation motion caption [9], and shift invariance [15] is essential for the case where time series share similar patterns but in different phase. In addition, uniform scaling invariance is required for heartbeats with measurement periods of different sampling frequency.

Dictionary learning has been proposed to learn a set of basis for compact representation. In dictionary learning frameworks, each sample T can be approximated by a sparse linear combination of learned basis \(\{D_k\}_{k=1}^K\).

$$\begin{aligned} \mathop {T = \sum _{k=1}^K \alpha _k D_k+ \epsilon } \qquad s.t ~ \Vert \alpha \Vert _1 \le c_0, \end{aligned}$$
(1)

Such models always employ reconstruction error as objective loss function, where basis and representation coefficients are optimized by alternate iteration. Dictionary learning has been proved feasible and desirable in image classification [6], signal reconstruction and representation [1, 11]. Further, it has been shown great power in scalable data mining and has strong interpretability and generalization capability [16].

3 Learning Transform Invariant Shapelet Prototypes

Formulations. In this section we adopt a conception of “prototype” D, representative shapes learned from training samples. For a set of subsequences without transformation, the prototype is usually computed based on Euclidean space:

$$\begin{aligned} {\begin{matrix} &{}\mathop {\arg \min }_{\begin{array}{c} \alpha ^i \in R^K \\ D_k \in R^q \end{array}} \sum _{i=1}^N \Vert T^i - \sum _{k=1}^K \alpha _k^i D_k \Vert ^2 + \lambda \sum _{i=1}^N \Vert \alpha ^i \Vert _1 \\ &{} \qquad \text {s.t.} \ \Vert D_k \Vert ^2 \le d_0, \quad \text {for} \ k = 1\ldots ,K \end{matrix}} \end{aligned}$$
(2)

While it’s not appropriate to utilize Euclidean distance by L2 norm in many cases, as we often pay more attention to the shape similarity. Even tiny operation in scaling, translation, and stretching may rapidly swamp shape similarity measured by Euclidean distance.

For subsequences with various transformations, we introduce a conception of transformation operator \(\tau \), and \(\tau \) may be a compound of multiple transformations. An intuitive observation is that each sample subsequences is transformed from shared “prototype” dictionary bases D by a corresponding transformation operator:

$$\begin{aligned} \mathop {\tau ^i(T^i) = \sum _{k=1}^K \alpha _k D_k+ \epsilon ^i} \qquad s.t ~ \Vert \alpha \Vert _1 \le c_0, \end{aligned}$$
(3)

where each sample has specific transformation operator \(\tau ^i\). Therefore, the learned dictionaries, removing the effect of transformations, is defined as “prototypes” and could be estimated by minimizing the reconstruction error between the sample subsequence and the aligned reconstruction samples. The reformulation is as follows:

$$\begin{aligned} {\begin{matrix} &{}\mathop {\arg \min }_{\begin{array}{c} \alpha ^i \in R^K \\ D_k \in R^q \\ \tau ^i \in \omega \end{array}} \sum _{i=1}^N \Vert \tau ^i(T^i) - \sum _{k=1}^K \alpha _k^i D_k \Vert ^2 + \lambda \sum _{i=1}^N \Vert \alpha ^i \Vert _1 \\ &{} \qquad \text {s.t.} \ \Vert D_k \Vert ^2 \le d_0, \quad \text {for} \ k = 1\ldots ,K. \end{matrix}} \end{aligned}$$
(4)

Here prototypes D for series and transformation operator \(\tau ^i\) and codes \(\alpha ^i\) for each sample need to be optimized.

Transform Definition

The proposed robust shapelet learning framework aims to find the intrinsic local patterns of time series. For a prototype \(w \in R^p\), a vector ordered by time stamps, most classical linear transformations can be represented as:

$$\begin{aligned} T^i = \tau ^i(w)+\epsilon _i = a_i w(\mu _i t)+c_i+\epsilon _i, \end{aligned}$$
(5)

where \(a_i\) is a scaling factor, \(c_i\) is a translation factor and \(\mu _i\) is a stretching factor. Here \(\tau _i\) defines a general transformation, where scaling, translation and stretching operators are its special cases:

  • Scaling operator: Scaling transformation describes differences in amplitude between w(t) and transformed sequence \(T_i(t)\):

    $$\begin{aligned} T_i(t) = a_i w(t) \quad t = 1,\ldots ,p \end{aligned}$$
    (6)
  • Translation operator: Translation transformation describes a translation along y axis from w(t) to transformed sequence \(T_i(t)\):

    $$\begin{aligned} T_i(t) = w(t)+c_i \quad t = 1, \ldots , p \end{aligned}$$
    (7)
  • Stretch operator [3]: Uniform scaling transformation is used for matching sequences with different lengths. Subsequences with different length may require a stretching or shrinking operation due to different tempo or sampling frequency:

    $$\begin{aligned} T_i(t) = w(\left\lceil \frac{t}{\mu } \right\rceil ) \quad t = 1,\ldots ,\left\lceil \mu p \right\rceil \end{aligned}$$
    (8)

Note that stretching operator defined is motivated by uniform scaling operation through ceiling function [3].

Figure 3 shows the above defined transformations. Similarly, we can define the inverse operators for the above transformations.

Fig. 3.
figure 3

Difference between transformed subsequence (plotted in red) and original subsequence (blue): Scale operator \( a = 1.5\) (leftmost). Translation operator \( c = 0.2\) (second column). Stretch operator \( \mu = 2\) (third column). Multiple transformations on GunPoint dataset (rightmost). (Color figure online)

4 Model Inference

Optimization objective function is a non-convex problem with respect to the prototypes and transformation operators. We adopt a coordinate descent approach for iteration alternatively. And in each iteration, we perform two steps shown in Algorithm 1: (i) Alignment step: Given current dictionary, updating corresponding transformation operator for each sample subsequences. (ii) Refinement step: Given all transformation operators, the dictionary is updated to reflect the basic shapes learnt from training series.

Alignment Step. Fix the dictionary and sparse coding, i.e., D and \(\alpha \), reconstruction sequence has been determined. We are going to find the optimal alignment (deformation) between original series \(T^i\) and reconstruction sequence. To learn the transform parameters, we minimize the mean square error between the reconstruction sequence and subsequence of time series by all possible conversion. It’s remarkable that for input segments \(T^i_j\) with stretching factor \(\mu \), it has an unequal length to the dictionary. So it needs to do an inverse operation \(\mu ^{-1}\). For a given sparse coding and dictionary, problems can be solve as:

$$\begin{aligned} {\begin{matrix}&\quad \mathop {\arg \min }_{\begin{array}{c} \tau ^i \end{array}} \Vert \tau ^i(T^i)-\sum _{k=1}^K \alpha _{k}^i D_{k}(t) \Vert ^2 \\ \end{matrix}} \end{aligned}$$
(9)

where \(\tau = \{a,\mu ,c \}\), superscript and subscript are dropped for clarify. Obviously, a and c are scaling and translation factors with continuous values, \(\mu \) is a stretch factor with discrete factor. So fixed \(\mu \), optimal ac can be derived by Least Square methods. Then a grid search will be conducted by all possible \(\mu \). With obtained parameters, original subsequence \(\tau (T)\) at transformed location are exacted, as well as aligned sequence \(S = \tau (T) \in R^q\).

figure a

Refinement Step. Fix the transform parameters, i.e., \(\tau = \{a,\mu ,c\}\), optimize \(\alpha , D\). If transform parameters \(\tau \) are known, it’s easy to get updated segments, i.e., updated \(\{\tau ^i(T^i)\}_{i=1}^N\), which is transformed with fixed length. The problem then reduces to a traditional dictionary learning problem:

$$\begin{aligned} {\begin{matrix}&\mathop {\arg \min }_{\begin{array}{c} \alpha ,D \end{array}} \sum _{i=1}^N \Vert \tau ^i(T^i) - \sum _{k=1}^K \alpha _{k}^i D_{k}\Vert ^2+ \sum _{i=1}^N\lambda \Vert \alpha ^i \Vert _1 \\ &{} \qquad \text {s.t.} \ \Vert D_k \Vert ^2 \le c, \quad \text {for} \ k = 1\ldots ,K \end{matrix}} \end{aligned}$$
(10)

Coordinate descent is used to solve for dictionary D and sparse coding \(\alpha \).

5 Experiment

5.1 Experiment Setting

Since there are enormous shapelet candidates, we calculated the average sequence to decide the discriminative ability by information gain. And shapes with low discriminative power have been removed so that the volume of candidates would be much smaller (similar, redundant candidates have been discarded), and only valuable candidates were selected as input segments. In training phase, the algorithm was initialized by subsequence matching, where we aligned the initial candidate to all possible segment and extracted the most similar projection. Then, these segments \(T_i\) were fed into our transform invariant prototype learning framework and after iterations, a transform-invariant dictionary could be derived. Lastly in testing phase, the sparse representation were conducted based on learned transform-invariant dictionary learning.

During candidate selection process, it requires the tuning of hyper-parameters, which were found through a grid search approach using cross-validation over the training data. The initial number of shape set was searched in a range of \( K_1 \in \{0.05, 0.1\}*Q\), and the length of shapelets \(q \in \{0.125,0.25,0.375,0.5\}*Q\), where Q is the full series length. During dictionary learning process, the number of atoms in dictionary \(K_2 \in \{5,10,15\}\), while the sparsity parameter \(\lambda \in \{0.1,1\}\). For efficiency, the operation of the above deformation could be selected flexibly.

Table 1. Statistics of the benchmark time series datasets

5.2 Complexity Analysis

The time complexity of shapelet prototypes learning consists of two parts, one for initialization segments exaction, and the other for transform invariant dictionary learning. For initialization segments exaction, we need to compute a robust shape similarity, instead of optimal transformation operators, between shapelets with length q candidates and all possible segments. Therefore, we adopted a Z-normalization distance with all range of stretching factors, and the cost is O(Cq), where C is the number of stretching factors. Here, the complexity is similar to the Euclidean distance computation, with a constant multiplier.

For transform invariant dictionary learning, the cost part of alignment step is the computing for \(\tau ^i\), which takes O(CNq), including the complexity of least square solution of scaling, translation factor and grid search on stretching factors. Refinement step takes O(KNq), including sparse coding and dictionary learning. Therefore, the total complexity for one call to Algorithm 1 is \(O(M((C+K)Nq))\), where M is the maximum number of iterations allowed in Algorithm 1. M can be set quite small. In practice, 20 iterations would be sufficient.

5.3 Classification Accuracy

Experiments were performed on the 16 commonly used UCR time series benchmark which could be downloaded from UCR website [2], and the information of the those datasets were listed in Table 1. For the sake of equivalent comparison, we used the same training and testing split for all the methods. And in the experiments, SVM was chosen as the classifier.

There are many shapelet learning algorithms proposed in the last decades. Ref. [5] compared the performance of some popular shapelet learning algorithms. In this work, we compared our algorithm with three state-of-art baselines IGSVM, LTS and FLAG:

  • IGSVM [7]: Shapelet-transform algorithm, which uses the linear SVM as classifier and information as shapelet quality measurement.

  • LTS [4]: learning time series shapelets algorithm, which learns the shapelets and logistic classifier automatically and jointly.

  • FLAG [5]: Learning position of shapelets, which maximizes the ratio of projected data variances between classes by fussed lasso constraints efficiently.

Table 2. Classification accuracy on 16 commonly used dataset.

Table 2 shows the classification accuracy of baseline and the proposed method. Our method shows a superiority to IGSVM, FLAG and a comparable performance to LTS, even better prediction accuracy in several dataset. In addition, dictionary learning is desirable for scalable analysis, as well as interpretability.

5.4 Exploratory Data Analysis

One of the strengths of using shapelets as a classification tool is that they provide an easy interpretation and summarization behind data that other classification approaches simply do not. It helps for mining inherent structure. One of the key motivation of our work is to capture an intrinsic structure and knowledge behind data, which is interpretable and unified for data.

To verify the power of transform invariant shapelet learning, we briefly analyze a classical problem in time series data mining domain. On the Gun/NoGun motion capture time series dataset, there are 100 instances from each class. In the Gun class, the actors have their hands by their sides, draw a gun from a hip-mounted holster, point it at a target for approximately one second, and then return the gun to the holster and their hands to their sides. In contrast, in the NoGun class, actors do the similar hands-down, point, hold, and return motion without the gun in their hands and therefore are pointing to a target using the index finger. The classification problem is to distinguish between above two very similar action. Moreover, the dataset consists of instances from two actors, who differ in baseline height about 12 in. (translation) and motion ‘style’, including different movement range (scaling), tempo (stretching). To sum up, the challenges relied on the similarity between two action and the intra-class diversity due to geometric transformation.

As shown in Fig. 4, the original top shapelet trained by [14], represents a “overshot” phenomenon at the end of series. It contains an action corresponding to the arm being lowered back into position. However, at the begin of series, Gun class has a specific shape found by proposed method before a consistent action of raising the arm. That’s because action from Gun class has to draw a gun from a holster. Intuitively, it’s one of intrinsic feature of Gun class, which differ from NoGun class. However, owing to the intra-class diversity, instances differ in offset, scale and stretching factor. DTW and Z-normalization fail to deal with such variance, so that they failed to explore the latent discriminative power of S1, because of the complex transformation. While our transform invariant framework achieved meaningful detection of shapelets. The graphs in Fig. 4 shows a best alignment for S1 by projecting it through optimal deformation factors.

Therefore, shapelet found by proposed method reflect the essential difference between classes. Interestingly, once instance take a gun from a holster, we can achieve a earlier judgement or prediction for Gun/NoGun classification problem.

Fig. 4.
figure 4

An illustration for complex transformation from the Gun class, where TS1 represents the optimal projection (transformation) from S1 to samples.