1 Introduction

Obtaining an effective and parsimonious representation of a data stream is useful for many purposes. An obvious example is that of human interpretation, where the dimensionality must be reduced to that which is possible to visualize (2D or 3D), for example to facilitate monitoring and visual analysis of a system. In medical scenarios, stock market, or sensory applications, qualitatively different states can be identified via particular behaviour (e.g. as in [1] for sensor data). Classification and regression tasks can be made more simple and efficient with an appropriately high-level feature representation. In regime-switching time series models, each label may be associated with a different set of parameters for auto-regressive prediction [2]. In reinforcement learning, it is highly preferable the encoding of the continuous state/action spaces such that the value function to be approximated efficiently and our estimations to be represented in a tabular form [3]. An introduction and survey is given in [4].

Traditional data-stream methods focus on a test-then-train routine (e.g. [5,6,7]) where a label is predicted for each example, and then the true label is used to update the model. However, many real-world problems involve limited labelled data, and even in cases where labels are available, they be available with a delay. Hence, we study the particular data-stream case where a representation must be obtained before labels become available. This allows application in areas where labelling is sparse or available with some delay.

Note that data-stream methodology differs from the standard time series task where the signal is itself of primary interest either for forecasting or applying a single label to the entire series. For example, in data-stream classification, applying class labels is of interest, but predictive power is assumed to come from a separate input stream. Furthermore, in the data-stream scenario, only a single filtering pass may be carried out on the data—each instance must be dealt with exactly when it arrives. Temporal dependence in data streams has only been relatively recently considered [8].

Existing approaches for obtaining new representations of data include statistical methods such as PCA, neural-network-based models such as auto-encoders (see [4] for a review), and hidden Markov models (e.g. [9]) particularly in sequential data, as well as other probabilistic approaches [10]. All these approaches obtain a hidden (or latent) layer of variables as a high-level representation of the data. In many cases, this representation may be obtained in an unsupervised fashion (again, see the aforementioned references). However, there is a lack of appropriate methods for performing in stream environments, except for particular cases like that of concept drift [11].

In this paper, we propose an approach that, rather than using iterative algorithms such as EM or gradient descent to tune a hidden layer representation of HMMs or multi-layer neural networks, we instead model the error space of linear forecasting models to obtain a representation. This approach provides some important advantages. It:

  1. 1.

    Is unsupervised (no labelled data is required to form the representation)

  2. 2.

    Learns from a small amount of unlabelled training data

  3. 3.

    Can perform in a data-stream scenario

  4. 4.

    Deals naturally with interdependence across time and among the dimensions of the data series

  5. 5.

    Handles naturally multi-dimensional and also multi-type data, e.g. a mix of continuous and binary/nominal data on different scales

Our main contribution is the development and analysis of a framework, ESRTD: Error-Space Representations in Temporally dependent Data. Essentially, we leverage the error space of auto-regressive forecasting models into an effective representation of the underlying concepts. We compare our proposed method empirically to existing methods on synthetic and real-world data, in particular the effectiveness of feature representations for supervised classification (noting that the supervision is only used for evaluation). It performs competitively. We also analyse this method from a theoretical perspective, to unravel the factors behind its performance.

We consider the realistic scenario where a relatively small amount of data is available initially, followed by a rapidly moving data-stream. Such a scenario is relevant to a number of modern contexts, such as data streams, sensor networks, and the internet of things.

2 ESRTD: a framework of error-space representations in temporally dependent streams

When employed over a multi-dimensional stream \({\mathbf {x}}_t | t=1,2,\ldots \) where \({\mathbf {x}}_t \in {\mathbb {R}}^D\), our method, in its simplest formulation, requires K multi-output models \(h_1,\ldots ,h_K\) able to map \(h_k : {\mathbb {R}}^{D^\tau } \rightarrow {\mathbb {R}}^{D}\), and some error function \(E_k : {\mathbb {R}}^D \times {\mathbb {R}}^D \rightarrow {\mathbb {R}}\), and proceeds in three four steps:

  1. 1.

    Obtain data instance \({\mathbf {x}}_t \in {\mathbb {R}}^D\) at time t.

  2. 2.

    Predict the present given the past:

    $$\begin{aligned} {\mathbf {\hat{x}}}^{(k)}_t = h_k(\underbrace{[{\mathbf {x}}_{t-\tau },\ldots,{\mathbf {x}}_{t-1}]}_{{\mathbf {x}'}_{t-1} \in {\mathbb {R}}^{\tau \times D}}) \end{aligned}$$
    (1)

    with each model \(h_k\), using a history of \(\tau \) instances.

  3. 3.

    Evaluate predictions

    $$\begin{aligned} \epsilon _t^{(k)} = E({\mathbf {\hat{x}}}^{(k)}_t,{\mathbf {x}}_t) \end{aligned}$$

    for each prediction \({\mathbf {\hat{x}}}^{(k)}_t\).

  4. 4.

    Return feature representation

    $$\begin{aligned} {\mathbf {z}}_t = \phi [\epsilon _t^{(1)},\ldots ,\epsilon _t^{(K)}] \end{aligned}$$
    (2)

    where optional function \(\phi \).

Thus, \({\mathbf {z}}_t\) is an instance in a new space (i.e. the error space) which embeds inter- and intra-instance contextual information. This can be interpreted as a new representation of the data stream, \({\mathbf {z}}_1, {\mathbf {z}}_2,\ldots \) and used for a variety of tasks, such as for classification, as we approach in Sect. 4.

figure a

As we will show, each instance \({\mathbf {z}}_t\) of this error space may be much more powerful than each instance \({\mathbf {x}}_t\), and furthermore, it may be more compact, whenever \(K < D\) (which we show is often adequate).

An advantage of our approach is that we are not directly concerned with the accuracy of the models. It is only important the accuracy of the model in the final task (e.g. classification or clustering). The models \(h_1,\ldots ,h_K\) produce an error-space representation of the data. Indeed, we want the models to perform poorly on at least some of the data, as we will explain. Hence, we may thus train computationally cheap, linear, incremental models, and thus escape intense training regimes and exhaustive parameter tuning faced by other models.

It is common, and often important, to use residual errors to diagnose problems in a regression scheme. In such scenarios, it is expected that each error will—in the best of cases—contain irreducible Gaussian noise. However, in our method, we are particularly interested leveraging the form and dynamics of this error space, later in Sect. 2.1 we explain the reasons that this space is useful in our case.

In particular, we desire that the new space encodes temporal and structural information minimally, thus to create a stream of instances, upon which any standard data-stream classifier, regressor, or clusterer (depending on the application) may be applied.

In a data-stream environment, it is common to expect concept drift, of which there are several types (see [12] for a comprehensive survey) including reoccurring concepts. Let us distinguish between a concept and a state: a concept is the general system being modelled, for example the concept of monitoring the weather. The state of the system may change (i.e. the current state of the weather—overcast, storm, fine, etc). Permanent or lasting concept drift (to follow the weather example, would relate to climate change) is an important issue in many applications, but it is not the focus on this paper, and our method focuses on the representation of states or regular concepts. Ideally, each instance \({\mathbf {z}}_t\) should be enough to identify the state of the underlying system at timestep t.

By using weak forecasting models, we will not be able to reduce the error to the irreducible measurement error over all states, and thus the error distribution will have a useful structure. Indeed, since the error metric evaluates a prediction of the present given the past it can thus be seen as a similarity measure between two timesteps and hence, embeds local auto-correlation in the original signal into the new representation. We explore this in Sect. 2.1.

To obtain a compact representation, i.e. a minimum number of variables necessary to represent a concept or state, is a task fundamental across much of machine learning. As such, many commonly used methods exist to deal with linear dependence, such as the ridge and lasso penalty terms (among many other possibilities). We use a method based on robust regression to place a premium on diversity, rather than individual predictive performance. The diversity helps maintain an interesting structure in the error space. We then apply an auto-encoding function to reduce this space to its essential components (note \(\phi \) in Eq. (2) and Algorithm 1).

Suppose a data points \(\{{\mathbf {x}}_t\}_{t=1}^T\), we wish to build models \(h_1,\ldots ,h_K\) that may provide a representation \({\mathbf {z}}_i\) for each point \({\mathbf {x}}_i\) in a streaming and unsupervised fashion. We now describe how we initialize these models.

We design an approach inspired by the general methodology of robust regression, and in particular that of RANSAC (reviewed in [13], for example). In robust regression, a model is built on a set of points without being adversely effected by outliers. However, in our proposal all data points outside the current state as outliers, allow particular models to learn particular states, and thus encouraging diversity.

Ensemble methods such as random forests diversify the input space by taking random subsets of the attribute space (see, e.g. [14]). In our framework, we encourage diversity by splitting the time window randomly into input and output. Let us define a \(D \times W\) window (the instance space across W timesteps):

$$\begin{aligned}{}[{\mathbf {x}}_{t-\tau },\ldots, {\mathbf {x}}_{t-1}, {\mathbf {x}}_t] = \begin{bmatrix} x_{t-\tau }^1&\ldots&x_{t-1}^1&x_{t}^1 \\ \vdots&\ddots&\vdots&\vdots \\ x_{t-\tau }^D&\ldots&x_{t-1}^D&x_{t}^D \\ \end{bmatrix} \end{aligned}$$

We define \(K \times 2\) arbitrary subsets of the variables and use these subsets to define input and output vectors: \({\mathbf {\hat{x}}}_t^{(k)}\) and \({{\mathbf {x}'}}_{t-1}^{(k)}\), respectively. Each model has a different input and output space and thus learns to predict a mix of present and past values from other present and past values. Diversity is hence assured, since each training pair \(({\mathbf {\hat{x}}}_t^{(k)},{{\mathbf {x}'}}_{t-1}^{(k)})\) comes from a different distribution.

This technique, in combination with the main mechanism (explained above), allows the possibility to generate a large number of diverse models. Since models may be similar, we additionally apply regularization and reduction in the form of an auto-encoder.

The process of our method can be illustrated with respect to Fig. 1. For this illustration we generated a single signal, as follows

$$\begin{aligned} y_t&= {\left\{ \begin{array}{ll} 1 &{} \textsf {mod}(t,40) < 10\\ 0 &{} \text {otherwise} \end{array}\right. } \\ x_t&= y_t \cdot (\sin (t/4)) + (1 - y_t)(x_{t-1} + u_t) \end{aligned}$$

i.e. a series of 10 points as one state concept, followed by 30 from another and thus repeating, where \(u_t \sim {\mathcal {N}}(0,0.1)\). Figure 1a shows this signal, plotted over time, as pairwise points, and as transformation under the error space using our methodology, leading to Fig. 1d which shows the error-space embedding of each point. Even though the time component has been removed from the plot, there are clear regions corresponding to each concept that would be much more easily separable than in Fig. 1b. Although a single dimension is considered for visualization purposes, all other experiments are carried out across multiple dimensions.

figure b
Fig. 1
figure 1

A raw 1D stream/signal is shown in Fig. 1a, i.e. \(x_t\) is plotted over time \(t=1,\ldots ,1000\). The state labels \(y_t\) (shown underneath) have been used to generate the data, but it is assumed that they are never known to a learner. We can view points over a window of length \(W=2\) (Fig. 1b) where each point is \([x_t,x_{t-1}]\), with \(y_t\) labels points shown explicitly as red/blue. Clearly, without knowledge of \(y_t\), separability in this feature space is challenging. However, with K auto-regressive models \(h_1,\ldots ,h_K\), we acquire a K-dimensional error signal (Fig. 1c, where \(K=3\)) and in plotting each step as a point \([\epsilon ^{(0)},\ldots ,\epsilon ^{(K)}]\) in KD space (Fig. 1d, again for \(K=3\)) this becomes an easier task: recall that Fig. 1b, d represents the same set of points. Of course, this is a simplified signal for illustration, but the mechanism will work for any dimensionality of input, simply by choosing K models (a) Raw stream \(x_t|t=1,2,\ldots ,T\) (b) Auto-correlation, \(x_{t-1}\) versus \(x_{t}\) (c) error stream (d) error space

2.1 Theoretical insights

How to justify our proposal—should the error space contain any useful information making it useful as a representation? Let us look at a simplified forecasting problem with a continuous target variableFootnote 1:

$$\begin{aligned} x_{t}&= h(x_{t-1}) + e_t \\ \hat{x}_t&= \mathop {\text {argmax}}\limits _{x \in {\mathbb {R}}} p(x|x_{t-1}) \\ \epsilon _t&= E(\hat{x}_t, x_t) = E(h(x_{t-1}), x_t) = E(e_t) \end{aligned}$$

Our ESRTD method is focussed on data resulting from a function of the error \(\epsilon _t\). Thus, we are interested in the information gained from using \(x_t\) instead of \(\epsilon _t\). This can be expressed quantitatively via KL divergence,

$$\begin{aligned} {\text {KL}} (p||q)&= \int _{-\infty }^{+\infty } p(x_t|x_{t-1}) \log \frac{p(x_t|x_{t-1})}{q(\epsilon _t|x_{t-1})} {{\mathrm{d}}}{x}_t \end{aligned}$$

where q is the distribution of getting error \(\epsilon _t\) by using prediction model h. Thus, here we are expressing the gain of using the predictive distribution over the error distribution (note we are considering the simplest case of time horizon of \(W = 2\) instances, and expressing \(p(x_t|x_{t-1}) \propto p(x_t,x_{t-1})\)).

Normally, the task is to reduce the error as much as possible—a fundamental goal in machine learning. In fact, the optimal error distribution with regard the predictive power of model h is that which approaches the Dirac delta \(q(\epsilon _t) = \delta (\epsilon _t)\): both mean and variance close to 0. As we can see, this would result in a relatively high KL value, indicating that much information is gained by using x rather than \(\epsilon \).

In particular, maximizing KL implies minimizing the term \(\Vert x_t - h(x_{t-1})\Vert ^2\), which is exactly the task approached by minimizing mean squared error (MSE); i.e. effective learning should take information from the error space. However, our models avoid this situation by using weak relatively inexpressive, and auto-robust models, preventing optimal minimization of MSE.

In effect, minimizing MSE will not be possible, and in particular not possible across all states, due to the limited expressive power of the base models. This keeps MSE high, and KL divergence low, in the sense that greater information content is preserved in the error density.

There are other factors that allow our proposed method to perform well: the nonlinearity of error functions, and the embedding of time dependence. We may view the error function as a similarity function between two vectors \({\mathbf {x}}_t\) and \({\mathbf {x}'}_{t-1}\), i.e. as a kernel, parameterized by base model \(h_k\), thus

$$\begin{aligned} \kappa ({\mathbf {x}}_t, {\mathbf {x}'}_{t-1}) = (h_k({\mathbf {x}'}_{t-1}) - {\mathbf {x}}_t)^2 = e_t \end{aligned}$$

Actually this is not a valid kernel, because it is in general non-symmetric, \( \kappa ({\mathbf {x}}_t, {\mathbf {x}'}_{t-1}) \ne \kappa ({\mathbf {x}'}_{t-1}, {\mathbf {x}}_t) \) (indeed, the two inputs are not necessarily of the same dimension since \({\mathbf {x}'}_{t-1}\) is shorthand for \({\mathbf {x}}_{t-\tau :t-1}\) as we specified earlier). Nevertheless, it has certain benefits reminiscent of kernel methods: a single number that can be used to measure similarity, and nonlinearity.

In one dimension, with a linear regressor base model,

$$\begin{aligned} \kappa (x_t,x'_{t-1}) = (x_t - (w_0 + w_1 x'_{t-1}))^2 = z_t \end{aligned}$$

we can see how we arrive at what is much like a centred quadratic basis function. Hence, unlike ordinary basis function expansion, it is centred.

Thus, the distances between points in the error space roughly encode the distances between each state. Each state of the stream will form its own distribution within this error space, with the added benefit that overlapping points in the original space may become linearly separable in this error space.

This scheme is related to that of an auto-encoder, which learns to reconstruct the (often noisy) input, and thus minimize the reconstruction error,

$$\begin{aligned} \hat{\theta }= \mathop {\text {argmin}}\limits _{\theta } | \hat{x}_t - h_\theta (\hat{x}_t) | \end{aligned}$$

where h is the encoder (e.g. neural) network, and the representation obtained is the hidden layer. But unlike this methodology, we make inter-instance and intra-instance predictions (minimizing prediction error, e.g. the mean squared error) with forecasting models, thus it is an auto-predictive/self-taught scheme in the sense of predicting the same signal (the present given the past), rather than auto-encoding (which predicts the present given the present). As a result, in our case the error space is available instantly, from the models \(h_k\) which may be updated incrementally and perform already well on limited training data, rather than learning iteratively and offline in a gradient-descent scheme (as auto-encoders typically do—typically requiring much training data to obtain good results).

Intuitively, greater separability can be obtained by using more dimensions (models), i.e. creating a higher-dimensional space, just as with regard to basis function expansion (e.g. in polynomial regression). Like polynomial regression, however, there is concern for overfitting, hence why we further filter the space to in mind of regularization, producing finally latent space \({\mathbf {z}}_t = \phi (\varvec{\epsilon }_t)\).

2.2 Parameter settings

The theoretical insights above guide us in choosing the settings for our model.

First of all, the base models \(h_k\): they should be of a class of model \({\mathcal {H}}\) with limited expressive power, and thus be able model only relatively simply concepts, and do so relatively quickly (i.e. quick learning curve). If the models were too powerful, they would not specialize to particular states; and also incur an increase of computational complexity. We chose stochastic gradient descent (details in Sect. 4), which has the additional advantage of being an incremental learning scheme.

When attributes are continuous, \(h_k\) will carry out regression, otherwise classification when discrete attributes are involved. Although for general application, it is possible to carry out classification using regression methods. This is a suitable approach, since we are interested in obtaining a smooth error space, not predictions, thus a threshold is not even necessary. Accuracy may indeed be lower than using a more ‘appropriate’ classification scheme, but we remark again that obtaining low error for these forecasting models is not our primary concern.

Assuming an adequate regularization scheme, a larger number of models are better, but as with any parametric method, ensemble models, and so on, this is a trade-off between computational requirements and expected accuracy. In our case, we may add and remove models dynamically, to create a nonparametric scheme. New models may be added until diversity becomes difficult to acquire, and older models may be removed when their error becomes constantly high. We leave this investigation to future work due to the substantial additional experimentation that this entails.

As a simpler approach, we simply apply auto-encoders to compress the output space (function \(\phi \) with regard to Algorithm 1). We mention auto-encoders in the next section on related work, but note that our result is quite different from simply running an auto-encoder on the input instances (or a time window thereof—as we compare to in experiments). We create the new instance representation using the approach proposed above, but since this space—although powerful—is likely to contain redundancy, we employ auto-encoders to compress it.

For the error/distance function E we use the \(L_2\) norm,

$$\begin{aligned} E({\mathbf {\hat{x}}}_t, {\mathbf {x}}_t) = \Vert {\mathbf {\hat{x}}}_t - {\mathbf {x}}_t\Vert _2^2. \end{aligned}$$

3 Related work

In this section, we survey the applicability of a range of techniques existing in the literature that may be applied to our task of interest: converting a stream of data points \({\mathbf {x}}_1,\ldots,{\mathbf {x}}_T\) to a stream of representatives instances \({\mathbf {z}}_1,\ldots ,{\mathbf {z}}_T\), which encode time information. Under the constraints of a data-stream, T may be infinity, and after any initial training batch, \({\mathbf {z}}_t\) must be created already at time t.

Producing a stream \({\mathbf {z}}_t \in {\mathbb {R}}^K\) from a stream \({\mathbf {x}}_t \in {\mathbb {R}}^D\), where \(K < D\), can be approached as a form of dimensionality reduction, or signal compression. PCA can be used for multivariate compression, simply by projecting data to the eigenvectors with the top-K eigenvalues to represent the new space (where \(K \le D\)). It is a fundamental and well-studied technique used in diverse areas.

To take into account time dependence, we may simply perform on a sliding window (of size W).Footnote 2 That is to say, we treat \([{\mathbf {x}}_{t-\tau },\ldots ,{\mathbf {x}}_t] \in {\mathbb {R}}^{D \times W}\) as input at time t rather than only \({\mathbf {x}}_t\). Actually, this is already a fundamental step in our method and can also be applied independently to any other method to ‘remove’ the time dependence.

Auto-encoders [15] are a common and often successful approach where (in the typical setup) a multi-layer neural network is used to predict its own input. In fact, our method already uses an auto-regressive mechanism, but is different in the sense that we predict the present given the past (auto-encoders predict the present given the present). Note that an auto-encoding neural network with linear activation functions will produce PCA-like results. But auto-encoders can be more interesting, since they are trained with gradient descent and thus more easily adapted to a data-stream environment. The sliding window can also be used as input to take into account the local time dependence.

Clustering can be used in a similar way: a number of clusters or latent factorsFootnote 3 are obtained from the input (see [2] for a survey of such methods). Again, simply working using a sliding window will embed also past information, i.e. take into account time dependence. Most approaches use a breed of expectation maximization (EM) to learn the latent variables \([z_1,\ldots ,z_K]\). Many available methods have incremental adaptations suitable for streams (e.g. [16, 17]).

Hidden Markov models can often provide an effective clustering across time. Because of the time component, working on a sliding window is not necessary and the assumed Markov dependence across the hidden states \(z_t\) will instead take into account dependence. Hidden Markov models (HMMs) are indeed a natural way to deal with this task, e.g. [18]. The filtering algorithm can be done online; learning the parameters is typically done on an offline batch (with an EM-variant). HMMs are typically trained with expectation maximization and applied in time series problems (e.g. in [9] for fraud detection). A more advanced scheme is put forth in [19] that is based on the assumption that data come from the same source will have the same underlying probability distribution. Actually, it compares the probability density over the data in a sliding window on the incoming data stream, but is not directly applicable in an online mode due to the fact that the number of the states becomes equal to infinity as the data instances arrives sequentially.

Random projections [20] can also be used for dimensionality reduction, by projecting an instance (or time window of instances) in Euclidean space, to a smaller dimensional space. In a simple form, random projections are like a PCA, but the first component is chosen randomly (other components are orthogonal, as in PCA). In a sense, basis functions are a kind of random projection.

Examination of the residuals of a regression analysis (i.e. the error space, in the case of multi-dimensional data) has always been of interest in different domains [21], but usually for the point of diagnosing regression models (since commonly, the residuals are expected to form a Gaussian distribution). Other analysis of the error space has been looked at in some specialist applications such as outlier detection or robotics [22] where each point is judged (as being an outlier or not) relative to the magnitude of the residual (error). But in our methodology, we use a multi-dimensional error space as input to a model, thus being qualitatively different.

4 Application to data-stream classification

In this section, we empirically investigate the effectiveness of the representation of instance space produced by our method, as compared to other methods. We measure this effectiveness via classification accuracy on supervised datasets; namely the accuracy of predicting the dataset label from the unsupervised labels assigned by our method. A more effective representation should imply better classification accuracy. We

  1. 1.

    Load dataset \({\mathcal {D}}= \{({\mathbf {x}}_t,y_t)\}_{t=1}^T\) (\(y_t\) labels \({\mathbf {x}}_t\))

  2. 2.

    Build \(g : X \mapsto Z\) from \(\{{\mathbf {x}}_t\}_{t=1}^{t_0}\) (initial batch; unsupervised)

  3. 3.

    Obtain representation \(\{{\mathbf {z}}_t\}_{t=t_0}^T\) where \(\mathbf {z}_t = g(\mathbf {x}_t)\) (data-stream; unsupervised)

  4. 4.

    Evaluate a supervised model f on dataset \(\{{\mathbf {z}}_t,y_t\}_{t=t_0}^T\); record accuracy

Where methods (including our proposal), denoted as g, provide a new representation, which is used by a supervised classifier, denoted by f. Thus, we use datasets with class labels (denoted \(y_t\)). We do this evaluation offline in batch form, for simplicity, since it is only of secondary interest: in terms of performance in data streams, we are primarily focused on obtaining the representation in this scenario. The classification scenario is only used to separately validate the effectiveness of this representation. On account of the time window, we may assume that time dependence has been removed from each instance, and thus there is no need for an online-evaluation. Only the initial batch \(t_0\) is for offline training, but this is to allow us to compare to non-incremental methods such as PCA. All methods may produce a new representation incrementally once trained on this initial batch, except for hidden Markov models.

We choose stochastic gradient descent (SGD; hinge loss, ridge penalty) for f. We evaluate with 10-fold cross validation. We choose time window \(W = \tau + 1 = 6\), thus \({\mathbf {x}'}_{t-1} {\mathrel {\triangleq }}{\mathbf {x}}_{t-5:t-1}\); according to our pilot studies on both datasets we used, which showed little change for larger time windows. A rigorous study of suitable \(\tau \) is application-dependent and outside the scope of this study. We set the initial batch to \(t_0 = T/20\) instances.

We wrote all experiment code in Python,Footnote 4 using Scikit-LearnFootnote 5 for implementations (e.g. SGD, PCA, GMM) except the HMM, for which we used hmmlearn.Footnote 6

4.1 Methods and parameters

A list of methods used in the empirical evaluation is summarized in Table 1 (values of K and \(\tau \) are provided additionally in the tables). A description of the hidden space is given in the second column. All methods are based on a moving window of size W except hidden Markov models (HMMs, which take into account time dependence already). The explicit Moving Window does no further processing on this window (thus produces \({\mathbf {z}}_t \equiv [{\mathbf {x}}_{t-\tau },\ldots ,{\mathbf {x}}_t]\)).

Table 1 Accuracy of SGD (and decision tree—DT) over fivefold CV using the different feature space representations produced by the different methods (see Table 2 for details) for \(K=3\), \(W=6\) (hence full sliding window of \(W \times D\) for D attributes in the original datasets; K for all other models)

For auto-encoders we use the standard implementation of MLPRegressor in Scikit-Learn (default parameters), and extracted the hidden layer. For HMM we used spherical covariance matrix with uniform starting probabilities and a transmission matrix as defined in Sect. 4.2 for our synthetic data; using the hmmlearn package. Inference is done with the forward–backward algorithm, thus it is the only paradigm which breaks the online-inference setting.

4.2 Datasets

We generated synthetic data (Synth) with the goal of creating a data stream with complex time dependence and changes in state. We use a uniform transition matrix to define \(p(y_t=j|y_{t-1}=j) = 0.95\) (same label/state) and \(0.05/(1-L)\) (other labels) used to define \(p(y_t|y_{t-1})\), and use recurrent matrix \(D \times D\) matrices \(\{{\mathbf {W}}_j\}_{j=1}^L\), and \(K \times D\) matrix \({\mathbf {U}}\). We obtain dataset \(\{{\mathbf {x}}_t,y_t\}_{t=1}^T\) as follows:

$$\begin{aligned} y_t&\sim p(\cdot |y_{t-1}) \\ {\mathbf {x}}_t&\leftarrow f\big ({\mathbf {W}}_{y_t}^\top {\mathbf {x}}_{t-1} + {\mathbf {U}}y_t\big ) + {\mathbf {v}} \end{aligned}$$

where \({\mathbf {v}} \in {\mathbb {R}}^D\), \(v_j \sim {\mathcal {N}}(0,1)\). Thus, \(y_t\) indexes a slightly different initialization of \(\mathbf {W}\). Specifically, we begin with a \(\mathbf {W}_0\) where \({W_0}_{jk} \sim \mathcal {N}(0,0.1)\) and from this create \(\mathbf {W}_{y_t} = \mathbf {W}_0 * \mathbf {B}_{y_t}\) where \(B_{y_tjk} \sim \textsf {Bern}(0.5)\), thus a binary masking matrix of 50% sparsity. To initialize, we set \(\mathbf {x}_0 \sim [\mathcal {N}(0,1),\ldots ,\mathcal {N}(0,1)]\). We generate a stream of 10,000 instances; with \(L=2\) and thus \(y_t \in \{0,1\}\) using \(\tanh \) as the nonlinearity f.

The electricity dataFootnote 7 (Elec) is a commonly used benchmark dataset in data-stream classification tasks. It contains 45, 312 instances, each of 6 attributes plus a binary class (i.e. \(y_j \in \{0,1\}\), \(L=2\)), where the positive class indicates that the price of electricity rises (and thus the negative—that it falls). It is known to contain time dependence [8]. A small section is visible in Fig. 2. Results are given in Table 2.

Fig. 2
figure 2

The original Elec signal \(\mathbf {x}_t\) plotted for over 60 points, the true labels \(y_t \in \{0,1\}\) are indicated at the top

Table 2 Methods and the description of the representation they create

5 Discussion and conclusions

A simple sliding window is quite effective in terms of accuracy. However, in this case the feature space is of size \(D \times W\) (size 84, and 36, respectively) which is considerably more than the other methods which produce a compact representation of size K which is not tied to any dimension of the dataset (\(K=3\) in this case). This challenge of performing with \(K < D\) should not be under-emphasized, especially in these temporally dependent streams where \(>D\) attributes may be needed for an accurate prediction (considering the time window).

The ESRTD-Basic approach with K forecasting models performs relatively poorly on both datasets, since \(K=3\) is simply not enough in this context: Figure 3 illustrates how the accuracy improves rapidly with larger K. Since each model is simple and rapid to obtain, the cost of creating more models is relatively small; however, having large K excludes many advantages in the second task, which in this case is classification. Indeed, having \(K>(D \times W)\) would practically eliminate all advantages. Hence, since we already identified redundancy among the models due to their random and weak initialization, in ESRTD-AE we initialize \(10 \cdot K\) models, but reduce this to only K outputs using an auto-encoder thus creating compact representation (of size K). This method performs very competitively.

Fig. 3
figure 3

Accuracy (blue) and running time (red, in seconds) of ESRTD-Basic (solid lines) and ESRTD-AE (dashed lines) on the Elec data wrt K (color figure online)

Indeed, compared to related work, our method ESRTD-AE leads to the best performance overall on the Electricity data; even better than the sliding window approach (although this is not likely to be statistically significant) for 6 times fewer attributes. HMMs are competitive on the Synthetic data but we must recall that they perform forward–backward inference in outputting the feature space, which is outside the data-stream scope we are targeting. The simple sliding window performs best, but at a cost of 84 input features instead of 3; a significant difference (28 times fewer). Figure 4 shows the 3D error space on this data, note the approximate separation of classes.

Fig. 4
figure 4

Error space from ESRTD on the Elec data with \(K=3\) models; each \(\epsilon _k\) is one of the axes. Points are coloured according to class value

One can also experiment with different classes of models (f), as we see in Sect. 2. Results are not markedly different overall when using a decision tree, despite being a more powerful learner. This is positive and can be explained as the models producing a relatively simple feature space to learn from. Indeed, this was a prime motivation for this work. Importantly also, we remark that results are also not much different in terms of their overall ranking.

All methods that we selected could deal with the multi-dimensional case. Although it was not a specific part of this experimental study, we remark that ESRTD can handle naturally a mixture of categorical and continuous attribute space, since error space is still continuous (e.g. if prediction confidence is measured). This flexibility is not as easily afforded by the other approaches (for example PCA), which need more careful consideration of input attributes.

Finally, we can also point out that our method produces a number of forecasting models, which aside from their integral contribution to the ESRTD method, are useful for other tasks on the data stream (namely forecasting or imputing missing values). Even though we emphasized throughout that each of these models is relatively weak on its own and responsible for only a subsection of the feature space, they may be combined into a more powerful ensemble or used in other tasks such as change detection. This could be fertile grounds for future work.