Keywords

1 Introduction

Binary sequence classification has attracted a remarkable amount of interest in both academic and industry communities, particularly in the fields of failure prediction [1,2,3] and anomaly detection [4,5,6]. We observe that for a majority of these applications, the datasets appear to possess unilateral common factor. That is, samples from one of the classes occur because of common underlying causes while those from the other class may not. Take the task of predicting seizure from electroencephalograph (EEG) data for example, EEG waves can be different when the subject is undergoing different activities [7], but when the subject is about to suffer from epilepsy they are empirically shown to reflect the underlying pathological features. (For brievity, we call the class having common factor positive and the other class negative in the rest of the paper.) On the other hand, encouraged by the recent advance of deep learning [8], researchers have successfully demonstrated its superiority in sequence classifications as well [9,10,11,12,13]. In this paper, we are interested in tasks of binary classification of sequences possessing unilateral common factor using CNN.

Due to both the technical specification and the nature of the data, learning CNN for these tasks is generally associated with two problems. First, implementation design often requires chopping an original sequence into sub-sequences which is also an important step for data augmentation when training a deep learning model. An appropriate window size for segmentation can provide the learning process with sufficient amount of training data and avoid serious multiple-instance learning issues wherein a great portion of sub-sequences from the original sequence does not actually carry representative features of the corresponding class. Defining such a window size with sufficient high-quality data augmentation is non-trivial without domain knowledge. Secondly, while samples from the positive class present similar features due to the common underlying causes, those from the negative class can have various latent characteristics and may prevent CNN from learning discriminative representations. To address the first problem, we define a random variable on a set of sample scaling parameters. Following the random variable’s distribution, we sample sub-sequences of different lengths from the original series and scale them according to the sampled scaling parameter to a common length. We fit the CNN to these scaled sub-sequences, at the end of every k iterations we update the scaling parameters’ distribution’s parameters and thus it’s trained jointly with the CNN. We show that as the training progresses, the distribution we sample from converges and is going to peak on a few scales that are optimal for the task. We call this process adaptive multi-scale sampling (AMS) and we give an explanation of why it works from the perspective of reinforcement learning (RL). With the knowledge of unilateral common factor, we mitigate the second problem using activation pattern regularization (APR) which acts as an extra term to the objective function that regularizes the activation patterns of only samples from the positive class. (For example, in the previous example of predicting seizure from EEG data, we apply APR on samples collected in the onset of epilepsy.) Concretely speaking, when training the CNN we construct in each mini-batch a Gramian matrix for each positive sample that represents its activation pattern and we minimize the variances of the matrices’ entries. To demonstrate the advantage of our proposals, we conducted extensive experiments on real-world datasets.

Our main contribution in this paper is the proposal of the deep learning scheme with a combination of AMS and APR. To the best of our knowledge, we are the first to give tentative solutions to the aforementioned two problems in the context of training a CNN model for binary sequence classification and have demonstrated their effectiveness on real-world datasets. The rest of this paper is organized as follows. We briefly introduce related literatures in Sect. 2, then give description of AMS and APR in Sect. 3. Experiments to prove the effectiveness of our proposals are introduced in Sect. 4 and we conclude our work in Sect. 5.

2 Related Work

A plenty of literatures ranging from heuristic methods to solutions utilizing probabilistic models on the topic of sequence classification have been published [14]. Encouraged by the huge success of deep learning applications recently, some researchers have demonstrated the effectiveness of applying deep learning models to sequence classifications as well. For instance, some approaches encode raw time series inputs into images first, and then fit CNN with 2D convolutional filters to the images, thus reducing the problem entirely to image classification and all relevant tools and parameter tuning techniques can be exploited [9, 10]. In another flavour, [11, 12] chopped the target time series input into equal length segments and convolutions are then performed on these transformed data. In these cases, 1D convolutional filters shared across data channels are applied along the time dimension and a fusion mechanism such as probability voting is adopted to determine the label of the original time series. To get rid of the uniform input length constraint, [13] adopted a sequence-to-sequence model that is common in natural language tasks, wherein the authors squash the original input sequences of various lengths into common length feature sequences, these feature sequences are then fed into the trailing fully-connected layers. To help learning the feature sequences, they also introduced the attention mechanism that helps the model to extract discriminative features by focusing on the most relevant part of the original sequences during feature generation.

3 Our Proposals

3.1 Network Architecture and Notations

The network we use in our experiments is of the form \({input}\rightarrow 3{Conv}\rightarrow 2{FC}\rightarrow {output}\), where the preceding numbers indicate number of layers. All layers except the output have ReLU activations. It is noteworthy that we purposefully designed the network to be simple and we did not tune its architecture in our experiments because we want to make sure all the performance advantages are from our proposals, though advanced setups such as batch normalization and skip connections are theoretically compatible.

Table 1 summarizes the notations we use frequently in this paper. The objective function that we minimize is defined as follow:

$$\begin{aligned} l(\mathbf {X}, \mathbf {y}; \mathbf {\theta }, \mathbf {\phi }) = \mathop {{}\mathbb {E}}[f_\mathbf {\theta }(\mathbf {X}, \mathbf {y}) + \lambda {g_\mathbf {\theta }(\mathbf {X})} \mid \mathbf {\phi }] \end{aligned}$$
(1)

where \(f_\mathbf {\theta }(\mathbf {X}, \mathbf {y})\) is the binary cross-entropy loss and \(g_\mathbf {\theta }(\mathbf {X})\) is the term from APR, both of which are parameterized by the CNN’s parameters \(\mathbf {\theta }\). \(\lambda \) is a hyper-parameter that determines how much weight we should put on APR. The expectation is taken over the training dataset that depends on the parameter \(\mathbf {\phi }\) from AMS. If we remove both AMS and APR, the objective function becomes simply \(f_\mathbf {\theta }(\mathbf {X}, \mathbf {y})\) and is the typical loss for training CNN for a binary classification problem. Also notice that neither AMS nor APR adds new network parameters to CNN, the network’s capacity remains the same.

Table 1. Notations

3.2 Adaptive Multi-scale Sampling (AMS)

Multi-scale Sampling. In AMS, we define several sample scaling parameters \(\{s_1, s_2, \cdots \}\) and assign a random variable S over them. S is categorical, and we define its distribution to be \(P(S=s_i)\triangleq \frac{\exp (\phi _i/\eta )}{\sum _j{\exp (\phi _j/\eta )}}\) where \(\mathbf {\phi }=[\phi _1, \phi _2, \cdots , \phi _{|S|}]\) are its parameters and \(\eta \) is a constant that acts as the temperature for this softmax function. We initialize \(\mathbf {\phi }\) to be a vector of all ones and during CNN training we randomly crop a mini-batch of sub-sequences of lengths \(l{\cdot }S\) by drawing samples from P(S). We scale all sub-sequences to be of a common length l that is pre-defined as our network’s input dimension and feed the mini-batch to the CNN. Mathematically, let us denote \(x_{1,\cdots ,lS}\) as a randomly cropped sub-sequence, \(x'_{1,\cdots ,l}\) as the sub-sequence after scaling, and \(h_S(\cdot )\) as the scaling operator, we can have several scaling strategies. For example, we can take the mean of every S samples from \(x_{1,\cdots ,lS}\) and hence \(h_S(x_{1,\cdots ,lS}) = \{x'_{1,\cdots ,l} \mid x'_i={mean}({x_{(i-1)S+1,\cdots ,iS}})\}\). Or in a simpler form, we can just define \(x'_i\) to be the j-th element of every S consecutive samples which leads to \(h_S(x_{1,\cdots ,lS}) = \{x'_{1,\cdots ,l} \mid x'_i = x_{(i-1)S+j} \}\).

To keep notations uncluttered, we merge some operators into \(h_S(\cdot )\) and \(\pi _\mathbf {\theta }(\cdot )\). Firstly, we merge sub-sequencing operator into \(h_S(\cdot )\) such that if its input is longer than \(l{\cdot }S\) then sub-sequencing (random crop in training; segmentation with minimum overlapping in evaluation) is performed prior to scaling. Furthermore, because we segment the original time series, we need a fusion mechanism for predictions. E.g., if we want the prediction score of the i-th sample in a test batch, gives several scores each of which for the sub-sequences originated from \(h_S(\mathbf {X}_i)\), we need to merge these scores to get the prediction score for \(\mathbf {X}_i\). In this paper, we define the score for the original series , we assume this fusion operation is merged into \(\pi _\mathbf {\theta }(\cdot )\) and is applied whenever necessary.

Adaptive Update. At the beginning of training, a mini-batch consists of sub-sequences of multiple scales, each of which has equal sampling weight. We train the CNN with sub-sequences of mixed scales, and at the end of every k-th training iteration, we randomly sample some time series from the entire training set, apply scaling operator \(h_S(\cdot )\) to this set and feed the scaled dataset to the CNN trained so far. We then calculate the first term in (1) by evaluating (2), the definition of \(f_\theta (h_S(\mathbf {X}), \mathbf {y})\) is given in (3). Minimization of (2) thus becomes jointly learning the CNN’s parameters \(\mathbf {\theta }\) and the parameters \(\mathbf {\phi }\) for P(S). In our experiments, at the end of every k CNN training iterations, we perform a one-step gradient descent on \(\mathbf {\phi }\) using (4). The gradient of \({\phi }\) is easy to derive and is given in (5), where matrix \(\mathbf {J}\) has entry if \(i=j\) or \(\mathbf {J}_{ij}=-\frac{1}{\eta }P(S_i)P(S_j)\) if \(i{\ne }j\). Adaptive update is just a single step of gradient, whose computational cost is negligible considering its parameters’ size.

(2)
(3)
(4)
(5)

Interpretation from RL. RL is a hot topic in the machine learning society recently, [15] serves as an excellent introduction material. In RL, an agent interacts with an environment by taking some actions according to a policy that usually takes into account observed state of the environment. The agent receives reward/penalty from the environment that depends on both the environment and its actions, and its goal is to maximize/minimize the accumulated reward/penalty by adjusting its policy. Policy iteration is one of the RL algorithms that searches for the best policy by iterating between two operations: policy evaluation (given a policy, estimate the expected reward/penalty in each state) and policy improvement (given the estimated reward/penalty in each state, improve the policy by taking greedy actions). If we consider the snapshot of CNN’s parameters \(\mathcal {\theta }\) as a state of the environment and P(S) as the agent’s policy from which we sample actions (that is, to pick a sample scaling parameter), then AMS resembles policy iteration. Specifically, if we regard (2) as the expected penalty, in policy evaluation we estimate the value of (2) upon our current settings of \(\mathcal {\phi }\), and in the policy improvement phase, we update our policy by taking a gradient step of \(\mathcal {\phi }\).

3.3 Activation Pattern Regularization (APR)

APR takes the form as an augmented term \(g_\theta (\mathbf {X})\) given by:

$$\begin{aligned} g_\theta (\mathbf {X}) = \frac{1}{n^2}\sum _{i=1}^{n^2}{{var}(\mathbf {A}^{\intercal }\mathbf {Z}, i)} \end{aligned}$$
(6)

where \(\mathbf {Z}\in \mathcal {R}^{N{\times }n^2}\) are flattened Gramian matrices calculated from the activations of the last convolutional layer, \(\mathbf {A}\in \mathcal {R}^{N{\times }p}\) is a matrix that selects only the positive samples in \(\mathbf {X}\). The variance operator is taken along each column i of the matrix product \(\mathbf {A}^\intercal \mathbf {Z}\in \mathcal {R}^{p{\times }n^2}\). To be concrete, let \(z_i\in \mathcal {R}^{l'{\times }n}\) be the activation of the last convolutional layer from the i-th sample in the batch, then each row of \(\mathbf {Z}\) is given by \(\mathbf {Z}_i={flat}(z_i^{\intercal }z_i)\in \mathcal {R}^{1{\times }n^2}\). It is easy to see that \(\mathbf {Z}_i\) encodes the relationships between each filter’s activation from the i-th sample, therefore the flattened Gramian matrices \(\mathbf {Z}\) describe the activation pattern of each sample across a batch. Because we are imposing regularizations on only the positive samples, we construct a mask matrix \(\mathbf {A}\) whose entry is defined as \(\mathbf {A}_{i,j}=1\) if \(\mathbf {X}_i\) is the j-th positive sample in the batch and 0 otherwise for \(i=1, \cdots , N\) and \(j=1, \cdots , p\). Since the variance operator in (6) is applied along columns, \(g_\theta (\mathbf {X})\) is a measure of activation patterns’ variance among positive samples. When this term from APR is augmented to the original objective function, the training process becomes a multiple-task learning problem, and we impose a hyper-parameter \(\lambda \) on it to control its strength. Following (1), when APR is applied together with AMS, the gradient in (5) should be updated by re-defining \(\mathbf {L}\)’s entry to be \(\mathbf {L}_i=f_\theta (h_{S_i}(\mathbf {X}), \mathbf {y}) + \lambda g_\theta (h_{S_i}(\mathbf {X}))\).

4 Experiments

4.1 Experimental Setup

We conducted extensive experiments on two datasets, each of which consists of data from multiple tasks. For each task, we compare the cross validation results of 4 methods: baseline (vanilla CNN training scheme), AMS, APR and AMS+APR. We used the same network architecture as described in Sect. 3.1 through all tasks. Although parameters such as kernel sizes and strides differ for each task, we used the same set of parameters for all methods for each task to ensure fair comparisons. At training time we randomly sample sub-sequences from the training split (rebalance samples to make the positive to negative ratio 1 : 1), and at test time we chop each validation segment with minimum overlap. To merge sub-sequences’ scores for the original series, we take their averaged value. In the case when AMS is involved, we use the simple strategy \(h_S(x_{1,\cdots ,lS}) = \{x'_{1,\cdots ,l} \mid x'_i = x_{(i-1)S+j} \}\) where \(j=1\), and we take the expected value of scores from all scaling parameters as the final score for the original series.

Table 2. NAB statistics (the scores are averaged accuracies)

4.2 Experiments on Dataset 1

Dataset Description. Numenta Anomaly Benchmark (NAB) [16] is a dataset consisting over 50 labeled real-world and artificial time series data files. The type of data included are for example, Amazon Web Services (AWS) server metrics, Freeway traffic, and Tweets volume. From the real-world datasets, we choose the subset whose size is larger than 10K and has at least 2 anomalies. The resulting selections are summarized in Table 2, due to the space restriction we refer the readers to [16] for detailed description of NABFootnote 1. The default segmentation window sizes are a quarter of the sequence lengths and \(S=\{1,2,3,4\}\).

Table 3. Dataset statistics

Results and Discussions. For each dataset, we conducted 11 trials each of which is a 2-fold cross validation, and we report the averaged accuracies in Table 2, the winning scores are in boldface, and the last column indicates whether the winning method passes the t-test. From a macro view, our proposals beat the baseline on every dataset and demonstrated the generalisability of our methods. Zoom in for a micro analysis, we notice AMS alone presents a strong improvement (10 improvements out of 12 datasets). Contrary to that, APR alone does not deliver satisfying results (though winning 7 out of 12, few passed the t-test), one possibility is that the default definition of segmentation window size is inappropriate and prevents APR from finding common factors from such a setting. This explanation is supported by the results from the combination of AMS and APR that show significant improvement on almost all datasets.

4.3 Experiments on Dataset 2

Dataset Description. This is a dataset from a past Kaggle data analysis contestFootnote 2. The dataset contains EEG recordings from 5 dogs and 2 humans and the task is to distinguish between ten minute long data clips covering an hour prior to a seizure (preictal segments), and ten minute clips of interictal activity. Preictal segments (positive class) are provided covering one hour prior to seizure with a five minute seizure horizon, and interictal segments (negative class) were chosen randomly from the full data record, with the restriction that interictal segments be as far from any seizure as can be practically achieved. Test segments without ground-truth labels are provided for final evaluation. Table 3 summarizes the statistics of the datasets. We down-sample the sample rate of patients to 500 Hz and define the segmentation window size to be 4000. We conducted 3-fold cross validation on each subject.

Results and Discussions. In this Kaggle contest, AUROC was set to be the evaluation metric and 504 teams in total made submissions. We report the scores from 2 result merging schemes: mean ensemble (predictions are the averaged scores from the models of cross validation) and max ensemble (predictions are the scores from the cross validation model with best validation scores). The scores and ranks are summarized in Table 4, because this contest is over, we are able to receive a (Public, Private) scores pair from each submission. For APR and AMS + APR, the best scores from the grid search on \(\lambda \) is reported. We notice first that even the baseline method could achieve high ranks (top 26% for mean ensemble and top 15% for max ensemble), this is consistent with the recent reports on successful applications of deep learning to general datasets besides images and audios. Secondly we notice the max ensemble gives higher scores than mean ensemble, but the trend in either group is consistent. When AMS is activated, we observe obvious improvements on both public and private scores. Compared to the baseline, the combination of AMS and APR gives an average relative improvement of near 10%, and sent us to as high as the 14-th place on the leaderboard. The observations here are consistent with the results from Experiment 1 and emphasizes the stability of our proposals.

Table 4. Kaggle submission (scores are AUROCs)

4.4 Analyzing AMS and APR

We give some analysis of AMS and APR in this section. Because the effect and trend of our analysis are similar for both experiments we show the statistics from Experiment 2 only for the sake of brevity.

Figure 1 gives the distribution of scaling parameters for each subject after AMS application. Our initial guess of input length (4000, equivalent to 10sec/8sec of time frame for Dogs/Humans) is not optimal and AMS learnt to find the combinations of sub-sequence lengths better suited for the task. Although AMS puts most weight on the largest scaling parameter for dogs 1, 2, 3 and 5, it considers combinations of sub-sequence lengths for dog 4 and both human patients. This suggests AMS does not always prefer longer sequences but is indeed looking for patterns residing in sub-sequences of different lengths that is optimal for the task. Figure 2 gives the evolution of training loss, validation accuracy and validation f-score for both the baseline and AMS. For almost all subjects and on any of the three criteria, we find AMS learns faster than the baseline. One may consider the faster training loss convergence is due to that in these trials AMS kept selecting longer sequences and thus led to smaller sample spaces, and a large network can easily fit to a smaller sample space, leading to faster convergence. This might be partially responsible, but it is also well known that with the same network capacity and smaller sample spaces, a flexible model like CNN can easily overfit the training dataset. However, we see strong faster rising accuracy and f-score lines, and do not observe any sign of overfitting in Fig. 2. We hence argue the faster learning speed (in terms of both training loss and validation scores) is indeed due to the involvement of AMS.

Fig. 1.
figure 1

Scaling parameters distribution (\(S_1=1, S_2=2, S_3=3, S_4=4\)).

Fig. 2.
figure 2

Learning curves of baseline and AMS (depending on the line the Y-axes can stand for binary cross-entropy loss, accuracy or f-score, the X-axes are training iterations in hundred; shorter lines are due to early stopping).

To analyze the difference between the baseline and APR, we select a random batch of validation data for both models and we check the differences of their activation patterns. Concretely speaking, from the models we learnt in cross validation, we pick a baseline model and a series of APR models with different \(\lambda \) settings from the same fold. We input the validation batch of preictal segments into the baseline model and the series of APR models, and we record the Gramian matrices \(\mathbf {Z}\) that encodes their activation patterns. For each model, we calculated the variances of each entry of the Gramian matrices and analyze the difference \({var}(\mathbf {Z^{(baseline)}}) - {var}(\mathbf {Z^{(APR)}})\) where the variance is taken on each entry of the matrices. Figure 3 gives an image of what we described. We have scaled the values to have unit standard deviation and have labelled the 0-level in each diagram for visual convenience. In order to find patterns, we have put the settings that have higher or equal (valid_accuracy, valid_f_score) scores than the baseline in green and leave the rest in black. Although with a few exceptions, we find from this figure that winning settings tend to have more positive \({var}(\mathbf {Z^{(baseline)}}) - {var}(\mathbf {Z^{(APR)}})\) entries (upward pointing spikes), meaning lower activation pattern variances compared to the baseline.

Fig. 3.
figure 3

Difference of positive samples’ activations variances (settings with higher or equal (valid_accuracy, valid_f_score) scores are in green). (Color figure online)

5 Conclusion

We address the two previously unexplored problems in the context of binary classification of sequences using CNN where the data possess unilateral common factor: (1) determining the optimal segmentation window size for CNN that provides sufficient data augmentation while avoiding serious multiple-instance learning problems; (2) helping CNN to concentrate on learning the common representations that capture the unilateral common factor. We proposed AMS to solve the first problem which automatically searches for a combination of sub-sequence lengths by learning a set of parameters that controls segmentation window size jointly with the learning of CNN parameters. And we use APR to shift the CNN’s attention to positive samples by minimizing the variances of the Gramian matrices’ entries formed from the last convolutional layer’s activations. From our experiments, AMS alone is able to give performance boosts and when APR is augmented, the improvements are significant. Our extensive experiments on multiple real-world tasks demonstrate the effectiveness, the generalisability and stability of AMS and APR.