Keywords

1 Introduction

Load forecasting problem, commonly occurring in the context of smart grid systems deals with prediction of future energy demands of consumers based on their previous load consumption. There has been an extensive research in load forecasting problems, however prediction of future loads with high accuracy remains an open problem till date.

With the advancement of deep learning models, the complex patterns in sequential input data like in time series, can now be better identified over conventional machine learning models. Recently, deep neural network (DNN) based models have proved to be useful in load forecasting problems. In [14], authors use a DNN based model for load forecasting that was trained in two different ways - using a pre-trained restricted Boltzmann machine (RBM) and using the rectified linear unit (ReLu) without pre-training. To better capture the temporal dependencies from historical load data, some state-of-the-art deep learning models have been proposed - recurrent neural networks (RNN) [5, 15, 16], long short term memory (LSTM) [9, 10, 13] and convolutional neural networks (CNN) [2, 3]. In [10] LSTM based predictive models have been used for individual house level forecasts and aggregate level forecasts. Authors show that as the level of aggregation decreases, the accuracy drops but do not state a reason for the same. However later in [9], authors say that accuracy at individual house level can be improved if appliance readings of the house are included in training data. Rahman et al. in [13] has shown that in addition to short term dependencies, LSTM models can capture long term dependencies by obtaining long-term hour ahead forecasts in case of energy buildings. However the RNN and LSTM based models have long training time and suffers from overfitting due to vanishing gradient problem. The advantage of CNN over other two popular DNN based models is that CNN can be trained efficiently on a smaller training dataset without compromising the performance and the over fitting issue. Though the sliding filters in CNN helps to identify the patterns from historic load data for future load prediction, to access a broader range of history to capture the trend and seasonality, Borovykh et al. in [3] used dilated convolutional neural network (DCNN) for the first time in load forecasting problem. Inspired by Wavenet architecture [11], the DCNN based model in [3] called Augmented Wavenet, has a deep stack of dilated convolution layers which comprehends from a wide range of historic data when forecasting the future values.

In this paper we study the problem of load forecasting at the building level using deep CNNs where the forecasted values can follow the trend and seasonality present in historic data. Motivated by the Augmented Wavenet model [3], which can learn from the broader historic data, the proposed model - Dilated Convolutional Dense Network (DaNSe), is designed using multiple dilated causal convolutional layers with residuals and parameterized skip connections and multiple fully connected layers in the output. Dilation operation extensively captures the historic data for output predictions. The residuals and parameterized skip connections in each layer of the proposed model, speeds up convergence and train the deeper layers without over fitting. Reportedly, the only work that is similar to the proposed model is the SeriesNet architecture [12] that is an enhancement of the Augmented Wavenet model and use parameterized skip connections from each dilated convolutional layer to output layer. As compared to the SeriesNet architecture [12], the proposed model can better capture the non-linear trend and seasonality in the time series data resulting in improved accuracy. Experiments on synthetic and real life time series datasets show the improvement of proposed model over the existing SeriesNet model. This paper has been arranged as follows. Section 2 explains the state of the art techniques followed by architectural details of proposed model - DaNSe. Section 3 reports the experiment and results followed by conclusions and future work in Sect. 4.

2 Methodology

The proposed deep learning model Dilated Convolutional Dense Network (DaNSe), is designed using stacked dilated causal layers with residual connection and SeLU activation followed by fully-connected layers with ReLu activation. The key components of the DaNSe model is discussed below.

Fig. 1.
figure 1

(a) DaNSe architecture (b) A resnet block

  • Dilated Convolutional Neural Networks (DCNN): The dilated convolution operator has been referred in the past as “convolution with dilated filter”. Dilated filter is an up-sampling of convolution filter by injecting predefined gaps between the filter weights. The term causal with dilated networks intends to maintain the ordering in time series data [11]. Dilated convolution of two functions f() and g() in one-dimensional space, is represented as:

    $$\begin{aligned} (f*g)(t)=\sum _{t=-\infty }^{\infty } f(t) g(t-lx) \end{aligned}$$
    (1)

    where the multiplier l is said to be the l-dilated convolution.

  • Residual Connections: As the number of layers are increased in deep models, a degradation in accuracy signifies that the shallower counterpart of network is learning well but not the deeper counterpart. In order to construct the deeper counterpart of a shallower network, an idea of skip connections or residual connections between the layers has been proposed in [7]. If F(x) is the underlying mapping of the model for input x, the stacked non-linear layers is used to fit another mapping: \(R(x)=F(x)-x\), which is easier to optimize. Hence the original mapping is: \(F(x)=R(x)+x\). Figure 1b shows a residual block.

  • Activations: The activation functions used in the model are:

    • Rectified linear unit (ReLu): A linear activation that will output the input, when it is positive else the output is zero. If x is the input \(relu(x)=max(0,x)\).

    • Scaled Exponential Linear Unit (SeLU): SeLU pushes the neuron activations towards zero mean and unit variance [8], integrating a self normalizing property. Activation function SeLU is represented as:

      $$\begin{aligned} \text {selu(x)\,=\,}\gamma {\left\{ \begin{array}{ll} \alpha (\mathrm{e}^x-1) \text { for x} \le 0\\ x \text { for x} \ge 0 \end{array}\right. } \end{aligned}$$
      (2)

      where \(\alpha \) and \(\gamma \) are the fixed parameters derived from inputs with mean 0 and standard deviation 1.

  • Fig. 1a shows the model architecture. Model has stacked dilated causal layers with residual connection from input to output, the sum of which is input to next dilated causal layer. The output of SeLU activation of each layer is parameterized by \(1 \times 1\) convolution, which is then summed up and fed to fully-connected layers followed by \(1\times 1\) convolution for obtaining the final output. Instead to passing the sum directly into ReLu activation as in SeriesNet [12], fully connected layers helps reducing the sparsity of ReLu activation and gives accurate predictions. \(80\%\) dropout has been used in the last two layers of the model to reduce the over fitting problem. Model weights are learned by minimizing mean absolute error (MAE) with L2 regularization that penalize large weights to avoid over fitting. Model uses adaptive momentum (Adam) optimization technique where the weight updates are given as:

    $$\begin{aligned} \varDelta w_t=-\eta \bigg (v_t / \sqrt{s_t+\epsilon }\bigg ) g_t \end{aligned}$$
    (3)

    where \(\varDelta w_t\) is the gradient for weight \(w_t\), \(\eta \) is the learning rate, \(v_t\) is exponential average of gradients along \(w_t\), \(s_t\) is exponential average of squares of gradients along \(w_t\), \(g_t\) is the gradient along \(w_t\) at time t and \(\epsilon \) being a constant.

3 Experiment and Results

3.1 Experimental Setup

The data sets used in the experiment and the metrics used for measuring the performances are discussed below. The reported metric values are averaged over 10 runs.

  • Data Description: Model has been tested on three different datasets - CIF 2016 competition dataset [6], CER-IRISH SME dataset [1] and a Turkish electricity data [4]. CIF-2016 benchmark dataset comprises of 72 real and synthetic time series with monthly periods varying length between 23 to 108 months. Commission of Energy Regulation (CER) IRISH comprises of half hourly power readings of 311 SMEs of Ireland during the period of 2009-10. Turkish electricity load data has daily load (in MW) for a period of nine years from 2000 to 2008 [4] with dual seasonality- a weekly and an yearly.

  • Error Metrics: The error metrics used in the comparative study are SMAPE and distribution of APE as discussed below.

    1. 1.

      Symmetric mean absolute percentage error (SMAPE): SMAPE is represented as below. SMAPE values ranges from 0 to 1. Lower valued SMAPE indicates better matching between forecasts and actuals.

      $$\begin{aligned} SMAPE=\frac{1}{N}\sum _{i=1}^{N}\bigg (\frac{1}{n}\sum _{t=1}^{n}\frac{\mid F_t-A_t \mid }{(\mid F_t \mid + \mid A_t \mid )/2}\bigg ) \end{aligned}$$
      (4)

      where, \(A_t\) and \(F_t\) are actual and forecast values respectively. n represents the length of time series and N is the number of individual time series values being considered.

    2. 2.

      Absolute percentage error (APE): In order to quantify the error distribution in individual time series data we introduced APE, described below.

      $$\begin{aligned} APE=\frac{1}{n}\sum _{t=1}^{n}\frac{\mid F_t-A_t \mid }{(\mid F_t \mid + \mid A_t \mid )/2} \end{aligned}$$
      (5)

      where, \(A_t\) and \(F_t\) are the actual and forecast values respectively. n represents the length of time series data.

  • Hyper-parameter selection and model training: Initial weights of the proposed model are chosen randomly from a truncated normal distribution with zero mean and 0.05 standard deviation. L2 regularization factor of 0.001 is used. Proposed model has seven dilated causal layers with varying filter sizes. Initial three layers with filter width of 2 is intended to capture short duration trend or seasonal patterns in time series data, however to capture the longer duration trend, seasonal patterns and cyclic periodicities, filter width 4, seemed adequate. We used three fully connected layers, each with 32 hidden units and ReLu activations. The proposed model has been trained for 3000 epochs.

  • Data Preprocessing: CIF-2016 data required no pre-processing, minimum pre-processing has been carried out on other two datasets. SME dataset had missing values which we replaced by moving averages. For our study, we converted the half-hourly power readings of SMEs into hourly data. Train and test sets for the CIF-2016 data has been predefined. For SMEs, we kept the last 24 h as test set and for the Turkish electricity data, we kept last one year as the test set.

3.2 Results and Analysis

As shown in the Table 1, proposed model DaNSe significantly outperforms SeriesNet for SME and Turkish electricity data while the performance is comparable for CIF-2016 dataset. In case of SME and Turkish electricity data, DaNSe efficiently learns multiple seasonalities in the data giving better results. The time required for achieving this performance gain in DaNSe is same as that of SeriesNet architecture. DaNSe has also shown improved performance over other classical single layered CNN and LSTM models. Reason for improved performance in DaNSe is because the fully connected layers helps reducing the sparsity of ReLu activation giving an improved performance.

Table 1. Comparison of DaNSe and SeriesNet architecture for different datasets
Fig. 2.
figure 2

Error distribution plot for SMEs, CIF-2016 and Turkish electricity data

We further analyze the error distribution for all the three datasets in terms of APE error, shown in Fig. 2. Outliers noticed in case of CIF-2016 data for both the models, is due to the time series that have less 6 months training data. In case of SMEs, we found high error rate for SeriesNet when there exists cyclic periodicity in the data. Turkish electricity has a single nine year dataset, DaNSe shows significant improvement as that of SeriesNet.

Fig. 3.
figure 3

Plot for comparison of DaNSe and SeriesNet in case of SME, CIF-2016 and Turkish electricity datasets.

Further we analyze the performance of a few randomly selected time series from CIF-2016 and SME data. In Fig. 3, the actual versus predicted consumption is shown for 4 randomly chosen time series of CIF-2016 dataset. As shown in the Fig. 3, time series 13 and 19 has linear increasing trend while time series 6 and 30 has non-linear damping trend pattern. DaNSe architecture significantly outperformed in all cases particularly for non-linear damping pattern.

The actual versus predicted consumption for 4 randomly chosen meter IDs of SME dataset has also been shown in Fig. 3. The proposed model is successful in capturing daily and weekly seasonality for meter ID 4623 and 6939 and 2687. Though for meter ID 2242, both methods exhibit degraded performance due to cyclic behavior in data while in rest of the meter IDs, proposed model significantly outperforms SeriesNet.

Figure 3 shows that the proposed model has \(33\%\) higher accuracy than the SeriesNet. The weekly seasonality has been efficiently learned by the proposed model as compared to that of yearly seasonal pattern.

4 Conclusions and Future Work

In this work, we propose a dilated causal convolutional network model with fully-connected layers for load forecasting. The proposed model named as DaNSe architecture achieves \(33\%\) higher accuracy over the existing SeriesNet architecture [12]. Plugging the fully connected layers with a combination of SeLU and ReLU non-linear activation has shown significant improvement as compared to that in SeriesNet architecture. The lower layers of the model, having small filter width learns the patterns with small periodicity in the data while the higher layers, with larger filter widths, learn the patterns with larger periodicity. The proposed model gives more accurate results in case of short as well as long seasonal patterns as compared to SeriesNet.

As future improvement to DaNSe model, we aim to explore its performance in case of varying forecast horizons, apply the distributed learning approaches and automate the hyper-parameter tuning. We also aim to explore the possibilities of incorporating gated memory layers capable to learn the cyclic periodicity present in the time series data.