Keywords

1 Introduction

In order to integrate wind into an electric grid it is necessary to estimate how much energy will be generated in the next few hours. Nevertheless, this task is highly complex because wind power depends on wind speed, which has a random nature. It is worth to mention that the prediction errors may increase the operating costs of the electric system, as system operators would need to use peaking generators to compensate for an unexpected interruption of the resource, as well as reducing the reliability of the system [12]. Several models have been proposed in the literature to address the problem of forecasting wind power or wind speed [7]. Among the different alternatives, machine learning models have gained popularity for achieving good results with a smaller number of restrictions in comparison with statistical models [11]. In particular, recurrent neural networks (RNN) [15] have become popular because they propose a network architecture that can process temporal sequences naturally, relating events from different time periods (with memory). However, gradient descent type methods present the vanishing gradient problem [1], which makes difficult the task of relating current events with events of the distant past, i.e., it may hurt long-term memory. An alternative to tackle this problem are Long Short-Term Memory (LSTM) networks [4], a new network architecture that changes the traditional artificial neuron (perceptron) for a memory block formed by gates and cells memories that control the flow of information. This model was compared in [9] to predict the wind speed from 1 to 24 steps ahead. Empirical results showed that LSTM are competitive in terms of accuracy against two neural networks methods. However, its training algorithm demands high computation time due to the complexity of its architecture. In this paper we propose an efficient alternative method to train LSTM; the proposed method divides the training process in two stages: First stage uses ridge regression in order to improve the weights initialization. Next, LSTM is trained to update the weights in a online fashion. The proposal will be evaluated using standard metrics [10] for the wind speed forecasting, of three geographical points of Chile. In these areas it is necessary to provide accurate forecasting in less than one hour. We consider wind speed, wind direction, ambient temperature and relative humidity as input features of the Multivariate time series. Here is the outline of the rest of the paper. In Sect. 2 we describe the LSTM model. Section 3 describes the proposed approach for training the LSTM. Next, we describe the experimental setting on which we tested the method for different data sources in Sect. 4. Finally, the last section is devoted to conclusions and future work.

2 Long Short-Term Memory

Long Short-Term Memory (LSTM) [4] is a class of recurrent network which replaces the traditional neuron in the hidden layer (perceptron) by a memory block. This block is composed of one or more memory cells and three gates for controlling the information flow passing through the blocks, by using sigmoid activation functions with range [0, 1]. Each memory cell is a self-connected unit called “Constant Error Carousel” (CEC), whose activation is the state of the cell (see Fig. 1).

Fig. 1.
figure 1

LSTM architecture with 1 block and 1 cell.

All outputs on each memory block (gates, cells and block) are connected with each input of all blocks, i.e., full-connectivity among hidden units. Let \(net_{in_j}(t)\), \(net_{\varphi _j}(t)\) and \(net_{out_j}(t)\) be the weighted sum of the inputs for the input, forget and output gates, described in Eqs. (1), (3) and (7), respectively, where j indexes memory blocks. Let \(y^{in_j}(t)\), \(y^{\varphi _j}(t)\) and \(y^{out_j}(t)\) be the output on the activation functions (\(f_{in_j}(.)\), \(f_{\varphi _j}(.)\), \(f_{out_j}(.)\), logistic functions with range [0, 1]) for each gate. Let \(net_{c_j^v}(t)\) be the input for the vth CEC associated to the block j and \(s_{c_j^v}(t)\) its state at time t. Let \(y^{c_j^v}\) be the output of the vth memory cell of the jth block and \(S_j\) is the number of cells of the block j. Then, the information flow (forward pass) following the next sequence:

$$\begin{aligned} net_{in_j}(t)&= \sum _{m}w_{in_jm} \cdot y^m(t-1) + \sum _{v=1}^{S_j} w_{in_j c_j^v} \cdot s_{c_j^v}(t-1), \end{aligned}$$
(1)
$$\begin{aligned} y^{in_j}(t)&= f_{in_j}(net_{in_j}(t)), \end{aligned}$$
(2)
$$\begin{aligned} net_{\varphi _j}(t)&= \sum _{m}w_{\varphi _jm} \cdot y^m(t-1) + \sum _{v=1}^{S_j} w_{\varphi _j c_j^v} \cdot s_{c_j^v}(t-1), \end{aligned}$$
(3)
$$\begin{aligned} y^{\varphi _j}(t)&= f_{\varphi _j}(net_{\varphi _j}(t)), \end{aligned}$$
(4)
$$\begin{aligned} net_{c_j^v}(t)&= \sum _{m}w_{c_j^v m} \cdot y^m(t-1), \end{aligned}$$
(5)
$$\begin{aligned} s_{c_j^v}(t)&= y^{\varphi _j}(t) \cdot s_{c_j^v}(t-1) + y^{in_j}(t) \cdot g(net_{c_j^v}(t)), \end{aligned}$$
(6)
$$\begin{aligned} net_{out_j}(t)&= \sum _m w_{out_j m} \cdot y^m(t-1) + \sum _{v=1}^{S_j} w_{out_j c_j^v} \cdot s_{c_j^v}(t), \end{aligned}$$
(7)
$$\begin{aligned} y^{out_j}(t)&= f_{out_j}(net_{out_j}(t)), \end{aligned}$$
(8)
$$\begin{aligned} y^{c_j^v}(t)&= y^{out_j}(t) \cdot h(s_{c_j^v}(t)). \end{aligned}$$
(9)

where \(w_{rm}\) is the weight from the unit m to the neuron r; \(y^{m}(t-1)\) is the mth input of the respective unit at time \(t-1\); g(.) and h(.) are hyperbolic tangent activation functions with range \([-2,2]\) and \([-1,1]\) respectively. For a more comprehensive study of this technique, please refer to [3].

The CEC solves the problem of vanishing (or explosion) gradient [1], since the local error back flow remains constant within the CEC (without growing or decreasing), while a new instance or external signal error does not arrive. Its training model is based on a modification of the algorithm BackPropagation Through Time [13] and a version of the Real-Time Recurrent Learning [14]. The main parameters are the block number, the number of cells of each block, and the number of input and output neurons as well. For the training process the following hyperparameters need to be defined: the activation functions, the number of iterations and the learning rate \(\alpha \in [0,1]\). This technique has shown accurate results in classification and forecasting problems. However, this method is computationally expensive. Therefore, its architecture is not scalable [8].

3 An Efficient Training for LSTM

LSTM architecture involves high computation time during the training process to find the optimal weights. And the computational training cost considerably increases when either the number of blocks or the number of cells increases.

To address the above-mentioned problem, we propose a new training method that reduces the computational cost, while maintaining its level of performance. The classical LSTM randomly initializes the weights. However, this point may be far from optimal and subsequently, the training algorithms may take more epochs to converge. To improve this particular drawback, we propose a fast method to find a better starting point in the hypothesis space by evaluating a number of instances, and using these output signals to perform a ridge regression to obtain the output layer weights. Finally, we train the LSTM in an online form.

Algorithm 1 describes our training method. It considers a network of three layers (input-hidden-output), where the hidden layer is composed by memory blocks and the output layer by simple perceptron units. Moreover, T is the length of the training series, \(n_{in}\) is the number of units of the input layer (the number of lags), \(n_h\) is the number of units in the hidden layer, \(n_o\) is the number of units of the output layer. Let Y matrix \(T \times n_o\) with the current outputs associated with each input vector.

figure a

The first stage (steps 1 to 3) of the algorithm finds a good starting point for the LSTM. In step 1 all network weights are initialized from a uniform distribution with range \([-0.5;0.5]\). Next, the matrix S, containing all memory cells outputs, is computed. Here each row of this matrix corresponds to the outputs of the units directly connected to the output layer given an input vector \(\mathbf {x}(t) = (x_1(t), \dots , x_{n_{in}}(t))\) as is described in step 2. Thus, the target estimations can be written as:

$$\begin{aligned} \hat{Y} = S \cdot W_{out}, \end{aligned}$$

where \(W_{out}\) is a \((n_{in}+ \sum _{j=1}^{n_h} S_j)\times n_o\) matrix containing the output layer weights. Then, \(W_{out}\) can be estimated by rigde regresion, as shown in step 3, where \(S'\) is the transpose of matrix S and I is the identity matrix. In the second stage, the LSTM network is trained with a set of instances by using incremental learning, i.e., the weights are updated after receiving a new instance. Note that this approach is similar to the way that extreme learning machines (ELM) [5] adjust the weights of the output layer. This approaches is well-known because its interpolation and universal approximation capabilities [6]. In contrast, here we use this fast method just to find a reliable starting point for the network.

4 Experiments and Results

In order to assess our proposal, we use three data sets from different geographic points of Chile: Data 1, code b08, (\(22.54^{\circ }\)S, \(69.08^{\circ }\)W); Data 2, code b21, (\(22.92^{\circ }\)S, \(69.04^{\circ }\)W); and Data 3, code d02, (\(25.1^{\circ }\)S, \(69.96^{\circ }\)W). These data are provided by the Department of Geophysics of the Faculty of Physical and Mathematical Sciences of the University of Chile, commissioned by the Ministry of Energy of the Republic of ChileFootnote 1.

We worked with the hourly time series, with no missing values. The attributes considered for the study are: wind speed at 10 meters height (m / s), wind direction at 10 m height (degrees), temperature at 5 m (\({}^{\circ }\)C), and relative humidity at 5 m height. The series starting at 00:00 on December 1, 2013 to 23:00 on March 31, 2015. Each feature is scaled to \([-1,1]\) using the min-max function.

To evaluate the model accuracy, each available series is divided in \(R=10\) subsets using a 4 months sliding window approach with a shift of 500 points (20 days approximately), as depicted in Fig. 2.

Fig. 2.
figure 2

Sliding window approach to evaluate model accuracy.

Then to measure the accuracy of the model, we consider the average over the subsets, computing three standard metrics [10] based on \(e_r(T+h|T) = y_r(T+h) - \hat{y}_r(T+h|T)\):

$$\begin{aligned} \text{ MAE }(h)&= \frac{1}{R} \sum _{r=1}^R |e_r(T+h|T)| \end{aligned}$$
(10)
$$\begin{aligned} \text{ MAPE }(h)&= \frac{1}{R} \sum _{r=1}^R \left| \frac{e_r(T+h|T)}{y_r(T+h)}\right| \end{aligned}$$
(11)
$$\begin{aligned} \text{ RMSE }(h)&= \sqrt{\frac{1}{R} \sum _{r=1}^R (e_r(T+h|T))^2.} \end{aligned}$$
(12)

Here \(y_r(T+h)\) is the unnormalized target at time \(T+h\). T is the index of the last point, and h is the number of steps ahead; \(\hat{y}_r(T+h|T)\) is the target estimation of the model at the time \(T+h\). The forecasting of several steps ahead was made with the multi-stage approach prediction [2]. Table 1 shows the parameters tuned.

Table 1. Parameters of LSTM for tuning.
Table 2. Overall time for 10 runs (in minutes) and the selected values of the parameters for each model.

The results show that the proposed method achieves a better overall computational time for 10 runs, based in the best model that minimizes MSE for each data set. Table 2 shows the training time of the original algorithm and of our proposal (columns two and three respectively). And the remainder columns exhibit the parameters selected to train the models.

Figure 3 shows that the proposed algorithm achieves a better or comparable performance using MAPE and RMSE. The results for MAE are omitted because they show similar behavior that MAPE, but different scale. There is a important insights from these experiments, we observe that the proposed method outperformed the original model by several steps, especially for the MAPE when forecasting several steps ahead.

Fig. 3.
figure 3

Data 1 (top-left), Data 2 (top-right), Data 3 (botton-left)

5 Conclusions and Future Work

This work presents an efficient training algorithm for LSTM networks. We observed that our proposed training method outperforms the original algorithm reducing by 98%, 99% and 92% the computational time for each data set. One can also notice that although our proposal uses a greater number of blocks or cells or lags, it is most efficient. Results suggests that our proposal, besides being efficient, in general achieved a better performance when forecasting several steps ahead. As future work, we would like to research how to increase the forecasting accuracy and evaluating our algorithm against other models derived from LSTM. Another interesting issue is to explore the performance of our proposal in large datasets.