Abstract
Nowadays, the Adam algorithm has become one of the most popular optimizers to train feed-forward neural networks because it takes the best features of other gradient-based optimizers, such as working well with sparse gradients, in online and non-stationary settings, and also it is very robust to the rescaling of the gradient. The above makes Adam the best choice to solve problems with non-stationary objectives, very noise gradients, and with large data inputs. In this work, we enhanced the Adam algorithm by using the Kalman filter, and the novel proposal is called KAdam. Instead of using the computed gradients directly from the cost function, we first apply the Kalman filter on them. As a result, the filtered gradients allow the algorithm to explore new (and potentially better) solutions on the cost function. The results obtained when applying our proposal and other state-of-the-art optimizers to solve classification problems show that KAdam is able to obtain better accuracies than its competitors in the same execution time.
You have full access to this open access chapter, Download conference paper PDF
1 Introduction
Gradient descent is an iterative algorithm to perform function optimization. Also, it is by far one of the most popular and common methods used in the training of neural networks. The gradient descent (GD) algorithm has three variants: Vanilla gradient descent (a.k.a Batch gradient descent), Stochastic gradient descent (SGD), and Mini-batch gradient descent. Which differs in the amount of training data used to compute the gradients (see [9] for further details). However, the GD algorithm and all its variants may present slow convergence time or heavy oscillations in the cost function [10]. As a result, there have been many proposals to improve the conventional gradient descent algorithms.
Momentum [8] is one of these methods, which accelerate the SGD algorithm in a relevant direction, even though its weakness is to show the behavior of a blind-rolling ball down the hill. Then AdaGrad [2] was designed to solve the blind-rolling problem by using non-constant learning rates. Unfortunately, it presents a radically diminishing learning rate, in which point the algorithm is no longer able to keep learning. After that, Adadelta [12] and RMSProp [11], both introduced the concept of adaptive learning rates, solving AdaGrad’s problem. On the other hand, Adam [4] (described by its authors as a combination of AdaGrad and RMSProp) is one of the most popular optimizers in nowadays neural network frameworks like [1, 7]. The Adam algorithm is commonly used because it presents high performance, is straightforward to implement, works well with sparse gradients and in online and non-stationary settings, and also it is very robust to the rescaling of the gradient. The above makes Adam the best choice to solve problems with non-stationary objectives, very noise gradients, and with large data inputs.
The methods mentioned early (see [13] for a detailed introduction) have shown great empirical results. However, we propose to enhance Adam algorithm using the Kalman filter [3] because we can obtain significant variations by using the estimated gradients instead of the computed ones. This change may help to explore and reach better solutions on the cost function, like other works have done by adding Gaussian noise to the gradients [6]. Hence, in this paper, it is introduced the KAdam algorithm, an extension of Adam using the Kalman filter.
The structure of this paper is described next. First, Sect. 2 describes the Adam algorithm. Then, Sect. 3 provides a brief introduction to the Kalman filter. After that, Sect. 4 describes the KAdam algorithm. Subsequently, Sect. 5 shows different carried out experiments and performance comparisons between the proposed method and other gradient-based optimizers. Finally, Sect. 6 shares conclusions from the authors and their future work.
2 Adam
The first step on the Adaptive Moment Estimation (Adam) algorithm [4], is to save the exponentially decaying averages of past gradients (first moment) and past squared gradients (second moment). This is done by computing the first moment estimate \(\upsilon _t\) (the mean) and the second moment estimate \(m_t\) (the uncentered variance) in the following equations:
where \(\beta _1\) and \(\beta _2\) are the decay rates for the first and second moment (which the authors of Adam suggest to be set to 0.9 and 0.999 respectively), \(g_t \in \mathbb {R}^n\) is the computed gradient of the cost function and \(g_t^2\) is the squared (element-wise) gradients.
As \(\upsilon _t\) and \(m_t\) are initialized as zero vectors, the authors of Adam observe that they are biased towards zero, especially during the initial time steps and when the decay rates are small. Thus, they counteract these biases by computing bias-corrected first and second moment estimates.
Finally, the parameters update rule is given by:
where, \(\eta \) is the learning rate and \(\epsilon \) is the smooth term (used to ensure algorithmic stability), which the authors of Adam suggested to be set to 0.001 and a value on the order \(10 \times 10^{-10}\) respectively.
3 Kalman Filter
The Kalman filter [3] is a recursive state estimator for linear systems. The algorithm consist in a group of equations that works in a two-steps process: prediction and update. The prediction phase is described by the following equations.
These equations gives a prediction of the state estimate and the covariance error but based only on information from the previous time step. In Eq. (6), the Kalman filter computes an a priori state estimate \(\hat{\mathbf {x}}_{k|k-1}\) where, \(\hat{\mathbf {x}}_{k-1|k-1}\) is the past predicted state, \(\mathbf {F}_k\) is the state transition model and \(\mathbf {B}_k\) is the control-input model with its respective input vector \(\mathbf {u}_k\). In Eq. (7), the predicted a priori error covariance \(\mathbf {P}_{k|k-1}\) is computed, where, \(\mathbf {P}_{k-1|k-1}\) is the previous covariance error, and \(\mathbf {Q}_{k}\) is the covariance of the process noise. On the other hand, the update phase is described by the following equations.
These equations gives an updated prediction of the state estimate and the covariance error, computed with a correction based on observed information and measurements \(\mathbf {z}_k\) from the true state in the current time step. In Eq. (8) the optimal Kalman gain matrix \(\mathbf {K}_k\) is computed, where, \(\mathbf {H}_k\) is the measuring matrix and \(\mathbf {R}_k\) is the covariance of the observation noise. In Eqs. (9) and (10) the Kalman filter computes an updated (a posteriori) state estimate \(\hat{\mathbf {x}}_{k|k}\) and an updated (a posteriori) estimated covariance \(\mathbf {P}_{k|k}\), respectively.
4 KAdam
The KAdam algorithm uses a Kalman filter to estimate the gradients of the cost function. Considering the dynamics of the gradients as unknown, the matrices \(\mathbf {F}_k\), \(\mathbf {H}_k\), \(\mathbf {Q}_k\) and \(\mathbf {R}_k\) are used as identities and the state vector \(\hat{\mathbf {x}}_{k|k}\) initialized as a zero vector, with adequates dimensions according to the gradients vector. Moreover, the gradients \(g_t\) of the cost function are used as the measurements \(\mathbf {z}_k\) from the true state vector in the Kalman filter. Thus, the estimated gradients \(\hat{g}_t\) can be written as the post-fit measurements \(\mathbf {H}_k\mathbf {\hat{x}_{k|k}}\) from the filter. The steps to calculate the estimated gradients \(\hat{g}_t\) with the Kalman filter are summarized as a function \(K(\bullet )\).
Hence, the equations to calculate the first and second moment are the following:
The original equations from Adam to compute the bias-correction of the moments (see Eqs. (3) and (4)) and the update rule (see Eq. (5)) were not modified.
5 Experiments
To empirically evaluate the accuracy and efficiency of the proposal, two experiments (with two types trainings for each one) were carried out using feed-forward neural networks to solve some of the most popular benchmark problems in machine learning. For each experiment, there is a comparison between the proposed algorithm and the following algorithms: GD, Momentum, RMSProp, and Adam. We also include the stochastic and the batch experimentation, where stochastic implies that for every patron in the training set, we adapt the parameters of the model, while in the batch training the full training set is used to calculate one adaptation of the parameters. The comparison criterion is the cost reduction using the mean squared error (MSE) through the training phase and the test phase.
In the experiments, each neural network was configured with the same architecture (experimentally selected) and the same weights initialization. The settings for the hyper-parameters used in the experiments are listed in Table 1 except the learning-rate, which is \(\eta =0.01\) for all the experiments.
5.1 Experiment: Moons
The experiment deals with the classification problem of two interleaving half circles, using a dataset with 12, 000 samples (10, 000 for training and 2, 000 for test) generated by a functionFootnote 1 from the scikit-learn python package [7]. The architectures for the neural networks were fixed to: (10, 1) layers, with a \(\tanh (\bullet )\) function in the hidden layer and a sigmoid function for the output layer.
In the stochastic training, the parameters of the neural network are adapted with each patron. In Fig. 1, we show the error function in the training phase. Notice that GD and Momentum have different behavior than RMSprop, Adam, and KAdam due to the second-moment dynamics. The second moment allows the algorithms to speed-up in an early stage of the training, as is shown in the left image, where KAdam has the fastest descend. On the other hand, these algorithms show a noisy behavior in the long run, where stochastically some low cost can be achieved.
In Table 2, we show the results of this experiment, where Adam have the best performance. Notice that in the long run, all the algorithms have close results.
In the batch training, the algorithms performed the weights update using all the samples from the data-set for each iteration. Figure 2 shows how RMSprop, Adam, and KAdam do not present the stochastic behavior in the batch training. Moreover, the proposed method showed an improvement compared with Adam and the other presented methods. Table 3 shows the results of the experiment.
5.2 Experiment: MNIST
This experiment deals with the MNIST classification problem. Before the training, the entire dataset was embedded into a 2D space (see Fig. 3) using a t-SNE [5] implementationFootnote 2.
The architectures for the neural networks were fixed to: (10, 10) layers, with a \(\tanh (\bullet )\) function in the hidden layer and a sigmoid function for the output layer.
In Table 4, we show the results, where KAdam has the best result in the training phase.
In Fig. 4, we show the cost function stochastic training for the MNIST data-set. The KAdam algorithm has a performance comparable with RMSProp and Adam. The MNIST data-set has more noise than the moons experiment. Therefore the gradient-based algorithms in the stochastic training tend to oscillate around the local minimum.
We also present the batch training experiment for the MNIST data-set. In Fig. 5 and Table 5, We present the results for this experiment where Adam and KAdam have a tight competence and they overcome the other algorithms.
6 Conclusion
In this work, we presented a proposal to improve the performance of Adam optimizer. As we have shown, when the Kalman filter is used, the estimate gradients keep following the original ones but adding relevant enough variations, which allow exploring new and probably better solutions in the cost function.
We present two empirical results with two classical data-sets, the moons, and the MNIST, and with the stochastic and batch training. We have shown that our approach presents an excellent performance in both the training phase and the testing phase. On the other hand, we think this algorithm opens the door to new developments in the research of better optimization algorithms for artificial neural networks. In our future works, we will explore deeper the impact of varying the Kalman parameters used to estimate the gradients.
Notes
References
Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. CoRR abs/1605.08695 (2016)
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)
Kalman, R.E.: A new approach to linear filtering and prediction problems. Trans. ASME-J. Basic Eng. 82(Series D), 35–45 (1960)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR 2015. Conference Track Proceedings, 7–9 May 2015, San Diego, CA, USA (2015)
van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)
Neelakantan, A., et al.: Adding gradient noise improves learning for very deep networks. CoRR abs/1511.06807 (2015)
Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Qian, N.: On the momentum term in gradient descent learning algorithms. Neural Netw. 12(1), 145–151 (1999). https://doi.org/10.1016/S0893-6080(98)00116-6
Ruder, S.: An overview of gradient descent optimization algorithms. CoRR abs/1609.04747 (2016)
Sutton, R.S.: Two problems with backpropagation and other steepest-descent learning procedures for networks. In: Proceedings of the Eighth Annual Conference of the Cognitive Science Society (1986)
Tieleman, T., Hinton, G.: Lecture 6.5 - RMSProp. Technical report, COURSERA: Neural Networks for Machine Learning (2012)
Zeiler, M.D.: ADADELTA: an adaptive learning rate method. CoRR abs/1212.5701 (2012)
Zhang, J.: Gradient descent based optimization algorithms for deep learning models training. CoRR abs/1903.03614 (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Camacho, J.D., Villaseñor, C., Alanis, A.Y., Lopez-Franco, C., Arana-Daniel, N. (2019). KAdam: Using the Kalman Filter to Improve Adam algorithm. In: Nyström, I., Hernández Heredia, Y., Milián Núñez, V. (eds) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2019. Lecture Notes in Computer Science(), vol 11896. Springer, Cham. https://doi.org/10.1007/978-3-030-33904-3_40
Download citation
DOI: https://doi.org/10.1007/978-3-030-33904-3_40
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-33903-6
Online ISBN: 978-3-030-33904-3
eBook Packages: Computer ScienceComputer Science (R0)