Elsevier

Neural Networks

Volume 141, September 2021, Pages 11-29
Neural Networks

Automatic, dynamic, and nearly optimal learning rate specification via local quadratic approximation

https://doi.org/10.1016/j.neunet.2021.03.025Get rights and content

Abstract

In deep learning tasks, the update step size determined by the learning rate at each iteration plays a critical role in gradient-based optimization. However, determining the appropriate learning rate in practice typically relies on subjective judgment. In this work, we propose a novel optimization method based on local quadratic approximation (LQA). In each update step, we locally approximate the loss function along the gradient direction by using a standard quadratic function of the learning rate. Subsequently, we propose an approximation step to obtain a nearly optimal learning rate in a computationally efficient manner. The proposed LQA method has three important features. First, the learning rate is automatically determined in each update step. Second, it is dynamically adjusted according to the current loss function value and parameter estimates. Third, with the gradient direction fixed, the proposed method attains a nearly maximum reduction in the loss function. Extensive experiments were conducted to prove the effectiveness of the proposed LQA method.

Introduction

In recent years, the development of deep learning has led to remarkable progress in computer visual recognition (He et al., 2016, Huang et al., 2017, Krizhevsky et al., 2012), speech recognition (Hinton et al., 2012, Xiong et al., 2016), natural language processing (Bahdanau et al., 2014, Goldberg and Levy, 2014), and numerous other fields. To facilitate different learning tasks, researchers have developed a wide range of neural network frameworks, including deep convolutional neural networks (Krizhevsky et al., 2012, LeCun et al., 1989), recurrent neural networks (Graves, Mohamed, & Hinton, 2013), graph convolutional networks (Kipf & Welling, 2016), and reinforcement learning (Mnih et al., 2013, Mnih et al., 2015). Although the network structures may be markedly different, the training methods are typically similar; particularly, gradient descent methods are often employed.

Among various gradient descent methods, the stochastic gradient descent (SGD) method (Robbins & Monro, 1951) plays a critical role. In the standard SGD method, the first-order gradient of a randomly selected sample is calculated. Then, parameter estimates are adjusted using the negative of this gradient multiplied by a step size. Many generalized methods based on the SGD method have been proposed (Andrychowicz et al., 2016, Duchi et al., 2011, Kingma and Ba, 2015, Rumelhart et al., 1986, Tieleman and Hinton, 2012). Most of these extensions specify improved update rules to adjust the direction or the step size. However, Andrychowicz et al. (2016) pointed out that rule-based methods might perform well in some specific cases but poorly in others. Consequently, an optimizer with an automatically adjusted update rule is preferable.

An update rule contains two important components: the update direction and step size. The learning rate determines the step size, which plays a significant role in optimization. Empirical experience suggests that a relatively large learning rate might be preferable in the early stages of the optimization. Otherwise, the algorithm might converge very slowly. In contrast, a relatively small learning rate should be used in the later stages. Otherwise, the objective function cannot be fully optimized. This inspired us to design an automatic method to search for an optimal learning rate in each step during optimization.

To this end, we propose a novel optimization method based on local quadratic approximation (LQA). It tunes the learning rate in a dynamic, automatic, and nearly optimal manner and can obtain the optimal step size in each update step. Intuitively, given a search direction, one must consider what constitutes an optimal step size, that is, the step size leading to the greatest reduction in global loss. For this purpose, the proposed method can be decomposed into two important steps: the expansion step and approximation step. First, in the expansion step, we perform Taylor expansions on the loss function around the current parameter estimates. The objective function can be locally approximated by a quadratic function of the learning rate. Then, the learning rate is also treated as a parameter to be optimized. This leads to a nearly optimal determination of the learning rate in a particular step.

Second, to avoid calculating the high-dimensional Hessian matrix, we propose a novel approximation step. Given a fixed gradient direction, the loss function can be approximated by a standard univariate quadratic function with the learning rate as the only input variable. Thus, there are only two unknown coefficients: the linear term coefficient and quadratic term coefficient. To estimate the two unknown coefficients, one can, for example, select two different but reasonably small learning rates. Then, the corresponding objective function can be evaluated. This step produces two equations that can be solved to estimate the unknown coefficients, from which the optimal learning rate can be obtained.

Our contributions: The proposed method contains three important features.

  • (1)

    The algorithm is automatic. In other words, it leads to an optimization method that requires minimal subjective judgment.

  • (2)

    The method is dynamic in the sense that the learning rate used in each update step is different. It is dynamically adjusted according to the current status of the loss function and parameter estimates.

  • (3)

    The learning rate derived from the proposed method is nearly optimal. In any given update step, with the gradient direction fixed, the learning rate determined by the proposed method can lead to a nearly maximum reduction in the loss function.

The remainder of this article is organized as follows. In Section 2, we review related work on gradient-based optimizers. Section 3 presents the proposed algorithm in detail. In Section 4, we verify the performance of the proposed method through empirical studies on open datasets. Concluding remarks are given in Section 5.

Section snippets

Related work

To optimize a loss function, two important components must be specified: the update direction and the step size. Ideally, the optimal update direction should be the gradient direction computed for the loss function based on the whole dataset. For convenience, we refer to this as the global gradient. Since calculating the global gradient is computationally expensive, the SGD method (Robbins & Monro, 1951) estimates a gradient based on a stochastic subsample in each iteration, which we refer to

Methodology

In this section, we introduce the notations used in this paper and describe the general formulation of the SGD method. We then propose an algorithm based on LQA to dynamically search for an optimal learning rate. This results in a new variant of the SGD method.

Experiments

In this section, we empirically evaluate the proposed method based on different models and compare it with various optimizers. To support practical applications that rely on the training of neural networks, we have developed implementations of the LQA method based on PyTorch and TensorFlow, which are available at https://github.com/rockfc196/LQA. This provides an alternative method for users to accelerate the training of their models. In the following part, we consider LQA as a variant of the

Conclusions

In this work, we propose LQA, a novel approach for determining the nearly optimal learning rate for automatic optimization. Our method has three important features. First, the learning rate is automatically estimated in each update step. Second, it is dynamically adjusted during the entire training process. Third, given the gradient direction, the learning rate leads to a nearly maximum reduction in the loss function. Experiments on openly available datasets demonstrated its effectiveness.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

Danyang Huang’s research was partially supported by the National Natural Science Foundation of China (No. 12071477, 11701560) and Consulting Research Project of Chinese Academy of Engineering (2020-XY-30). Bo Zhang’s research was partially supported by the National Natural Science Foundation of China (No. 71873137). Hansheng Wang’s research was partially supported by National Natural Science Foundation of China (No. 11831008, 11525101, 71532001). It was also supported in part by the Consulting

References (43)

  • HornikK.

    Approximation capabilities of multilayer feedforward networks

    Neural Networks

    (1991)
  • AgarwalN. et al.

    Second-order stochastic optimization for machine learning in linear time

    Journal of Machine Learning Research

    (2017)
  • AndrychowiczM. et al.

    Learning to learn by gradient descent by gradient descent

  • BahdanauD. et al.

    Neural machine translation by jointly learning to align and translate

    (2014)
  • BaydinA.G. et al.

    Online learning rate adaptation with hypergradient descent

    (2017)
  • BergouE. et al.

    A subsampling line-search method with second-order results

    (2020)
  • ChongE.K.P. et al.

    An introduction to optimization

    (2013)
  • CotterA. et al.

    Better mini-batch algorithms via accelerated gradient methods

  • DuchiJ. et al.

    Adaptive subgradient methods for online learning and stochastic optimization

    Journal of Machine Learning Research

    (2011)
  • GargianiM. et al.

    On the promise of the stochastic generalized Gauss-Newton method for training DNNs

    (2020)
  • GeR. et al.

    The step decay schedule: A near optimal, geometrically decaying learning rate procedure for least squares

  • GoldbergY. et al.

    Word2vec explained: deriving mikolov et al.’s negative-sampling word-embedding method

    (2014)
  • GravesA. et al.

    Speech recognition with deep recurrent neural networks

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition, In Proceedings of the IEEE...
  • HintonG. et al.

    Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups

    IEEE Signal Processing Magazine

    (2012)
  • Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional networks. In...
  • KingmaD.P. et al.

    Adam: A method for stochastic optimization

  • KipfT.N. et al.

    Semi-supervised classification with graph convolutional networks

    (2016)
  • KrizhevskyA. et al.

    Learning multiple layers of features from tiny imagesTech. rep.

    (2009)
  • KrizhevskyA. et al.

    Imagenet classification with deep convolutional neural networks

  • LanG.

    An optimal method for stochastic composite optimization

    Mathematical Programming

    (2012)
  • Cited by (4)

    • Learning an Interpretable Learning Rate Schedule via the Option Framework

      2022, Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI
    View full text