LLR: Learning learning rates by LSTM for training neural networks
Introduction
In recent years, neural networks have achieved great successes on various tasks. It has been successfully applied in speech recognition [1], [2], [3], [4], object detection [5], [6], [7], visual tracking [8], [9] and image annotation [10], [11], [12], etc. However, training neural networks is still facing many challenges. The adjustment of hyper-parameters has always been one of the most important problems, because these parameters are always hand-designed by using many experiments and cannot be learned in conventional methods.
Classic hyperparameters in neural networks include: learning rate, weight initialization, network layers, number of neurons in each layer, and regularization. Among these hyperparameters, the learning rate can guide us how far we should go in the direction of the fastest decline in loss. But for new models and data sets, what kind of learning rate to choose for training is still a very challenging issue. Under normal circumstances, researchers need a large number of experiments to determine the appropriate learning rate, but even the experimentally determined learning rate also has many disadvantages, which directly affects the speed of training and the accuracy of the model.
Fig. 1 shows the effect of different learning rates on neural network training. If we choose a small learning rate, although it can guarantee that we do not miss any local minimum, it also means that the network takes a long time to reach the convergence state, especially when the training is trapped in the saddle point. On the contrary, if we use a large learning rate, the fluctuation of the loss during training will be relatively large, and it is even difficult to converge in the end.
In traditional optimization methods, one should firstly determine an initial value to the learning rate, and then let the learning rate monotonically decrease according to a certain hand-designed strategy during training. Models like DenseNet and ResNet both set the learning rate in this way. Recently, new methods for adjusting the learning rate are constantly being proposed, such as cyclically changing the learning rate within a fixed range, or re-initializing when the learning rate drops to a certain threshold, Compared with the traditional method of setting the learning rate, these method can improve the performance of the network. The reason may be that increasing the learning rate at certain moments in the neural network training process has a positive impact on the training of the network. In addition, adaptive learning rates are currently widely used. Because researchers consider that it is reasonable to set different learning rates for each parameter and automatically adapt to these learning rates throughout the learning process. The algorithm proposed in [13] is an early heuristic method for adapting the learning rates of model parameters during training. This method is based on a very simple idea that if the loss keeps the same sign for the partial derivative of a given model parameter, then the learning rate should be increased. If the sign for the parameter changes the sign, then the learning rate should be reduced.
For the above-mentioned learning rate adjustment strategies, they have a common disadvantage, that is, they need to manually set hyperparameters based on experience, including the downward trend of the learning rate, the upper and lower bounds of the learning rate change, the cycle of the learning rate change, etc. Which will cause the network training process to be slow. Therefore, this paper proposes a new learning rate adjustment strategy. Let the neural network, which we call it learning rate optimizer, learn how to set the learning rate. When training a neural network, our method continuously adjusts the learning rate based on changes of network parameters. In other words, we let the neural network know how far the parameters should go along the gradient in each iteration.
Our method can be regarded as a meta-learning method, and we use LSTM as the learning rate optimizer. The LSTM network is able to process time series very well. This is because it has the outstanding ability to deal with long-term dependence problems [13], [14], so we use it to learn the learning rate. Because it has a memory function, especially long-term memory, so that it can be more effective to set the current learning rate based on historical information.
We demonstrated our methods on different datasets and architectures. Experiment results show that our methods efficiently converge and attain smaller loss with the same number of iterations.
The contributions of this are:
- 1.
We proposed a new strategy to tune learning rates which used a Long Short Term Memory(LSTM) to learn learning rates for training neural networks. Our method can avoid the need to determine the appropriate learning rates through a large number of experiments.
- 2.
In the case of the same network architecture and datasets, our method requires fewer iterations to achieve the same loss. In addition, it can be combined with most optimization methods to achieve better performance.
- 3.
We demonstrate the learning rate optimizer with a quadratic function, a fully connected network, a simple CNN and DenseNets on a synthetic data set, MNIST and CIFAR10 data set. Experiment results prove the effectiveness of it.
Section snippets
Related works
For the training of neural networks, the choice of learning rate will have a great impact on the training results. Because the learning rate determines how many gradients will be applied to the current weights so that they can move rapidly in the direction of reducing the loss. Making the right choice in the right direction means we can train the model in less time.
Methods
To achieve converge, the learning rate should be changed along with the directions of gradients when using gradients-based optimal methods for neural networks. Because neural networks will largely slow the converge process, if we use the fixed learning rates during training. So it is important to choose better learning rates.
How can we know which learning rate is best in each iteration? For the gradient descent method, the selected learning rate should make the value of the loss function
Experiments
In this section, we evaluate the efficiency of our method. Our experiments is based on the problem in [26]. Also we use a two-layer LSTM with 20 hidden units as a learning rate optimizer. We apply this learning rate optimizer to different optimization methods and compare them with different learning rate adjustment strategies. Comparison between learning rate optimizer and LSTM optimizer is also described below. Every learning rate optimizer is trained 100 epochs with the same number of steps
Conclusion
This paper proposes a learning rate adjustment strategy based on LSTM. We consider the correlation between learning rates between multiple iterations. Making full use of the advantages of LSTM in processing serialized data to provide learning rates for each step of neural network training, So that each parameter update can minimize the loss. Our method is verified on different datasets and network structures. The experimental results show that the LSTM-based learning rate adjustment strategy
Declaration Competing of Interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work is supported by the National Natural Science Foundation of China (61772124).
Changyong Yu received his B.E., M.E. and Ph.D. in computer science, from Northeastern University, China, in 2004, 2006 and 2009, respectively. Currently he is an associate professor in the School of Information Science and Engineering, Northeastern University, China. He is a member of IEEE ACM, and a member of CCF. His major research interests include data mining and bioinformatics.
References (42)
- et al.
Design of fractional order PID controller for automatic regulator voltage system based on multi-objective extremal optimization
Neurocomputing
(2015) Some methods of speeding up the convergence of iteration methods
Zh. Vychisl. Mat. Mat. Fiz.
(1964)- et al.
Deep neural networks for acoustic modeling in speech recognition
IEEE Signal Process. Mag.
(2012) - H. Sak, A. Senior, K. Rao, and F. Beaufays, Fast and accurate recurrent neural network acoustic models for speech...
- et al.
Towards end-to-end speech recognition with recurrent neural networks
- W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, and G. Zweig, Achieving human parity in...
- et al.
Spatial pyramid pooling in deep convolutional networks for visual recognition
In ECCV
(2014) - et al.
You only look once: Unified, real-time object detection
- et al.
Rich feature hierarchies for accurate object detection and semantic segmentation
CVPR
(2014) - et al.
Visual tracking with fully convolutional networks
2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7–13, 2015
(2015)
Fully-convolutional siamese networks for object tracking
In ECCV workshop
Deep visual-semantic alignments for generating image descriptions
IEEE Trans. Pattern Anal. Mach. Intell.
Show, attend and tell: neural image caption generation with visual attention
Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6–11 July 2015
Deep captioning with multimodal recurrent neural networks (m-rnn)
CoRR
A two-layer nonlinear combination method for short-term wind speed prediction based on elm, enn, and LSTM
IEEE Internet Things J.
A nonlinear hybrid wind speed forecasting model using LSTM network, hysteretic elm and differential evolution algorithm
Energy Convers. Manag.
Cyclical learning rates for training neural networks
2017 IEEE Winter Conference on Applications of Computer Vision (WACV) IEEE
Sgdr: Stochastic gradient descent with warm restarts
Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming
Mach. Learn.
Adaptive subgradient methods for online learning and stochastic optimization
J. Mach. Learn. Res.
Cited by (52)
Leakage diagnosis of district heating-network based on system simulation and PCA_BP neural network
2023, Process Safety and Environmental ProtectionA short-term wind power prediction approach based on ensemble empirical mode decomposition and improved long short-term memory
2023, Computers and Electrical EngineeringSoil salinity simulation based on electromagnetic induction and deep learning
2023, Soil and Tillage ResearchOptimization of electric vehicle sound package based on LSTM with an adaptive learning rate forest and multiple-level multiple-object method
2023, Mechanical Systems and Signal ProcessingCitation Excerpt :The noise transmission path of AP is complex and hierarchical. Most of the related works [12–24] focus on the data-driven part while ignoring the knowledge attributes behind the AP design problem, which limits the further improvement of prediction and optimization of AP performance. It is beneficial to combine data-driven methods and knowledge-based approaches for AP development.
Leakage diagnosis of heating pipe-network based on BP neural network
2022, Sustainable Energy, Grids and NetworksCitation Excerpt :The method of adjusting weights is the lbfgs method, which belongs to the optimizer of the quasi-Newtonian methods family [32]. The invscaling method is selected as the learning rate, which is reducing the learning rate with the inverse scaling exponent over time [33–35]. The prediction accuracy of neural network is related to both the number of iterations and the iteration error concerning with the training stop point of BP neural network.
Changyong Yu received his B.E., M.E. and Ph.D. in computer science, from Northeastern University, China, in 2004, 2006 and 2009, respectively. Currently he is an associate professor in the School of Information Science and Engineering, Northeastern University, China. He is a member of IEEE ACM, and a member of CCF. His major research interests include data mining and bioinformatics.
Xin Qi is currently pursuing the Master degree at School of Computer and Communication Engineering, Northeastern University at Qinhuangdao, China. Her major research interests is machine learning.
Haitao Ma received his Ph.D. in computer application technology, from Harbin Institute of Technology, China. Currently he is a lecturer in the School of Computer and Communication Engineering, Northeastern University at Qinhuangdao, China. His major research interests include image recognition and bioinformatics.
Xin He is currently pursuing the Master degree at School of Computer and Communication Engineering, Northeastern University at Qinhuangdao, China. His major research interests is image recognition.
Cuirong Wang received her Ph.D. in computer science, from Northeastern University, China. Currently she is a professor in the School of Computer and Communication Engineering, Northeastern University at Qinhuangdao, China. Her major research interests include computer network virtualization, cloud computing and big data analysis.
Yuhai Zhao received his B.E., M.E. and Ph.D. in computer science, from Northeastern University, China, in 1999, 2004 and 2007, respectively. Currently he is an associate professor in the School of Information Science and Engineering, Northeastern University, China. He is a member of IEEE ACM, and a member of CCF. His major research interests include data mining and bioinformatics.