Abstract
Many machine learning models require a training procedure based on running stochastic gradient descent. A key element for the efficiency of those algorithms is the choice of the learning rate schedule. While finding good learning rates schedules using Bayesian optimisation has been tackled by several authors, adapting it dynamically in a data-driven way is an open question. This is of high practical importance to users that need to train a single, expensive model. To tackle this problem, we introduce an original probabilistic model for traces of optimisers, based on latent Gaussian processes and an auto-/regressive formulation, that flexibly adjusts to abrupt changes of behaviours induced by new learning rate values. As illustrated, this model is well-suited to tackle a set of problems: first, for the on-line adaptation of the learning rate for a cold-started run; then, for tuning the schedule for a set of similar tasks (in a classical BO setup), as well as warm-starting it for a new task.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
In the following, without loss of generality we use the convention that \(\mathcal {L}\) should be maximised.
References
Andrychowicz, M., et al.: Learning to learn by gradient descent by gradient descent. In: Advances in Neural Information Processing Systems, pp. 3981–3989 (2016)
Baydin, A.G., Cornish, R., Rubio, D.M., Schmidt, M., Wood, F.: Online learning rate adaptation with hypergradient descent. arXiv preprint arXiv:1703.04782 (2017)
Bengio, Y.: Practical recommendations for gradient-based training of deep architectures. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 437–478. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35289-8_26
Bergstra, J.S., Bardenet, R., Bengio, Y., Kégl, B.: Algorithms for hyper-parameter optimization. In: Advances in Neural Information Processing Systems, pp. 2546–2554 (2011)
Bogunovic, I., Scarlett, J., Jegelka, S., Cevher, V.: Adversarially robust optimization with Gaussian processes. In: Advances in Neural Information Processing Systems, pp. 5760–5770 (2018)
Bogunovic, I., Scarlett, J., Krause, A., Cevher, V.: Truncated variance reduction: a unified approach to Bayesian optimization and level-set estimation. In: Advances in Neural Information Processing Systems, pp. 1507–1515 (2016)
Chollet, F.: Keras implementation of ResNet for CIFAR. https://keras.io/examples/cifar10_resnet/ (2009)
Domhan, T., Springenberg, J.T., Hutter, F.: Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In: Twenty-Fourth International Joint Conference on Artificial Intelligence (2015)
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(7), 2121–2159 (2011)
Falkner, S., Klein, A., Hutter, F.: Bohb: Robust and efficient hyperparameter optimization at scale. arXiv preprint arXiv:1807.01774 (2018)
Ginsbourger, D., Baccou, J., Chevalier, C., Perales, F., Garland, N., Monerie, Y.: Bayesian adaptive reconstruction of profile optima and optimizers. SIAM/ASA J. Uncertainty Quantification 2(1), 490–510 (2014)
Gugger, S., Howard, J.: Adamw and super-convergence is now the fastest way to train neural nets (2018). https://www.fast.ai/2018/07/02/adam-weight-decay/
Hansen, N., Ostermeier, A.: Completely derandomized self-adaptation in evolution strategies. Evol. Comput. 9(2), 159–195 (2001)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Hensman, J., Matthews, A.G.D.G., Ghahramani, Z.: Scalable variational Gaussian process classification. In: Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics (2015)
Hoffman, M.D., Blei, D.M., Wang, C., Paisley, J.: Stochastic variational inference. Journal of Machine Learning Research (2013)
Kaufmann, E., Cappé, O., Garivier, A.: On Bayesian upper confidence bounds for bandit problems. In: Artificial Intelligence and Statistics, pp. 592–600 (2012)
Klein, A., Falkner, S., Bartels, S., Hennig, P., Hutter, F.: Fast Bayesian optimization of machine learning hyperparameters on large datasets. In: International Conference on Artificial Intelligence and Statistics (AISTATS 2017), pp. 528–536. PMLR (2017)
Klein, A., Falkner, S., Springenberg, J.T., Hutter, F.: Learning curve prediction with Bayesian neural networks. In: ICLR (2017)
Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Technical report, Citeseer (2009)
Leontaritis, I., Billings, S.A.: Input-output parametric models for non-linear systems part ii: stochastic non-linear systems. Int. J. Control 41(2), 329–344 (1985)
Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., Talwalkar, A.: Hyperband: a novel bandit-based approach to hyperparameter optimization. J. Mach. Learn. Res. 18(185), 1–52 (2018)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
Matthews, A.G.D.G., Hensman, J., Turner, R., Ghahramani, Z.: On sparse variational methods and the kullback-leibler divergence between stochastic Processes. J. Mach. Learn. Res. 51, 231–239 (2016)
Matthews, A.G.D.G., et al.: Gpflow: a Gaussian process library using tensorflow. J. Mach. Learn. Res. 18(1), 1299–1304 (2017)
Nishida, K., Akimoto, Y.: PSA-CMA-ES: CMA-ES with population size adaptation. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 865–872 (2018)
Pearce, M., Branke, J.: Continuous multi-task Bayesian optimisation with correlation. Eur. J. Oper. Res. 270(3), 1074–1085 (2018)
Picheny, V., Ginsbourger, D.: A nonstationary space-time Gaussian process model for partially converged simulations. SIAM/ASA J. Uncertainty Quantification 1(1), 57–78 (2013)
Poloczek, M., Wang, J., Frazier, P.I.: Warm starting Bayesian optimization. In: Proceedings of the 2016 Winter Simulation Conference, pp. 770–781. IEEE Press (2016)
Reddi, S.J., Kale, S., Kumar, S.: On the convergence of ADAM and beyond. In: ICLR (2018)
Saul, A.D., Hensman, J., Vehtari, A., Lawrence, N.D., et al.: Chained Gaussian processes. In: AISTATS, pp. 1431–1440 (2016)
Shahriari, B., Swersky, K., Wang, Z., Adams, R.P., De Freitas, N.: Taking the human out of the loop: a review of Bayesian optimization. Proc. IEEE 104(1), 148–175 (2016)
Smith, L.N.: Cyclical learning rates for training neural networks. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 464–472. IEEE (2017)
Smith, L.N., Topin, N.: Super-convergence: very fast training of neural networks using large learning rates. In: Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications. vol. 11006, p. 1100612. International Society for Optics and Photonics (2019)
Snoek, J., Larochelle, H., Adams, R.P.: Practical Bayesian optimization of machine learning algorithms. In: Advances in Neural Information Processing Systems, pp. 2951–2959 (2012)
Srinivas, N., Krause, A., Kakade, S., Seeger, M.: Gaussian Process optimization in the bandit setting: no regret and experimental design. In: Proceedings of the 27th International Conference on International Conference on Machine Learning, pp. 1015–1022. Omnipress (2010)
Swersky, K., Snoek, J., Adams, R.P.: Multi-task Bayesian optimization. In: Advances in Neural Information Processing Systems, pp. 2004–2012 (2013)
Swersky, K., Snoek, J., Adams, R.P.: Freeze-thaw Bayesian optimization. arXiv preprint arXiv:1406.3896 (2014)
Tieleman, T., Hinton, G.: Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw. Mach. Learn. 4(2), 26–31 (2012)
Titsias, M.: Variational learning of inducing variables in sparse Gaussian processes. In: Artificial Intelligence and Statistics (2009)
van der Wilk, M., Dutordoir, V., John, S., Artemev, A., Adam, V., Hensman, J.: A framework for interdomain and multioutput Gaussian processes. arXiv:2003.01115 (2020). https://arxiv.org/abs/2003.01115
Wilson, J., Hutter, F., Deisenroth, M.: Maximizing acquisition functions for Bayesian optimization. In: Advances in Neural Information Processing Systems, pp. 9884–9895 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Picheny, V., Dutordoir, V., Artemev, A., Durrande, N. (2021). Automatic Tuning of Stochastic Gradient Descent with Bayesian Optimisation. In: Hutter, F., Kersting, K., Lijffijt, J., Valera, I. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2020. Lecture Notes in Computer Science(), vol 12459. Springer, Cham. https://doi.org/10.1007/978-3-030-67664-3_26
Download citation
DOI: https://doi.org/10.1007/978-3-030-67664-3_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-67663-6
Online ISBN: 978-3-030-67664-3
eBook Packages: Computer ScienceComputer Science (R0)