Skip to main content

Automatic Tuning of Stochastic Gradient Descent with Bayesian Optimisation

  • Conference paper
  • First Online:
Machine Learning and Knowledge Discovery in Databases (ECML PKDD 2020)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12459))

Abstract

Many machine learning models require a training procedure based on running stochastic gradient descent. A key element for the efficiency of those algorithms is the choice of the learning rate schedule. While finding good learning rates schedules using Bayesian optimisation has been tackled by several authors, adapting it dynamically in a data-driven way is an open question. This is of high practical importance to users that need to train a single, expensive model. To tackle this problem, we introduce an original probabilistic model for traces of optimisers, based on latent Gaussian processes and an auto-/regressive formulation, that flexibly adjusts to abrupt changes of behaviours induced by new learning rate values. As illustrated, this model is well-suited to tackle a set of problems: first, for the on-line adaptation of the learning rate for a cold-started run; then, for tuning the schedule for a set of similar tasks (in a classical BO setup), as well as warm-starting it for a new task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    In the following, without loss of generality we use the convention that \(\mathcal {L}\) should be maximised.

References

  1. Andrychowicz, M., et al.: Learning to learn by gradient descent by gradient descent. In: Advances in Neural Information Processing Systems, pp. 3981–3989 (2016)

    Google Scholar 

  2. Baydin, A.G., Cornish, R., Rubio, D.M., Schmidt, M., Wood, F.: Online learning rate adaptation with hypergradient descent. arXiv preprint arXiv:1703.04782 (2017)

  3. Bengio, Y.: Practical recommendations for gradient-based training of deep architectures. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 437–478. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35289-8_26

    Chapter  Google Scholar 

  4. Bergstra, J.S., Bardenet, R., Bengio, Y., Kégl, B.: Algorithms for hyper-parameter optimization. In: Advances in Neural Information Processing Systems, pp. 2546–2554 (2011)

    Google Scholar 

  5. Bogunovic, I., Scarlett, J., Jegelka, S., Cevher, V.: Adversarially robust optimization with Gaussian processes. In: Advances in Neural Information Processing Systems, pp. 5760–5770 (2018)

    Google Scholar 

  6. Bogunovic, I., Scarlett, J., Krause, A., Cevher, V.: Truncated variance reduction: a unified approach to Bayesian optimization and level-set estimation. In: Advances in Neural Information Processing Systems, pp. 1507–1515 (2016)

    Google Scholar 

  7. Chollet, F.: Keras implementation of ResNet for CIFAR. https://keras.io/examples/cifar10_resnet/ (2009)

  8. Domhan, T., Springenberg, J.T., Hutter, F.: Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In: Twenty-Fourth International Joint Conference on Artificial Intelligence (2015)

    Google Scholar 

  9. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(7), 2121–2159 (2011)

    Google Scholar 

  10. Falkner, S., Klein, A., Hutter, F.: Bohb: Robust and efficient hyperparameter optimization at scale. arXiv preprint arXiv:1807.01774 (2018)

  11. Ginsbourger, D., Baccou, J., Chevalier, C., Perales, F., Garland, N., Monerie, Y.: Bayesian adaptive reconstruction of profile optima and optimizers. SIAM/ASA J. Uncertainty Quantification 2(1), 490–510 (2014)

    Article  MathSciNet  Google Scholar 

  12. Gugger, S., Howard, J.: Adamw and super-convergence is now the fastest way to train neural nets (2018). https://www.fast.ai/2018/07/02/adam-weight-decay/

  13. Hansen, N., Ostermeier, A.: Completely derandomized self-adaptation in evolution strategies. Evol. Comput. 9(2), 159–195 (2001)

    Article  Google Scholar 

  14. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  15. Hensman, J., Matthews, A.G.D.G., Ghahramani, Z.: Scalable variational Gaussian process classification. In: Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics (2015)

    Google Scholar 

  16. Hoffman, M.D., Blei, D.M., Wang, C., Paisley, J.: Stochastic variational inference. Journal of Machine Learning Research (2013)

    Google Scholar 

  17. Kaufmann, E., Cappé, O., Garivier, A.: On Bayesian upper confidence bounds for bandit problems. In: Artificial Intelligence and Statistics, pp. 592–600 (2012)

    Google Scholar 

  18. Klein, A., Falkner, S., Bartels, S., Hennig, P., Hutter, F.: Fast Bayesian optimization of machine learning hyperparameters on large datasets. In: International Conference on Artificial Intelligence and Statistics (AISTATS 2017), pp. 528–536. PMLR (2017)

    Google Scholar 

  19. Klein, A., Falkner, S., Springenberg, J.T., Hutter, F.: Learning curve prediction with Bayesian neural networks. In: ICLR (2017)

    Google Scholar 

  20. Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Technical report, Citeseer (2009)

    Google Scholar 

  21. Leontaritis, I., Billings, S.A.: Input-output parametric models for non-linear systems part ii: stochastic non-linear systems. Int. J. Control 41(2), 329–344 (1985)

    Article  Google Scholar 

  22. Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., Talwalkar, A.: Hyperband: a novel bandit-based approach to hyperparameter optimization. J. Mach. Learn. Res. 18(185), 1–52 (2018)

    MathSciNet  MATH  Google Scholar 

  23. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)

    Google Scholar 

  24. Matthews, A.G.D.G., Hensman, J., Turner, R., Ghahramani, Z.: On sparse variational methods and the kullback-leibler divergence between stochastic Processes. J. Mach. Learn. Res. 51, 231–239 (2016)

    Google Scholar 

  25. Matthews, A.G.D.G., et al.: Gpflow: a Gaussian process library using tensorflow. J. Mach. Learn. Res. 18(1), 1299–1304 (2017)

    Google Scholar 

  26. Nishida, K., Akimoto, Y.: PSA-CMA-ES: CMA-ES with population size adaptation. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 865–872 (2018)

    Google Scholar 

  27. Pearce, M., Branke, J.: Continuous multi-task Bayesian optimisation with correlation. Eur. J. Oper. Res. 270(3), 1074–1085 (2018)

    Article  MathSciNet  Google Scholar 

  28. Picheny, V., Ginsbourger, D.: A nonstationary space-time Gaussian process model for partially converged simulations. SIAM/ASA J. Uncertainty Quantification 1(1), 57–78 (2013)

    Article  MathSciNet  Google Scholar 

  29. Poloczek, M., Wang, J., Frazier, P.I.: Warm starting Bayesian optimization. In: Proceedings of the 2016 Winter Simulation Conference, pp. 770–781. IEEE Press (2016)

    Google Scholar 

  30. Reddi, S.J., Kale, S., Kumar, S.: On the convergence of ADAM and beyond. In: ICLR (2018)

    Google Scholar 

  31. Saul, A.D., Hensman, J., Vehtari, A., Lawrence, N.D., et al.: Chained Gaussian processes. In: AISTATS, pp. 1431–1440 (2016)

    Google Scholar 

  32. Shahriari, B., Swersky, K., Wang, Z., Adams, R.P., De Freitas, N.: Taking the human out of the loop: a review of Bayesian optimization. Proc. IEEE 104(1), 148–175 (2016)

    Article  Google Scholar 

  33. Smith, L.N.: Cyclical learning rates for training neural networks. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 464–472. IEEE (2017)

    Google Scholar 

  34. Smith, L.N., Topin, N.: Super-convergence: very fast training of neural networks using large learning rates. In: Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications. vol. 11006, p. 1100612. International Society for Optics and Photonics (2019)

    Google Scholar 

  35. Snoek, J., Larochelle, H., Adams, R.P.: Practical Bayesian optimization of machine learning algorithms. In: Advances in Neural Information Processing Systems, pp. 2951–2959 (2012)

    Google Scholar 

  36. Srinivas, N., Krause, A., Kakade, S., Seeger, M.: Gaussian Process optimization in the bandit setting: no regret and experimental design. In: Proceedings of the 27th International Conference on International Conference on Machine Learning, pp. 1015–1022. Omnipress (2010)

    Google Scholar 

  37. Swersky, K., Snoek, J., Adams, R.P.: Multi-task Bayesian optimization. In: Advances in Neural Information Processing Systems, pp. 2004–2012 (2013)

    Google Scholar 

  38. Swersky, K., Snoek, J., Adams, R.P.: Freeze-thaw Bayesian optimization. arXiv preprint arXiv:1406.3896 (2014)

  39. Tieleman, T., Hinton, G.: Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw. Mach. Learn. 4(2), 26–31 (2012)

    Google Scholar 

  40. Titsias, M.: Variational learning of inducing variables in sparse Gaussian processes. In: Artificial Intelligence and Statistics (2009)

    Google Scholar 

  41. van der Wilk, M., Dutordoir, V., John, S., Artemev, A., Adam, V., Hensman, J.: A framework for interdomain and multioutput Gaussian processes. arXiv:2003.01115 (2020). https://arxiv.org/abs/2003.01115

  42. Wilson, J., Hutter, F., Deisenroth, M.: Maximizing acquisition functions for Bayesian optimization. In: Advances in Neural Information Processing Systems, pp. 9884–9895 (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Victor Picheny .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Picheny, V., Dutordoir, V., Artemev, A., Durrande, N. (2021). Automatic Tuning of Stochastic Gradient Descent with Bayesian Optimisation. In: Hutter, F., Kersting, K., Lijffijt, J., Valera, I. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2020. Lecture Notes in Computer Science(), vol 12459. Springer, Cham. https://doi.org/10.1007/978-3-030-67664-3_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-67664-3_26

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-67663-6

  • Online ISBN: 978-3-030-67664-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics