Skip to main content

Scalable Learning to Optimize: A Learned Optimizer Can Train Big Models

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13683))

Included in the following conference series:

Abstract

Learning to optimize (L2O) has gained increasing attention since it demonstrates a promising path to automating and accelerating the optimization of complicated problems. Unlike manually crafted classical optimizers, L2O parameterizes and learns optimization rules in a data-driven fashion. However, the primary barrier, scalability, persists for this paradigm: as the typical L2O models create massive memory overhead due to unrolled computational graphs, it disables L2O’s applicability to large-scale tasks. To overcome this core challenge, we propose a new scalable learning to optimize (SL2O) framework which (i) first constrains the network updates in a tiny subspace and (ii) then explores learning rules on top of it. Thanks to substantially reduced trainable parameters, learning optimizers for large-scale networks with a single GPU become feasible for the first time, showing that the scalability roadblock of applying L2O to training large models is now removed. Comprehensive experiments on various network architectures (i.e., ResNets, VGGs, ViTs) and datasets (i.e., CIFAR, ImageNet, E2E) across vision and language tasks, consistently validate that SL2O can achieve significantly faster convergence speed and competitive performance compared to analytical optimizers. For example, our approach converges 3.41\(\sim \)4.60 times faster on CIFAR-10/100 with ResNet-18, and 1.24 times faster on ViTs, at nearly no performance loss. Codes are in https://github.com/VITA-Group/Scalable-L2O.

X. Chen and T. Chen—Equal Contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    We also include GPT-2 results in the supplementary.

References

  1. Andrychowicz, M., et al.: Learning to learn by gradient descent by gradient descent. In: Advances in Neural Information Processing Systems (NeurIPS) (2016)

    Google Scholar 

  2. Bello, I., Zoph, B., Vasudevan, V., Le, Q.V.: Neural optimizer search with reinforcement learning. In: Precup, D., Teh, Y.W. (eds.) Proceedings of the 34th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 70, pp. 459–468. PMLR, 06–11 Aug 2017. https://proceedings.mlr.press/v70/bello17a.html

  3. Cao, Y., Chen, T., Wang, Z., Shen, Y.: Learning to optimize in swarms. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 15018–15028 (2019)

    Google Scholar 

  4. Chen, T., Chen, X., Chen, W., Heaton, H., Liu, J., Wang, Z., Yin, W.: Learning to optimize: a primer and a benchmark. arXiv preprint arXiv:2103.12828 (2021)

  5. Chen, T., Zhang, W., Zhou, J., Chang, S., Liu, S., Amini, L., Wang, Z.: Training stronger baselines for learning to optimize. arXiv preprint arXiv:2010.09089 (2020)

  6. Chen, W., Yu, Z., Wang, Z., Anandkumar, A.: Automated synthetic-to-real generalization. In: International Conference on Machine Learning (ICML). pp. 1746–1756 (2020)

    Google Scholar 

  7. Chen, X., et al.: Self-PU: self boosted and calibrated positive-unlabeled training. In: International Conference on Machine Learning (ICML), pp. 1510–1519 (2020)

    Google Scholar 

  8. Chen, Y., et al.: Learning to learn without gradient descent by gradient descent. In: International Conference on Machine Learning (ICML), pp. 748–756 (2017)

    Google Scholar 

  9. Cranmer, M.: PYSR: Fast & parallelized symbolic regression in python/Julia (2020)

    Google Scholar 

  10. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). https://doi.org/10.1109/CVPR.2009.5206848

  11. Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  12. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(7), 1–39 (2011)

    MathSciNet  MATH  Google Scholar 

  13. Dušek, O., Howcroft, D.M., Rieser, V.: Semantic noise matters for neural natural language generation. In: Proceedings of the 12th International Conference on Natural Language Generation, pp. 421–426. Association for Computational Linguistics, Tokyo, Japan, October–November 2019. https://doi.org/10.18653/v1/W19-8652, https://www.aclweb.org/anthology/W19-8652

  14. Gressmann, F., Eaton-Rosen, Z., Luschi, C.: Improving neural network training in low dimensional random bases. arXiv preprint arXiv:2011.04720 (2020)

  15. Gur-Ari, G., Roberts, D.A., Dyer, E.: Gradient descent happens in a tiny subspace. arXiv preprint arXiv:1812.04754 (2018)

  16. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  17. Hecht-Nielsen, R.: Theory of the backpropagation neural network. In: Neural Networks For Perception, pp. 65–93. Elsevier (1992)

    Google Scholar 

  18. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  19. Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Chen, W.: Lora: low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)

  20. Jaderberg, M., Vedaldi, A., Zisserman, A.: Speeding up convolutional neural networks with low rank expansions. In: Proceedings of the British Machine Vision Conference. BMVA Press (2014)

    Google Scholar 

  21. Jiang, H., Chen, Z., Shi, Y., Dai, B., Zhao, T.: Learning to defense by learning to attack. arXiv preprint arXiv:1811.01213 (2018)

  22. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  23. Krizhevsky, A., et al.: Learning multiple layers of features from tiny images (2009)

    Google Scholar 

  24. Li, C., Chen, T., You, H., Wang, Z., Lin, Y.: HALO: hardware-aware learning to optimize. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12354, pp. 500–518. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58545-7_29

    Chapter  Google Scholar 

  25. Li, C., Farkhoor, H., Liu, R., Yosinski, J.: Measuring the intrinsic dimension of objective landscapes. arXiv preprint arXiv:1804.08838 (2018)

  26. Li, H., Xu, Z., Taylor, G., Studer, C., Goldstein, T.: Visualizing the loss landscape of neural nets. Adv. Neural. Inf. Process. Syst. 31, 1–11 (2018)

    Google Scholar 

  27. Li, K., Malik, J.: Learning to optimize. arXiv preprint arXiv:1606.01885 (2016)

  28. Li, T., Tan, L., Tao, Q., Liu, Y., Huang, X.: Low dimensional landscape hypothesis is true: DNNs can be trained in tiny subspaces (2021)

    Google Scholar 

  29. Lv, K., Jiang, S., Li, J.: Learning gradient descent: better generalization and longer horizons. In: International Conference on Machine Learning (ICML), pp. 2247–2255 (2017)

    Google Scholar 

  30. Metz, L., Maheswaranathan, N., Nixon, J., Freeman, D., Sohl-Dickstein, J.: Understanding and correcting pathologies in the training of learned optimizers. In: International Conference on Machine Learning, pp. 4556–4565. PMLR (2019)

    Google Scholar 

  31. Oymak, S., Fabian, Z., Li, M., Soltanolkotabi, M.: Generalization guarantees for neural networks via harnessing the low-rank structure of the Jacobean. arXiv preprint arXiv:1906.05392 (2019)

  32. Povey, D., et al.: Semi-orthogonal low-rank matrix factorization for deep neural networks. In: Interspeech, pp. 3743–3747 (2018)

    Google Scholar 

  33. Qian, N.: On the momentum term in gradient descent learning algorithms. Neural Netw. 12(1), 145–151 (1999)

    Article  Google Scholar 

  34. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI blog. 8, 9 (2019)

    Google Scholar 

  35. Robbins, H.E.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  36. Sainath, T.N., Kingsbury, B., Sindhwani, V., Arisoy, E., Ramabhadran, B.: Low-rank matrix factorization for deep neural network training with high-dimensional output targets. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6655–6659. IEEE (2013)

    Google Scholar 

  37. Shen, J., Chen, X., Heaton, H., Chen, T., Liu, J., Yin, W., Wang, Z.: Learning a minimax optimizer: a pilot study. In: International Conference on Learning Representations (ICLR) (2021)

    Google Scholar 

  38. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  39. Sohl-Dickstein, J., Poole, B., Ganguli, S.: Fast large-scale optimization by unifying stochastic gradient and quasi-newton methods. In: International Conference on Machine Learning, pp. 604–612. PMLR (2014)

    Google Scholar 

  40. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)

    Google Scholar 

  41. Tieleman, T., Hinton, G.: Lecture 6.5–RMSProp: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning (2012)

    Google Scholar 

  42. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)

    Google Scholar 

  43. Tuddenham, M., Prügel-Bennett, A., Hare, J.: Quasi-newton’s method in the class gradient defined high-curvature subspace. arXiv preprint arXiv:2012.01938 (2020)

  44. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  45. Vicol, P., Metz, L., Sohl-Dickstein, J.: Unbiased gradient estimation in unrolled computation graphs with persistent evolution strategies. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 139, pp. 10553–10563. PMLR, 18–24 July 2021. https://proceedings.mlr.press/v139/vicol21a.html

  46. Wang, Z., Wohlwend, J., Lei, T.: Structured pruning of large language models. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6151–6162 (2020)

    Google Scholar 

  47. Werbos, P.J.: Backpropagation through time: what it does and how to do it. Proc. IEEE 78(10), 1550–1560 (1990)

    Article  Google Scholar 

  48. Wichrowska, O., et al.: Learned optimizers that scale and generalize. In: International Conference on Machine Learning (ICML) (2017)

    Google Scholar 

  49. Xiong, Y., Hsieh, C.-J.: Improved adversarial training via learned optimizer. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12353, pp. 85–100. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58598-3_6

    Chapter  Google Scholar 

  50. Yin, M., Tucker, G., Zhou, M., Levine, S., Finn, C.: Meta-learning without memorization. arXiv preprint arXiv:1912.03820 (2019)

  51. You, Y., Chen, T., Wang, Z., Shen, Y.: L2-GCN: layer-wise and learned efficient training of graph convolutional networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2127–2135 (2020)

    Google Scholar 

  52. Yu, X., Liu, T., Wang, X., Tao, D.: On compressing deep models by low rank and sparse decomposition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7370–7379 (2017)

    Google Scholar 

  53. Zhang, S., Wang, M., Liu, S., Chen, P.Y., Xiong, J.: Why lottery ticket wins? A theoretical perspective of sample complexity on sparse neural networks. Adv. Neural. Inf. Process. Syst. 34, 2707–2720 (2021)

    Google Scholar 

  54. Zhang, Y., Chuangsuwanich, E., Glass, J.: Extracting deep neural network bottleneck features using low-rank matrix factorization. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 185–189. IEEE (2014)

    Google Scholar 

  55. Zhao, Y., Li, J., Gong, Y.: Low-rank plus diagonal adaptation for deep neural networks. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5005–5009. IEEE (2016)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xuxi Chen .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1570 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chen, X., Chen, T., Cheng, Y., Chen, W., Awadallah, A., Wang, Z. (2022). Scalable Learning to Optimize: A Learned Optimizer Can Train Big Models. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13683. Springer, Cham. https://doi.org/10.1007/978-3-031-20050-2_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20050-2_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20049-6

  • Online ISBN: 978-3-031-20050-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics