Scalable Learning to Optimize: A Learned Optimizer Can Train Big Models

Chen, Xuxi; Chen, Tianlong; Cheng, Yu; Chen, Weizhu; Awadallah, Ahmed; Wang, Zhangyang

doi:10.1007/978-3-031-20050-2_23

Xuxi Chen¹²,
Tianlong Chen¹²,
Yu Cheng¹³,
Weizhu Chen¹³,
Ahmed Awadallah¹³ &
…
Zhangyang Wang¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13683))

Included in the following conference series:

European Conference on Computer Vision

2705 Accesses
2 Citations

Abstract

Learning to optimize (L2O) has gained increasing attention since it demonstrates a promising path to automating and accelerating the optimization of complicated problems. Unlike manually crafted classical optimizers, L2O parameterizes and learns optimization rules in a data-driven fashion. However, the primary barrier, scalability, persists for this paradigm: as the typical L2O models create massive memory overhead due to unrolled computational graphs, it disables L2O’s applicability to large-scale tasks. To overcome this core challenge, we propose a new scalable learning to optimize (SL2O) framework which (i) first constrains the network updates in a tiny subspace and (ii) then explores learning rules on top of it. Thanks to substantially reduced trainable parameters, learning optimizers for large-scale networks with a single GPU become feasible for the first time, showing that the scalability roadblock of applying L2O to training large models is now removed. Comprehensive experiments on various network architectures (i.e., ResNets, VGGs, ViTs) and datasets (i.e., CIFAR, ImageNet, E2E) across vision and language tasks, consistently validate that SL2O can achieve significantly faster convergence speed and competitive performance compared to analytical optimizers. For example, our approach converges 3.41$\sim $4.60 times faster on CIFAR-10/100 with ResNet-18, and 1.24 times faster on ViTs, at nearly no performance loss. Codes are in https://github.com/VITA-Group/Scalable-L2O.

X. Chen and T. Chen—Equal Contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

InfoPro: Locally Supervised Deep Learning by Maximizing Information Propagation

Article 11 December 2024

Stitched ViTs are Flexible Vision Backbones

Progressive DARTS: Bridging the Optimization Gap for NAS in the Wild

Article 03 November 2020

Notes

1.
We also include GPT-2 results in the supplementary.

References

Andrychowicz, M., et al.: Learning to learn by gradient descent by gradient descent. In: Advances in Neural Information Processing Systems (NeurIPS) (2016)
Google Scholar
Bello, I., Zoph, B., Vasudevan, V., Le, Q.V.: Neural optimizer search with reinforcement learning. In: Precup, D., Teh, Y.W. (eds.) Proceedings of the 34th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 70, pp. 459–468. PMLR, 06–11 Aug 2017. https://proceedings.mlr.press/v70/bello17a.html
Cao, Y., Chen, T., Wang, Z., Shen, Y.: Learning to optimize in swarms. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 15018–15028 (2019)
Google Scholar
Chen, T., Chen, X., Chen, W., Heaton, H., Liu, J., Wang, Z., Yin, W.: Learning to optimize: a primer and a benchmark. arXiv preprint arXiv:2103.12828 (2021)
Chen, T., Zhang, W., Zhou, J., Chang, S., Liu, S., Amini, L., Wang, Z.: Training stronger baselines for learning to optimize. arXiv preprint arXiv:2010.09089 (2020)
Chen, W., Yu, Z., Wang, Z., Anandkumar, A.: Automated synthetic-to-real generalization. In: International Conference on Machine Learning (ICML). pp. 1746–1756 (2020)
Google Scholar
Chen, X., et al.: Self-PU: self boosted and calibrated positive-unlabeled training. In: International Conference on Machine Learning (ICML), pp. 1510–1519 (2020)
Google Scholar
Chen, Y., et al.: Learning to learn without gradient descent by gradient descent. In: International Conference on Machine Learning (ICML), pp. 748–756 (2017)
Google Scholar
Cranmer, M.: PYSR: Fast & parallelized symbolic regression in python/Julia (2020)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). https://doi.org/10.1109/CVPR.2009.5206848
Dosovitskiy, A., et al.: An image is worth 16$\times $16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(7), 1–39 (2011)
MathSciNet MATH Google Scholar
Dušek, O., Howcroft, D.M., Rieser, V.: Semantic noise matters for neural natural language generation. In: Proceedings of the 12th International Conference on Natural Language Generation, pp. 421–426. Association for Computational Linguistics, Tokyo, Japan, October–November 2019. https://doi.org/10.18653/v1/W19-8652, https://www.aclweb.org/anthology/W19-8652
Gressmann, F., Eaton-Rosen, Z., Luschi, C.: Improving neural network training in low dimensional random bases. arXiv preprint arXiv:2011.04720 (2020)
Gur-Ari, G., Roberts, D.A., Dyer, E.: Gradient descent happens in a tiny subspace. arXiv preprint arXiv:1812.04754 (2018)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Hecht-Nielsen, R.: Theory of the backpropagation neural network. In: Neural Networks For Perception, pp. 65–93. Elsevier (1992)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Chen, W.: Lora: low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)
Jaderberg, M., Vedaldi, A., Zisserman, A.: Speeding up convolutional neural networks with low rank expansions. In: Proceedings of the British Machine Vision Conference. BMVA Press (2014)
Google Scholar
Jiang, H., Chen, Z., Shi, Y., Dai, B., Zhao, T.: Learning to defense by learning to attack. arXiv preprint arXiv:1811.01213 (2018)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Krizhevsky, A., et al.: Learning multiple layers of features from tiny images (2009)
Google Scholar
Li, C., Chen, T., You, H., Wang, Z., Lin, Y.: HALO: hardware-aware learning to optimize. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12354, pp. 500–518. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58545-7_29
Chapter Google Scholar
Li, C., Farkhoor, H., Liu, R., Yosinski, J.: Measuring the intrinsic dimension of objective landscapes. arXiv preprint arXiv:1804.08838 (2018)
Li, H., Xu, Z., Taylor, G., Studer, C., Goldstein, T.: Visualizing the loss landscape of neural nets. Adv. Neural. Inf. Process. Syst. 31, 1–11 (2018)
Google Scholar
Li, K., Malik, J.: Learning to optimize. arXiv preprint arXiv:1606.01885 (2016)
Li, T., Tan, L., Tao, Q., Liu, Y., Huang, X.: Low dimensional landscape hypothesis is true: DNNs can be trained in tiny subspaces (2021)
Google Scholar
Lv, K., Jiang, S., Li, J.: Learning gradient descent: better generalization and longer horizons. In: International Conference on Machine Learning (ICML), pp. 2247–2255 (2017)
Google Scholar
Metz, L., Maheswaranathan, N., Nixon, J., Freeman, D., Sohl-Dickstein, J.: Understanding and correcting pathologies in the training of learned optimizers. In: International Conference on Machine Learning, pp. 4556–4565. PMLR (2019)
Google Scholar
Oymak, S., Fabian, Z., Li, M., Soltanolkotabi, M.: Generalization guarantees for neural networks via harnessing the low-rank structure of the Jacobean. arXiv preprint arXiv:1906.05392 (2019)
Povey, D., et al.: Semi-orthogonal low-rank matrix factorization for deep neural networks. In: Interspeech, pp. 3743–3747 (2018)
Google Scholar
Qian, N.: On the momentum term in gradient descent learning algorithms. Neural Netw. 12(1), 145–151 (1999)
Article Google Scholar
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI blog. 8, 9 (2019)
Google Scholar
Robbins, H.E.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (2007)
Article MathSciNet MATH Google Scholar
Sainath, T.N., Kingsbury, B., Sindhwani, V., Arisoy, E., Ramabhadran, B.: Low-rank matrix factorization for deep neural network training with high-dimensional output targets. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6655–6659. IEEE (2013)
Google Scholar
Shen, J., Chen, X., Heaton, H., Chen, T., Liu, J., Yin, W., Wang, Z.: Learning a minimax optimizer: a pilot study. In: International Conference on Learning Representations (ICLR) (2021)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Sohl-Dickstein, J., Poole, B., Ganguli, S.: Fast large-scale optimization by unifying stochastic gradient and quasi-newton methods. In: International Conference on Machine Learning, pp. 604–612. PMLR (2014)
Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Google Scholar
Tieleman, T., Hinton, G.: Lecture 6.5–RMSProp: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning (2012)
Google Scholar
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)
Google Scholar
Tuddenham, M., Prügel-Bennett, A., Hare, J.: Quasi-newton’s method in the class gradient defined high-curvature subspace. arXiv preprint arXiv:2012.01938 (2020)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Vicol, P., Metz, L., Sohl-Dickstein, J.: Unbiased gradient estimation in unrolled computation graphs with persistent evolution strategies. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 139, pp. 10553–10563. PMLR, 18–24 July 2021. https://proceedings.mlr.press/v139/vicol21a.html
Wang, Z., Wohlwend, J., Lei, T.: Structured pruning of large language models. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6151–6162 (2020)
Google Scholar
Werbos, P.J.: Backpropagation through time: what it does and how to do it. Proc. IEEE 78(10), 1550–1560 (1990)
Article Google Scholar
Wichrowska, O., et al.: Learned optimizers that scale and generalize. In: International Conference on Machine Learning (ICML) (2017)
Google Scholar
Xiong, Y., Hsieh, C.-J.: Improved adversarial training via learned optimizer. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12353, pp. 85–100. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58598-3_6
Chapter Google Scholar
Yin, M., Tucker, G., Zhou, M., Levine, S., Finn, C.: Meta-learning without memorization. arXiv preprint arXiv:1912.03820 (2019)
You, Y., Chen, T., Wang, Z., Shen, Y.: L2-GCN: layer-wise and learned efficient training of graph convolutional networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2127–2135 (2020)
Google Scholar
Yu, X., Liu, T., Wang, X., Tao, D.: On compressing deep models by low rank and sparse decomposition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7370–7379 (2017)
Google Scholar
Zhang, S., Wang, M., Liu, S., Chen, P.Y., Xiong, J.: Why lottery ticket wins? A theoretical perspective of sample complexity on sparse neural networks. Adv. Neural. Inf. Process. Syst. 34, 2707–2720 (2021)
Google Scholar
Zhang, Y., Chuangsuwanich, E., Glass, J.: Extracting deep neural network bottleneck features using low-rank matrix factorization. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 185–189. IEEE (2014)
Google Scholar
Zhao, Y., Li, J., Gong, Y.: Low-rank plus diagonal adaptation for deep neural networks. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5005–5009. IEEE (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

The University of Texas at Austin, Austin, TX, 78712, USA
Xuxi Chen, Tianlong Chen & Zhangyang Wang
Microsoft Research, Redmond, USA
Yu Cheng, Weizhu Chen & Ahmed Awadallah

Authors

Xuxi Chen
View author publications
You can also search for this author in PubMed Google Scholar
Tianlong Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yu Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Weizhu Chen
View author publications
You can also search for this author in PubMed Google Scholar
Ahmed Awadallah
View author publications
You can also search for this author in PubMed Google Scholar
Zhangyang Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xuxi Chen .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1570 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, X., Chen, T., Cheng, Y., Chen, W., Awadallah, A., Wang, Z. (2022). Scalable Learning to Optimize: A Learned Optimizer Can Train Big Models. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13683. Springer, Cham. https://doi.org/10.1007/978-3-031-20050-2_23

Download citation

DOI: https://doi.org/10.1007/978-3-031-20050-2_23
Published: 28 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20049-6
Online ISBN: 978-3-031-20050-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Scalable Learning to Optimize: A Learned Optimizer Can Train Big Models

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

InfoPro: Locally Supervised Deep Learning by Maximizing Information Propagation

Stitched ViTs are Flexible Vision Backbones

Progressive DARTS: Bridging the Optimization Gap for NAS in the Wild

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 1570 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Scalable Learning to Optimize: A Learned Optimizer Can Train Big Models

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

InfoPro: Locally Supervised Deep Learning by Maximizing Information Propagation

Stitched ViTs are Flexible Vision Backbones

Progressive DARTS: Bridging the Optimization Gap for NAS in the Wild

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 1570 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation