Abstract
Learning to optimize (L2O) has gained increasing attention since it demonstrates a promising path to automating and accelerating the optimization of complicated problems. Unlike manually crafted classical optimizers, L2O parameterizes and learns optimization rules in a data-driven fashion. However, the primary barrier, scalability, persists for this paradigm: as the typical L2O models create massive memory overhead due to unrolled computational graphs, it disables L2O’s applicability to large-scale tasks. To overcome this core challenge, we propose a new scalable learning to optimize (SL2O) framework which (i) first constrains the network updates in a tiny subspace and (ii) then explores learning rules on top of it. Thanks to substantially reduced trainable parameters, learning optimizers for large-scale networks with a single GPU become feasible for the first time, showing that the scalability roadblock of applying L2O to training large models is now removed. Comprehensive experiments on various network architectures (i.e., ResNets, VGGs, ViTs) and datasets (i.e., CIFAR, ImageNet, E2E) across vision and language tasks, consistently validate that SL2O can achieve significantly faster convergence speed and competitive performance compared to analytical optimizers. For example, our approach converges 3.41\(\sim \)4.60 times faster on CIFAR-10/100 with ResNet-18, and 1.24 times faster on ViTs, at nearly no performance loss. Codes are in https://github.com/VITA-Group/Scalable-L2O.
X. Chen and T. Chen—Equal Contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
We also include GPT-2 results in the supplementary.
References
Andrychowicz, M., et al.: Learning to learn by gradient descent by gradient descent. In: Advances in Neural Information Processing Systems (NeurIPS) (2016)
Bello, I., Zoph, B., Vasudevan, V., Le, Q.V.: Neural optimizer search with reinforcement learning. In: Precup, D., Teh, Y.W. (eds.) Proceedings of the 34th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 70, pp. 459–468. PMLR, 06–11 Aug 2017. https://proceedings.mlr.press/v70/bello17a.html
Cao, Y., Chen, T., Wang, Z., Shen, Y.: Learning to optimize in swarms. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 15018–15028 (2019)
Chen, T., Chen, X., Chen, W., Heaton, H., Liu, J., Wang, Z., Yin, W.: Learning to optimize: a primer and a benchmark. arXiv preprint arXiv:2103.12828 (2021)
Chen, T., Zhang, W., Zhou, J., Chang, S., Liu, S., Amini, L., Wang, Z.: Training stronger baselines for learning to optimize. arXiv preprint arXiv:2010.09089 (2020)
Chen, W., Yu, Z., Wang, Z., Anandkumar, A.: Automated synthetic-to-real generalization. In: International Conference on Machine Learning (ICML). pp. 1746–1756 (2020)
Chen, X., et al.: Self-PU: self boosted and calibrated positive-unlabeled training. In: International Conference on Machine Learning (ICML), pp. 1510–1519 (2020)
Chen, Y., et al.: Learning to learn without gradient descent by gradient descent. In: International Conference on Machine Learning (ICML), pp. 748–756 (2017)
Cranmer, M.: PYSR: Fast & parallelized symbolic regression in python/Julia (2020)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). https://doi.org/10.1109/CVPR.2009.5206848
Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(7), 1–39 (2011)
Dušek, O., Howcroft, D.M., Rieser, V.: Semantic noise matters for neural natural language generation. In: Proceedings of the 12th International Conference on Natural Language Generation, pp. 421–426. Association for Computational Linguistics, Tokyo, Japan, October–November 2019. https://doi.org/10.18653/v1/W19-8652, https://www.aclweb.org/anthology/W19-8652
Gressmann, F., Eaton-Rosen, Z., Luschi, C.: Improving neural network training in low dimensional random bases. arXiv preprint arXiv:2011.04720 (2020)
Gur-Ari, G., Roberts, D.A., Dyer, E.: Gradient descent happens in a tiny subspace. arXiv preprint arXiv:1812.04754 (2018)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Hecht-Nielsen, R.: Theory of the backpropagation neural network. In: Neural Networks For Perception, pp. 65–93. Elsevier (1992)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Chen, W.: Lora: low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)
Jaderberg, M., Vedaldi, A., Zisserman, A.: Speeding up convolutional neural networks with low rank expansions. In: Proceedings of the British Machine Vision Conference. BMVA Press (2014)
Jiang, H., Chen, Z., Shi, Y., Dai, B., Zhao, T.: Learning to defense by learning to attack. arXiv preprint arXiv:1811.01213 (2018)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Krizhevsky, A., et al.: Learning multiple layers of features from tiny images (2009)
Li, C., Chen, T., You, H., Wang, Z., Lin, Y.: HALO: hardware-aware learning to optimize. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12354, pp. 500–518. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58545-7_29
Li, C., Farkhoor, H., Liu, R., Yosinski, J.: Measuring the intrinsic dimension of objective landscapes. arXiv preprint arXiv:1804.08838 (2018)
Li, H., Xu, Z., Taylor, G., Studer, C., Goldstein, T.: Visualizing the loss landscape of neural nets. Adv. Neural. Inf. Process. Syst. 31, 1–11 (2018)
Li, K., Malik, J.: Learning to optimize. arXiv preprint arXiv:1606.01885 (2016)
Li, T., Tan, L., Tao, Q., Liu, Y., Huang, X.: Low dimensional landscape hypothesis is true: DNNs can be trained in tiny subspaces (2021)
Lv, K., Jiang, S., Li, J.: Learning gradient descent: better generalization and longer horizons. In: International Conference on Machine Learning (ICML), pp. 2247–2255 (2017)
Metz, L., Maheswaranathan, N., Nixon, J., Freeman, D., Sohl-Dickstein, J.: Understanding and correcting pathologies in the training of learned optimizers. In: International Conference on Machine Learning, pp. 4556–4565. PMLR (2019)
Oymak, S., Fabian, Z., Li, M., Soltanolkotabi, M.: Generalization guarantees for neural networks via harnessing the low-rank structure of the Jacobean. arXiv preprint arXiv:1906.05392 (2019)
Povey, D., et al.: Semi-orthogonal low-rank matrix factorization for deep neural networks. In: Interspeech, pp. 3743–3747 (2018)
Qian, N.: On the momentum term in gradient descent learning algorithms. Neural Netw. 12(1), 145–151 (1999)
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI blog. 8, 9 (2019)
Robbins, H.E.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (2007)
Sainath, T.N., Kingsbury, B., Sindhwani, V., Arisoy, E., Ramabhadran, B.: Low-rank matrix factorization for deep neural network training with high-dimensional output targets. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6655–6659. IEEE (2013)
Shen, J., Chen, X., Heaton, H., Chen, T., Liu, J., Yin, W., Wang, Z.: Learning a minimax optimizer: a pilot study. In: International Conference on Learning Representations (ICLR) (2021)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Sohl-Dickstein, J., Poole, B., Ganguli, S.: Fast large-scale optimization by unifying stochastic gradient and quasi-newton methods. In: International Conference on Machine Learning, pp. 604–612. PMLR (2014)
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Tieleman, T., Hinton, G.: Lecture 6.5–RMSProp: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning (2012)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)
Tuddenham, M., Prügel-Bennett, A., Hare, J.: Quasi-newton’s method in the class gradient defined high-curvature subspace. arXiv preprint arXiv:2012.01938 (2020)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Vicol, P., Metz, L., Sohl-Dickstein, J.: Unbiased gradient estimation in unrolled computation graphs with persistent evolution strategies. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 139, pp. 10553–10563. PMLR, 18–24 July 2021. https://proceedings.mlr.press/v139/vicol21a.html
Wang, Z., Wohlwend, J., Lei, T.: Structured pruning of large language models. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6151–6162 (2020)
Werbos, P.J.: Backpropagation through time: what it does and how to do it. Proc. IEEE 78(10), 1550–1560 (1990)
Wichrowska, O., et al.: Learned optimizers that scale and generalize. In: International Conference on Machine Learning (ICML) (2017)
Xiong, Y., Hsieh, C.-J.: Improved adversarial training via learned optimizer. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12353, pp. 85–100. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58598-3_6
Yin, M., Tucker, G., Zhou, M., Levine, S., Finn, C.: Meta-learning without memorization. arXiv preprint arXiv:1912.03820 (2019)
You, Y., Chen, T., Wang, Z., Shen, Y.: L2-GCN: layer-wise and learned efficient training of graph convolutional networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2127–2135 (2020)
Yu, X., Liu, T., Wang, X., Tao, D.: On compressing deep models by low rank and sparse decomposition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7370–7379 (2017)
Zhang, S., Wang, M., Liu, S., Chen, P.Y., Xiong, J.: Why lottery ticket wins? A theoretical perspective of sample complexity on sparse neural networks. Adv. Neural. Inf. Process. Syst. 34, 2707–2720 (2021)
Zhang, Y., Chuangsuwanich, E., Glass, J.: Extracting deep neural network bottleneck features using low-rank matrix factorization. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 185–189. IEEE (2014)
Zhao, Y., Li, J., Gong, Y.: Low-rank plus diagonal adaptation for deep neural networks. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5005–5009. IEEE (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Chen, X., Chen, T., Cheng, Y., Chen, W., Awadallah, A., Wang, Z. (2022). Scalable Learning to Optimize: A Learned Optimizer Can Train Big Models. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13683. Springer, Cham. https://doi.org/10.1007/978-3-031-20050-2_23
Download citation
DOI: https://doi.org/10.1007/978-3-031-20050-2_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20049-6
Online ISBN: 978-3-031-20050-2
eBook Packages: Computer ScienceComputer Science (R0)