Fast Convergence of Random Reshuffling Under Over-Parameterization and the Polyak-Łojasiewicz Condition

Fan, Chen; Thrampoulidis, Christos; Schmidt, Mark

doi:10.1007/978-3-031-43421-1_18

Chen Fan¹²,
Christos Thrampoulidis¹³ &
Mark Schmidt^12,14

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14172))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

1348 Accesses

Abstract

Modern machine learning models are often over-parameterized and as a result they can interpolate the training data. Under such a scenario, we study the convergence properties of a sampling-without-replacement variant of stochastic gradient descent (SGD) known as random reshuffling (RR). Unlike SGD that samples data with replacement at every iteration, RR chooses a random permutation of data at the beginning of each epoch and each iteration chooses the next sample from the permutation. For under-parameterized models, it has been shown RR can converge faster than SGD under certain assumptions. However, previous works do not show that RR outperforms SGD in over-parameterized settings except in some highly-restrictive scenarios. For the class of Polyak-Łojasiewicz (PL) functions, we show that RR can outperform SGD in over-parameterized settings when either one of the following holds: (i) the number of samples (n) is less than the product of the condition number ($\kappa $) and the parameter ($\alpha $) of a weak growth condition (WGC), or (ii) n is less than the parameter ($\rho $) of a strong growth condition (SGC).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Why random reshuffling beats stochastic gradient descent

Article 29 October 2019

Using sequential statistical tests for efficient hyperparameter tuning

Article Open access 14 March 2024

Sample size determination: posterior distributions proximity

Article 07 January 2025

Notes

1.
Following previous conventions in the literature [7, 26, 35], we take $\kappa = \Theta (\frac{1}{\mu })$ for comparisons.
2.
Since the first version of this work was released, Koloskova et al. [14] show that SGD with arbitrary data orderings including RR converges at least as fast as SGD in a general nonconvex setting irrespective of the number of epochs. However, only sublinear rates are possible in this general setting.
3.
In this setting we could solve the problem by simply applying gradient descent to any individual function and ignoring all other training examples.
4.
The usual “weak strong convexity” assumption is that for each $f_i$ the strong convexity inequality holds for the projection onto the minima with respect to $f_i$ which holds for least squares. But Ma and Zhou instead use the projection onto the intersection of this set with the set of minima with respect to f, which does not hold in general for least squares.
5.
By taking $B=0$ in Mischenko et al.’s Theorem 4.
6.
We include the $\alpha \le \rho $ result in Appendix A since it is shown under stronger assumptions [38] or stated differently [23] in prior work.

References

Ahn, K., Yun, C., Sra, S.: Sgd with shuffling: optimal rates without component convexity and large epoch requirements. Adv. Neural. Inf. Process. Syst. 33, 17526–17535 (2020)
Google Scholar
Bassily, R., Belkin, M., Ma, S.: On exponential convergence of SGD in non-convex over-parametrized learning. arXiv preprint arXiv:1811.02564 (2018)
Bottou, L.: Curiously fast convergence of some stochastic gradient descent algorithms. In: Proceedings of the Symposium on Learning and Data Science, Paris, vol. 8, pp. 2624–2633 (2009)
Google Scholar
Bottou, L.: Stochastic gradient descent tricks. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 421–436. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35289-8_25
Chapter Google Scholar
Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018)
Article MathSciNet MATH Google Scholar
Cevher, V., Vũ, B.C.: On the linear convergence of the stochastic gradient method with constant step-size. Optim. Lett. 13(5), 1177–1187 (2019)
Article MathSciNet MATH Google Scholar
Cha, J., Lee, J., Yun, C.: Tighter lower bounds for shuffling SGD: random permutations and beyond. arXiv preprint arXiv:2303.07160 (2023)
Craven, B.D., Glover, B.M.: Invex functions and duality. J. Aust. Math. Soc. 39(1), 1–20 (1985)
Article MathSciNet MATH Google Scholar
Du, S.S., Zhai, X., Poczos, B., Singh, A.: Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054 (2018)
Gower, R.M., Loizou, N., Qian, X., Sailanbayev, A., Shulgin, E., Richtárik, P.: SGD: general analysis and improved rates. In: International Conference on Machine Learning, pp. 5200–5209. PMLR (2019)
Google Scholar
Gürbüzbalaban, M., Ozdaglar, A., Parrilo, P.A.: Why random reshuffling beats stochastic gradient descent. Math. Program. 186(1), 49–84 (2021)
Article MathSciNet MATH Google Scholar
Haochen, J., Sra, S.: Random shuffling beats SGD after finite epochs. In: International Conference on Machine Learning, pp. 2624–2633. PMLR (2019)
Google Scholar
Karimi, H., Nutini, J., Schmidt, M.: Linear convergence of gradient and proximal-gradient methods under the Polyak-Łojasiewicz Condition. In: Frasconi, P., Landwehr, N., Manco, G., Vreeken, J. (eds.) ECML PKDD 2016. LNCS (LNAI), vol. 9851, pp. 795–811. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46128-1_50
Chapter Google Scholar
Koloskova, A., Doikov, N., Stich, S.U., Jaggi, M.: Shuffle SGD is always better than SGD: improved analysis of SGD with arbitrary data orders. arXiv preprint arXiv:2305.19259 (2023)
Lai, Z., Lim, L.H.: Recht-ré noncommutative arithmetic-geometric mean conjecture is false. In: International Conference on Machine Learning, pp. 5608–5617. PMLR (2020)
Google Scholar
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Article Google Scholar
Li, X., Milzarek, A., Qiu, J.: Convergence of random reshuffling under the kurdyka-$\{$$\backslash $L$\}$ ojasiewicz inequality. arXiv preprint arXiv:2110.04926 (2021)
Liu, C., Zhu, L., Belkin, M.: Loss landscapes and optimization in over-parameterized non-linear systems and neural networks. Appl. Comput. Harmon. Anal. 59, 85–116 (2022)
Article MathSciNet MATH Google Scholar
Loizou, N., Vaswani, S., Laradji, I.H., Lacoste-Julien, S.: Stochastic polyak step-size for SGD: an adaptive learning rate for fast convergence. In: International Conference on Artificial Intelligence and Statistics, pp. 1306–1314. PMLR (2021)
Google Scholar
Lojasiewicz, S.: A topological property of real analytic subsets. Coll. du CNRS, Les équations aux dérivées partielles 117(87–89), 2 (1963)
Google Scholar
Lu, Y., Guo, W., Sa, C.D.: Grab: Finding provably better data permutations than random reshuffling (2023)
Google Scholar
Ma, S., Zhou, Y.: Understanding the impact of model incoherence on convergence of incremental SGD with random reshuffle. In: International Conference on Machine Learning, pp. 6565–6574. PMLR (2020)
Google Scholar
Mishchenko, K., Khaled, A., Richtárik, P.: Random reshuffling: simple analysis with vast improvements. Adv. Neural. Inf. Process. Syst. 33, 17309–17320 (2020)
Google Scholar
Mishkin, A.: Interpolation, growth conditions, and stochastic gradient descent. Ph.D. thesis, University of British Columbia (2020)
Google Scholar
Moulines, E., Bach, F.: Non-asymptotic analysis of stochastic approximation algorithms for machine learning. Advances in neural information processing systems 24 (2011)
Google Scholar
Nagaraj, D., Jain, P., Netrapalli, P.: Sgd without replacement: sharper rates for general smooth convex functions. In: International Conference on Machine Learning, pp. 4703–4711. PMLR (2019)
Google Scholar
Needell, D., Ward, R., Srebro, N.: Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. Advances in neural information processing systems 27 (2014)
Google Scholar
Nguyen, L.M., Tran-Dinh, Q., Phan, D.T., Nguyen, P.H., Van Dijk, M.: A unified convergence analysis for shuffling-type gradient methods. J. Mach. Learn. Res. 22(1), 9397–9440 (2021)
MathSciNet MATH Google Scholar
Oymak, S., Soltanolkotabi, M.: Overparameterized nonlinear learning: gradient descent takes the shortest path? In: International Conference on Machine Learning, pp. 4951–4960. PMLR (2019)
Google Scholar
Polyak, B.T.: Gradient methods for the minimisation of functionals. USSR Comput. Math. Math. Phys. 3(4), 864–878 (1963)
Article MATH Google Scholar
Polyak, B., Tsypkin, Y.Z.: Pseudogradient adaptation and training algorithms. Autom. Remote. Control. 34, 45–67 (1973)
MathSciNet MATH Google Scholar
Rajput, S., Gupta, A., Papailiopoulos, D.: Closing the convergence gap of SGD without replacement. In: International Conference on Machine Learning, pp. 7964–7973. PMLR (2020)
Google Scholar
Recht, B., Ré, C.: Toward a noncommutative arithmetic-geometric mean inequality: Conjectures, case-studies, and consequences. In: Conference on Learning Theory, pp. 11–1. JMLR Workshop and Conference Proceedings (2012)
Google Scholar
Safran, I., Shamir, O.: How good is sgd with random shuffling? In: Conference on Learning Theory, pp. 3250–3284. PMLR (2020)
Google Scholar
Safran, I., Shamir, O.: Random shuffling beats SGD only after many epochs on ill-conditioned problems. Adv. Neural. Inf. Process. Syst. 34, 15151–15161 (2021)
Google Scholar
Schmidt, M., Roux, N.L.: Fast convergence of stochastic gradient descent under a strong growth condition. arXiv preprint arXiv:1308.6370 (2013)
Soltanolkotabi, M., Javanmard, A., Lee, J.D.: Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Trans. Inf. Theory 65(2), 742–769 (2018)
Article MathSciNet MATH Google Scholar
Vaswani, S., Bach, F., Schmidt, M.: Fast and faster convergence of SGD for over-parameterized models and an accelerated perceptron. In: The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1195–1204. PMLR (2019)
Google Scholar
Vaswani, S., Mishkin, A., Laradji, I., Schmidt, M., Gidel, G., Lacoste-Julien, S.: Painless stochastic gradient: interpolation, line-search, and convergence rates. Advances in neural information processing systems 32 (2019)
Google Scholar

Download references

Acknowledgments

This work was partially supported by the Canada CIFAR AI Chair Program, and the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grants RGPIN-2021-03677 and GPIN-2022-03669.

Author information

Authors and Affiliations

Department of Computer Science, University of British Columbia, Vancouver, BC, Canada
Chen Fan & Mark Schmidt
Department of Electrical and Computer Engineering, University of British Columbia, Vancouver, BC, Canada
Christos Thrampoulidis
Canada CIFAR AI Chair (Amii), Montreal, Canada
Mark Schmidt

Authors

Chen Fan
View author publications
You can also search for this author in PubMed Google Scholar
Christos Thrampoulidis
View author publications
You can also search for this author in PubMed Google Scholar
Mark Schmidt
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chen Fan .

Editor information

Editors and Affiliations

University of Michigan, Ann Arbor, MI, USA
Danai Koutra
University of Vienna, Vienna, Austria
Claudia Plant
Max Planck Institute for Software Systems, Kaiserslautern, Germany
Manuel Gomez Rodriguez
Politecnico di Torino, Turin, Italy
Elena Baralis
CENTAI, Turin, Italy
Francesco Bonchi

Ethics declarations

Ethical Statement

The contribution is the theoretical analysis of an existing algorithm, so it does not have direct societal or ethical implications.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fan, C., Thrampoulidis, C., Schmidt, M. (2023). Fast Convergence of Random Reshuffling Under Over-Parameterization and the Polyak-Łojasiewicz Condition. In: Koutra, D., Plant, C., Gomez Rodriguez, M., Baralis, E., Bonchi, F. (eds) Machine Learning and Knowledge Discovery in Databases: Research Track. ECML PKDD 2023. Lecture Notes in Computer Science(), vol 14172. Springer, Cham. https://doi.org/10.1007/978-3-031-43421-1_18

Download citation

DOI: https://doi.org/10.1007/978-3-031-43421-1_18
Published: 18 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43420-4
Online ISBN: 978-3-031-43421-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)

Fast Convergence of Random Reshuffling Under Over-Parameterization and the Polyak-Łojasiewicz Condition