Skip to main content

Fast Convergence of Random Reshuffling Under Over-Parameterization and the Polyak-Łojasiewicz Condition

  • Conference paper
  • First Online:
Machine Learning and Knowledge Discovery in Databases: Research Track (ECML PKDD 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14172))

  • 1348 Accesses

Abstract

Modern machine learning models are often over-parameterized and as a result they can interpolate the training data. Under such a scenario, we study the convergence properties of a sampling-without-replacement variant of stochastic gradient descent (SGD) known as random reshuffling (RR). Unlike SGD that samples data with replacement at every iteration, RR chooses a random permutation of data at the beginning of each epoch and each iteration chooses the next sample from the permutation. For under-parameterized models, it has been shown RR can converge faster than SGD under certain assumptions. However, previous works do not show that RR outperforms SGD in over-parameterized settings except in some highly-restrictive scenarios. For the class of Polyak-Łojasiewicz (PL) functions, we show that RR can outperform SGD in over-parameterized settings when either one of the following holds: (i) the number of samples (n) is less than the product of the condition number (\(\kappa \)) and the parameter (\(\alpha \)) of a weak growth condition (WGC), or (ii) n is less than the parameter (\(\rho \)) of a strong growth condition (SGC).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Following previous conventions in the literature [7, 26, 35], we take \(\kappa = \Theta (\frac{1}{\mu })\) for comparisons.

  2. 2.

    Since the first version of this work was released, Koloskova et al. [14] show that SGD with arbitrary data orderings including RR converges at least as fast as SGD in a general nonconvex setting irrespective of the number of epochs. However, only sublinear rates are possible in this general setting.

  3. 3.

    In this setting we could solve the problem by simply applying gradient descent to any individual function and ignoring all other training examples.

  4. 4.

    The usual “weak strong convexity” assumption is that for each \(f_i\) the strong convexity inequality holds for the projection onto the minima with respect to \(f_i\) which holds for least squares. But Ma and Zhou instead use the projection onto the intersection of this set with the set of minima with respect to f, which does not hold in general for least squares.

  5. 5.

    By taking \(B=0\) in Mischenko et al.’s Theorem 4.

  6. 6.

    We include the \(\alpha \le \rho \) result in Appendix A since it is shown under stronger assumptions [38] or stated differently [23] in prior work.

References

  1. Ahn, K., Yun, C., Sra, S.: Sgd with shuffling: optimal rates without component convexity and large epoch requirements. Adv. Neural. Inf. Process. Syst. 33, 17526–17535 (2020)

    Google Scholar 

  2. Bassily, R., Belkin, M., Ma, S.: On exponential convergence of SGD in non-convex over-parametrized learning. arXiv preprint arXiv:1811.02564 (2018)

  3. Bottou, L.: Curiously fast convergence of some stochastic gradient descent algorithms. In: Proceedings of the Symposium on Learning and Data Science, Paris, vol. 8, pp. 2624–2633 (2009)

    Google Scholar 

  4. Bottou, L.: Stochastic gradient descent tricks. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 421–436. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35289-8_25

    Chapter  Google Scholar 

  5. Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  6. Cevher, V., Vũ, B.C.: On the linear convergence of the stochastic gradient method with constant step-size. Optim. Lett. 13(5), 1177–1187 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  7. Cha, J., Lee, J., Yun, C.: Tighter lower bounds for shuffling SGD: random permutations and beyond. arXiv preprint arXiv:2303.07160 (2023)

  8. Craven, B.D., Glover, B.M.: Invex functions and duality. J. Aust. Math. Soc. 39(1), 1–20 (1985)

    Article  MathSciNet  MATH  Google Scholar 

  9. Du, S.S., Zhai, X., Poczos, B., Singh, A.: Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054 (2018)

  10. Gower, R.M., Loizou, N., Qian, X., Sailanbayev, A., Shulgin, E., Richtárik, P.: SGD: general analysis and improved rates. In: International Conference on Machine Learning, pp. 5200–5209. PMLR (2019)

    Google Scholar 

  11. Gürbüzbalaban, M., Ozdaglar, A., Parrilo, P.A.: Why random reshuffling beats stochastic gradient descent. Math. Program. 186(1), 49–84 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  12. Haochen, J., Sra, S.: Random shuffling beats SGD after finite epochs. In: International Conference on Machine Learning, pp. 2624–2633. PMLR (2019)

    Google Scholar 

  13. Karimi, H., Nutini, J., Schmidt, M.: Linear convergence of gradient and proximal-gradient methods under the Polyak-Łojasiewicz Condition. In: Frasconi, P., Landwehr, N., Manco, G., Vreeken, J. (eds.) ECML PKDD 2016. LNCS (LNAI), vol. 9851, pp. 795–811. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46128-1_50

    Chapter  Google Scholar 

  14. Koloskova, A., Doikov, N., Stich, S.U., Jaggi, M.: Shuffle SGD is always better than SGD: improved analysis of SGD with arbitrary data orders. arXiv preprint arXiv:2305.19259 (2023)

  15. Lai, Z., Lim, L.H.: Recht-ré noncommutative arithmetic-geometric mean conjecture is false. In: International Conference on Machine Learning, pp. 5608–5617. PMLR (2020)

    Google Scholar 

  16. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)

    Article  Google Scholar 

  17. Li, X., Milzarek, A., Qiu, J.: Convergence of random reshuffling under the kurdyka-\(\{\)\(\backslash \)L\(\}\) ojasiewicz inequality. arXiv preprint arXiv:2110.04926 (2021)

  18. Liu, C., Zhu, L., Belkin, M.: Loss landscapes and optimization in over-parameterized non-linear systems and neural networks. Appl. Comput. Harmon. Anal. 59, 85–116 (2022)

    Article  MathSciNet  MATH  Google Scholar 

  19. Loizou, N., Vaswani, S., Laradji, I.H., Lacoste-Julien, S.: Stochastic polyak step-size for SGD: an adaptive learning rate for fast convergence. In: International Conference on Artificial Intelligence and Statistics, pp. 1306–1314. PMLR (2021)

    Google Scholar 

  20. Lojasiewicz, S.: A topological property of real analytic subsets. Coll. du CNRS, Les équations aux dérivées partielles 117(87–89), 2 (1963)

    Google Scholar 

  21. Lu, Y., Guo, W., Sa, C.D.: Grab: Finding provably better data permutations than random reshuffling (2023)

    Google Scholar 

  22. Ma, S., Zhou, Y.: Understanding the impact of model incoherence on convergence of incremental SGD with random reshuffle. In: International Conference on Machine Learning, pp. 6565–6574. PMLR (2020)

    Google Scholar 

  23. Mishchenko, K., Khaled, A., Richtárik, P.: Random reshuffling: simple analysis with vast improvements. Adv. Neural. Inf. Process. Syst. 33, 17309–17320 (2020)

    Google Scholar 

  24. Mishkin, A.: Interpolation, growth conditions, and stochastic gradient descent. Ph.D. thesis, University of British Columbia (2020)

    Google Scholar 

  25. Moulines, E., Bach, F.: Non-asymptotic analysis of stochastic approximation algorithms for machine learning. Advances in neural information processing systems 24 (2011)

    Google Scholar 

  26. Nagaraj, D., Jain, P., Netrapalli, P.: Sgd without replacement: sharper rates for general smooth convex functions. In: International Conference on Machine Learning, pp. 4703–4711. PMLR (2019)

    Google Scholar 

  27. Needell, D., Ward, R., Srebro, N.: Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. Advances in neural information processing systems 27 (2014)

    Google Scholar 

  28. Nguyen, L.M., Tran-Dinh, Q., Phan, D.T., Nguyen, P.H., Van Dijk, M.: A unified convergence analysis for shuffling-type gradient methods. J. Mach. Learn. Res. 22(1), 9397–9440 (2021)

    MathSciNet  MATH  Google Scholar 

  29. Oymak, S., Soltanolkotabi, M.: Overparameterized nonlinear learning: gradient descent takes the shortest path? In: International Conference on Machine Learning, pp. 4951–4960. PMLR (2019)

    Google Scholar 

  30. Polyak, B.T.: Gradient methods for the minimisation of functionals. USSR Comput. Math. Math. Phys. 3(4), 864–878 (1963)

    Article  MATH  Google Scholar 

  31. Polyak, B., Tsypkin, Y.Z.: Pseudogradient adaptation and training algorithms. Autom. Remote. Control. 34, 45–67 (1973)

    MathSciNet  MATH  Google Scholar 

  32. Rajput, S., Gupta, A., Papailiopoulos, D.: Closing the convergence gap of SGD without replacement. In: International Conference on Machine Learning, pp. 7964–7973. PMLR (2020)

    Google Scholar 

  33. Recht, B., Ré, C.: Toward a noncommutative arithmetic-geometric mean inequality: Conjectures, case-studies, and consequences. In: Conference on Learning Theory, pp. 11–1. JMLR Workshop and Conference Proceedings (2012)

    Google Scholar 

  34. Safran, I., Shamir, O.: How good is sgd with random shuffling? In: Conference on Learning Theory, pp. 3250–3284. PMLR (2020)

    Google Scholar 

  35. Safran, I., Shamir, O.: Random shuffling beats SGD only after many epochs on ill-conditioned problems. Adv. Neural. Inf. Process. Syst. 34, 15151–15161 (2021)

    Google Scholar 

  36. Schmidt, M., Roux, N.L.: Fast convergence of stochastic gradient descent under a strong growth condition. arXiv preprint arXiv:1308.6370 (2013)

  37. Soltanolkotabi, M., Javanmard, A., Lee, J.D.: Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Trans. Inf. Theory 65(2), 742–769 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  38. Vaswani, S., Bach, F., Schmidt, M.: Fast and faster convergence of SGD for over-parameterized models and an accelerated perceptron. In: The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1195–1204. PMLR (2019)

    Google Scholar 

  39. Vaswani, S., Mishkin, A., Laradji, I., Schmidt, M., Gidel, G., Lacoste-Julien, S.: Painless stochastic gradient: interpolation, line-search, and convergence rates. Advances in neural information processing systems 32 (2019)

    Google Scholar 

Download references

Acknowledgments

This work was partially supported by the Canada CIFAR AI Chair Program, and the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grants RGPIN-2021-03677 and GPIN-2022-03669.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chen Fan .

Editor information

Editors and Affiliations

Ethics declarations

Ethical Statement

The contribution is the theoretical analysis of an existing algorithm, so it does not have direct societal or ethical implications.

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Fan, C., Thrampoulidis, C., Schmidt, M. (2023). Fast Convergence of Random Reshuffling Under Over-Parameterization and the Polyak-Łojasiewicz Condition. In: Koutra, D., Plant, C., Gomez Rodriguez, M., Baralis, E., Bonchi, F. (eds) Machine Learning and Knowledge Discovery in Databases: Research Track. ECML PKDD 2023. Lecture Notes in Computer Science(), vol 14172. Springer, Cham. https://doi.org/10.1007/978-3-031-43421-1_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-43421-1_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-43420-4

  • Online ISBN: 978-3-031-43421-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics