Abstract
Modern machine learning models are often over-parameterized and as a result they can interpolate the training data. Under such a scenario, we study the convergence properties of a sampling-without-replacement variant of stochastic gradient descent (SGD) known as random reshuffling (RR). Unlike SGD that samples data with replacement at every iteration, RR chooses a random permutation of data at the beginning of each epoch and each iteration chooses the next sample from the permutation. For under-parameterized models, it has been shown RR can converge faster than SGD under certain assumptions. However, previous works do not show that RR outperforms SGD in over-parameterized settings except in some highly-restrictive scenarios. For the class of Polyak-Łojasiewicz (PL) functions, we show that RR can outperform SGD in over-parameterized settings when either one of the following holds: (i) the number of samples (n) is less than the product of the condition number (\(\kappa \)) and the parameter (\(\alpha \)) of a weak growth condition (WGC), or (ii) n is less than the parameter (\(\rho \)) of a strong growth condition (SGC).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
Since the first version of this work was released, Koloskova et al. [14] show that SGD with arbitrary data orderings including RR converges at least as fast as SGD in a general nonconvex setting irrespective of the number of epochs. However, only sublinear rates are possible in this general setting.
- 3.
In this setting we could solve the problem by simply applying gradient descent to any individual function and ignoring all other training examples.
- 4.
The usual “weak strong convexity” assumption is that for each \(f_i\) the strong convexity inequality holds for the projection onto the minima with respect to \(f_i\) which holds for least squares. But Ma and Zhou instead use the projection onto the intersection of this set with the set of minima with respect to f, which does not hold in general for least squares.
- 5.
By taking \(B=0\) in Mischenko et al.’s Theorem 4.
- 6.
References
Ahn, K., Yun, C., Sra, S.: Sgd with shuffling: optimal rates without component convexity and large epoch requirements. Adv. Neural. Inf. Process. Syst. 33, 17526–17535 (2020)
Bassily, R., Belkin, M., Ma, S.: On exponential convergence of SGD in non-convex over-parametrized learning. arXiv preprint arXiv:1811.02564 (2018)
Bottou, L.: Curiously fast convergence of some stochastic gradient descent algorithms. In: Proceedings of the Symposium on Learning and Data Science, Paris, vol. 8, pp. 2624–2633 (2009)
Bottou, L.: Stochastic gradient descent tricks. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 421–436. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35289-8_25
Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018)
Cevher, V., Vũ, B.C.: On the linear convergence of the stochastic gradient method with constant step-size. Optim. Lett. 13(5), 1177–1187 (2019)
Cha, J., Lee, J., Yun, C.: Tighter lower bounds for shuffling SGD: random permutations and beyond. arXiv preprint arXiv:2303.07160 (2023)
Craven, B.D., Glover, B.M.: Invex functions and duality. J. Aust. Math. Soc. 39(1), 1–20 (1985)
Du, S.S., Zhai, X., Poczos, B., Singh, A.: Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054 (2018)
Gower, R.M., Loizou, N., Qian, X., Sailanbayev, A., Shulgin, E., Richtárik, P.: SGD: general analysis and improved rates. In: International Conference on Machine Learning, pp. 5200–5209. PMLR (2019)
Gürbüzbalaban, M., Ozdaglar, A., Parrilo, P.A.: Why random reshuffling beats stochastic gradient descent. Math. Program. 186(1), 49–84 (2021)
Haochen, J., Sra, S.: Random shuffling beats SGD after finite epochs. In: International Conference on Machine Learning, pp. 2624–2633. PMLR (2019)
Karimi, H., Nutini, J., Schmidt, M.: Linear convergence of gradient and proximal-gradient methods under the Polyak-Łojasiewicz Condition. In: Frasconi, P., Landwehr, N., Manco, G., Vreeken, J. (eds.) ECML PKDD 2016. LNCS (LNAI), vol. 9851, pp. 795–811. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46128-1_50
Koloskova, A., Doikov, N., Stich, S.U., Jaggi, M.: Shuffle SGD is always better than SGD: improved analysis of SGD with arbitrary data orders. arXiv preprint arXiv:2305.19259 (2023)
Lai, Z., Lim, L.H.: Recht-ré noncommutative arithmetic-geometric mean conjecture is false. In: International Conference on Machine Learning, pp. 5608–5617. PMLR (2020)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Li, X., Milzarek, A., Qiu, J.: Convergence of random reshuffling under the kurdyka-\(\{\)\(\backslash \)L\(\}\) ojasiewicz inequality. arXiv preprint arXiv:2110.04926 (2021)
Liu, C., Zhu, L., Belkin, M.: Loss landscapes and optimization in over-parameterized non-linear systems and neural networks. Appl. Comput. Harmon. Anal. 59, 85–116 (2022)
Loizou, N., Vaswani, S., Laradji, I.H., Lacoste-Julien, S.: Stochastic polyak step-size for SGD: an adaptive learning rate for fast convergence. In: International Conference on Artificial Intelligence and Statistics, pp. 1306–1314. PMLR (2021)
Lojasiewicz, S.: A topological property of real analytic subsets. Coll. du CNRS, Les équations aux dérivées partielles 117(87–89), 2 (1963)
Lu, Y., Guo, W., Sa, C.D.: Grab: Finding provably better data permutations than random reshuffling (2023)
Ma, S., Zhou, Y.: Understanding the impact of model incoherence on convergence of incremental SGD with random reshuffle. In: International Conference on Machine Learning, pp. 6565–6574. PMLR (2020)
Mishchenko, K., Khaled, A., Richtárik, P.: Random reshuffling: simple analysis with vast improvements. Adv. Neural. Inf. Process. Syst. 33, 17309–17320 (2020)
Mishkin, A.: Interpolation, growth conditions, and stochastic gradient descent. Ph.D. thesis, University of British Columbia (2020)
Moulines, E., Bach, F.: Non-asymptotic analysis of stochastic approximation algorithms for machine learning. Advances in neural information processing systems 24 (2011)
Nagaraj, D., Jain, P., Netrapalli, P.: Sgd without replacement: sharper rates for general smooth convex functions. In: International Conference on Machine Learning, pp. 4703–4711. PMLR (2019)
Needell, D., Ward, R., Srebro, N.: Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. Advances in neural information processing systems 27 (2014)
Nguyen, L.M., Tran-Dinh, Q., Phan, D.T., Nguyen, P.H., Van Dijk, M.: A unified convergence analysis for shuffling-type gradient methods. J. Mach. Learn. Res. 22(1), 9397–9440 (2021)
Oymak, S., Soltanolkotabi, M.: Overparameterized nonlinear learning: gradient descent takes the shortest path? In: International Conference on Machine Learning, pp. 4951–4960. PMLR (2019)
Polyak, B.T.: Gradient methods for the minimisation of functionals. USSR Comput. Math. Math. Phys. 3(4), 864–878 (1963)
Polyak, B., Tsypkin, Y.Z.: Pseudogradient adaptation and training algorithms. Autom. Remote. Control. 34, 45–67 (1973)
Rajput, S., Gupta, A., Papailiopoulos, D.: Closing the convergence gap of SGD without replacement. In: International Conference on Machine Learning, pp. 7964–7973. PMLR (2020)
Recht, B., Ré, C.: Toward a noncommutative arithmetic-geometric mean inequality: Conjectures, case-studies, and consequences. In: Conference on Learning Theory, pp. 11–1. JMLR Workshop and Conference Proceedings (2012)
Safran, I., Shamir, O.: How good is sgd with random shuffling? In: Conference on Learning Theory, pp. 3250–3284. PMLR (2020)
Safran, I., Shamir, O.: Random shuffling beats SGD only after many epochs on ill-conditioned problems. Adv. Neural. Inf. Process. Syst. 34, 15151–15161 (2021)
Schmidt, M., Roux, N.L.: Fast convergence of stochastic gradient descent under a strong growth condition. arXiv preprint arXiv:1308.6370 (2013)
Soltanolkotabi, M., Javanmard, A., Lee, J.D.: Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Trans. Inf. Theory 65(2), 742–769 (2018)
Vaswani, S., Bach, F., Schmidt, M.: Fast and faster convergence of SGD for over-parameterized models and an accelerated perceptron. In: The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1195–1204. PMLR (2019)
Vaswani, S., Mishkin, A., Laradji, I., Schmidt, M., Gidel, G., Lacoste-Julien, S.: Painless stochastic gradient: interpolation, line-search, and convergence rates. Advances in neural information processing systems 32 (2019)
Acknowledgments
This work was partially supported by the Canada CIFAR AI Chair Program, and the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grants RGPIN-2021-03677 and GPIN-2022-03669.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Ethical Statement
The contribution is the theoretical analysis of an existing algorithm, so it does not have direct societal or ethical implications.
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Fan, C., Thrampoulidis, C., Schmidt, M. (2023). Fast Convergence of Random Reshuffling Under Over-Parameterization and the Polyak-Łojasiewicz Condition. In: Koutra, D., Plant, C., Gomez Rodriguez, M., Baralis, E., Bonchi, F. (eds) Machine Learning and Knowledge Discovery in Databases: Research Track. ECML PKDD 2023. Lecture Notes in Computer Science(), vol 14172. Springer, Cham. https://doi.org/10.1007/978-3-031-43421-1_18
Download citation
DOI: https://doi.org/10.1007/978-3-031-43421-1_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43420-4
Online ISBN: 978-3-031-43421-1
eBook Packages: Computer ScienceComputer Science (R0)