Abstract
In the field of machine learning, many large-scale optimization problems can be decomposed into the sum of two functions: a smooth function and a nonsmooth function with a simple proximal mapping. In light of this, our paper introduces a novel variant of the proximal stochastic quasi-Newton algorithm, grounded in three key components: (i) developing an adaptive sampling method that dynamically increases the sample size during the iteration process, thus preventing rapid growth in sample size and mitigating the noise introduced by the stochastic approximation method; (ii) the integration of stochastic line search to ensure a sufficient decrease in the expected value of the objective function; and (iii) a stable update scheme for the stochastic modified limited-memory Broyden-Fletcher-Goldfarb-Shanno algorithm. For a general objective function, it can be proven that the limit points of the generated sequence almost surely converge to stationary points. Furthermore, the convergence rate and the number of required gradient computations for this process have been analyzed. In the case of a strongly convex objective function, a global linear convergence rate can be achieved, and the number of required gradient computations is thoroughly examined. Finally, numerical experiments demonstrate the robustness of the proposed method across various hyperparameter settings, establishing its competitiveness compared to state-of-the-art methods.
Similar content being viewed by others
Availability of Data and Materials
The datasets analysed during the current study are available in links given in the paper.
References
Beck, A.: First-order methods in optimization. SIAM (2017)
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imag. Sci. 2(1), 183–202 (2009)
Beiser, F., Keith, B., Urbainczyk, S., Wohlmuth, B.: Adaptive sampling strategies for risk-averse stochastic optimization with constraints. IMA J. Numer. Anal. 43(6), 3729–3765 (2023)
Berahas, A.S., Bollapragada, R., Nocedal, J.: An investigation of Newton-sketch and subsampled Newton methods. Optimiz. Methods Softw. 35(4), 661–680 (2020)
Berahas, A.S., Nocedal, J., Takác, M.: A multi-batch L-BFGS method for machine learning. Adv. Neural Inf. Proc. Syst. 29, 16 (2016)
Bollapragada, R., Byrd, R.H., Nocedal, J.: Adaptive sampling strategies for stochastic optimization. SIAM J. Opt. 28(4), 3312–3343 (2018)
Bollapragada, R., Byrd, R.H., Nocedal, J.: Exact and inexact subsampled Newton methods for optimization. IMA J. Numer. Anal. 39(2), 545–578 (2019)
Bollapragada, R., Nocedal, J., Mudigere, D., Shi, H.J., Tang, P.T.P.: A progressive batching L-BFGS method for machine learning. In: International Conference on Machine Learning, pp. 620–629. PMLR (2018)
Bonettini, S., Loris, I., Porta, F., Prato, M.: Variable metric inexact line-search-based methods for nonsmooth optimization. SIAM J. Opt. 26(2), 891–921 (2016)
Botev, A., Ritter, H., Barber, D.: Practical gauss-Newton optimisation for deep learning. In: International Conference on Machine Learning, pp. 557–565. PMLR (2017)
Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018)
Byrd, R.H., Chin, G.M., Nocedal, J., Wu, Y.: Sample size selection in optimization methods for machine learning. Math. Progr. 134(1), 127–155 (2012)
Byrd, R.H., Hansen, S.L., Nocedal, J., Singer, Y.: A stochastic quasi-Newton method for large-scale optimization. SIAM J. Opt. 26(2), 1008–1031 (2016)
Byrd, R.H., Nocedal, J., Oztoprak, F.: An inexact successive quadratic approximation method for l-1 regularized optimization. Math. Progr. 157(2), 375–396 (2016)
Byrd, R.H., Nocedal, J., Schnabel, R.B.: Representations of quasi-Newton matrices and their use in limited memory methods. Math. Progr. 63(1), 129–156 (1994)
Combettes, P.L., Pesquet, J.C.: Proximal splitting methods in signal processing. Fixed-point algorithms for inverse problems in science and engineering pp. 185–212 (2011)
Defazio, A., Domke, J., et al.: Finito: A faster, permutable incremental gradient method for big data problems. In: International Conference on Machine Learning, pp. 1125–1133. PMLR (2014)
Di Serafino, D., Krejić, N., Krklec Jerinkić, N., Viola, M.: Lsos: line-search second-order stochastic optimization methods for nonconvex finite sums. Math. Computat. 92(341), 1273–1299 (2023)
Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: a library for large linear classification. J. Machi. Learn. Res. 9, 1871–1874 (2008)
Franchini, G., Porta, F., Ruggiero, V., Trombini, I.: A line search based proximal stochastic gradient algorithm with dynamical variance reduction. J. Sci. Comput. 94(1), 23 (2023)
Ghadimi, S., Lan, G.: Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: a generic algorithmic framework. SIAM J. Opt. 22(4), 1469–1492 (2012)
Goldman, R.: Curvature formulas for implicit curves and surfaces. Comput. Aided Geomet. Des. 22(7), 632–658 (2005)
Hastie, T., Tibshirani, R., Friedman, J.H., Friedman, J.H.: The elements of statistical learning: data mining, inference, and prediction, vol. 2, p. 103. Springer, London (2009)
Reddi, J., S., Sra, S., Poczos, B., Smola, A.J.: Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. Adv. Neural Inf. Proc. Syst. 29, 16 (2016)
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. Adv. Neural Inf. Proc. Syst. 26, 16 (2013)
Kanzow, C., Lechner, T.: Globalized inexact proximal Newton-type methods for nonconvex composite functions. Computat. Opt. Appl. 78(2), 377–410 (2021)
Lee, Cp., Wright, S.J.: Inexact successive quadratic approximation for regularized optimization. Computat. Opt. Appl. 72, 641–674 (2019)
Lee, J.D., Sun, Y., Saunders, M.A.: Proximal Newton-type methods for minimizing composite functions. SIAM J. Opt. 24(3), 1420–1443 (2014)
Li, D.H., Fukushima, M.: A modified BFGS method and its global convergence in nonconvex minimization. J. Computat. Appl. Math. 129(1–2), 15–35 (2001)
Li, D.H., Fukushima, M.: On the global convergence of the BFGS method for nonconvex unconstrained optimization problems. SIAM J. Opt. 11(4), 1054–1064 (2001)
Liu, D.C., Nocedal, J.: On the limited memory BFGS method for large scale optimization. Math. Progr. 45(1–3), 503–528 (1989)
Mannel, F., Aggrawal, H.O., Modersitzki, J.: A structured L-BFGS method and its application to inverse problems. Inverse Problems (2023)
Miller, I., Miller, M., Freund, J.E.: John E. Freund’s Mathematical Statistics with Applications. 8th edition. Pearson Education Limited, America (2014)
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, vol. 87. Springer Science & Business Media, Cham (2003)
Nguyen, L.M., Liu, J., Scheinberg, K., Takáč, M.: Sarah: A novel method for machine learning problems using stochastic recursive gradient. In: International conference on machine learning, pp. 2613–2621. PMLR (2017)
Nocedal, J.: Theory of algorithms for unconstrained optimization. Acta Numer. 1, 199–242 (1992)
Pham, N.H., Nguyen, L.M., Phan, D.T., Tran-Dinh, Q.: Proxsarah: an efficient algorithmic framework for stochastic composite nonconvex optimization. J. Mach. Learn. Res. 21(1), 4455–4502 (2020)
Pilanci, M., Wainwright, M.J.: Newton sketch: A near linear-time optimization algorithm with linear-quadratic convergence. SIAM J. Opt. 27(1), 205–245 (2017)
Robbins, H., Siegmund, D.: A convergence theorem for non negative almost supermartingales and some applications. In: Optimizing methods in statistics, pp. 233–257. Elsevier (1971)
Roux, N., Schmidt, M., Bach, F.: A stochastic gradient method with an exponential convergence rate for finite training sets. Adv. Neural Inf. Process. Syst. 25, 12 (2012)
Saratchandran, H., Chng, S.F., Ramasinghe, S., MacDonald, L., Lucey, S.: Curvature-aware training for coordinate networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13328–13338 (2023)
Schmidt, M., Le Roux, N., Bach, F.: Minimizing finite sums with the stochastic average gradient. Math. Progr. 162, 83–112 (2017)
Shalev-Shwartz, S., Ben-David, S.: Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, Cambridge (2014)
Shalev-Shwartz, S., Tewari, A.: Stochastic methods for \(l_{1}\) regularized loss minimization. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 929–936 (2009)
Shi, J., Yin, W., Osher, S., Sajda, P.: A fast hybrid algorithm for large-scale \(l_{1}\)-regularized logistic regression. J. Mach. Learn. Res. 11, 713–741 (2010)
Vapnik, V.: The Nature of Statistical Learning Theory. Springer science & business media, Cham (1999)
Wang, J., Zhang, T.: Utilizing second order information in minibatch stochastic variance reduced proximal iterations. J. Mach. Learn. Res. 20(1), 1578–1633 (2019)
Wang, X., Ma, S., Goldfarb, D., Liu, W.: Stochastic quasi-Newton methods for nonconvex stochastic optimization. SIAM J. Opt. 27(2), 927–956 (2017)
Wang, X., Wang, X., Yuan, Yx.: Stochastic proximal quasi-Newton methods for non-convex composite optimization. Opt. Methods Softw. 34(5), 922–948 (2019)
Wang, X., Zhang, H.: Inexact proximal stochastic second-order methods for nonconvex composite optimization. Opt. Methods Softw. 35(4), 808–835 (2020)
Wright, S.J.: Numerical optimization (2006)
Xiao, L., Zhang, T.: A proximal stochastic gradient method with progressive variance reduction. SIAM J. Opt. 24(4), 2057–2075 (2014)
Xie, Y., Bollapragada, R., Byrd, R., Nocedal, J.: Constrained and composite optimization via adaptive sampling methods. IMA J. Numer. Anal. 44(2), 680–709 (2024)
Xu, P., Roosta, F., Mahoney, M.W.: Second-order optimization for non-convex machine learning: An empirical study. In: Proceedings of the 2020 SIAM International Conference on Data Mining, pp. 199–207. SIAM (2020)
Xu, P., Yang, J., Roosta, F., Ré, C., Mahoney, M.W.: Sub-sampled Newton methods with non-uniform sampling. Advances in Neural Information Processing Systems 29 (2016)
Yang, M., Milzarek, A., Wen, Z., Zhang, T.: A stochastic extra-step quasi-Newton method for nonsmooth nonconvex optimization. Mathematical Programming pp. 1–47 (2021)
Acknowledgements
The authors thank the anonymous referees for their careful reading and useful remarks and suggestions that improved the quality of the paper.
Funding
This work was partially supported by the National Natural Science Foundation of China (No. 11971078) and Graduate Research and Innovation Foundation of Chongqing, China (Grant No. CYB23009).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have not disclosed any conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendices
1.1 Appendix A: Proof of Theorem 3.1
For the sake of discussion and without loss of generality, we assume that k is greater than the storage size m. Let the eigenvalues of matrix \(\bar{B}_{k}\) be denoted by \(\lambda _{1}, \lambda _{2}, \cdots , \lambda _{d}\), with \(\lambda _{\max }\) being the maximum eigenvalue and \(\lambda _{\min }\) being the minimum eigenvalue.
Lemma 7.1
Suppose that \(\bar{y}_{i}\) is updated through the formula (3.16). Then, for \(\forall i \in {0, 1,\cdots ,k}\),
holds.
Proof
Since \(\Vert s_{i}\Vert \Vert \bar{y}_{i}\Vert \ge s_{i}^{\top }\bar{y}_{i}\) and \(s_{i}^{\top }\bar{y}_{i} \ge c\Vert s_{i}\Vert ^2\) (3.17), for \(\forall i \in {0, 1,\cdots ,k}\), we obtain
holds. Therefore, we get
From (3.16), (3.17) and \(\nabla f_{i}\) is L-Lipschitz continuous, the following inequality is true
Furthermore, we can obtain
Due to (7.2) and (7.3), we can conclude (7.1). \(\square \)
Lemma 7.2
The updated formula for matrix \(\bar{B}_{k}\) is given by (3.27), and \(\bar{B}_{k-m} = \frac{\bar{y}_{k-m-1}^{\top }\bar{y}_{k-m-1}}{s_{k-m-1}^{\top }\bar{y}_{k-m-1}}\) (3.25). Then, \(\bar{B}_{k}\) satisfies the following inequality:
where \(\lambda _{\max }\) is maximum eigenvalue of \(\bar{B}_{k}\), d is the dimension and m is history storage size.
Proof
Since \(\bar{B}_{k}\) is positive definite (Remark 3.2) and \(\textrm{tr}(\bar{B}_{k})\) is the sum of all eigenvalues of matrix \(\bar{B}_{k}\), we can deduce that
where d is the dimension and m is history storage size. \(\square \)
Lemma 7.3
The updated formula for matrix \(\bar{B}_{k}\) is given by (3.27), and \(\bar{B}_{k-m} = \frac{\bar{y}_{k-m-1}^{\top }\bar{y}_{k-m-1}}{s_{k-m-1}^{\top }\bar{y}_{k-m-1}} I\) (3.25). Then, \(\bar{B}_{k}\) satisfies the following inequality:
where \(\lambda _{\min }\) and \(\lambda _{\max }\) are respectively the minimum and maximum eigenvalues of \(\bar{B}_{k}\), d is the dimension and m is history storage size.
Proof
Because \(\bar{B}_{k}\) is a symmetric positive definite matrix (Remark 3.2), \(\det (\bar{B}_{k})\) is the product of all the eigenvalues of matrix \(\bar{B}_{k}\). We can then conclude that
Next, we explore the lower bound of \(\det (\bar{B}_{k})\). From (3.27) and (3.23), we can derive the following inequality
where the third equality follows from the formula \(\det (I+u_1u_2^\textrm{T}+u_3u_4^\textrm{T})=(1+u_1^\textrm{T}u_2)(1+u_3^\textrm{T}u_4)-(u_1^\textrm{T}u_4)(u_2^\textrm{T}u_3)\). Therefore, combining inequality (7.4), we obtain
where d is the dimension and m is history storage size. \(\square \)
Corollary
Combining Lemmas 7.2 and 7.3, if we define
then
holds. Combining Remark 3.2 with Eq. (7.5), Theorem 3.1 is established.
1.2 Appendix B: Subproblem Solution
The FISTA algorithm we used in the paper is faithful to the literature [2, Section 4], with essentially no modifications.
It should be noted that we did not set a separate step size, and moreover, \(t_{j}\) not starting from 1 each time. Because we have not disrupted the standard LBFGS iteration format, based on the compact update formula proposed in [15, Section 2 and 3], we can rewrite Eq. (3.27) as follows:
where \(Y_{k} = [\bar{y}_{k-m}, \bar{y}_{k-m+1}, \cdots , \bar{y}_{k-1}]\), \(S_{k} = [s_{k-m}, s_{k-m+1}, \cdots , s_{k-1}]\), \(D_{k}\) is the \(m\times m\) diagonal matrix \(D_{k} = \text {diag} [s_{k-m}^{\top } \bar{y}_{k-m}, s_{k-m+1}^{\top } \bar{y}_{k-m+1},\cdots , s_{k-1}^{\top } \bar{y}_{k-1}]\), \(L_{k}\) is the \(m\times m\) matrix
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, M., Li, S. A Proximal Stochastic Quasi-Newton Algorithm with Dynamical Sampling and Stochastic Line Search. J Sci Comput 102, 23 (2025). https://doi.org/10.1007/s10915-024-02748-2
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10915-024-02748-2
Keywords
- Proximal stochastic methods
- Stochastic qusi-Newton methods
- Variance reduction
- Line search
- Machine learning