Skip to main content
Log in

A sparsity preserving stochastic gradient methods for sparse regression

  • Published:
Computational Optimization and Applications Aims and scope Submit manuscript

Abstract

We propose a new stochastic first-order algorithm for solving sparse regression problems. In each iteration, our algorithm utilizes a stochastic oracle of the subgradient of the objective function. Our algorithm is based on a stochastic version of the estimate sequence technique introduced by Nesterov (Introductory lectures on convex optimization: a basic course, Kluwer, Amsterdam, 2003). The convergence rate of our algorithm depends continuously on the noise level of the gradient. In particular, in the limiting case of noiseless gradient, the convergence rate of our algorithm is the same as that of optimal deterministic gradient algorithms. We also establish some large deviation properties of our algorithm. Unlike existing stochastic gradient methods with optimal convergence rates, our algorithm has the advantage of readily enforcing sparsity at all iterations, which is a critical property for applications of sparse regressions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1
Fig. 1

Similar content being viewed by others

Notes

  1. Although Lemma 5 in [6] is stated for \(\eta_{i}=\eta\sqrt{i+1}\), its conclusion and proof remain valid whenever η i is positive and nondecreasing in i.

References

  1. Beck, A., Teboulle, M.: A fast iterative shrinkage-threshold algorithm for linear inverse problems. SIAM J. Image Sci. 2(1), 183–202 (2009)

    Article  MATH  MathSciNet  Google Scholar 

  2. Chung, K.L.: On a stochastic approximation method. Ann. Math. Stat. 25, 463–483 (1954)

    Article  MATH  Google Scholar 

  3. Cotter, A., Shamir, O., Srebro, N., Sridharan, K.: Better mini-batch algorithms via accelerated gradient methods. In: Advances in Neural Information Processing Systems (NIPS) (2011)

    Google Scholar 

  4. Dekel, O., Gilad-Bachrach, R., Shamir, O., Xiao, L.: Optimal distributed online prediction using mini-batches. J. Mach. Learn. Res. 13, 165–202 (2012)

    MATH  MathSciNet  Google Scholar 

  5. Duchi, J., Singer, Y.: Efficient online and batch learning using forward-backward splitting. J. Mach. Learn. Res. 10, 2873–2898 (2009)

    MathSciNet  Google Scholar 

  6. Duchi, J.C., Bartlett, P.L., Wainwright, M.J.: Randomized smoothing for stochastic optimization. SIAM J. Optim. 22(2), 674–701 (2012)

    Article  MATH  MathSciNet  Google Scholar 

  7. Ermoliev, Y.: Stochastic quasigradient methods and their application to system optimization. Stochastics 9, 1–36 (1983)

    Article  MATH  MathSciNet  Google Scholar 

  8. Gaivoronski, A.A.: Nonstationary stochastic programming problems. Kybernetika 4, 89–92 (1978)

    Google Scholar 

  9. Hu, C., Kwok, J.T., Pan, W.: Accelerated gradient methods for stochastic optimization and online learning. In: Advances in Neural Information Processing Systems (NIPS) (2009)

    Google Scholar 

  10. Kiefer, J., Wolfowitz, J.: Stochastic estimation of the maximum of a regression function. Ann. Math. Stat. 23, 462–466 (1952)

    Article  MATH  MathSciNet  Google Scholar 

  11. Kushner, H.J., Yin, G.G.: Stochastic Approximation Algorithms and Applications. Springer, New York (2003)

    MATH  Google Scholar 

  12. Lan, G.: An optimal method for stochastic composite optimization. Math. Program. 133(1), 365–397 (2012)

    Article  MATH  MathSciNet  Google Scholar 

  13. Lan, G., Ghadimi, S.: Optimal stochastic approximation algorithm for strongly convex stochastic composite optimization, I: a generic algorithmic framework. SIAM J. Optim. 22(4), 1469–1492 (2012)

    Article  MATH  MathSciNet  Google Scholar 

  14. Lan, G., Ghadimi, S.: Optimal stochastic approximation algorithm for strongly convex stochastic composite optimization, II: shrinking procedures and optimal algorithms. SIAM J. Optim. 23(4), 2061–2069 (2013)

    Article  MATH  MathSciNet  Google Scholar 

  15. Langford, J., Li, L., Zhang, T.: Sparse online learning via truncated gradient. J. Mach. Learn. Res. 10, 2873–2908 (2009)

    MathSciNet  Google Scholar 

  16. Lee, S., Wright, S.J.: Manifold identification in dual averaging methods for regularized stochastic online learning. J. Mach. Learn. Res. 13, 1705–1744 (2012)

    MATH  MathSciNet  Google Scholar 

  17. Liu, J., Ji, S., Ye, J.: Multi-task feature learning via efficient 1/ 2 norm minimization. In: The Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI) (2009)

    Google Scholar 

  18. Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)

    Article  MATH  MathSciNet  Google Scholar 

  19. Nemirovski, A., Yudin, D.: On Cezari’s convergence of the steepest descent method for approximating saddle point of convex-concave functions. Soviet Mathematics Doklady 19 (1978)

  20. Nemirovski, A., Yudin, D.: Problem Complexity and Method Efficiency in Optimization. Wiley, New York (1983)

    Google Scholar 

  21. Nesterov, Y.E.: Gradient methods for minimizing composite objective function. Technical report, CORE (2007)

  22. Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, Amsterdam (2003)

    Google Scholar 

  23. Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  24. Nesterov, Y.: Primal-dual subgradient methods for convex problems. Math. Program. 120, 221–259 (2009)

    Article  MATH  MathSciNet  Google Scholar 

  25. Pflug, G.C.: Optimization of Stochastic Models, the Interface Between Simulation and Optimization. Kluwer, Boston (1996)

    MATH  Google Scholar 

  26. Polyak, B.T., Juditsky, A.B.: Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 30, 838–855 (1992)

    Article  MATH  MathSciNet  Google Scholar 

  27. Polyak, B.T.: New stochastic approximation type procedures. Automat. Telemekh. 7, 98–107 (1990)

    MathSciNet  Google Scholar 

  28. Pong, T.K., Tseng, P., Ji, S., Ye, J.: Trace norm regularization: reformulations, algorithms, and multi-task learning. SIAM J. Optim. 20(6), 3465–3489 (2010)

    Article  MATH  MathSciNet  Google Scholar 

  29. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)

    Article  MATH  MathSciNet  Google Scholar 

  30. Ruszczynski, A., Syski, W.: A method of aggregate stochastic subgradients with on-line stepsize rules for convex stochastic programming problems. Math. Program. Stud. 28, 113–131 (1986)

    Article  MATH  MathSciNet  Google Scholar 

  31. Sacks, J.: Asymptotic distribution of stochastic approximation. Ann. Math. Stat. 29, 373–409 (1958)

    Article  MATH  MathSciNet  Google Scholar 

  32. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B 58, 267–288 (1996)

    MATH  MathSciNet  Google Scholar 

  33. Toh, K.-C., Yun, S.: An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems. Pac. J. Optim. 6, 615–640 (2010)

    MATH  MathSciNet  Google Scholar 

  34. Tseng, P.: On accelerated proximal gradient methods for convex-concave optimization. Technical report, University of Washington (2008)

  35. Xiao, L.: Dual averaging methods for regularized stochastic learning and online optimization. J. Mach. Learn. Res. 11, 2543–2596 (2010)

    MATH  MathSciNet  Google Scholar 

  36. Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. B 68, 49–67 (2006)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Javier Peña.

Appendix:  Some technical proofs

Appendix:  Some technical proofs

This section presents the proofs of Lemma 2, Lemma 4, and Proposition 1.

1.1 A.1 Proof of Lemma 2

According to the definition of \(\widehat{x}\), there is a \(\eta\in \partial h(\widehat{x})\) such that

$$ \bigl\langle G(\bar{x})+\bar{L}(\widehat{x}-\bar {x})+\eta,x- \widehat{x} \bigr\rangle \geq0\quad \text{for all}\ x\in\mathbb{R}^n. $$
(56)

By the convexity of h(x), we will have

$$ h(x)\geq h(\widehat{x})+ \langle\eta,x-\widehat{x} \rangle. $$
(57)

It then follow from (2), (56) and (57) that

$$\begin{aligned} &\phi(x)-\frac{\mu}{2}\|x-\bar{x}\|^2 \\ &\quad =f(x)-\frac{\mu}{2}\|x-\bar{x}\|^2+h(x) \\ &\quad \geq f(\bar{x})+ \bigl\langle \nabla f(\bar{x}),x-\bar{x} \bigr\rangle +h(\widehat{x})+ \langle\eta,x-\widehat{x} \rangle\quad \bigl(\text{by (2) and (57)}\bigr) \\ &\quad \geq f(\bar{x})+ \bigl\langle \nabla f(\bar{x}),x-\bar{x} \bigr\rangle +h(\widehat{x})- \bigl\langle G(\bar{x})+\bar{L}(\widehat {x}-\bar{x}),x- \widehat{x} \bigr\rangle \quad\bigl(\text{by (56)}\bigr) \\ &\quad \geq f(\widehat{x})-\frac{L}{2}\|\widehat{x}-\bar{x} \|^2-M\| \widehat{x}-\bar{x}\|- \bigl\langle \nabla f(\bar {x}),\widehat {x}-\bar{x} \bigr\rangle + \bigl\langle \nabla f(\bar{x}),x-\bar {x} \bigr\rangle +h(\widehat{x}) \\ &\qquad - \bigl\langle G(\bar{x},\xi)+\bar{L}(\widehat{x}-\bar{x}),x- \widehat{x} \bigr\rangle \quad\bigl(\text{by (2)}\bigr) \\ &\quad =\phi(\widehat{x})-\frac{L}{2}\|\widehat{x}-\bar{x} \|^2-M\| \widehat{x}-\bar{x}\|+ \langle \varDelta ,\widehat{x}-x \rangle- \bigl\langle \bar{L}(\widehat{x}-\bar{x}),x-\widehat{x} \bigr\rangle \\ &\qquad(\text{by the definition of $\varDelta $}) \\ &\quad =\phi(\widehat{x})-\frac{L}{2\bar{L}^2}\|g\|^2-M\| \widehat{x}-\bar{x}\|+ \langle \varDelta ,\widehat{x}-x \rangle+ \langle g,\bar{x}- \widehat{x} \rangle+ \langle g,x-\bar{x} \rangle\\ &\qquad(\text{by the definition of $g$}) \\ &\quad =\phi(\widehat{x})+\frac{2\bar{L}-L}{2\bar{L}^2}\|g\|^2+ \langle g,x-\bar{x} \rangle+ \langle \varDelta ,\widehat{x}-x \rangle-M\| \widehat{x}-\bar{x} \|. \end{aligned}$$

Notice that the inequalities above hold for any realization of stochastic gradient G(x,ξ). Hence, (16) holds almost surely.  □

1.2 A.2 Proof of Lemma 4

By the choice of V 0(x), (20) holds for k=0 with \(V_{0}^{*}=\phi(x_{0})\), ϵ 0=0 and δ 0=0. Suppose we have \(V_{k}(x)=V_{k}^{*}+\frac{\gamma_{k}}{2}\|x-z_{k}\|^{2}+ \langle\epsilon _{k},x \rangle\). According to the updating equation (19), V k+1(x) is also a quadratic function whose coefficient on ∥x2 is \(\frac{(1-\alpha_{k})\gamma_{k}+\alpha_{k}\mu}{2}\). Therefore, V k+1(x) can definitely be represented as \(V_{k+1}(x)=V_{k+1}^{*}+\frac{\gamma_{k+1}}{2}\|x-z_{k+1}\|^{2}+ \langle\epsilon_{k+1},x \rangle\) with γ k+1=(1−α k )γ k +α k μ and some \(V_{k+1}^{*}\), z k+1 and ϵ k+1.

It follows from (19) that

$$\begin{aligned} V_{k+1}(x) &=(1-\alpha_k)V_k^*+ \frac{(1-\alpha_k)\gamma_k}{2}\|x-z_k\| ^2+(1-\alpha_k) \langle\epsilon_k,x \rangle \\ &\quad +\alpha_k \biggl[\phi(x_{k+1})+ \langle g_k,x-y_k \rangle+\frac{2L_k-L}{2L_k^2}\|g_k \|^2+\frac{\mu}{2}\|x-y_k\| ^2 \\ &\quad + \langle \varDelta _k,x_{k+1}-x \rangle-M\|x_{k+1}-y_k \| \biggr] \\ &=\underbrace{\frac{(1-\alpha_k)\gamma_k}{2}\|x-z_k\|^2+ \frac {\alpha_k\mu}{2}\|x-y_k\|^2+\alpha_k \langle g_k,x-y_k \rangle}_{T_1} \\ &\quad +(1- \alpha_k) \langle\epsilon_k,x \rangle- \alpha_k \langle \varDelta _k,x \rangle+(1-\alpha_k)V_k^* \\ &\quad +\alpha_k \biggl[\phi(x_{k+1})+\frac {2L_k-L}{2L_k^2}\|g_k \|^2+ \langle \varDelta _k,x_{k+1} \rangle-M \|x_{k+1}-y_k\| \biggr] \end{aligned}$$
(58)

We take the term \(T_{1}=\frac{(1-\alpha_{k})\gamma_{k}}{2}\|x-z_{k}\| ^{2}+\frac{\alpha_{k}\mu}{2}\|x-y_{k}\|^{2}+\alpha_{k} \langle g_{k},x-y_{k} \rangle\) out from V k+1(x) and consider it separately. We define γ k+1 and z k+1 according to (21) and (23), we get

$$\begin{aligned} T_1 &=\frac{(1-\alpha_k)\gamma_k}{2}\|x-z_k \|^2+\frac{\alpha_k\mu }{2}\|x-y_k\|^2+ \alpha_k \langle g_k,x-y_k \rangle \\ &=\frac{\gamma_{k+1}}{2}\|x\|^2+ \bigl\langle x,(1-\alpha_k) \gamma _kz_k+\alpha_k\mu y_k- \alpha_kg_k \bigr\rangle \\ &\quad +\frac{(1-\alpha_k)\gamma_k}{2} \|z_k\|^2+\frac{\alpha_k\mu}{2}\| y_k \|^2-\alpha_k \langle g_k,y_k \rangle \\ &=\frac{\gamma_{k+1}}{2}\|x-z_{k+1}\|^2-\frac{\gamma _{k+1}}{2}\biggl\| \frac{(1-\alpha_k)\gamma_kz_k+\alpha_k\mu y_k-\alpha_kg_k}{\gamma_{k+1}}\biggr\| ^2 \\ &\quad +\frac{(1-\alpha_k)\gamma_k}{2}\|z_k \|^2+\frac{\alpha_k\mu}{2}\| y_k\|^2- \alpha_k \langle g_k,y_k \rangle \\ &=\frac{\gamma_{k+1}}{2}\|x-z_{k+1}\|^2+\frac{\alpha_k(1-\alpha _k)\gamma_k\mu}{2\gamma_{k+1}} \|y_k-z_k\|^2-\frac{\alpha _k^2}{2\gamma_{k+1}} \|g_k\|^2 \\ &\quad +\frac{\alpha_k(1-\alpha_k)\gamma_k}{\gamma_{k+1}} \langle g_k, z_k-y_k \rangle. \end{aligned}$$
(59)

(The last step follows from a straightforward, though somewhat tedious algebraic expansion.) Therefore, replacing T 1 in (58) with (59), we can rewrite V k+1(x) as

$$\begin{aligned} V_{k+1}(x) =&\frac{\gamma_{k+1}}{2}\|x-z_{k+1}\|^2+ \underbrace{(1-\alpha_k) \langle\epsilon_k,x \rangle- \alpha_k \langle \varDelta _k,x \rangle}_{ \langle\epsilon_{k+1},x \rangle} \\ &{}+(1-\alpha_k)V_k^*+\alpha_k \phi(x_{k+1})+\biggl(\alpha_k\frac {2L_k-L}{2L_k^2}- \frac{\alpha_k^2}{2\gamma_{k+1}}\biggr)\|g_k\|^2 \\ &{}+\frac{\alpha_k(1-\alpha_k)\gamma_k}{\gamma_{k+1}}\biggl(\frac {\mu }{2}\|y_k-z_k \|^2+ \langle g_k,z_k-y_k \rangle \biggr) \\ &{}+\alpha_k \langle \varDelta _k,x_{k+1} \rangle-\alpha_kM\| x_{k+1}-y_k\| \end{aligned}$$

In order to represent V k+1(x) by the formulation (20), it suffices to define ϵ k+1 and \(V_{k+1}^{*}\) as (22) and (24). □

1.3 A.3 Proof of Proposition 1

We prove (13) is true by induction. According to Lemma 4, V k (x) is given by (20). Hence, we just need to prove \(V_{k}^{*}\geq\phi(x_{k})-B_{k}\). Since \(V_{0}^{*}=\phi(x_{0})\) and B 0=0, (13) is true when k=0. Suppose (13) is true for k. We are going to show that it is also true for k+1.

Dropping the non-negative term \(\frac{\mu}{2}\|y_{k}-z_{k}\|^{2}\) in (24) and applying the assumption \(V_{k}^{*}\geq\phi(x_{k})-B_{k}\), we can bound \(V_{k+1}^{*}\) from below as

$$\begin{aligned} V_{k+1}^* \geq&(1-\alpha_k)\phi(x_k)-(1- \alpha_k)B_k+\alpha_k\phi(x_{k+1})+ \biggl(\alpha_k\frac{2L_k-L}{2L_k^2}-\frac{\alpha_k^2}{2\gamma _{k+1}}\biggr) \|g_k\|^2 \\ & {}+\frac{\alpha_k(1-\alpha_k)\gamma_k}{\gamma_{k+1}} \langle g_k,z_k-y_k \rangle+\alpha_k \langle \varDelta _k,x_{k+1} \rangle-\alpha_kM\|x_{k+1}-y_k\|. \end{aligned}$$
(60)

According to Lemma 2 (with \(\widehat{x} = x_{k+1}, \; \bar{x} = y_{k}\)) and (17), we have

$$\begin{aligned} \phi(x_k) \geq& \phi(x_{k+1})+ \langle g_k,x_k-y_k \rangle+\frac {2L_k-L}{2L_k^2} \|g_k\|^2+\frac{\mu}{2}\|x_k-y_k \|^2 \\ &{}+ \langle \varDelta _k,x_{k+1}-x_k \rangle-M\|x_{k+1}-y_k\| \\ \geq& \phi(x_{k+1})+ \langle g_k,x_k-y_k \rangle+\frac {2L_k-L}{2L_k^2}\|g_k\|^2\\ &{}+ \langle \varDelta _k,x_{k+1}-x_k \rangle-M \|x_{k+1}-y_k\|. \end{aligned}$$

Applying this to ϕ(x k ) in (60), it follows that

$$\begin{aligned} V_{k+1}^* \geq&\phi(x_{k+1})+ \biggl( \frac{2L_k-L}{2L_k^2}-\frac{\alpha _k^2}{2\gamma_{k+1}} \biggr)\|g_k \|^2-(1-\alpha_k)B_k\\ &{} +(1-\alpha_k) \biggl\langle g_k,\underbrace{\frac{\alpha_k\gamma _k}{\gamma_{k+1}}(z_k-y_k)+x_k-y_k}_{T_2} \biggr\rangle \\ &{}+ \bigl\langle \varDelta _k,x_{k+1}-(1- \alpha_k)x_k \bigr\rangle -M\| x_{k+1}-y_k \|. \end{aligned}$$

Due to (26), the term T 2 above is zero. Hence, \(V_{k+1}^{*}\) can be bounded from below as

$$\begin{aligned} V_{k+1}^* \geq&\phi(x_{k+1})+ \biggl(\frac{2L_k-L}{2L_k^2}- \frac{\alpha _k^2}{2\gamma_{k+1}} \biggr)\|g_k\|^2-(1- \alpha_k)B_k \\ &{} + \langle \varDelta _k,x_{k+1}-y_k \rangle-M \|x_{k+1}-y_k\| + \bigl\langle \varDelta _k,y_k-(1- \alpha_k)x_k \bigr\rangle \\ \geq&\phi(x_{k+1})+ \biggl(\frac{2L_k-L}{2L_k^2}-\frac{\alpha _k^2}{2\gamma_{k+1}} \biggr)\|g_k\|^2-(1-\alpha_k)B_k \\ &{} -(\|\varDelta _k\|+M)\|x_{k+1}-y_k\|+ \bigl\langle \varDelta _k,y_k-(1-\alpha_k)x_k \bigr\rangle \\ &{}\quad\text{(Cauchy-Schwarz and (26))} \\ \geq&\phi(x_{k+1})-\frac{(\|\varDelta _k\|+M)^2}{4L_k^2A_k}-(1-\alpha _k)B_k+ \bigl\langle \varDelta _k,y_k-(1- \alpha_k)x_k \bigr\rangle \\ =&\phi(x_{k+1})-B_{k+1}. \end{aligned}$$

Here, the last inequality is from the definition of A k given by (28) and the following inequality

$$ A_k\|g_k\|^2-\|g_k\| \frac{\|\varDelta _k\|+M}{L_k}\geq-\frac{(\|\varDelta _k\|+M)^2}{4L_k^2A_k}. $$

Hence, (13) holds for k+1. □

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lin, Q., Chen, X. & Peña, J. A sparsity preserving stochastic gradient methods for sparse regression. Comput Optim Appl 58, 455–482 (2014). https://doi.org/10.1007/s10589-013-9633-9

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10589-013-9633-9

Keywords

Navigation