A sparsity preserving stochastic gradient methods for sparse regression

Lin, Qihang; Chen, Xi; Peña, Javier

doi:10.1007/s10589-013-9633-9

A sparsity preserving stochastic gradient methods for sparse regression

Published: 09 January 2014

Volume 58, pages 455–482, (2014)
Cite this article

Computational Optimization and Applications Aims and scope Submit manuscript

Qihang Lin¹,
Xi Chen² &
Javier Peña³

618 Accesses
14 Citations
Explore all metrics

Abstract

We propose a new stochastic first-order algorithm for solving sparse regression problems. In each iteration, our algorithm utilizes a stochastic oracle of the subgradient of the objective function. Our algorithm is based on a stochastic version of the estimate sequence technique introduced by Nesterov (Introductory lectures on convex optimization: a basic course, Kluwer, Amsterdam, 2003). The convergence rate of our algorithm depends continuously on the noise level of the gradient. In particular, in the limiting case of noiseless gradient, the convergence rate of our algorithm is the same as that of optimal deterministic gradient algorithms. We also establish some large deviation properties of our algorithm. Unlike existing stochastic gradient methods with optimal convergence rates, our algorithm has the advantage of readily enforcing sparsity at all iterations, which is a critical property for applications of sparse regressions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Practical inexact proximal quasi-Newton method with global complexity analysis

Article 31 March 2016

Newton-Type Methods with the Proximal Gradient Step for Sparse Estimation

Article 20 March 2024

A fast non-monotone line search for stochastic gradient descent

Article 23 September 2023

Notes

Although Lemma 5 in [6] is stated for $\eta_{i}=\eta\sqrt{i+1}$, its conclusion and proof remain valid whenever η _i is positive and nondecreasing in i.

References

Beck, A., Teboulle, M.: A fast iterative shrinkage-threshold algorithm for linear inverse problems. SIAM J. Image Sci. 2(1), 183–202 (2009)
Article MATH MathSciNet Google Scholar
Chung, K.L.: On a stochastic approximation method. Ann. Math. Stat. 25, 463–483 (1954)
Article MATH Google Scholar
Cotter, A., Shamir, O., Srebro, N., Sridharan, K.: Better mini-batch algorithms via accelerated gradient methods. In: Advances in Neural Information Processing Systems (NIPS) (2011)
Google Scholar
Dekel, O., Gilad-Bachrach, R., Shamir, O., Xiao, L.: Optimal distributed online prediction using mini-batches. J. Mach. Learn. Res. 13, 165–202 (2012)
MATH MathSciNet Google Scholar
Duchi, J., Singer, Y.: Efficient online and batch learning using forward-backward splitting. J. Mach. Learn. Res. 10, 2873–2898 (2009)
MathSciNet Google Scholar
Duchi, J.C., Bartlett, P.L., Wainwright, M.J.: Randomized smoothing for stochastic optimization. SIAM J. Optim. 22(2), 674–701 (2012)
Article MATH MathSciNet Google Scholar
Ermoliev, Y.: Stochastic quasigradient methods and their application to system optimization. Stochastics 9, 1–36 (1983)
Article MATH MathSciNet Google Scholar
Gaivoronski, A.A.: Nonstationary stochastic programming problems. Kybernetika 4, 89–92 (1978)
Google Scholar
Hu, C., Kwok, J.T., Pan, W.: Accelerated gradient methods for stochastic optimization and online learning. In: Advances in Neural Information Processing Systems (NIPS) (2009)
Google Scholar
Kiefer, J., Wolfowitz, J.: Stochastic estimation of the maximum of a regression function. Ann. Math. Stat. 23, 462–466 (1952)
Article MATH MathSciNet Google Scholar
Kushner, H.J., Yin, G.G.: Stochastic Approximation Algorithms and Applications. Springer, New York (2003)
MATH Google Scholar
Lan, G.: An optimal method for stochastic composite optimization. Math. Program. 133(1), 365–397 (2012)
Article MATH MathSciNet Google Scholar
Lan, G., Ghadimi, S.: Optimal stochastic approximation algorithm for strongly convex stochastic composite optimization, I: a generic algorithmic framework. SIAM J. Optim. 22(4), 1469–1492 (2012)
Article MATH MathSciNet Google Scholar
Lan, G., Ghadimi, S.: Optimal stochastic approximation algorithm for strongly convex stochastic composite optimization, II: shrinking procedures and optimal algorithms. SIAM J. Optim. 23(4), 2061–2069 (2013)
Article MATH MathSciNet Google Scholar
Langford, J., Li, L., Zhang, T.: Sparse online learning via truncated gradient. J. Mach. Learn. Res. 10, 2873–2908 (2009)
MathSciNet Google Scholar
Lee, S., Wright, S.J.: Manifold identification in dual averaging methods for regularized stochastic online learning. J. Mach. Learn. Res. 13, 1705–1744 (2012)
MATH MathSciNet Google Scholar
Liu, J., Ji, S., Ye, J.: Multi-task feature learning via efficient ℓ ₁/ℓ ₂ norm minimization. In: The Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI) (2009)
Google Scholar
Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)
Article MATH MathSciNet Google Scholar
Nemirovski, A., Yudin, D.: On Cezari’s convergence of the steepest descent method for approximating saddle point of convex-concave functions. Soviet Mathematics Doklady 19 (1978)
Nemirovski, A., Yudin, D.: Problem Complexity and Method Efficiency in Optimization. Wiley, New York (1983)
Google Scholar
Nesterov, Y.E.: Gradient methods for minimizing composite objective function. Technical report, CORE (2007)
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, Amsterdam (2003)
Google Scholar
Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005)
Article MATH MathSciNet Google Scholar
Nesterov, Y.: Primal-dual subgradient methods for convex problems. Math. Program. 120, 221–259 (2009)
Article MATH MathSciNet Google Scholar
Pflug, G.C.: Optimization of Stochastic Models, the Interface Between Simulation and Optimization. Kluwer, Boston (1996)
MATH Google Scholar
Polyak, B.T., Juditsky, A.B.: Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 30, 838–855 (1992)
Article MATH MathSciNet Google Scholar
Polyak, B.T.: New stochastic approximation type procedures. Automat. Telemekh. 7, 98–107 (1990)
MathSciNet Google Scholar
Pong, T.K., Tseng, P., Ji, S., Ye, J.: Trace norm regularization: reformulations, algorithms, and multi-task learning. SIAM J. Optim. 20(6), 3465–3489 (2010)
Article MATH MathSciNet Google Scholar
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)
Article MATH MathSciNet Google Scholar
Ruszczynski, A., Syski, W.: A method of aggregate stochastic subgradients with on-line stepsize rules for convex stochastic programming problems. Math. Program. Stud. 28, 113–131 (1986)
Article MATH MathSciNet Google Scholar
Sacks, J.: Asymptotic distribution of stochastic approximation. Ann. Math. Stat. 29, 373–409 (1958)
Article MATH MathSciNet Google Scholar
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B 58, 267–288 (1996)
MATH MathSciNet Google Scholar
Toh, K.-C., Yun, S.: An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems. Pac. J. Optim. 6, 615–640 (2010)
MATH MathSciNet Google Scholar
Tseng, P.: On accelerated proximal gradient methods for convex-concave optimization. Technical report, University of Washington (2008)
Xiao, L.: Dual averaging methods for regularized stochastic learning and online optimization. J. Mach. Learn. Res. 11, 2543–2596 (2010)
MATH MathSciNet Google Scholar
Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. B 68, 49–67 (2006)
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Tippie College of Business, University of Iowa, Iowa City, IA, 52242, USA
Qihang Lin
Department of Electrical Engineering and Computer Science, University of California at Berkeley, Berkeley, CA, 94720, USA
Xi Chen
Tepper School of Business, Carnegie Mellon University, Pittsburgh, PA, 15213, USA
Javier Peña

Authors

Qihang Lin
View author publications
You can also search for this author in PubMed Google Scholar
Xi Chen
View author publications
You can also search for this author in PubMed Google Scholar
Javier Peña
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Javier Peña.

Appendix: Some technical proofs

This section presents the proofs of Lemma 2, Lemma 4, and Proposition 1.

1.1 A.1 Proof of Lemma 2

According to the definition of $\widehat{x}$, there is a $\eta\in \partial h(\widehat{x})$ such that

$$ \bigl\langle G(\bar{x})+\bar{L}(\widehat{x}-\bar {x})+\eta,x- \widehat{x} \bigr\rangle \geq0\quad \text{for all}\ x\in\mathbb{R}^n. $$

(56)

By the convexity of h(x), we will have

$$ h(x)\geq h(\widehat{x})+ \langle\eta,x-\widehat{x} \rangle. $$

(57)

It then follow from (2), (56) and (57) that

$$\begin{aligned} &\phi(x)-\frac{\mu}{2}\|x-\bar{x}\|^2 \\ &\quad =f(x)-\frac{\mu}{2}\|x-\bar{x}\|^2+h(x) \\ &\quad \geq f(\bar{x})+ \bigl\langle \nabla f(\bar{x}),x-\bar{x} \bigr\rangle +h(\widehat{x})+ \langle\eta,x-\widehat{x} \rangle\quad \bigl(\text{by (2) and (57)}\bigr) \\ &\quad \geq f(\bar{x})+ \bigl\langle \nabla f(\bar{x}),x-\bar{x} \bigr\rangle +h(\widehat{x})- \bigl\langle G(\bar{x})+\bar{L}(\widehat {x}-\bar{x}),x- \widehat{x} \bigr\rangle \quad\bigl(\text{by (56)}\bigr) \\ &\quad \geq f(\widehat{x})-\frac{L}{2}\|\widehat{x}-\bar{x} \|^2-M\| \widehat{x}-\bar{x}\|- \bigl\langle \nabla f(\bar {x}),\widehat {x}-\bar{x} \bigr\rangle + \bigl\langle \nabla f(\bar{x}),x-\bar {x} \bigr\rangle +h(\widehat{x}) \\ &\qquad - \bigl\langle G(\bar{x},\xi)+\bar{L}(\widehat{x}-\bar{x}),x- \widehat{x} \bigr\rangle \quad\bigl(\text{by (2)}\bigr) \\ &\quad =\phi(\widehat{x})-\frac{L}{2}\|\widehat{x}-\bar{x} \|^2-M\| \widehat{x}-\bar{x}\|+ \langle \varDelta ,\widehat{x}-x \rangle- \bigl\langle \bar{L}(\widehat{x}-\bar{x}),x-\widehat{x} \bigr\rangle \\ &\qquad(\text{by the definition of $\varDelta $}) \\ &\quad =\phi(\widehat{x})-\frac{L}{2\bar{L}^2}\|g\|^2-M\| \widehat{x}-\bar{x}\|+ \langle \varDelta ,\widehat{x}-x \rangle+ \langle g,\bar{x}- \widehat{x} \rangle+ \langle g,x-\bar{x} \rangle\\ &\qquad(\text{by the definition of $g$}) \\ &\quad =\phi(\widehat{x})+\frac{2\bar{L}-L}{2\bar{L}^2}\|g\|^2+ \langle g,x-\bar{x} \rangle+ \langle \varDelta ,\widehat{x}-x \rangle-M\| \widehat{x}-\bar{x} \|. \end{aligned}$$

Notice that the inequalities above hold for any realization of stochastic gradient G(x,ξ). Hence, (16) holds almost surely. □

1.2 A.2 Proof of Lemma 4

By the choice of V ₀(x), (20) holds for k=0 with $V_{0}^{*}=\phi(x_{0})$, ϵ ₀=0 and δ ₀=0. Suppose we have $V_{k}(x)=V_{k}^{*}+\frac{\gamma_{k}}{2}\|x-z_{k}\|^{2}+ \langle\epsilon _{k},x \rangle$. According to the updating equation (19), V _k+1(x) is also a quadratic function whose coefficient on ∥x∥² is $\frac{(1-\alpha_{k})\gamma_{k}+\alpha_{k}\mu}{2}$. Therefore, V _k+1(x) can definitely be represented as $V_{k+1}(x)=V_{k+1}^{*}+\frac{\gamma_{k+1}}{2}\|x-z_{k+1}\|^{2}+ \langle\epsilon_{k+1},x \rangle$ with γ _k+1=(1−α _k)γ _k+α _k μ and some $V_{k+1}^{*}$, z _k+1 and ϵ _k+1.

It follows from (19) that

$$\begin{aligned} V_{k+1}(x) &=(1-\alpha_k)V_k^*+ \frac{(1-\alpha_k)\gamma_k}{2}\|x-z_k\| ^2+(1-\alpha_k) \langle\epsilon_k,x \rangle \\ &\quad +\alpha_k \biggl[\phi(x_{k+1})+ \langle g_k,x-y_k \rangle+\frac{2L_k-L}{2L_k^2}\|g_k \|^2+\frac{\mu}{2}\|x-y_k\| ^2 \\ &\quad + \langle \varDelta _k,x_{k+1}-x \rangle-M\|x_{k+1}-y_k \| \biggr] \\ &=\underbrace{\frac{(1-\alpha_k)\gamma_k}{2}\|x-z_k\|^2+ \frac {\alpha_k\mu}{2}\|x-y_k\|^2+\alpha_k \langle g_k,x-y_k \rangle}_{T_1} \\ &\quad +(1- \alpha_k) \langle\epsilon_k,x \rangle- \alpha_k \langle \varDelta _k,x \rangle+(1-\alpha_k)V_k^* \\ &\quad +\alpha_k \biggl[\phi(x_{k+1})+\frac {2L_k-L}{2L_k^2}\|g_k \|^2+ \langle \varDelta _k,x_{k+1} \rangle-M \|x_{k+1}-y_k\| \biggr] \end{aligned}$$

(58)

We take the term $T_{1}=\frac{(1-\alpha_{k})\gamma_{k}}{2}\|x-z_{k}\| ^{2}+\frac{\alpha_{k}\mu}{2}\|x-y_{k}\|^{2}+\alpha_{k} \langle g_{k},x-y_{k} \rangle$ out from V _k+1(x) and consider it separately. We define γ _k+1 and z _k+1 according to (21) and (23), we get

$$\begin{aligned} T_1 &=\frac{(1-\alpha_k)\gamma_k}{2}\|x-z_k \|^2+\frac{\alpha_k\mu }{2}\|x-y_k\|^2+ \alpha_k \langle g_k,x-y_k \rangle \\ &=\frac{\gamma_{k+1}}{2}\|x\|^2+ \bigl\langle x,(1-\alpha_k) \gamma _kz_k+\alpha_k\mu y_k- \alpha_kg_k \bigr\rangle \\ &\quad +\frac{(1-\alpha_k)\gamma_k}{2} \|z_k\|^2+\frac{\alpha_k\mu}{2}\| y_k \|^2-\alpha_k \langle g_k,y_k \rangle \\ &=\frac{\gamma_{k+1}}{2}\|x-z_{k+1}\|^2-\frac{\gamma _{k+1}}{2}\biggl\| \frac{(1-\alpha_k)\gamma_kz_k+\alpha_k\mu y_k-\alpha_kg_k}{\gamma_{k+1}}\biggr\| ^2 \\ &\quad +\frac{(1-\alpha_k)\gamma_k}{2}\|z_k \|^2+\frac{\alpha_k\mu}{2}\| y_k\|^2- \alpha_k \langle g_k,y_k \rangle \\ &=\frac{\gamma_{k+1}}{2}\|x-z_{k+1}\|^2+\frac{\alpha_k(1-\alpha _k)\gamma_k\mu}{2\gamma_{k+1}} \|y_k-z_k\|^2-\frac{\alpha _k^2}{2\gamma_{k+1}} \|g_k\|^2 \\ &\quad +\frac{\alpha_k(1-\alpha_k)\gamma_k}{\gamma_{k+1}} \langle g_k, z_k-y_k \rangle. \end{aligned}$$

(59)

(The last step follows from a straightforward, though somewhat tedious algebraic expansion.) Therefore, replacing T ₁ in (58) with (59), we can rewrite V _k+1(x) as

$$\begin{aligned} V_{k+1}(x) =&\frac{\gamma_{k+1}}{2}\|x-z_{k+1}\|^2+ \underbrace{(1-\alpha_k) \langle\epsilon_k,x \rangle- \alpha_k \langle \varDelta _k,x \rangle}_{ \langle\epsilon_{k+1},x \rangle} \\ &{}+(1-\alpha_k)V_k^*+\alpha_k \phi(x_{k+1})+\biggl(\alpha_k\frac {2L_k-L}{2L_k^2}- \frac{\alpha_k^2}{2\gamma_{k+1}}\biggr)\|g_k\|^2 \\ &{}+\frac{\alpha_k(1-\alpha_k)\gamma_k}{\gamma_{k+1}}\biggl(\frac {\mu }{2}\|y_k-z_k \|^2+ \langle g_k,z_k-y_k \rangle \biggr) \\ &{}+\alpha_k \langle \varDelta _k,x_{k+1} \rangle-\alpha_kM\| x_{k+1}-y_k\| \end{aligned}$$

In order to represent V _k+1(x) by the formulation (20), it suffices to define ϵ _k+1 and $V_{k+1}^{*}$ as (22) and (24). □

1.3 A.3 Proof of Proposition 1

We prove (13) is true by induction. According to Lemma 4, V _k(x) is given by (20). Hence, we just need to prove $V_{k}^{*}\geq\phi(x_{k})-B_{k}$. Since $V_{0}^{*}=\phi(x_{0})$ and B ₀=0, (13) is true when k=0. Suppose (13) is true for k. We are going to show that it is also true for k+1.

Dropping the non-negative term $\frac{\mu}{2}\|y_{k}-z_{k}\|^{2}$ in (24) and applying the assumption $V_{k}^{*}\geq\phi(x_{k})-B_{k}$, we can bound $V_{k+1}^{*}$ from below as

$$\begin{aligned} V_{k+1}^* \geq&(1-\alpha_k)\phi(x_k)-(1- \alpha_k)B_k+\alpha_k\phi(x_{k+1})+ \biggl(\alpha_k\frac{2L_k-L}{2L_k^2}-\frac{\alpha_k^2}{2\gamma _{k+1}}\biggr) \|g_k\|^2 \\ & {}+\frac{\alpha_k(1-\alpha_k)\gamma_k}{\gamma_{k+1}} \langle g_k,z_k-y_k \rangle+\alpha_k \langle \varDelta _k,x_{k+1} \rangle-\alpha_kM\|x_{k+1}-y_k\|. \end{aligned}$$

(60)

According to Lemma 2 (with $\widehat{x} = x_{k+1}, \; \bar{x} = y_{k}$) and (17), we have

$$\begin{aligned} \phi(x_k) \geq& \phi(x_{k+1})+ \langle g_k,x_k-y_k \rangle+\frac {2L_k-L}{2L_k^2} \|g_k\|^2+\frac{\mu}{2}\|x_k-y_k \|^2 \\ &{}+ \langle \varDelta _k,x_{k+1}-x_k \rangle-M\|x_{k+1}-y_k\| \\ \geq& \phi(x_{k+1})+ \langle g_k,x_k-y_k \rangle+\frac {2L_k-L}{2L_k^2}\|g_k\|^2\\ &{}+ \langle \varDelta _k,x_{k+1}-x_k \rangle-M \|x_{k+1}-y_k\|. \end{aligned}$$

Applying this to ϕ(x _k) in (60), it follows that

$$\begin{aligned} V_{k+1}^* \geq&\phi(x_{k+1})+ \biggl( \frac{2L_k-L}{2L_k^2}-\frac{\alpha _k^2}{2\gamma_{k+1}} \biggr)\|g_k \|^2-(1-\alpha_k)B_k\\ &{} +(1-\alpha_k) \biggl\langle g_k,\underbrace{\frac{\alpha_k\gamma _k}{\gamma_{k+1}}(z_k-y_k)+x_k-y_k}_{T_2} \biggr\rangle \\ &{}+ \bigl\langle \varDelta _k,x_{k+1}-(1- \alpha_k)x_k \bigr\rangle -M\| x_{k+1}-y_k \|. \end{aligned}$$

Due to (26), the term T ₂ above is zero. Hence, $V_{k+1}^{*}$ can be bounded from below as

$$\begin{aligned} V_{k+1}^* \geq&\phi(x_{k+1})+ \biggl(\frac{2L_k-L}{2L_k^2}- \frac{\alpha _k^2}{2\gamma_{k+1}} \biggr)\|g_k\|^2-(1- \alpha_k)B_k \\ &{} + \langle \varDelta _k,x_{k+1}-y_k \rangle-M \|x_{k+1}-y_k\| + \bigl\langle \varDelta _k,y_k-(1- \alpha_k)x_k \bigr\rangle \\ \geq&\phi(x_{k+1})+ \biggl(\frac{2L_k-L}{2L_k^2}-\frac{\alpha _k^2}{2\gamma_{k+1}} \biggr)\|g_k\|^2-(1-\alpha_k)B_k \\ &{} -(\|\varDelta _k\|+M)\|x_{k+1}-y_k\|+ \bigl\langle \varDelta _k,y_k-(1-\alpha_k)x_k \bigr\rangle \\ &{}\quad\text{(Cauchy-Schwarz and (26))} \\ \geq&\phi(x_{k+1})-\frac{(\|\varDelta _k\|+M)^2}{4L_k^2A_k}-(1-\alpha _k)B_k+ \bigl\langle \varDelta _k,y_k-(1- \alpha_k)x_k \bigr\rangle \\ =&\phi(x_{k+1})-B_{k+1}. \end{aligned}$$

Here, the last inequality is from the definition of A _k given by (28) and the following inequality

$$ A_k\|g_k\|^2-\|g_k\| \frac{\|\varDelta _k\|+M}{L_k}\geq-\frac{(\|\varDelta _k\|+M)^2}{4L_k^2A_k}. $$

Hence, (13) holds for k+1. □

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lin, Q., Chen, X. & Peña, J. A sparsity preserving stochastic gradient methods for sparse regression. Comput Optim Appl 58, 455–482 (2014). https://doi.org/10.1007/s10589-013-9633-9

Download citation

Received: 11 January 2013
Published: 09 January 2014
Issue Date: June 2014
DOI: https://doi.org/10.1007/s10589-013-9633-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A sparsity preserving stochastic gradient methods for sparse regression

Abstract

Access this article

Similar content being viewed by others

Practical inexact proximal quasi-Newton method with global complexity analysis

Newton-Type Methods with the Proximal Gradient Step for Sparse Estimation

A fast non-monotone line search for stochastic gradient descent

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendix: Some technical proofs

1.1 A.1 Proof of Lemma 2

1.2 A.2 Proof of Lemma 4

1.3 A.3 Proof of Proposition 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A sparsity preserving stochastic gradient methods for sparse regression

Abstract

Access this article

Similar content being viewed by others

Practical inexact proximal quasi-Newton method with global complexity analysis

Newton-Type Methods with the Proximal Gradient Step for Sparse Estimation

A fast non-monotone line search for stochastic gradient descent

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendix: Some technical proofs

Appendix: Some technical proofs

1.1 A.1 Proof of Lemma 2

1.2 A.2 Proof of Lemma 4

1.3 A.3 Proof of Proposition 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation