Skip to main content
Log in

On proximal gradient method for the convex problems regularized with the group reproducing kernel norm

  • Published:
Journal of Global Optimization Aims and scope Submit manuscript

Abstract

We consider a class of nonsmooth convex optimization problems where the objective function is the composition of a strongly convex differentiable function with a linear mapping, regularized by the group reproducing kernel norm. This class of problems arise naturally from applications in group Lasso, which is a popular technique for variable selection. An effective approach to solve such problems is by the proximal gradient method. In this paper we derive and study theoretically the efficient algorithms for the class of the convex problems, analyze the convergence of the algorithm and its subalgorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Bach, F.: Consistency of the group Lasso and multiple kernel learning. J. Mach. Learn. Res. 9, 1179–1225 (2008)

    Google Scholar 

  2. Bakin, S.: Adaptive Regression and Model Selection in Data Mining Problems. PhD Thesis. Australian National University, Canberra (1999)

  3. Boikanyo, O.A., Morosanu, G.: Four parameter proximal point algorithms. Nonlinear Anal. 74, 544–C555 (2011)

    Article  Google Scholar 

  4. Boikanyo, O.A., Morosanu, G.: Inexact Halpern-type proximal point algorithm. J. Glob. Optim. 51, 11C26 (2011). doi: 10.1007/s10898-010-9616-7

  5. Cartis, C., Gould, N.I.M., Ph. Toint, L.: Adaptive cubic regularisation methods for unconstrained optimization. Part I: motivation, convergence and numerical results. Math. Program. Ser. A 127, 245–295 (2011)

  6. Combettes, P.L., Pesquet J.C.: Proximal Splitting Methods in Signal Processing. arXiv:0912.3522v4 [math.OC] 18 May (2010)

  7. Combettes, P.L., Wajs, V.R.: Signal recovery by proximal forward-backward splitting. Multiscale Model. Simul. 4, 1168–1200 (2005)

    Article  Google Scholar 

  8. Eckstein, J., Bertsekas, D.P.: On the Douglas–Rachford splitting method and the proximal point algorithm for maximal monotone operators. Math. Program. 55, 293–318 (1992)

    Article  Google Scholar 

  9. Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1359 (2001)

    Article  Google Scholar 

  10. Friedman, J., Hastie, T., Tibshirani, R.: A Note on the Group Lasso and a Sparse Group Lasso. arXiv:1001.0736v1 [math.ST] 5 Jan (2010)

  11. Kelley, C.T.: Iterative Methods for Linear and Nonlinear Equations. Society for Industrial and Applied Mathematics, Philadelphia (1995)

    Book  Google Scholar 

  12. Kim, D., Sra, S., Dhillon, I.: A scalable trust-region algorithm with application to mixednorm regression. In: International Conference on Machine Learning (ICML), p. 1 (2010)

  13. Liu, J., Ji, S., Ye, J.: SLEP: Sparse Learning with Efficient Projections. Arizona State University (2009)

  14. Luo, Z.Q., Tseng, P.: On the linear convergence of descent methods for convex essentially smooth minimization. SIAM J. Control Optim. 30(2), 408–425 (1992)

    Google Scholar 

  15. Ma, S., Song, X., Huang, J.: Supervised group Lasso with applications to microarray data analysis. BMC Bioinform. 8(1), 60 (2007)

    Article  Google Scholar 

  16. Meier, L., Van De Geer, S., Buhlmann, P.: The group Lasso for logistic regression. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 70(1), 53–71 (2008)

    Article  Google Scholar 

  17. Nesterov, Y.: Introductory Lectures on Convex Optimization. Kluwer, Boston (2004)

    Book  Google Scholar 

  18. Paul, H.Calamai, Jorge, J.Mor: Projected gradient methods for linearly constrained problems. Math. Progrom. 39, 93–116 (1987)

    Article  Google Scholar 

  19. Rockafellar, R.T.: Convex Analysis. Princeton University Press, Princeton (1970)

    Google Scholar 

  20. Rockafellar, R.T., Wets, R.J.B.: Variational Analysis. Springer, New York (1998)

    Book  Google Scholar 

  21. Roth, V., Fischer, B.: The group-Lasso for generalized linear models: uniqueness of solutions and efficient algorithms. In: Proceedings of the 25th International Conference on Machine Learning, pp. 848–855, ACM (2008)

  22. Tibshirani, R.: Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Ser. B 58, 267–288 (1996)

    Google Scholar 

  23. Tseng, P., Yun, S.: A coordinate gradient descent method for nonsmooth separable minimization. Math. Program. 117, 387–423 (2009)

    Article  Google Scholar 

  24. Tseng, P.: Approximation accuracy, gradient methods, and error bound for structured convex optimization. Math. Program. 125(2), 263–295 (2010)

    Article  Google Scholar 

  25. Van den Berg, E., Schmidt, M., Friedlander M., Murphy, K.: Group Sparsity Via Linear-Time Projection. Technical Report TR-2008-09, Department of Computer Science, University of British Columbia (2008)

  26. Wright, S., Nowak, R., Figueiredo, M.: Sparse reconstruction by separable approximation. IEEE Trans. Signal Process. 57(7), 2479–2493 (2009)

    Article  Google Scholar 

  27. Yang, H., Xu, Z., King, I., Lyu, M.: Online learning for group Lasso. In: 27th International Conference on, Machine Learning (2010)

  28. Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 68(1), 49–67 (2006)

    Article  Google Scholar 

  29. Zheng, H., Chen, S., Mo, Z., Huang, X.: Numerical Computation (in Chinese). Wuhan University Press, Wuhan (2004)

    Google Scholar 

  30. Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. B. 67(2), 301–320 (2005)

    Article  Google Scholar 

  31. Zou, H.: The adaptive Lasso and its oracle properties. J. Am. Stat. Assoc. 101(476), 1418–1429 (2006)

    Article  Google Scholar 

Download references

Acknowledgments

This work was supported by the National Science Foundation of China, Grant No. 61179033, and by the National Science Foundation, Grant No. DMS-1015346. We would like to thank Professor Zhi-Quan Luo, and part of this work was performed during a research visit by the first author to the University of Minnesota. We would like to thank the reviewer for his valuable suggestions and the editor for his helpful assistance.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Haibin Zhang.

Appendices

Appendix: A Proof of Theorem 2

By Lemma 2, since \(f_2\) is given by (11), it follows that

$$\begin{aligned} \bar{X}=\{x|\sum _{J\in \mathcal{J }} w_J\Vert {B_J x_J }\Vert = \min F -h(\bar{y}), Ax=\bar{y},\}, \end{aligned}$$

so that \(\bar{X}\) is bounded (since \(w_{J}>0 \) for all \(J\in \mathcal{J }\)), as well as being closed convex. Since \(F\) is convex and \(\bar{X}\) is bounded, it follows from [19, Theorem 8.7] that (36) holds. Next, we prove below the EB condition (32).

We argue by contradiction. Suppose there exists a \(\zeta \ge min F\) such that (32) fails to hold for all \(\kappa >0\) and \(\epsilon >0\). Then there exists a sequence \(x^{1},x^{2},\ldots \,\,\text{ in}\,\, \mathfrak R ^{n}\backslash \bar{X}\) satisfying

$$\begin{aligned} F(x^{k})\le \zeta \,\,\forall k, \,\,~ {r^k}\rightarrow 0, \,\,\left\{ \frac{r^k}{\delta _{k}}\right\} \rightarrow 0, \end{aligned}$$
(37)

where for simplicity we let \(r^{k}:=r(x^k), \delta _{k}:=\Vert Bx^k-B\bar{x}^k\Vert \), and \(\bar{x}^k:=\arg \min _{s\in \bar{X}}\Vert x^{k}-s\Vert \). Let

$$\begin{aligned} g^{k}:=\nabla f_2(x^k)=A^{T}\nabla h(Ax^k), \,\,\bar{g}:=A^{T}\nabla h(\bar{y}). \end{aligned}$$
(38)

By (35) and (38), \(A\bar{x}^k=\bar{y}\) and \(\nabla f_2(\bar{x}^k)=\bar{g}\) for all \(k\).

By (36) and (37), \({x^k}\) is bounded. By further passing to a subsequence if necessary, we can assume that \({x^k}\rightarrow \text{ some}~ \bar{x}\). Since \(\{r(x^k)\}=\{r^k\}\rightarrow 0\) and \(r\) is continuous by Lemma 1, this implies \(r(\bar{x})=0\), so \(\bar{x}\in \bar{X}\). Hence \(\Vert x^k-\bar{x}^k\Vert \le \Vert x^k-\bar{x}\Vert \rightarrow 0\), and \(\delta _k\le \Vert B\Vert \Vert x^k-\bar{x}^k\Vert \rightarrow 0\) as \(k\rightarrow \infty \) so that \({\bar{x}^k}\rightarrow \bar{x}\) (by \(\Vert \bar{x}^k-\bar{x}\Vert \le \Vert \bar{x}^k-x^k\Vert +\Vert x^k-\bar{x}\Vert \rightarrow 0\)). Also, by (35) and (38), \({g^k}\rightarrow \nabla f_2(\bar{x})= \bar{g}\).

Since \(f_1(x^k)\ge 0\), (11) implies \(h(Ax^k)=F(x^k)-f_1(x^k)\le F(x^k)\le \zeta \) for all \(k\). Since \({Ax^k}\) is bounded and \(h(y)\rightarrow \infty \) whenever \(\Vert y\Vert \rightarrow \infty \), this implies that \({Ax^k}\) and \(\bar{y}\) lie in some compact convex subset \(Y\) of \(\mathfrak R ^{m}\). By our assumption on \(h, h\) is strongly convex and \(\nabla h\) is Lipschitz continuous on \(Y\), so, in particular,

$$\begin{aligned} \mu \Vert y\!-\!\bar{y}\Vert ^{2}\le \langle \nabla h(y)\!-\!\nabla h(\bar{y}), y\!-\!\bar{y} \rangle \,\, \text{ and} \,\,\Vert \nabla h(y)\!-\!\nabla h(\bar{y})\Vert \le L \Vert y-\bar{y}\Vert ,~\forall y\!\in \! Y,\qquad \end{aligned}$$
(39)

for some \(0<\mu \le L\).

Denote \(\bar{B}\) as an block diagonal matrix of \(N\times N\) which diagonal block is \(B_J, J\in \mathcal{J }\). Let \(B\) is obtained by swapping some rows and swapping some columns from \(\bar{B}\), such that \((Bx)_J=B_Jx_J, \forall J\in \mathcal{J }\). By the nonsingularity of \(B_J(J\in \mathcal{J }), \bar{B}\) and \(B\) are both nonsingular. We claim that there exists \(\kappa > 0\) such that

$$\begin{aligned} \Vert Bx^k-B\bar{x}^k\Vert \le \kappa \Vert Ax^k-\bar{y}\Vert ,~\forall k. \end{aligned}$$
(40)

We argue this by contradiction. Suppose this is false. Then, by passing to a subsequence if necessary, we can assume that

$$\begin{aligned} \left\{ \frac{\Vert Ax^k-\bar{y}\Vert }{\Vert Bx^k-B\bar{x}^k\Vert }\right\} \rightarrow 0. \end{aligned}$$

Since \(\bar{y}=A\bar{x}^k\), this is equivalent to \(\{Au^k\}\rightarrow 0\), where we let

$$\begin{aligned} u^{k}:=\frac{x^k-\bar{x}^k}{\delta _{k}},~\forall k. \end{aligned}$$
(41)

Then \(\Vert Bu^k\Vert = 1\) for all \(k\). By further passing to a subsequence if necessary, we will assume that \(\{Bu^k\}\rightarrow \text{ some}~B\bar{u}\). Then \(A\bar{u}=0\) and \(\Vert B\bar{u}\Vert =1\), Moreover,

$$\begin{aligned} Ax^k=A(\bar{x}^k+\delta _k u^k)=\bar{y}+\delta _{k} Au^k =\bar{y}+o(\delta _{k}). \end{aligned}$$

Since \(Ax^k\) and \(\bar{y}\) are in \(Y\), the Lipschitz continuity of \(\nabla h\) on \(Y\) [see (39) and (38)] yield

$$\begin{aligned} g^k= \bar{g} +o (\delta _{k}). \end{aligned}$$
(42)

By further passing to a subsequence if necessary, we can assume that, for each \(J\in \mathcal{J }\), either (a) \(\Vert B_J^{-T}(x^k_J-g^k_J)\Vert \le w_{J}\) for all \(k\) or (b) \(\Vert B_J^{-T}(x^k_J-g^k_J)\Vert >w_{J}\) and \(\bar{x}^k_J\ne 0\) for all \(k\) or (c) \(\Vert B_J^{-T}(x^k_J-g^k_J)\Vert >w_{J}\) and \(\bar{x}^k_J=0\) for all \(k\).

Case (a). In this case, Lemma 1 implies that \(r^k_J= -x^k_J\) for all \(k\). Since \(\{r^k\}\rightarrow 0\) and \(\{x^k\}\rightarrow \bar{x}\), this implies \(\bar{x}_J= 0\). Also, by (41) and (37), we have

$$\begin{aligned} u^k_J= \frac{-r^k_J - \bar{x}^k_J}{\delta _k}= \frac{o(\delta _k) - \bar{x}^k_J}{\delta _k}. \end{aligned}$$
(43)

Thus \(\bar{u}_J=- \mathop {\lim }\limits _{k \rightarrow \infty }\frac{\bar{x}^k_J}{\delta _k}\). Suppose \(\bar{u}_J \ne 0\). Then \(\bar{x}^k_J\ne 0\) for all \(k\) sufficiently large, i.e. \(B_J\bar{x}^k_J \ne 0\), so \(\bar{x}^k\in \bar{X}\) and the optimality condition for (11), (12) and \(\nabla f_2(\bar{x}^k)= \bar{g}\) imply

$$\begin{aligned} \bar{g}_J + w_J\frac{B^T_JB_J\bar{x}^k_J}{\Vert B_J\bar{x}^k_J\Vert } = 0, \end{aligned}$$
(44)

by \(\bar{u}_J=- \mathop {\lim }\limits _{k \rightarrow \infty }\frac{\bar{x}^k_J}{\delta _k}, \mathop {\lim }\limits _{k \rightarrow \infty }\frac{B^T_JB_J\bar{x}^k_J}{\Vert B_J\bar{x}^k_J\Vert }=\mathop {\lim } \limits _{k \rightarrow \infty } \frac{B^T_JB_J\bar{x}^k_J}{\delta _k}\cdot \frac{\delta _k}{\Vert B_J\bar{x}^k_J\Vert }=-B^T_JB_J\bar{u}_J\cdot \mathop {\lim }\limits _{k \rightarrow \infty }\frac{1}{\Vert B_J\frac{\bar{x}^k_J}{\delta _k}\Vert }\), we get

$$\begin{aligned} \bar{g}_J - w_J\frac{B^T_JB_J\bar{u}_J}{\Vert B_J\bar{u}_J\Vert } = 0. \end{aligned}$$
(45)

Case (b). Since \(\bar{x}^k\in \bar{X}\) and \(\bar{u}_J \ne 0\), the optimality condition for (11), (12) and \(\nabla f_2(\bar{x}^k)= \bar{g}\) imply (44) holds for all \(k\). Then Lemma 1 implies

$$\begin{aligned} -r^k_J&= [\theta I + w_{J} B^{T}_{J}B_{J}]^{-1}(\theta g^k_{J}+w_{J} B^{T}_{J}B_{J}x^k_{J}) \nonumber \\ -[\theta I + w_{J} B^{T}_{J}B_{J}]r^k_J&= \theta g^k_{J}+w_{J} B^{T}_{J}B_{J}x^k_{J}\nonumber \\&= \theta (\bar{g}_J+ o(\delta _k))+ w_{J} B^{T}_{J}B_{J}\bar{x}^k_{J}+ \delta _kw_{J} B^{T}_{J}B_{J}u^k_{J} \nonumber \\&= \theta \bar{g}_J+ w_{J} B^{T}_{J}B_{J}\bar{x}^k_{J}+ \delta _kw_{J}B^{T}_{J}B_{J}u^k_{J}+\theta o(\delta _k) \nonumber \\&= \theta \bar{g}_J-\Vert B_J\bar{x}^k_J\Vert \bar{g}_J+ \delta _kw_{J}B^{T}_{J}B_{J}u^k_{J}+\theta o(\delta _k)\nonumber \\&= \Vert B_J(x^k_J+r^k_J)\Vert \bar{g}_J\!-\!\Vert B_J\bar{x}^k_J\Vert \bar{g}_J\!+\! \delta _kw_{J}B^{T}_{J}B_{J}u^k_{J}\!+\! \theta o(\delta _k),\qquad \end{aligned}$$
(46)

where the third equality uses (41) and (42); the fifth equality uses (44); the sixth equality uses the definition of \(\theta \). Dividing both sides by \(\delta _{k}\) and using (37) yields in the limit

$$\begin{aligned} 0=\bar{\theta }\cdot \bar{g}_J+w_{J} B^{T}_{J}B_{J}\bar{u}_{J}, \end{aligned}$$
(47)

where \(\bar{\theta }:=\mathop {\lim }\limits _{k \rightarrow \infty }\frac{\Vert B_J(x^k_J+r^k_J)\Vert -\Vert B_J\bar{x}^k_J\Vert }{\delta _k}\). Thus \(B^T_JB_J\bar{u}_J\) is a nonzero multiple of \(\bar{g}_J\).

Case (c). In this case, it follows from \(\{\bar{x}^k\}\rightarrow \bar{x}\) that \(\bar{x}_J= 0\). Since \(\Vert B_J^{-T}(x^k_J - g^k_J)\Vert >w_{J}\) for all \(k\), this implies \(\Vert B_J^{-T}\bar{g}_J\Vert \ge w_{J}\). Since \(\bar{x}\in \bar{X}\), the optimality condition \(0\in \bar{g}_{J}+ w_{J}B^T_J\partial \Vert 0\Vert \) implies \(\Vert B_J^{-T}\bar{g}_J\Vert \le w_{J}\). Hence \(\Vert B_J^{-T}\bar{g}_J\Vert =w_{J}\). Then similar to (46) and \(\bar{x}^k_J= 0\),

$$\begin{aligned} -r^k_J&= [\theta I + w_{J} B^{T}_{J}B_{J}]^{-1}(\theta g^k_{J}+w_{J} B^{T}_{J}B_{J}x^k_{J})\nonumber \\ -[\theta I + w_{J} B^{T}_{J}B_{J}]r^k_J&= \theta g^k_{J}+w_{J} B^{T}_{J}B_{J}x^k_{J}\nonumber \\&= \theta \bar{g}_J+\delta _kw_{J}B^{T}_{J}B_{J}u^k_{J}+\theta o(\delta _k)\nonumber \\&= \Vert B_J(x^k_J+r^k_J)\Vert \bar{g}_J+\delta _kw_{J}B^{T}_{J}B_{J} u^k_{J}+\theta o(\delta _k). \end{aligned}$$
(48)

Dividing both sides by \(\delta _{k}\), one has

$$\begin{aligned} -[\theta I + w_{J} B^{T}_{J}B_{J}]\frac{r^k_J}{\delta _{k}}= \Vert B_J\frac{(x^k_J+r^k_J)}{\delta _{k}}\Vert \bar{g}_J+w_{J}B^{T}_{J} B_{J}u^k_{J}+\theta \frac{o(\delta _k)}{\delta _{k}}, \end{aligned}$$

using (37), (41) and \(\bar{x}^k_J= 0\) yields in the limit

$$\begin{aligned} 0= \Vert B_J\bar{u}_{J}\Vert \bar{g}_J + w_{J}B^{T}_{J}B_{J}\bar{u}_{J}. \end{aligned}$$
(49)

Suppose \(\bar{u}_J\ne 0\), then \(\Vert B_J\bar{u}_J\Vert >0\). Thus (49) implies \(B^T_JB_J\bar{u}_J\) is a negative multiple of \(\bar{g}_J\).

Since \(\frac{x^k-\bar{x}^k}{\delta _{k}}= \{u^k\}\rightarrow \bar{u}\ne 0\), we have \(\langle x^k-\bar{x}^k,\bar{u}\rangle > 0\) for all \(k\) sufficiently large. Fix any such \(k\) and let

$$\begin{aligned} \hat{x}:= \bar{x}^k + \epsilon \bar{u} \end{aligned}$$

with \(\epsilon > 0\). Since \(A\bar{u}= 0\), we have \(\nabla f_2(\hat{x})=A^T\nabla h(A\hat{x})=A^T\nabla h(A\bar{x}^k)= \nabla f_2(\bar{x}^k)=\bar{g}\). We show below that, for \(\epsilon > 0\) sufficiently small, \(\hat{x}\) satisfies

$$\begin{aligned} 0\in \bar{g}_J+w_J\partial \Vert B\hat{x}_J\Vert \end{aligned}$$
(50)

for all \(J\in \mathcal{J }\), and hence \(\hat{x}\in \bar{X}\). Then \(\langle x^k-\bar{x}^k,\bar{u}\rangle > 0\) and \(\Vert B\bar{u}\Vert = 1\) yield

$$\begin{aligned} \Vert x^k-\hat{x}\Vert =\Vert x^k-\bar{x}^k-\epsilon \bar{u}\Vert ^{2}= \Vert x^k-\bar{x}^k\Vert ^{2}-2\epsilon \langle x^k-\bar{x}^k,\bar{u}\rangle +\epsilon ^{2}\Vert \bar{u}\Vert ^{2}<\Vert x^k-\bar{x}^k\Vert ^{2} \end{aligned}$$

for all \(\epsilon > 0\) sufficiently small, which contradicts to \(\bar{x}^k\) being the point in \(\bar{X}\) nearest to \(x^k\) and thus proves (40). For each \(J\in \mathcal{J }\), if \(\bar{u}_J= 0\), then \(\hat{x}_J= \bar{x}^k_J\) and (50) holds automatically (since \(\bar{x}^k\in \bar{X}\)). Suppose that \(\bar{u}_J\ne 0\). We prove (50) below by considering the three aforementioned cases (a), (b) and (c).

Case (a). Since \(\bar{u}_J\ne 0\), by (45), we have that \(B^T_JB_J\bar{u}_J\) is a positive multiple of \(\bar{g}_{J}\). Also, by (44) and \(B^T_JB_J\bar{x}^k_J\) is a negative multiple of \(\bar{g}_J\), hence \(B^T_JB_J\hat{x}^k_J\) is a negative multiple of \(\bar{g}_J\) for all \(\epsilon > 0\) sufficiently small. Suppose \(\bar{g}_J+ aB^T_JB_J\hat{x}^k_J =0 (a>0)\), combining with (44) (\(\bar{g}_J + w_J\frac{B^T_JB_J\bar{x}^k_J}{\Vert B_J\bar{x}^k_J\Vert } = 0\)), we have that \(aB_J\hat{x}^k_J =w_J\frac{B_J\bar{x}^k_J}{\Vert B_J\bar{x}^k_J\Vert }, \frac{B_J\hat{x}^k_J}{\Vert B_J\hat{x}^k_J\Vert }= \frac{B_J\bar{x}^k_J}{\Vert B_J\bar{x}^k_J\Vert }\). Thus \(\hat{x}^k_J\) satisfies

$$\begin{aligned} \bar{g}_J +w_J\frac{B^T_JB_J\hat{x}^k_J}{\Vert B_J\hat{x}^k_J\Vert } = 0. \end{aligned}$$
(51)

Case (b). Since (44) holds, \(B^T_JB_J\bar{x}^k_J\) is a negative multiple of \(\bar{g}_J\). Also, \(B^T_JB_J\bar{u}_J\) is a nonzero multiple of \(\bar{g}_J\). A similar argument as in case (a) shows that \(B^T_JB_J\hat{x}^k_J\) satisfies (50) for all \(\epsilon > 0\) sufficiently small.

Case (c). We have \(\bar{x}^k_J= 0\) and \(B^T_JB_J\bar{u}_J\) is a negative multiple of \(\bar{g}_J\). Hence \(B^T_JB_J\hat{x}^k_J\) is a negative multiple of \(\bar{g}_J\) for all \(\epsilon > 0\), so it satisfies (50).

Obviously,

$$\begin{aligned} r^k\in \text{ arg}\min _df_1(x^k+d)+\frac{1}{2}\Vert d+g^k\Vert ^2 \end{aligned}$$

is equivalent to (by their optimization conditions)

$$\begin{aligned} r^k\in \arg \min \limits _{d}\langle g^k+r^k,d\rangle +f_1(x^k+d). \end{aligned}$$

Hence

$$\begin{aligned} \langle g^k+r^k,r^k\rangle +f_1(x^k+r^k)\le \langle g^k+r^k,\bar{x}^k-x^k\rangle +f_1(\bar{x}^k). \end{aligned}$$

Since \(\bar{x}^k\in \bar{X}\) and \(\nabla f_2(\bar{x}^k)=\bar{g}\), we have similarly that

$$\begin{aligned} \langle \bar{g}+0,0\rangle +f_1(\bar{x}^k+0)\le \langle \bar{g},x^k+r^k-\bar{x}^k\rangle +f_1(\bar{x}^k+x^k+r^k-\bar{x}^k), \end{aligned}$$

i.e.

$$\begin{aligned} f_1(\bar{x}^k)\le \langle \bar{g},x^k+r^k-\bar{x}^k\rangle +f_1(x^k+r^k). \end{aligned}$$

Adding the above two inequalities and simplifying yield

$$\begin{aligned} \langle g^k-\bar{g},x^k-\bar{x}^k\rangle + \Vert r^k\Vert ^{2}\le \langle \bar{g}-g^k,r^k\rangle +\langle r^k,\bar{x}^k-x^k\rangle . \end{aligned}$$

Since \(Ax^k\) and \(A\bar{x}^k=\bar{y}\) are in \(Y\), we also have from (38), (39) and (40) that

$$\begin{aligned} \langle g^k-\bar{g},x^k-\bar{x}^k\rangle&= \langle \nabla h(Ax^k)-\nabla h(\bar{y}),Ax^k-\bar{y}\rangle \nonumber \\&\ge \mu \Vert Ax^k-\bar{y}\Vert ^{2}\nonumber \\&\ge \frac{\mu }{\kappa ^{2}}\Vert B(x^k-\bar{x}^k)\Vert ^{2}. \end{aligned}$$
(52)

Combining these two inequalities and using (39) yield

$$\begin{aligned} \frac{\mu }{\kappa ^{2}}\Vert B(x^k-\bar{x}^k)\Vert ^{2}+\Vert r^k\Vert ^{2}&\le \langle \nabla h(\bar{y})-\nabla h(Ax^k),Ar^k\rangle +\langle r^k,\bar{x}^k-x^k\rangle \nonumber \\&\le L\Vert A\Vert ^{2}\Vert x^k-\bar{x}^k\Vert \Vert r^k\Vert +\Vert x^k-\bar{x}^k\Vert \Vert r^k\Vert . \end{aligned}$$
(53)

Thus

$$\begin{aligned} \frac{\mu }{\kappa ^{2}}\Vert B(x^k-\bar{x}^k)\Vert ^{2}&\le (L\Vert A\Vert ^{2}+1)\Vert x^k-\bar{x}^k\Vert \Vert r^k\Vert \nonumber \\&\le (L\Vert A\Vert ^{2}+1)\Vert B(x^k-\bar{x}^k)\Vert \Vert B^{-1}\Vert \Vert r^k\Vert \nonumber \\ \frac{\mu }{\kappa ^{2}}\Vert B(x^k-\bar{x}^k)\Vert&\le (L\Vert A\Vert ^{2}+1)\Vert B^{-1}\Vert \Vert r^k\Vert , \,\,\forall k. \end{aligned}$$
(54)

Dividing both sides by \(\Vert B(x^k-\bar{x}^k)\Vert \) yields a contradiction to (37).

Appendix B: Proof of Lemma 3

Firstly, we give the following Resolvent Identity (see Lemma 3 in [3] or Lemma 2 in [4]).

For any \(\beta ,\gamma >0\), and \(x\in R^n\), since the operator \(\partial \varphi \) is maximal monotone, the following Resolvent Identity holds true:

$$\begin{aligned} \mathrm{prox}_{\beta \varphi }(x)=\mathrm{prox}_{\gamma \varphi }\left(\frac{\gamma }{\beta }x+\left(1-\frac{\gamma }{\beta }\right)\mathrm{prox}_{\beta \varphi }(x)\right). \end{aligned}$$
(55)

Using the resolvent identity (55) and the fact that the resolvent is nonexpansive, we get

$$\begin{aligned}&\Vert x-\mathrm{prox}_{\varphi }[x+d]\Vert \\&\quad \le \Vert x-\mathrm{prox}_{\alpha \varphi }[x+\alpha d]\Vert +\Vert \mathrm{prox}_{\alpha \varphi }[x+\alpha d]-\mathrm{prox}_{\varphi }[x+d]\Vert \\&\quad =\Vert x-\mathrm{prox}_{\alpha \varphi }[x+\alpha d]\Vert \\&\qquad +\left\Vert\mathrm{prox}_{\varphi }\left[\frac{1}{\alpha }(x+\alpha d)+\left(1-\frac{1}{\alpha }\right)\mathrm{prox}_{\alpha \varphi }[x+\alpha d]\right]-\mathrm{prox}_{\varphi }[x+d]\right\Vert\\&\quad \le \Vert x-\mathrm{prox}_{\alpha \varphi }[x+\alpha d]\Vert \\&\qquad +\left\Vert\left[\frac{1}{\alpha }(x+\alpha d)+\left(1-\frac{1}{\alpha }\right)\mathrm{prox}_{\alpha \varphi }[x+\alpha d]\right]-[x+d]\right\Vert\\&\quad =\Vert x-\mathrm{prox}_{\alpha \varphi }[x+\alpha d]\Vert +\left|1-\frac{1}{\alpha }\right| \Vert x-\mathrm{prox}_{\alpha \varphi }[x+\alpha d]\Vert \\&\quad =\left(1+|1-\frac{1}{\alpha }|\right)\Vert x-\mathrm{prox}_{\alpha \varphi }[x+\alpha d]\Vert . \end{aligned}$$

So, we have

$$\begin{aligned} \Vert x-\mathrm{prox}_{\varphi }[x+d]\Vert \le (1+|1-\frac{1}{\alpha }|) \Vert x-\mathrm{prox}_{\alpha \varphi }[x+\alpha d]\Vert . \end{aligned}$$
(56)

After getting rid of sign of the absolute value, one can get that

when \(0<\alpha \le 1, 1+|1-\frac{1}{\alpha }|=\frac{1}{\alpha }\); and when \(\alpha >1, 1+|1-\frac{1}{\alpha }|=2-\frac{1}{\alpha }<2\).

Thus, we have

$$\begin{aligned} \Vert x-\mathrm{prox}_{\varphi }[x+d]\Vert \le \frac{1}{\tilde{\alpha }}\Vert x-\mathrm{prox}_{\alpha \varphi }[x+\alpha d]\Vert , \end{aligned}$$

where \(\tilde{\alpha }=\min \{\alpha , \frac{1}{2}\}\) and \(\forall \alpha >0\).

That is the conclusion of Lemma 3.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, H., Wei, J., Li, M. et al. On proximal gradient method for the convex problems regularized with the group reproducing kernel norm. J Glob Optim 58, 169–188 (2014). https://doi.org/10.1007/s10898-013-0034-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10898-013-0034-5

Keywords

Navigation