On proximal gradient method for the convex problems regularized with the group reproducing kernel norm

Zhang, Haibin; Wei, Juan; Li, Meixia; Zhou, Jie; Chao, Miantao

doi:10.1007/s10898-013-0034-5

On proximal gradient method for the convex problems regularized with the group reproducing kernel norm

Published: 22 January 2013

Volume 58, pages 169–188, (2014)
Cite this article

Journal of Global Optimization Aims and scope Submit manuscript

Haibin Zhang¹,
Juan Wei¹,
Meixia Li^1,2,
Jie Zhou¹ &
…
Miantao Chao¹

599 Accesses
7 Citations
Explore all metrics

Abstract

We consider a class of nonsmooth convex optimization problems where the objective function is the composition of a strongly convex differentiable function with a linear mapping, regularized by the group reproducing kernel norm. This class of problems arise naturally from applications in group Lasso, which is a popular technique for variable selection. An effective approach to solve such problems is by the proximal gradient method. In this paper we derive and study theoretically the efficient algorithms for the class of the convex problems, analyze the convergence of the algorithm and its subalgorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On the proximal Landweber Newton method for a class of nonsmooth convex problems

Article 07 October 2014

Newton-Type Methods with the Proximal Gradient Step for Sparse Estimation

Article 20 March 2024

On the Gradient Projection Method for Weakly Convex Functions on a Proximally Smooth Set

Article 01 November 2020

References

Bach, F.: Consistency of the group Lasso and multiple kernel learning. J. Mach. Learn. Res. 9, 1179–1225 (2008)
Google Scholar
Bakin, S.: Adaptive Regression and Model Selection in Data Mining Problems. PhD Thesis. Australian National University, Canberra (1999)
Boikanyo, O.A., Morosanu, G.: Four parameter proximal point algorithms. Nonlinear Anal. 74, 544–C555 (2011)
Article Google Scholar
Boikanyo, O.A., Morosanu, G.: Inexact Halpern-type proximal point algorithm. J. Glob. Optim. 51, 11C26 (2011). doi: 10.1007/s10898-010-9616-7
Cartis, C., Gould, N.I.M., Ph. Toint, L.: Adaptive cubic regularisation methods for unconstrained optimization. Part I: motivation, convergence and numerical results. Math. Program. Ser. A 127, 245–295 (2011)
Combettes, P.L., Pesquet J.C.: Proximal Splitting Methods in Signal Processing. arXiv:0912.3522v4 [math.OC] 18 May (2010)
Combettes, P.L., Wajs, V.R.: Signal recovery by proximal forward-backward splitting. Multiscale Model. Simul. 4, 1168–1200 (2005)
Article Google Scholar
Eckstein, J., Bertsekas, D.P.: On the Douglas–Rachford splitting method and the proximal point algorithm for maximal monotone operators. Math. Program. 55, 293–318 (1992)
Article Google Scholar
Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1359 (2001)
Article Google Scholar
Friedman, J., Hastie, T., Tibshirani, R.: A Note on the Group Lasso and a Sparse Group Lasso. arXiv:1001.0736v1 [math.ST] 5 Jan (2010)
Kelley, C.T.: Iterative Methods for Linear and Nonlinear Equations. Society for Industrial and Applied Mathematics, Philadelphia (1995)
Book Google Scholar
Kim, D., Sra, S., Dhillon, I.: A scalable trust-region algorithm with application to mixednorm regression. In: International Conference on Machine Learning (ICML), p. 1 (2010)
Liu, J., Ji, S., Ye, J.: SLEP: Sparse Learning with Efficient Projections. Arizona State University (2009)
Luo, Z.Q., Tseng, P.: On the linear convergence of descent methods for convex essentially smooth minimization. SIAM J. Control Optim. 30(2), 408–425 (1992)
Google Scholar
Ma, S., Song, X., Huang, J.: Supervised group Lasso with applications to microarray data analysis. BMC Bioinform. 8(1), 60 (2007)
Article Google Scholar
Meier, L., Van De Geer, S., Buhlmann, P.: The group Lasso for logistic regression. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 70(1), 53–71 (2008)
Article Google Scholar
Nesterov, Y.: Introductory Lectures on Convex Optimization. Kluwer, Boston (2004)
Book Google Scholar
Paul, H.Calamai, Jorge, J.Mor: Projected gradient methods for linearly constrained problems. Math. Progrom. 39, 93–116 (1987)
Article Google Scholar
Rockafellar, R.T.: Convex Analysis. Princeton University Press, Princeton (1970)
Google Scholar
Rockafellar, R.T., Wets, R.J.B.: Variational Analysis. Springer, New York (1998)
Book Google Scholar
Roth, V., Fischer, B.: The group-Lasso for generalized linear models: uniqueness of solutions and efficient algorithms. In: Proceedings of the 25th International Conference on Machine Learning, pp. 848–855, ACM (2008)
Tibshirani, R.: Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Ser. B 58, 267–288 (1996)
Google Scholar
Tseng, P., Yun, S.: A coordinate gradient descent method for nonsmooth separable minimization. Math. Program. 117, 387–423 (2009)
Article Google Scholar
Tseng, P.: Approximation accuracy, gradient methods, and error bound for structured convex optimization. Math. Program. 125(2), 263–295 (2010)
Article Google Scholar
Van den Berg, E., Schmidt, M., Friedlander M., Murphy, K.: Group Sparsity Via Linear-Time Projection. Technical Report TR-2008-09, Department of Computer Science, University of British Columbia (2008)
Wright, S., Nowak, R., Figueiredo, M.: Sparse reconstruction by separable approximation. IEEE Trans. Signal Process. 57(7), 2479–2493 (2009)
Article Google Scholar
Yang, H., Xu, Z., King, I., Lyu, M.: Online learning for group Lasso. In: 27th International Conference on, Machine Learning (2010)
Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 68(1), 49–67 (2006)
Article Google Scholar
Zheng, H., Chen, S., Mo, Z., Huang, X.: Numerical Computation (in Chinese). Wuhan University Press, Wuhan (2004)
Google Scholar
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. B. 67(2), 301–320 (2005)
Article Google Scholar
Zou, H.: The adaptive Lasso and its oracle properties. J. Am. Stat. Assoc. 101(476), 1418–1429 (2006)
Article Google Scholar

Download references

Acknowledgments

This work was supported by the National Science Foundation of China, Grant No. 61179033, and by the National Science Foundation, Grant No. DMS-1015346. We would like to thank Professor Zhi-Quan Luo, and part of this work was performed during a research visit by the first author to the University of Minnesota. We would like to thank the reviewer for his valuable suggestions and the editor for his helpful assistance.

Author information

Authors and Affiliations

College of Applied Sciences, Beijing University of Technology, Beijing, 100124, China
Haibin Zhang, Juan Wei, Meixia Li, Jie Zhou & Miantao Chao
School of Mathematics and Information Science, Weifang University, Weifang, 261061, Shandong, China
Meixia Li

Authors

Haibin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Juan Wei
View author publications
You can also search for this author in PubMed Google Scholar
Meixia Li
View author publications
You can also search for this author in PubMed Google Scholar
Jie Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Miantao Chao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Haibin Zhang.

Appendices

Appendix: A Proof of Theorem 2

By Lemma 2, since $f_2$ is given by (11), it follows that

$$\begin{aligned} \bar{X}=\{x|\sum _{J\in \mathcal{J }} w_J\Vert {B_J x_J }\Vert = \min F -h(\bar{y}), Ax=\bar{y},\}, \end{aligned}$$

so that $\bar{X}$ is bounded (since $w_{J}>0 $ for all $J\in \mathcal{J }$), as well as being closed convex. Since $F$ is convex and $\bar{X}$ is bounded, it follows from [19, Theorem 8.7] that (36) holds. Next, we prove below the EB condition (32).

We argue by contradiction. Suppose there exists a $\zeta \ge min F$ such that (32) fails to hold for all $\kappa >0$ and $\epsilon >0$. Then there exists a sequence $x^{1},x^{2},\ldots \,\,\text{ in}\,\, \mathfrak R ^{n}\backslash \bar{X}$ satisfying

$$\begin{aligned} F(x^{k})\le \zeta \,\,\forall k, \,\,~ {r^k}\rightarrow 0, \,\,\left\{ \frac{r^k}{\delta _{k}}\right\} \rightarrow 0, \end{aligned}$$

(37)

where for simplicity we let $r^{k}:=r(x^k), \delta _{k}:=\Vert Bx^k-B\bar{x}^k\Vert $, and $\bar{x}^k:=\arg \min _{s\in \bar{X}}\Vert x^{k}-s\Vert $. Let

$$\begin{aligned} g^{k}:=\nabla f_2(x^k)=A^{T}\nabla h(Ax^k), \,\,\bar{g}:=A^{T}\nabla h(\bar{y}). \end{aligned}$$

(38)

By (35) and (38), $A\bar{x}^k=\bar{y}$ and $\nabla f_2(\bar{x}^k)=\bar{g}$ for all $k$.

By (36) and (37), ${x^k}$ is bounded. By further passing to a subsequence if necessary, we can assume that ${x^k}\rightarrow \text{ some}~ \bar{x}$. Since $\{r(x^k)\}=\{r^k\}\rightarrow 0$ and $r$ is continuous by Lemma 1, this implies $r(\bar{x})=0$, so $\bar{x}\in \bar{X}$. Hence $\Vert x^k-\bar{x}^k\Vert \le \Vert x^k-\bar{x}\Vert \rightarrow 0$, and $\delta _k\le \Vert B\Vert \Vert x^k-\bar{x}^k\Vert \rightarrow 0$ as $k\rightarrow \infty $ so that ${\bar{x}^k}\rightarrow \bar{x}$ (by $\Vert \bar{x}^k-\bar{x}\Vert \le \Vert \bar{x}^k-x^k\Vert +\Vert x^k-\bar{x}\Vert \rightarrow 0$). Also, by (35) and (38), ${g^k}\rightarrow \nabla f_2(\bar{x})= \bar{g}$.

Since $f_1(x^k)\ge 0$, (11) implies $h(Ax^k)=F(x^k)-f_1(x^k)\le F(x^k)\le \zeta $ for all $k$. Since ${Ax^k}$ is bounded and $h(y)\rightarrow \infty $ whenever $\Vert y\Vert \rightarrow \infty $, this implies that ${Ax^k}$ and $\bar{y}$ lie in some compact convex subset $Y$ of $\mathfrak R ^{m}$. By our assumption on $h, h$ is strongly convex and $\nabla h$ is Lipschitz continuous on $Y$, so, in particular,

$$\begin{aligned} \mu \Vert y\!-\!\bar{y}\Vert ^{2}\le \langle \nabla h(y)\!-\!\nabla h(\bar{y}), y\!-\!\bar{y} \rangle \,\, \text{ and} \,\,\Vert \nabla h(y)\!-\!\nabla h(\bar{y})\Vert \le L \Vert y-\bar{y}\Vert ,~\forall y\!\in \! Y,\qquad \end{aligned}$$

(39)

for some $0<\mu \le L$.

Denote $\bar{B}$ as an block diagonal matrix of $N\times N$ which diagonal block is $B_J, J\in \mathcal{J }$. Let $B$ is obtained by swapping some rows and swapping some columns from $\bar{B}$, such that $(Bx)_J=B_Jx_J, \forall J\in \mathcal{J }$. By the nonsingularity of $B_J(J\in \mathcal{J }), \bar{B}$ and $B$ are both nonsingular. We claim that there exists $\kappa > 0$ such that

$$\begin{aligned} \Vert Bx^k-B\bar{x}^k\Vert \le \kappa \Vert Ax^k-\bar{y}\Vert ,~\forall k. \end{aligned}$$

(40)

We argue this by contradiction. Suppose this is false. Then, by passing to a subsequence if necessary, we can assume that

$$\begin{aligned} \left\{ \frac{\Vert Ax^k-\bar{y}\Vert }{\Vert Bx^k-B\bar{x}^k\Vert }\right\} \rightarrow 0. \end{aligned}$$

Since $\bar{y}=A\bar{x}^k$, this is equivalent to $\{Au^k\}\rightarrow 0$, where we let

$$\begin{aligned} u^{k}:=\frac{x^k-\bar{x}^k}{\delta _{k}},~\forall k. \end{aligned}$$

(41)

Then $\Vert Bu^k\Vert = 1$ for all $k$. By further passing to a subsequence if necessary, we will assume that $\{Bu^k\}\rightarrow \text{ some}~B\bar{u}$. Then $A\bar{u}=0$ and $\Vert B\bar{u}\Vert =1$, Moreover,

$$\begin{aligned} Ax^k=A(\bar{x}^k+\delta _k u^k)=\bar{y}+\delta _{k} Au^k =\bar{y}+o(\delta _{k}). \end{aligned}$$

Since $Ax^k$ and $\bar{y}$ are in $Y$, the Lipschitz continuity of $\nabla h$ on $Y$ [see (39) and (38)] yield

$$\begin{aligned} g^k= \bar{g} +o (\delta _{k}). \end{aligned}$$

(42)

By further passing to a subsequence if necessary, we can assume that, for each $J\in \mathcal{J }$, either (a) $\Vert B_J^{-T}(x^k_J-g^k_J)\Vert \le w_{J}$ for all $k$ or (b) $\Vert B_J^{-T}(x^k_J-g^k_J)\Vert >w_{J}$ and $\bar{x}^k_J\ne 0$ for all $k$ or (c) $\Vert B_J^{-T}(x^k_J-g^k_J)\Vert >w_{J}$ and $\bar{x}^k_J=0$ for all $k$.

Case (a). In this case, Lemma 1 implies that $r^k_J= -x^k_J$ for all $k$. Since $\{r^k\}\rightarrow 0$ and $\{x^k\}\rightarrow \bar{x}$, this implies $\bar{x}_J= 0$. Also, by (41) and (37), we have

$$\begin{aligned} u^k_J= \frac{-r^k_J - \bar{x}^k_J}{\delta _k}= \frac{o(\delta _k) - \bar{x}^k_J}{\delta _k}. \end{aligned}$$

(43)

Thus $\bar{u}_J=- \mathop {\lim }\limits _{k \rightarrow \infty }\frac{\bar{x}^k_J}{\delta _k}$. Suppose $\bar{u}_J \ne 0$. Then $\bar{x}^k_J\ne 0$ for all $k$ sufficiently large, i.e. $B_J\bar{x}^k_J \ne 0$, so $\bar{x}^k\in \bar{X}$ and the optimality condition for (11), (12) and $\nabla f_2(\bar{x}^k)= \bar{g}$ imply

$$\begin{aligned} \bar{g}_J + w_J\frac{B^T_JB_J\bar{x}^k_J}{\Vert B_J\bar{x}^k_J\Vert } = 0, \end{aligned}$$

(44)

by $\bar{u}_J=- \mathop {\lim }\limits _{k \rightarrow \infty }\frac{\bar{x}^k_J}{\delta _k}, \mathop {\lim }\limits _{k \rightarrow \infty }\frac{B^T_JB_J\bar{x}^k_J}{\Vert B_J\bar{x}^k_J\Vert }=\mathop {\lim } \limits _{k \rightarrow \infty } \frac{B^T_JB_J\bar{x}^k_J}{\delta _k}\cdot \frac{\delta _k}{\Vert B_J\bar{x}^k_J\Vert }=-B^T_JB_J\bar{u}_J\cdot \mathop {\lim }\limits _{k \rightarrow \infty }\frac{1}{\Vert B_J\frac{\bar{x}^k_J}{\delta _k}\Vert }$, we get

$$\begin{aligned} \bar{g}_J - w_J\frac{B^T_JB_J\bar{u}_J}{\Vert B_J\bar{u}_J\Vert } = 0. \end{aligned}$$

(45)

Case (b). Since $\bar{x}^k\in \bar{X}$ and $\bar{u}_J \ne 0$, the optimality condition for (11), (12) and $\nabla f_2(\bar{x}^k)= \bar{g}$ imply (44) holds for all $k$. Then Lemma 1 implies

$$\begin{aligned} -r^k_J&= [\theta I + w_{J} B^{T}_{J}B_{J}]^{-1}(\theta g^k_{J}+w_{J} B^{T}_{J}B_{J}x^k_{J}) \nonumber \\ -[\theta I + w_{J} B^{T}_{J}B_{J}]r^k_J&= \theta g^k_{J}+w_{J} B^{T}_{J}B_{J}x^k_{J}\nonumber \\&= \theta (\bar{g}_J+ o(\delta _k))+ w_{J} B^{T}_{J}B_{J}\bar{x}^k_{J}+ \delta _kw_{J} B^{T}_{J}B_{J}u^k_{J} \nonumber \\&= \theta \bar{g}_J+ w_{J} B^{T}_{J}B_{J}\bar{x}^k_{J}+ \delta _kw_{J}B^{T}_{J}B_{J}u^k_{J}+\theta o(\delta _k) \nonumber \\&= \theta \bar{g}_J-\Vert B_J\bar{x}^k_J\Vert \bar{g}_J+ \delta _kw_{J}B^{T}_{J}B_{J}u^k_{J}+\theta o(\delta _k)\nonumber \\&= \Vert B_J(x^k_J+r^k_J)\Vert \bar{g}_J\!-\!\Vert B_J\bar{x}^k_J\Vert \bar{g}_J\!+\! \delta _kw_{J}B^{T}_{J}B_{J}u^k_{J}\!+\! \theta o(\delta _k),\qquad \end{aligned}$$

(46)

where the third equality uses (41) and (42); the fifth equality uses (44); the sixth equality uses the definition of $\theta $. Dividing both sides by $\delta _{k}$ and using (37) yields in the limit

$$\begin{aligned} 0=\bar{\theta }\cdot \bar{g}_J+w_{J} B^{T}_{J}B_{J}\bar{u}_{J}, \end{aligned}$$

(47)

where $\bar{\theta }:=\mathop {\lim }\limits _{k \rightarrow \infty }\frac{\Vert B_J(x^k_J+r^k_J)\Vert -\Vert B_J\bar{x}^k_J\Vert }{\delta _k}$. Thus $B^T_JB_J\bar{u}_J$ is a nonzero multiple of $\bar{g}_J$.

Case (c). In this case, it follows from $\{\bar{x}^k\}\rightarrow \bar{x}$ that $\bar{x}_J= 0$. Since $\Vert B_J^{-T}(x^k_J - g^k_J)\Vert >w_{J}$ for all $k$, this implies $\Vert B_J^{-T}\bar{g}_J\Vert \ge w_{J}$. Since $\bar{x}\in \bar{X}$, the optimality condition $0\in \bar{g}_{J}+ w_{J}B^T_J\partial \Vert 0\Vert $ implies $\Vert B_J^{-T}\bar{g}_J\Vert \le w_{J}$. Hence $\Vert B_J^{-T}\bar{g}_J\Vert =w_{J}$. Then similar to (46) and $\bar{x}^k_J= 0$,

$$\begin{aligned} -r^k_J&= [\theta I + w_{J} B^{T}_{J}B_{J}]^{-1}(\theta g^k_{J}+w_{J} B^{T}_{J}B_{J}x^k_{J})\nonumber \\ -[\theta I + w_{J} B^{T}_{J}B_{J}]r^k_J&= \theta g^k_{J}+w_{J} B^{T}_{J}B_{J}x^k_{J}\nonumber \\&= \theta \bar{g}_J+\delta _kw_{J}B^{T}_{J}B_{J}u^k_{J}+\theta o(\delta _k)\nonumber \\&= \Vert B_J(x^k_J+r^k_J)\Vert \bar{g}_J+\delta _kw_{J}B^{T}_{J}B_{J} u^k_{J}+\theta o(\delta _k). \end{aligned}$$

(48)

Dividing both sides by $\delta _{k}$, one has

$$\begin{aligned} -[\theta I + w_{J} B^{T}_{J}B_{J}]\frac{r^k_J}{\delta _{k}}= \Vert B_J\frac{(x^k_J+r^k_J)}{\delta _{k}}\Vert \bar{g}_J+w_{J}B^{T}_{J} B_{J}u^k_{J}+\theta \frac{o(\delta _k)}{\delta _{k}}, \end{aligned}$$

using (37), (41) and $\bar{x}^k_J= 0$ yields in the limit

$$\begin{aligned} 0= \Vert B_J\bar{u}_{J}\Vert \bar{g}_J + w_{J}B^{T}_{J}B_{J}\bar{u}_{J}. \end{aligned}$$

(49)

Suppose $\bar{u}_J\ne 0$, then $\Vert B_J\bar{u}_J\Vert >0$. Thus (49) implies $B^T_JB_J\bar{u}_J$ is a negative multiple of $\bar{g}_J$.

Since $\frac{x^k-\bar{x}^k}{\delta _{k}}= \{u^k\}\rightarrow \bar{u}\ne 0$, we have $\langle x^k-\bar{x}^k,\bar{u}\rangle > 0$ for all $k$ sufficiently large. Fix any such $k$ and let

$$\begin{aligned} \hat{x}:= \bar{x}^k + \epsilon \bar{u} \end{aligned}$$

with $\epsilon > 0$. Since $A\bar{u}= 0$, we have $\nabla f_2(\hat{x})=A^T\nabla h(A\hat{x})=A^T\nabla h(A\bar{x}^k)= \nabla f_2(\bar{x}^k)=\bar{g}$. We show below that, for $\epsilon > 0$ sufficiently small, $\hat{x}$ satisfies

$$\begin{aligned} 0\in \bar{g}_J+w_J\partial \Vert B\hat{x}_J\Vert \end{aligned}$$

(50)

for all $J\in \mathcal{J }$, and hence $\hat{x}\in \bar{X}$. Then $\langle x^k-\bar{x}^k,\bar{u}\rangle > 0$ and $\Vert B\bar{u}\Vert = 1$ yield

$$\begin{aligned} \Vert x^k-\hat{x}\Vert =\Vert x^k-\bar{x}^k-\epsilon \bar{u}\Vert ^{2}= \Vert x^k-\bar{x}^k\Vert ^{2}-2\epsilon \langle x^k-\bar{x}^k,\bar{u}\rangle +\epsilon ^{2}\Vert \bar{u}\Vert ^{2}<\Vert x^k-\bar{x}^k\Vert ^{2} \end{aligned}$$

for all $\epsilon > 0$ sufficiently small, which contradicts to $\bar{x}^k$ being the point in $\bar{X}$ nearest to $x^k$ and thus proves (40). For each $J\in \mathcal{J }$, if $\bar{u}_J= 0$, then $\hat{x}_J= \bar{x}^k_J$ and (50) holds automatically (since $\bar{x}^k\in \bar{X}$). Suppose that $\bar{u}_J\ne 0$. We prove (50) below by considering the three aforementioned cases (a), (b) and (c).

Case (a). Since $\bar{u}_J\ne 0$, by (45), we have that $B^T_JB_J\bar{u}_J$ is a positive multiple of $\bar{g}_{J}$. Also, by (44) and $B^T_JB_J\bar{x}^k_J$ is a negative multiple of $\bar{g}_J$, hence $B^T_JB_J\hat{x}^k_J$ is a negative multiple of $\bar{g}_J$ for all $\epsilon > 0$ sufficiently small. Suppose $\bar{g}_J+ aB^T_JB_J\hat{x}^k_J =0 (a>0)$, combining with (44) ($\bar{g}_J + w_J\frac{B^T_JB_J\bar{x}^k_J}{\Vert B_J\bar{x}^k_J\Vert } = 0$), we have that $aB_J\hat{x}^k_J =w_J\frac{B_J\bar{x}^k_J}{\Vert B_J\bar{x}^k_J\Vert }, \frac{B_J\hat{x}^k_J}{\Vert B_J\hat{x}^k_J\Vert }= \frac{B_J\bar{x}^k_J}{\Vert B_J\bar{x}^k_J\Vert }$. Thus $\hat{x}^k_J$ satisfies

$$\begin{aligned} \bar{g}_J +w_J\frac{B^T_JB_J\hat{x}^k_J}{\Vert B_J\hat{x}^k_J\Vert } = 0. \end{aligned}$$

(51)

Case (b). Since (44) holds, $B^T_JB_J\bar{x}^k_J$ is a negative multiple of $\bar{g}_J$. Also, $B^T_JB_J\bar{u}_J$ is a nonzero multiple of $\bar{g}_J$. A similar argument as in case (a) shows that $B^T_JB_J\hat{x}^k_J$ satisfies (50) for all $\epsilon > 0$ sufficiently small.

Case (c). We have $\bar{x}^k_J= 0$ and $B^T_JB_J\bar{u}_J$ is a negative multiple of $\bar{g}_J$. Hence $B^T_JB_J\hat{x}^k_J$ is a negative multiple of $\bar{g}_J$ for all $\epsilon > 0$, so it satisfies (50).

Obviously,

$$\begin{aligned} r^k\in \text{ arg}\min _df_1(x^k+d)+\frac{1}{2}\Vert d+g^k\Vert ^2 \end{aligned}$$

is equivalent to (by their optimization conditions)

$$\begin{aligned} r^k\in \arg \min \limits _{d}\langle g^k+r^k,d\rangle +f_1(x^k+d). \end{aligned}$$

Hence

$$\begin{aligned} \langle g^k+r^k,r^k\rangle +f_1(x^k+r^k)\le \langle g^k+r^k,\bar{x}^k-x^k\rangle +f_1(\bar{x}^k). \end{aligned}$$

Since $\bar{x}^k\in \bar{X}$ and $\nabla f_2(\bar{x}^k)=\bar{g}$, we have similarly that

$$\begin{aligned} \langle \bar{g}+0,0\rangle +f_1(\bar{x}^k+0)\le \langle \bar{g},x^k+r^k-\bar{x}^k\rangle +f_1(\bar{x}^k+x^k+r^k-\bar{x}^k), \end{aligned}$$

i.e.

$$\begin{aligned} f_1(\bar{x}^k)\le \langle \bar{g},x^k+r^k-\bar{x}^k\rangle +f_1(x^k+r^k). \end{aligned}$$

Adding the above two inequalities and simplifying yield

$$\begin{aligned} \langle g^k-\bar{g},x^k-\bar{x}^k\rangle + \Vert r^k\Vert ^{2}\le \langle \bar{g}-g^k,r^k\rangle +\langle r^k,\bar{x}^k-x^k\rangle . \end{aligned}$$

Since $Ax^k$ and $A\bar{x}^k=\bar{y}$ are in $Y$, we also have from (38), (39) and (40) that

$$\begin{aligned} \langle g^k-\bar{g},x^k-\bar{x}^k\rangle&= \langle \nabla h(Ax^k)-\nabla h(\bar{y}),Ax^k-\bar{y}\rangle \nonumber \\&\ge \mu \Vert Ax^k-\bar{y}\Vert ^{2}\nonumber \\&\ge \frac{\mu }{\kappa ^{2}}\Vert B(x^k-\bar{x}^k)\Vert ^{2}. \end{aligned}$$

(52)

Combining these two inequalities and using (39) yield

$$\begin{aligned} \frac{\mu }{\kappa ^{2}}\Vert B(x^k-\bar{x}^k)\Vert ^{2}+\Vert r^k\Vert ^{2}&\le \langle \nabla h(\bar{y})-\nabla h(Ax^k),Ar^k\rangle +\langle r^k,\bar{x}^k-x^k\rangle \nonumber \\&\le L\Vert A\Vert ^{2}\Vert x^k-\bar{x}^k\Vert \Vert r^k\Vert +\Vert x^k-\bar{x}^k\Vert \Vert r^k\Vert . \end{aligned}$$

(53)

Thus

$$\begin{aligned} \frac{\mu }{\kappa ^{2}}\Vert B(x^k-\bar{x}^k)\Vert ^{2}&\le (L\Vert A\Vert ^{2}+1)\Vert x^k-\bar{x}^k\Vert \Vert r^k\Vert \nonumber \\&\le (L\Vert A\Vert ^{2}+1)\Vert B(x^k-\bar{x}^k)\Vert \Vert B^{-1}\Vert \Vert r^k\Vert \nonumber \\ \frac{\mu }{\kappa ^{2}}\Vert B(x^k-\bar{x}^k)\Vert&\le (L\Vert A\Vert ^{2}+1)\Vert B^{-1}\Vert \Vert r^k\Vert , \,\,\forall k. \end{aligned}$$

(54)

Dividing both sides by $\Vert B(x^k-\bar{x}^k)\Vert $ yields a contradiction to (37).

Appendix B: Proof of Lemma 3

Firstly, we give the following Resolvent Identity (see Lemma 3 in [3] or Lemma 2 in [4]).

For any $\beta ,\gamma >0$, and $x\in R^n$, since the operator $\partial \varphi $ is maximal monotone, the following Resolvent Identity holds true:

$$\begin{aligned} \mathrm{prox}_{\beta \varphi }(x)=\mathrm{prox}_{\gamma \varphi }\left(\frac{\gamma }{\beta }x+\left(1-\frac{\gamma }{\beta }\right)\mathrm{prox}_{\beta \varphi }(x)\right). \end{aligned}$$

(55)

Using the resolvent identity (55) and the fact that the resolvent is nonexpansive, we get

$$\begin{aligned}&\Vert x-\mathrm{prox}_{\varphi }[x+d]\Vert \\&\quad \le \Vert x-\mathrm{prox}_{\alpha \varphi }[x+\alpha d]\Vert +\Vert \mathrm{prox}_{\alpha \varphi }[x+\alpha d]-\mathrm{prox}_{\varphi }[x+d]\Vert \\&\quad =\Vert x-\mathrm{prox}_{\alpha \varphi }[x+\alpha d]\Vert \\&\qquad +\left\Vert\mathrm{prox}_{\varphi }\left[\frac{1}{\alpha }(x+\alpha d)+\left(1-\frac{1}{\alpha }\right)\mathrm{prox}_{\alpha \varphi }[x+\alpha d]\right]-\mathrm{prox}_{\varphi }[x+d]\right\Vert\\&\quad \le \Vert x-\mathrm{prox}_{\alpha \varphi }[x+\alpha d]\Vert \\&\qquad +\left\Vert\left[\frac{1}{\alpha }(x+\alpha d)+\left(1-\frac{1}{\alpha }\right)\mathrm{prox}_{\alpha \varphi }[x+\alpha d]\right]-[x+d]\right\Vert\\&\quad =\Vert x-\mathrm{prox}_{\alpha \varphi }[x+\alpha d]\Vert +\left|1-\frac{1}{\alpha }\right| \Vert x-\mathrm{prox}_{\alpha \varphi }[x+\alpha d]\Vert \\&\quad =\left(1+|1-\frac{1}{\alpha }|\right)\Vert x-\mathrm{prox}_{\alpha \varphi }[x+\alpha d]\Vert . \end{aligned}$$

So, we have

$$\begin{aligned} \Vert x-\mathrm{prox}_{\varphi }[x+d]\Vert \le (1+|1-\frac{1}{\alpha }|) \Vert x-\mathrm{prox}_{\alpha \varphi }[x+\alpha d]\Vert . \end{aligned}$$

(56)

After getting rid of sign of the absolute value, one can get that

when $0<\alpha \le 1, 1+|1-\frac{1}{\alpha }|=\frac{1}{\alpha }$; and when $\alpha >1, 1+|1-\frac{1}{\alpha }|=2-\frac{1}{\alpha }<2$.

Thus, we have

$$\begin{aligned} \Vert x-\mathrm{prox}_{\varphi }[x+d]\Vert \le \frac{1}{\tilde{\alpha }}\Vert x-\mathrm{prox}_{\alpha \varphi }[x+\alpha d]\Vert , \end{aligned}$$

where $\tilde{\alpha }=\min \{\alpha , \frac{1}{2}\}$ and $\forall \alpha >0$.

That is the conclusion of Lemma 3.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, H., Wei, J., Li, M. et al. On proximal gradient method for the convex problems regularized with the group reproducing kernel norm. J Glob Optim 58, 169–188 (2014). https://doi.org/10.1007/s10898-013-0034-5

Download citation

Received: 12 March 2012
Accepted: 11 January 2013
Published: 22 January 2013
Issue Date: January 2014
DOI: https://doi.org/10.1007/s10898-013-0034-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On proximal gradient method for the convex problems regularized with the group reproducing kernel norm

Abstract

Access this article

Similar content being viewed by others

On the proximal Landweber Newton method for a class of nonsmooth convex problems

Newton-Type Methods with the Proximal Gradient Step for Sparse Estimation

On the Gradient Projection Method for Weakly Convex Functions on a Proximally Smooth Set

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix: A Proof of Theorem 2

Appendix B: Proof of Lemma 3

Rights and permissions

About this article

Cite this article

Keywords

Navigation

On proximal gradient method for the convex problems regularized with the group reproducing kernel norm

Abstract

Access this article

Similar content being viewed by others

On the proximal Landweber Newton method for a class of nonsmooth convex problems

Newton-Type Methods with the Proximal Gradient Step for Sparse Estimation

On the Gradient Projection Method for Weakly Convex Functions on a Proximally Smooth Set

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix: A Proof of Theorem 2

Appendix B: Proof of Lemma 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation