Proximal alternating penalty algorithms for nonsmooth constrained convex optimization

Tran-Dinh, Quoc

doi:10.1007/s10589-018-0033-z

Proximal alternating penalty algorithms for nonsmooth constrained convex optimization

Published: 27 September 2018

Volume 72, pages 1–43, (2019)
Cite this article

Computational Optimization and Applications Aims and scope Submit manuscript

Quoc Tran-Dinh ORCID: orcid.org/0000-0002-1077-2579¹

1056 Accesses
9 Citations
1 Altmetric
Explore all metrics

Abstract

We develop two new proximal alternating penalty algorithms to solve a wide range class of constrained convex optimization problems. Our approach mainly relies on a novel combination of the classical quadratic penalty, alternating minimization, Nesterov’s acceleration, adaptive strategy for parameters. The first algorithm is designed to solve generic and possibly nonsmooth constrained convex problems without requiring any Lipschitz gradient continuity or strong convexity, while achieving the best-known $\mathcal {O}\left( \frac{1}{k}\right) $-convergence rate in a non-ergodic sense, where k is the iteration counter. The second algorithm is also designed to solve non-strongly convex, but semi-strongly convex problems. This algorithm can achieve the best-known $\mathcal {O}\left( \frac{1}{k^2}\right) $-convergence rate on the primal constrained problem. Such a rate is obtained in two cases: (1) averaging only on the iterate sequence of the strongly convex term, or (2) using two proximal operators of this term without averaging. In both algorithms, we allow one to linearize the second subproblem to use the proximal operator of the corresponding objective term. Then, we customize our methods to solve different convex problems, and lead to new variants. As a byproduct, these algorithms preserve the same convergence guarantees as in our main algorithms. We verify our theoretical development via different numerical examples and compare our methods with some existing state-of-the-art algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Tseng’s extragradient method with double projection for solving pseudomonotone variational inequality problems in Hilbert spaces

Article 10 April 2024

The Frank-Wolfe Algorithm: A Short Introduction

Article Open access 13 December 2023

A New Insight on Augmented Lagrangian Method with Applications in Machine Learning

Article 13 April 2024

Notes

A recent work in [1] showed an ${o}\left( \frac{1}{k}\right) $ or ${o}\left( \frac{1}{k^2}\right) $ rate depending on problem structures.

References

Attouch, H., Chbani, Z., Riahi, H.: Rate of convergence of the Nesterov accelerated gradient method in the subcritical case $\alpha \le 3$. ESAIM: COCV (2017). https://doi.org/10.1051/cocv/2017083
Auslender, A., Teboulle, M.: Interior gradient and proximal methods for convex and conic optimization. SIAM J. Optim. 16(3), 697–725 (2006)
Article MathSciNet MATH Google Scholar
Bauschke, H.H., Combettes, P.: Convex Analysis and Monotone Operators Theory in Hilbert Spaces, 2nd edn. Springer, Berlin (2017)
Book MATH Google Scholar
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding agorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
Article MathSciNet MATH Google Scholar
Becker, S., Candès, E.J., Grant, M.: Templates for convex cone problems with applications to sparse signal recovery. Math. Program. Comput. 3(3), 165–218 (2011)
Article MathSciNet MATH Google Scholar
Belloni, A., Chernozhukov, V., Wang, L.: Square-root LASSO: pivotal recovery of sparse signals via conic programming. Biometrika 94(4), 791–806 (2011)
Article MathSciNet MATH Google Scholar
Ben-Tal, A., Nemirovski, A.: Lectures on Modern Convex Optimization: Analysis, Algorithms, and Engineering Applications, Volume 3 of MPS/SIAM Series on Optimization. SIAM, Philadelphia (2001)
Book MATH Google Scholar
Bertsekas, D.P.: Nonlinear Programming, 2nd edn. Athena Scientific, Nashua (1999)
MATH Google Scholar
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2011)
Article MATH Google Scholar
Boyd, S., Vandenberghe, L.: Convex Optimization. University Press, Cambridge (2004)
Book MATH Google Scholar
Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis. 40(1), 120–145 (2011)
Article MathSciNet MATH Google Scholar
Chambolle, A., Pock, T.: On the ergodic convergence rates of a first-order primal–dual algorithm. Math. Program. 159(1–2), 253–287 (2016)
Article MathSciNet MATH Google Scholar
Chen, G., Teboulle, M.: A proximal-based decomposition method for convex minimization problems. Math. Program. 64, 81–101 (1994)
Article MathSciNet MATH Google Scholar
Condat, L.: A primal–dual splitting method for convex optimization involving Lipschitzian, proximable and linear composite terms. J. Optim. Theory Appl. 158, 460–479 (2013)
Article MathSciNet MATH Google Scholar
Davis, D.: Convergence rate analysis of primal–dual splitting schemes. SIAM J. Optim. 25(3), 1912–1943 (2015)
Article MathSciNet MATH Google Scholar
Davis, D.: Convergence rate analysis of the forward-Douglas–Rachford splitting scheme. SIAM J. Optim. 25(3), 1760–1786 (2015)
Article MathSciNet MATH Google Scholar
Davis, D., Yin, W.: Faster convergence rates of relaxed Peaceman–Rachford and ADMM under regularity assumptions. Math. Oper. Res. 42, 783–805 (2014)
Article MathSciNet MATH Google Scholar
Du, Y., Lin, X., Ruszczyński, A.: A selective linearization method for multiblock convex optimization. SIAM J. Optim. 27(2), 1102–1117 (2017)
Article MathSciNet MATH Google Scholar
Eckstein, J., Bertsekas, D.: On the Douglas–Rachford splitting method and the proximal point algorithm for maximal monotone operators. Math. Program. 55, 293–318 (1992)
Article MathSciNet MATH Google Scholar
Esser, E., Zhang, X., Chan, T.: A general framework for a class of first order primal-dual algorithms for TV-minimization. SIAM J. Imaging Sci. 3(4), 1015–1046 (2010)
Article MathSciNet MATH Google Scholar
Fletcher, R.: Practical Methods of Optimization, 2nd edn. Wiley, Chichester (1987)
MATH Google Scholar
Goldfarb, D., Ma, S., Scheinberg, K.: Fast alternating linearization methods of minimization of the sum of two convex functions. Math. Program. Ser. A 141(1), 349–382 (2012)
MATH Google Scholar
Goldstein, T., O’Donoghue, B., Setzer, S.: Fast alternating direction optimization methods. SIAM J. Imaging Sci. 7(3), 1588–1623 (2012)
Article MathSciNet MATH Google Scholar
Grant, M., Boyd, S., Ye, Y.: Disciplined convex programming. In: Liberti, L., Maculan, N. (eds.) Global Optimization: From Theory to Implementation, Nonconvex Optimization and Its Applications, pp. 155–210. Springer, Berlin (2006)
Chapter Google Scholar
He, B.S., Yuan, X.M.: On the ${O}(1/n)$ convergence rate of the Douglas–Rachford alternating direction method. SIAM J. Numer. Anal. 50, 700–709 (2012)
Article MathSciNet MATH Google Scholar
Jaggi, M.: Revisiting Frank–Wolfe: projection-free sparse convex optimization. JMLR W&CP 28(1), 427–435 (2013)
Google Scholar
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.), Advances in Neural Information Processing Systems (NIPS), pp. 315–323. NIPS Foundation Inc., Lake Tahoe (2013)
Kiwiel, K.C., Rosa, C.H., Ruszczyński, A.: Proximal decomposition via alternating linearization. SIAM J. Optim. 9(3), 668–689 (1999)
Article MathSciNet MATH Google Scholar
Lan, G., Monteiro, R.D.C.: Iteration complexity of first-order penalty methods for convex programming. Math. Program. 138(1), 115–139 (2013)
Article MathSciNet MATH Google Scholar
Necoara, I., Nesterov, Yu., Glineur, F.: Linear convergence of first order methods for non-strongly convex optimization. Math. Program. (2018). https://doi.org/10.1007/s10107-018-1232-1
Necoara, I., Patrascu, A., Glineur, F.: Complexity of first-order inexact Lagrangian and penalty methods for conic convex programming. Optim. Method Softw. (2017). https://doi.org/10.1080/10556788.2017.1380642
Necoara, I., Suykens, J.A.K.: Interior-point Lagrangian decomposition method for separable convex optimization. J. Optim. Theory Appl. 143(3), 567–588 (2009)
Article MathSciNet MATH Google Scholar
Nemirovskii, A., Yudin, D.: Problem Complexity and Method Efficiency in Optimization. Wiley Interscience, New York (1983)
Google Scholar
Nesterov, Y.: A method for unconstrained convex minimization problem with the rate of convergence ${\cal{O}} (1/k^2)$. Doklady AN SSSR 269, 543–547 (1983). Translated as Soviet Math. Dokl
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, Volume 87 of Applied Optimization. Kluwer Academic Publishers, Dordrecht (2004)
Book MATH Google Scholar
Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)
Article MathSciNet MATH Google Scholar
Nesterov, Y.: Gradient methods for minimizing composite objective function. Math. Program. 140(1), 125–161 (2013)
Article MathSciNet MATH Google Scholar
Nguyen, V.Q., Fercoq, O., Cevher, V.: Smoothing technique for nonsmooth composite minimization with linear operator. ArXiv preprint (arXiv:1706.05837) (2017)
Nocedal, J., Wright, S.J.: Numerical Optimization. Springer Series in Operations Research and Financial Engineering, 2nd edn. Springer, Berlin (2006)
MATH Google Scholar
O’Donoghue, B., Candes, E.: Adaptive Restart for Accelerated Gradient Schemes. Found. Comput. Math. 15, 715–732 (2015)
Article MathSciNet MATH Google Scholar
Ouyang, Y., Chen, Y., Lan, G., Pasiliao, E .J.R.: An accelerated linearized alternating direction method of multiplier. SIAM J. Imaging Sci. 8(1), 644–681 (2015)
Article MathSciNet MATH Google Scholar
Recht, B., Fazel, M., Parrilo, P.A.: Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev. 52(3), 471–501 (2010)
Article MathSciNet MATH Google Scholar
Rockafellar, R.T.: Convex Analysis, Volume 28 of Princeton Mathematics Series. Princeton University Press, Princeton (1970)
Google Scholar
Shefi, R., Teboulle, M.: On the rate of convergence of the proximal alternating linearized minimization algorithm for convex problems. EURO J. Comput. Optim. 4(1), 27–46 (2016)
Article MathSciNet MATH Google Scholar
Sra, S., Nowozin, S., Wright, S.J.: Optimization for Machine Learning. MIT Press, Cambridge (2012)
Google Scholar
Tran-Dinh, Q., Alacaoglu, A., Fercoq, O., Cevher, V.: Self-adaptive double-loop primal–dual algorithm for nonsmooth convex optimization. ArXiv preprint (arXiv:1808.04648), pp. 1–38 (2018)
Tran-Dinh, Q., Cevher, V.: A primal–dual algorithmic framework for constrained convex minimization. ArXiv preprint (arXiv:1406.5403), Technical Report, pp. 1–54 (2014)
Tran-Dinh, Q., Cevher, V.: Constrained convex minimization via model-based excessive gap. In: Proceedings of the Neural Information Processing Systems (NIPS), vol. 27, pp. 721–729, Montreal, Canada, December (2014)
Tran-Dinh, Q., Fercoq, O., Cevher, V.: A smooth primal–dual optimization framework for nonsmooth composite convex minimization. SIAM J. Optim. 28, 1–35 (2018)
Article MathSciNet MATH Google Scholar
Tran-Dinh, Q.: Construction and iteration-complexity of primal sequences in alternating minimization algorithms. ArXiv preprint (arXiv:1511.03305) (2015)
Tran-Dinh, Q.: Adaptive smoothing algorithms for nonsmooth composite convex minimization. Comput. Optim. Appl. 66(3), 425–451 (2016)
Article MathSciNet MATH Google Scholar
Tran-Dinh, Q.: Non-Ergodic alternating proximal augmented Lagrangian algorithms with optimal rates. In: The 32nd Conference on Neural Information Processing Systems (NIPS), pp. 1–9, NIPS Foundation Inc., Montreal, Canada, December (2018)
Tseng, P.: Applications of splitting algorithm to decomposition in convex programming and variational inequalities. SIAM J. Control Optim. 29, 119–138 (1991)
Article MathSciNet MATH Google Scholar
Tseng, P.: On accelerated proximal gradient methods for convex-concave optimization. Submitted to SIAM J. Optim. (2008)
Vu, C.B.: A splitting algorithm for dual monotone inclusions involving co-coercive operators. Adv. Comput. Math. 38(3), 667–681 (2013)
Article MathSciNet MATH Google Scholar
Woodworth, B.E., Srebro, N.: Tight complexity bounds for optimizing composite objectives. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.), Advances in Neural Information Processing Systems (NIPS), pp. 3639–3647. NIPS Foundation Inc., Barcelona, Spain (2016)
Xu, Y.: Accelerated first-order primal–dual proximal methods for linearly constrained composite convex programming. SIAM J. Optim. 27(3), 1459–1484 (2017)
Article MathSciNet MATH Google Scholar
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B 67(2), 301–320 (2005)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

This work is partly supported by the NSF-grant, DMS-1619884, USA, and the Nafosted grant 101.01-2017.315 (Vietnam).

Author information

Authors and Affiliations

Department of Statistics and Operations Research, University of North Carolina at Chapel Hill (UNC-Chapel Hill), 333 Hanes Hall, CB# 3260, Chapel Hill, NC, 27599-3260, USA
Quoc Tran-Dinh

Authors

Quoc Tran-Dinh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Quoc Tran-Dinh.

A Appendix: The proof of technical results in the main text

This appendix provides the full proof of the technical results in the main text.

1.1 A.1 Properties of the distance function $\mathrm {dist}_{\mathcal {K}}(\cdot )$

We investigate some necessary properties of $\psi $ defined by (7) to analyze the convergence of Algorithms 1 and 2. We first consider the following distance function:

$$\begin{aligned} \varphi (u) := \tfrac{1}{2}\mathrm {dist}_{\mathcal {K}}\big (u\big )^2 = \displaystyle \min _{r\in \mathcal {K}}\tfrac{1}{2}\Vert r- u\Vert ^2 = \tfrac{1}{2}\Vert r^{*}(u) - u\Vert ^2 = \tfrac{1}{2}\Vert \mathrm {proj}_{\mathcal {K}}\left( u\right) - u\Vert ^2, \end{aligned}$$

(45)

where $r^{*}(u) := \mathrm {proj}_{\mathcal {K}}\left( u\right) $ is the projection of $u$ onto $\mathcal {K}$. Clearly, (45) becomes

$$\begin{aligned} \varphi (u) = \displaystyle \max _{\lambda \in \mathbb {R}^n}\displaystyle \min _{r\in \mathcal {K}}\left\{ \langle u- r, \lambda \rangle - \tfrac{1}{2}\left\| \lambda \right\| ^2\right\} = \displaystyle \max _{\lambda \in \mathbb {R}^n}\left\{ \langle u, \lambda \rangle - s_{\mathcal {K}}(\lambda ) - \tfrac{1}{2}\Vert \lambda \Vert ^2\right\} , \end{aligned}$$

(46)

where $s_{\mathcal {K}}(\lambda ) := \sup _{r\in \mathcal {K}}\langle \lambda , r\rangle $ is the support function of $\mathcal {K}$.

The function $\varphi $ is convex and differentiable. Its gradient is given by

$$\begin{aligned} \nabla {\varphi }(u) = u- \mathrm {proj}_{\mathcal {K}}\left( u\right) = \nu ^{-1}\mathrm {proj}_{\mathcal {K}^{\circ }}\left( \nu u\right) , \end{aligned}$$

(47)

where $\mathcal {K}^{\circ } := \left\{ v\in \mathbb {R}^n \mid \langle u, v\rangle \le 1, ~u\in \mathcal {K}\right\} $ is the polar set of $\mathcal {K}$, and $\nu > 0$ solves $\nu = \langle \mathrm {proj}_{\mathcal {K}^{\circ }}\left( \nu u\right) , \nu u - \mathrm {proj}_{\mathcal {K}^{\circ }}\left( \nu u\right) \rangle $. If $\mathcal {K}$ is a cone, then $\nabla {\varphi }(u) = \mathrm {proj}_{\mathcal {K}^{\circ }}\left( u\right) = \mathrm {proj}_{-\mathcal {K}^{*}}\left( u\right) $, where $\mathcal {K}^{*} := \left\{ v\in \mathbb {R}^n \mid \langle u, v\rangle \ge 0, ~u\in \mathcal {K}\right\} $ is the dual cone of $\mathcal {K}$ [3].

By using the property of $\mathrm {proj}_{\mathcal {K}}(\cdot )$, it is easy to prove that $\nabla {\varphi }(\cdot )$ is Lipschitz continuous with the Lipschitz constant $L_{\varphi } = 1$. Hence, for any $u, v\in \mathbb {R}^n$, we have (see [35]):

$$\begin{aligned}&\varphi (u) + \langle \nabla {\varphi }(u), v- u\rangle + \tfrac{1}{2}\Vert \nabla {\varphi }(v) - \nabla {\varphi }(u)\Vert ^2 \le \varphi (v), \nonumber \\&\varphi (v) \le \varphi (u) + \langle \nabla {\varphi }(u), v- u\rangle + \tfrac{1}{2}\Vert v- u\Vert ^2. \end{aligned}$$

(48)

Let us recall $\psi $ defined by (7) as

$$\begin{aligned} \psi (x, y) := \varphi (Ax+ By- c) = \tfrac{1}{2}\mathrm {dist}_{\mathcal {K}}\big (Ax+ By- c\big )^2. \end{aligned}$$

(49)

Then, $\psi $ is also convex and differentiable, and its gradient is given by

$$\begin{aligned} \nabla _x\psi (x, y)= & {} A^{\top }\left( Ax+ By- c- \mathrm {proj}_{\mathcal {K}}\left( Ax+ By- c\right) \right) , \nonumber \\ \nabla _y\psi (x, y)= & {} B^{\top }\left( Ax+ By- c- \mathrm {proj}_{\mathcal {K}}\left( Ax+ By- c\right) \right) . \end{aligned}$$

(50)

For given $x^{k+1}\in \mathbb {R}^{p_1}$ and $\hat{y}^k\in \mathbb {R}^{p_2}$, let us define the following two functions:

$$\begin{aligned} \begin{aligned} \mathcal {Q}_k(y)&:= \psi (x^{k+1}, \hat{y}^k) + \langle \nabla _y\psi (x^{k+1}, \hat{y}^k), y- \hat{y}^k\rangle + \tfrac{\Vert B\Vert ^2}{2}\Vert y- \hat{y}^k\Vert ^2. \\ \ell _k(z)&:= \psi (x^{k+1}, \hat{y}^k) + \langle \nabla _x\psi (x^{k+1}, \hat{y}^k), x- x^{k+1}\rangle + \langle \nabla _y\psi (x^{k+1}, \hat{y}^k), y- \hat{y}^k\rangle . \end{aligned} \end{aligned}$$

(51)

Then, the following lemma provides some properties of $\ell _k$ and $\mathcal {Q}_k$.

Lemma 4

Let $z^{\star } = (x^{\star }, y^{\star }) \in \mathbb {R}^{p}$ be such that $Ax^{\star } + By^{\star } - c\in \mathcal {K}$. Then, for $\ell _k$ defined by (51) and $\psi $ defined by (49), we have

$$\begin{aligned} \ell _k(z^{\star }) \le -\tfrac{1}{2}\Vert \hat{s}^{k+1}\Vert ^2 ~~~\text {and}~~~~\ell _k(z^k) \le \psi (x^k, y^k) - \tfrac{1}{2}\Vert s^k - \hat{s}^{k+1}\Vert ^2, \end{aligned}$$

(52)

where $\hat{s}^{k+1} := Ax^{k+1} + B\hat{y}^k - c- \mathrm {proj}_{\mathcal {K}}\big (Ax^{k+1} + B\hat{y}^k - c\big )$ and $s^k := Ax^k + By^k - c- \mathrm {proj}_{\mathcal {K}}\big (Ax^k + By^k - c\big )$. Moreover, we also have

$$\begin{aligned} \psi (x^{k+1}, y) \le \mathcal {Q}_k(y)~~\text {for all}~y\in \mathbb {R}^{p_2}. \end{aligned}$$

(53)

Proof

Since $Ax^{\star } + By^{\star } - c\in \mathcal {K}$, if we define $r^{\star } := Ax^{\star } + By^{\star } - c$, then $r^{\star } \in \mathcal {K}$. Let $\hat{u}^{k} := Ax^{k+1} + B\hat{y}^k - c\in \mathbb {R}^n$. We can derive

$$\begin{aligned}&\ell _k(z^{\star }) := \psi (x^{k+1}, \hat{y}^k) + \langle \nabla _x\psi (x^{k+1}, \hat{y}^k), x^{\star } - x^{k+1}\rangle + \langle \nabla _y\psi (x^{k+1}, \hat{y}^k), y^{\star } - \hat{y}^k\rangle \\&\quad \overset{(49)}{=} \langle \hat{u}^{k} - \mathrm {proj}_{\mathcal {K}}(\hat{u}^{k}), A(x^{\star } - x^{k+ 1}) + B(y^{\star } - \hat{y}^k)\rangle + \tfrac{1}{2}\Vert \hat{u}^{k} - \mathrm {proj}_{\mathcal {K}}(\hat{u}^{k})\Vert ^2 \\&\quad = \langle \hat{u}^{k} - \mathrm {proj}_{\mathcal {K}}(\hat{u}^{k}), r^{\star } - \mathrm {proj}_{\mathcal {K}}(\hat{u}^{k})\rangle - \tfrac{1}{2}\Vert \hat{u}^{k} - \mathrm {proj}_{\mathcal {K}}(\hat{u}^{k})\Vert ^2 \\&\quad \le -\tfrac{1}{2}\Vert \hat{u}^{k} - \mathrm {proj}_{\mathcal {K}}(\hat{u}^{k})\Vert ^2, \end{aligned}$$

which is the first inequality of (52). Here, we use the property $\langle \hat{u}^{k} - \mathrm {proj}_{\mathcal {K}}(\hat{u}^{k}), r^{\star } - \mathrm {proj}_{\mathcal {K}}(\hat{u}^{k})\rangle \le 0$ for any $r^{\star }\in \mathcal {K}$ of the projection $\mathrm {proj}_{\mathcal {K}}$. The second inequality of (52) follows directly from (48) and the definition of $\psi $ in (49). The proof of (53) can be found in [35] due to the Lipschitz continuity of $\nabla _y\psi (x^{k+1},\cdot )$. $\square $

1.2 A.2 Descent property of the alternating scheme in Algorithms 1 and 2

Lemma 5

Let $\ell _k$ and $\mathcal {Q}_k$ be defined by (51), and $\varPhi _{\rho }$ be defined by (7).

(a)
Let $z^{k+1} := (x^{k+1}, y^{k+1})$ be generated by Step 3 of Algorithm 1. Then, for any $z:= (x, y) \in \mathrm {dom}(F)$, we have
$$\begin{aligned} \varPhi _{\rho _k}(z^{k+1})\le & {} F(z) + \rho _k\ell _k(z) + \gamma _k\langle x^{k+1} - \hat{x}^k, x- \hat{x}^k\rangle - \gamma _k\Vert x^{k+1} - \hat{x}^k\Vert ^2 \nonumber \\&+ \rho _k\Vert B\Vert ^2\langle y^{k + 1} - \hat{y}^k, y-\hat{y}^k\rangle - \frac{\rho _k\left\| B\right\| ^2}{2}\Vert y^{k+ 1} - \hat{y}^k\Vert ^2. \end{aligned}$$
(54)
(b)
Alternatively, let $z^{k+1} := (x^{k+1}, y^{k+1})$ be generated by Step 4 of Algorithm 2, and $\breve{y}^{k+1} := (1-\tau _k)y^k + \tau _k\tilde{y}^{k+1}$. Then, for any $z:= (x, y) \in \mathrm {dom}(F)$, we have
$$\begin{aligned} \breve{\varPhi }_{k+1}:= & {} f(x^{k+1}) + g(\breve{y}^{k+1}) + \rho _k\mathcal {Q}_k(\breve{y}^{k+1}) \nonumber \\\le & {} (1-\tau _k) \big [ F(z^{k}) + \rho _k\ell _k(z^k) \big ] + \tau _k\big [ F(z) + \rho _k\ell _k(z) \big ] \nonumber \\&+ \tfrac{\gamma _0\tau _k^2}{2}\Vert \tilde{x}^k - x\Vert ^2 - \tfrac{\gamma _0\tau _k^2}{2}\Vert \tilde{x}^{k+1} - x\Vert ^2 \nonumber \\&+ \tfrac{\rho _k\tau _k^2\Vert B\Vert ^2}{2}\Vert \tilde{y}^k - y\Vert ^2 - \tfrac{\left( \rho _k\tau _k^2\Vert B\Vert ^2 + \mu _g\tau _k\right) }{2}\Vert \tilde{y}^{k+1} - y\Vert ^2. \end{aligned}$$
(55)

Proof

(a) Combining the optimality condition of two subproblems at Step 3 of Algorithm 1, and the convexity of f and g, we can derive

$$\begin{aligned} \left\{ \begin{array}{lll} f(x^{k+1}) \le f(x) + \langle \rho _k\nabla _x{\psi }(x^{k+1},\hat{y}^k) + \gamma _k(x^{k+1} - \hat{x}^k), x- x^{k+1}\rangle , \\ g(y^{k+1}) \le g(y) + \langle \rho _k\nabla _y{\psi }(x^{k+1},\hat{y}^k) + \rho _k\Vert B\Vert ^2(y^{k+1} - \hat{y}^k), y- y^{k+1}\rangle . \end{array}\right. \end{aligned}$$

(56)

Using (53) with $y= y^{k+1}$, we have

$$\begin{aligned} \psi (x^{k+1}, y^{k+1}) \le \psi (x^{k+1}, \hat{y}^k) + \langle \nabla _y{\psi }(x^{k+1}, \hat{y}^k), y^{k+1} - \hat{y}^k\rangle + \tfrac{\Vert B\Vert ^2}{2}\Vert y^{k+1} - \hat{y}^k\Vert ^2. \end{aligned}$$

Combining the last estimate and (56), and then using (7), we can derive

$$\begin{aligned}&\varPhi _{\rho }(z^{k+1}) \overset{ (7)}{=} f(x^{k+1}) + g(y^{k+1}) + \rho _k\psi (x^{k+1}, y^{k+1}) \\&\quad \overset{(56)}{\le } f(x) + g(y) + \rho _k\ell _k(z) + \gamma _k\langle \hat{x}^k - x^{k+1}, x^{k+1} - x\rangle \\&\qquad + \rho _k\Vert B\Vert ^2\langle \hat{y}^k - y^{k+1}, y^{k+1} - y\rangle + \tfrac{\rho _k\Vert B\Vert ^2}{2}\Vert y^{k+1}-\hat{y}^k\Vert ^2 \\&\quad = f(x) + g(y) + \rho _k\ell _k(z) + \gamma _k\langle \hat{x}^k - x^{k+1}, \hat{x}^k - x\rangle \\&\qquad - \gamma _k\Vert x^{k+1} - \hat{x}^k\Vert ^2 + \rho _k\Vert B\Vert ^2\langle \hat{y}^k - y^{k+1}, \hat{y}^k - y\rangle - \frac{\rho _k\Vert B\Vert ^2}{2}\Vert y^{k+1} - \hat{y}^k\Vert ^2, \end{aligned}$$

which is exactly (54).

(b) First, from the definition of $\ell _k$ and $\mathcal {Q}_k$ in (51), using $\breve{y}^{k+1} - \hat{y}^k = \tau _k(\tilde{y}^{k+1} - \tilde{y}^k)$ and $x^{k+1} - (1-\tau _k)x^k - \tau _k\tilde{x}^{k+1} = 0$, we can show that

$$\begin{aligned} \mathcal {Q}_k(\breve{y}^{k+1}) \overset{(51)}{=} (1-\tau _k)\ell _k(z^k) + \tau _k\ell _k(\tilde{z}^{k+1}) + \tfrac{\Vert B\Vert ^2\tau _k^2}{2}\Vert \tilde{y}^{k+1} - \tilde{y}^k\Vert ^2. \end{aligned}$$

(57)

By the convexity of f, $\tau _k\tilde{x}^{k+1} = x^{k+1} - (1-\tau _k)x^k$ from (18), and the optimality condition of the x-subproblem at Step 4 of Algorithm 2, we can derive

$$\begin{aligned} f(x^{k+1})\le & {} (1 - \tau _k)f(x^k) + \tau _k f(x) + \tau _k \langle \nabla {f}(x^{k+1}), \tilde{x}^{k+1} - x\rangle \nonumber \\= & {} (1 - \tau _k)f(x^k) + \tau _k f(x) + \rho _k\tau _k \langle \nabla _x{\psi }(x^{k+1},\hat{y}^k),x- \tilde{x}^{k+1}\rangle \nonumber \\&+ \gamma _{0}\tau _k \langle x^{k+1} - \hat{x}^k ,x- \tilde{x}^{k+1}\rangle , \end{aligned}$$

(58)

for any $x\in \mathbb {R}^{p_1}$, where $\nabla {f}(x^{k + 1})\in \partial {f}(x^{k + 1})$.

By the $\mu _g$-strong convexity of g, $\breve{y}^{k+1} := (1-\tau _k)y^k + \tau _k\tilde{y}^{k+1}$, and the optimality condition of the y-subproblem at Step 4 of Algorithm 2, one can also derive

$$\begin{aligned} g(\breve{y}^{k+1})\le & {} (1-\tau _k)g(y^k) + \tau _kg(y) + \tau _k\langle \nabla {g}(\tilde{y}^{k+1}), \tilde{y}^{k+1} - y\rangle - \frac{\tau _k\mu _g}{2}\Vert \tilde{y}^{k+1} - y\Vert ^2 \nonumber \\= & {} (1-\tau _k)g(y^k) + \tau _kg(y) + \rho _k\tau _k\langle \nabla _y{\psi }(x^{k+1},\hat{y}^k), y- \tilde{y}^{k+1}\rangle \nonumber \\&+ \rho _k\tau _k^2\Vert B\Vert ^2\langle \tilde{y}^{k+1} - \tilde{y}^k, y- \tilde{y}^{k+1}\rangle - \frac{\tau _k\mu _g}{2}\Vert \tilde{y}^{k+1} - y\Vert ^2, \end{aligned}$$

(59)

for any $y\in \mathbb {R}^{p_2}$, where $\nabla {g}(\tilde{y}^{k+1}) \in \partial {g}(\tilde{y}^{k+1})$.

Combining this, (57), (58) and (59) and then using $\breve{\varPhi }_k$, we have

$$\begin{aligned}&\breve{\varPhi }_{k+1} = f(x^{k+1}) + g(\breve{y}^{k+1}) + \rho _k\mathcal {Q}_k(\breve{y}^{k+1}) \nonumber \\&\quad \overset{(57), (58), (59}{\le } (1-\tau _k)\big [ F(z^{k}) + \rho _k\ell _k(z^k) \big ] + \tau _k\big [F(z) + \rho _k\ell _k(z) \big ] \nonumber \\&\qquad \qquad + \gamma _0\tau _k\langle x^{k+1} - \hat{x}^k, x- \tilde{x}^{k+1}\rangle + \rho _k\tau _k^2\Vert B\Vert ^2\langle \tilde{y}^{k+1} - \tilde{y}^k, y- \tilde{y}^{k+1}\rangle \nonumber \\&\qquad \qquad + \frac{1}{2}\rho _k\tau _k^2\Vert B\Vert ^2\Vert \tilde{y}^{k+1} - \tilde{y}^k\Vert ^2 - \frac{\tau _k\mu _g}{2}\Vert \tilde{y}^{k+1} - y\Vert ^2. \end{aligned}$$

(60)

Next, using (18), for any $z= (x, y)\in \mathrm {dom}(F)$, we also have

$$\begin{aligned} 2\tau _k\langle \hat{x}^k - x^{k+1}, \tilde{x}^{k} - x\rangle= & {} \tau _k^2\Vert \tilde{x}^k - x\Vert ^2 - \tau _k^2\Vert \tilde{x}^{k+1} - x\Vert ^2 + \Vert x^{k+1} - \hat{x}^k\Vert ^2, \nonumber \\ 2\langle \tilde{y}^k - \tilde{y}^{k+1}, \tilde{y}^{k+1} - y\rangle= & {} \Vert \tilde{y}^k - y\Vert ^2 - \Vert \tilde{y}^{k+1} - y\Vert ^2 - \Vert \tilde{y}^{k+1} - \tilde{y}^k\Vert ^2. \end{aligned}$$

(61)

Substituting (61) into (60) we obtain

$$\begin{aligned} \breve{\varPhi }_{k+1}\le & {} (1-\tau _k) \big [ F(z^{k}) + \rho _k\ell _k(z^k) \big ] + \tau _k\big [ F(z) + \rho _k\ell _k(z) \big ] \\&+ \tfrac{\gamma _0\tau _k^2}{2}\Vert \tilde{x}^k - x\Vert ^2 - \tfrac{\gamma _0\tau _k^2}{2}\Vert \tilde{x}^{k+1} - x\Vert ^2 - \tfrac{\gamma _0}{2} \Vert x^{k+1} - \hat{x}^k\Vert ^2 \\&+ \tfrac{\rho _k\tau _k^2\Vert B\Vert ^2}{2}\Vert \tilde{y}^k - y\Vert ^2 - \tfrac{\left( \rho _k\tau _k^2\Vert B\Vert ^2 + \mu _g\tau _k \right) }{2}\Vert \tilde{y}^{k+1} - y\Vert ^2, \end{aligned}$$

which is exactly (55) by neglecting the term $-\frac{\gamma _0}{2}\Vert x^{k+1} - \hat{x}^k\Vert ^2$. $\square $

1.3 A.3 The proof of Lemma 2: the key estimate of Algorithm 1

Using the fact that $\tau _k = \frac{1}{k+1}$, we have $\frac{\tau _{k+1}(1-\tau _k)}{\tau _k} = \frac{k}{k+2}$. Hence, the third line of Step 3 of Algorithm 1 can be written as

$$\begin{aligned} (\hat{x}^{k+1}, \hat{y}^{k+1} ) = (x^{k+1}, y^{k+1}) + \tfrac{\tau _{k+1}(1-\tau _k)}{\tau _k}(x^{k+1} - x^k, y^{k+1} - y^k). \end{aligned}$$

This step can be split into two steps with $(\hat{x}^k, \hat{y}^k)$ and $(\tilde{x}^k, \tilde{y}^k)$ as in (14), which is standard in accelerated gradient methods [4, 35]. We omit the detailed derivation.

Next, we prove (15). Using (52), we have

$$\begin{aligned} \ell _k(z^k) \le \psi (x^k, y^k) - \tfrac{1}{2}\Vert s^k - \hat{s}^{k+1}\Vert ^2, ~~\text {and}~~\ell _k(z^{\star }) \le -\tfrac{1}{2}\Vert \hat{s}^{k+1}\Vert ^2. \end{aligned}$$

(62)

Using (54) with $(x, y) = (x^k, y^k)$ and $(x, y) = (x^{\star }, y^{\star })$ respectively, we obtain

$$\begin{aligned}&\varPhi _{\rho _k}(z^{k+1}) \overset{(62)}{\le } \varPhi _{\rho _k}(z^k) + \gamma _k\langle \hat{x}^k - x^{k+1}, \hat{x}^{k} - x^k\rangle - \gamma _k\Vert \hat{x}^k - x^{k+1}\Vert ^2 \\&\qquad + \rho _k\Vert B\Vert ^2\langle \hat{y}^k - y^{k+1}, \hat{y}^{k} - y^k\rangle - \frac{\rho _k\Vert B\Vert ^2}{2}\Vert \hat{y}^k - y^{k+1}\Vert ^2 - \frac{\rho _k}{2}\Vert s^k - \hat{s}^{k+1}\Vert ^2. \\&\varPhi _{\rho _k}(z^{k+1}) \le F(z^{\star }) - \frac{\rho _k}{2}\Vert \hat{s}^{k+1}\Vert ^2 + \gamma _k\langle \hat{x}^k - x^{k+1}, \hat{x}^{k} - x^{\star }\rangle - \gamma _k\Vert \hat{x}^k - x^{k+1}\Vert ^2 \\&\quad \quad \quad + \rho _k\Vert B\Vert ^2\langle \hat{y}^k - y^{k+1}, \hat{y}^{k} - y^{\star }\rangle - \frac{\rho _k\Vert B\Vert ^2}{2}\Vert \hat{y}^k - y^{k+1}\Vert ^2. \end{aligned}$$

Multiplying the first inequality by $1 - \tau _k \in [0, 1]$ and the second one by $\tau _k \in [0, 1]$, and summing up the results, then using $\hat{x}^k - (1-\tau _k)x^k = \tau _k\tilde{x}^k$ and $\hat{y}^k - (1-\tau _k)y^k = \tau _k\tilde{y}^k$ from (14), we obtain

$$\begin{aligned} \varPhi _{\rho _k}(z^{k+1})\le & {} (1-\tau _k)\varPhi _{\rho _k}(z^k) + \tau _kF(z^{\star }) + \gamma _k\tau _k\langle \hat{x}^k - x^{k+1}, \tilde{x}^{k} - x^{\star }\rangle \nonumber \\&- \gamma _k\Vert x^{k+1} - \hat{x}^k\Vert ^2 + \rho _k\tau _k\Vert B\Vert ^2\langle \hat{y}^k - y^{k+1}, \tilde{y}^{k} - y^{\star }\rangle \nonumber \\&- \frac{\rho _k\Vert B\Vert ^2}{2}\Vert y^{k+1} - \hat{y}^k\Vert ^2-\frac{(1-\tau _{k})\rho _k}{2}\Vert s^{k}-\hat{s}^{k+1}\Vert ^{2}\nonumber \\&-\frac{\tau _{k}\rho _k}{2}\Vert \hat{s}^{k+1}\Vert ^{2} \end{aligned}$$

(63)

By the update rule in (14) we can show that

$$\begin{aligned} 2\tau _k\langle \hat{x}^k - x^{k+1}, \tilde{x}^{k} - x^{\star }\rangle= & {} \tau _k^2\Vert \tilde{x}^k - x^{\star }\Vert ^2 - \tau _k^2\Vert \tilde{x}^{k+1} - x^{\star }\Vert ^2 + \Vert x^{k+1} - \hat{x}^k\Vert ^2, \\ 2\tau _k\langle \hat{y}^k - y^{k+1}, \tilde{y}^k - y^{\star }\rangle= & {} \tau _k^2\Vert \tilde{y}^k - y^{\star }\Vert ^2 - \tau _k^2\Vert \tilde{y}^{k+1} - y^{\star }\Vert ^2 + \Vert y^{k+1} - \hat{y}^k\Vert ^2. \end{aligned}$$

Using this relation and $\varPhi _{\rho _{k}}(z^k) = \varPhi _{\rho _{k-1}}(z^k) + \frac{(\rho _k - \rho _{k-1})}{2}\Vert s^k\Vert ^2$ into (63), we get

$$\begin{aligned} \varPhi _{\rho _k}(z^{k+1})\le & {} (1-\tau _k)\varPhi _{\rho _{k-1}}(z^k) + \tau _kF(z^{\star }) + \gamma _k\tau _k^2\left[ \Vert \tilde{x}^k - x^{\star }\Vert ^2 - \Vert \tilde{x}^{k+1} - x^{\star }\Vert ^2\right] \nonumber \\&- \tfrac{\gamma _k}{2}\Vert \hat{x}^k - x^{k+1}\Vert ^2 + \frac{\Vert B\Vert ^2\rho _k\tau _k^2}{2} \left[ \Vert \tilde{y}^k - y^{\star }\Vert ^2 - \Vert \tilde{y}^{k+1} - y^{\star }\Vert ^2\right] - R_k, \end{aligned}$$

(64)

where $R_k$ is defined as

$$\begin{aligned} R_k:= & {} \frac{(1-\tau _k)\rho _k}{2}\Vert s^k - \hat{s}^{k+1}\Vert ^2 + \frac{\rho _k\tau _k}{2}\Vert \hat{s}^{k+1}\Vert ^2 - \frac{(1-\tau _k)(\rho _k - \rho _{k-1})}{2}\Vert s^k\Vert ^2 \nonumber \\\ge & {} \tfrac{(1-\tau _k)}{2}\left[ \rho _{k-1} - \rho _k(1-\tau _k)\right] \Vert s^k\Vert ^2. \end{aligned}$$

(65)

Using (65) into (64) and ignoring $-\frac{\gamma _k}{2}\Vert x^{k+1} - \hat{x}^k\Vert ^2$, we obtain (15). $\square $

1.4 A.4 The proof of Lemma 3: the key estimate of Algorithm 2

The proof of (18) is similar to the proof of (14), and we skip its details here.

Using $z= z^{\star }$ and (52) into (55), we obtain

$$\begin{aligned}&\breve{\varPhi }_{k+1} := f(x^{k+1}) + g(\breve{y}^{k+1}) + \rho _k\mathcal {Q}_k(\breve{y}^{k+1}) \nonumber \\&\quad \overset{(52)}{\le } (1-\tau _k) \big [ F(z^{k}) + \rho _k\ell _k(z^k) \big ] + \tau _kF(z^{\star }) - \frac{\rho _k\tau _k}{2}\Vert \hat{s}^{k+1}\Vert ^2 \nonumber \\&\qquad + \tfrac{\gamma _0\tau _k^2}{2}\Vert \tilde{x}^k - x^{\star }\Vert ^2 - \tfrac{\gamma _0\tau _k^2}{2}\Vert \tilde{x}^{k+1} - x^{\star }\Vert ^2 \nonumber \\&\qquad + \tfrac{\rho _k\tau _k^2\Vert B\Vert ^2}{2}\Vert \tilde{y}^k - y^{\star }\Vert ^2 - \tfrac{\left( \rho _k\tau _k^2\Vert B\Vert ^2 + \mu _g\tau _k\right) }{2}\Vert \tilde{y}^{k+1} - y^{\star }\Vert ^2. \end{aligned}$$

(66)

From the definition of $\psi $ in (49) and (52), we have

$$\begin{aligned} \varPhi _{\rho _k}(z^k)= & {} \varPhi _{\rho _{k-1}}(z^k) + \frac{(\rho _k - \rho _{k-1})}{2}\Vert s^k\Vert ^2~~~\text {and}\\ \ell _k(z^k)\le & {} \psi (x^k, y^k) - \tfrac{1}{2}\Vert s^k - \hat{s}^{k+1}\Vert ^2. \end{aligned}$$

Using these expressions into (66), we obtain

$$\begin{aligned} \breve{\varPhi }_{k+1}\le & {} (1-\tau _k)\varPhi _{\rho _{k-1}}(z^k) + \tau _kF(z^{\star }) + \tfrac{\gamma _0\tau _k^2}{2}\Vert \tilde{x}^k - x^{\star }\Vert ^2 - \tfrac{\gamma _0\tau _k^2}{2}\Vert \tilde{x}^{k+1} - x^{\star }\Vert ^2 \nonumber \\&+ \tfrac{\Vert B\Vert ^2\rho _k\tau _k^2}{2}\Vert \tilde{y}^k - y^{\star }\Vert ^2 - \tfrac{(\Vert B\Vert ^2\rho _k\tau _k^2 + \mu _g\tau _k)}{2}\Vert \tilde{y}^{k+1} - y^{\star }\Vert ^2 - R_k, \end{aligned}$$

(67)

where $R_k$ is defined as (65).

Let us consider two cases:

For Option 1 at Step 5 of Algorithm 2, we have $y^{k+1} = \breve{y}^{k+1}$. Hence, using (53), we get
$$\begin{aligned} \varPhi _{\rho _k}(z^{k+1}) = f(x^{k+1}) + g(y^{k+1}) + \rho _k\psi (x^{k+1},y^{k+1}) \le \breve{\varPhi }_{k+1}. \end{aligned}$$
(68)
For Option 2 at Step 5 of Algorithm 2, we have
$$\begin{aligned} \varPhi _{\rho _k}(z^{k+1})\le & {} f(x^{k+1}) + g(y^{k+1}) + \rho _k\mathcal {Q}_k(y^{k+1}) \nonumber \\= & {} f(x^{k+1}) + \displaystyle \min _{y\in \mathbb {R}^{p_2}}\Big \{ g(y) + \rho _k\mathcal {Q}_k(y) \Big \} \nonumber \\\le & {} f(x^{k+1}) + g(\breve{y}^{k+1}) + \rho _k\mathcal {Q}_k(\breve{y}^{k+1}) = \breve{\varPhi }_{k+1}. \end{aligned}$$
(69)

Using either (68) or (69) into (67), we obtain

$$\begin{aligned} \varPhi _{\rho _k}(z^{k+1})\le & {} (1-\tau _k)\varPhi _{\rho _{k-1}}(z^k) + \tau _kF(z^{\star }) + \tfrac{\gamma _0\tau _k^2}{2}\Vert \tilde{x}^k - x^{\star }\Vert ^2 - \tfrac{\gamma _0\tau _k^2}{2}\Vert \tilde{x}^{k+1} - x^{\star }\Vert ^2 \nonumber \\&+ \tfrac{\Vert B\Vert ^2\rho _k\tau _k^2}{2}\Vert \tilde{y}^k - y^{\star }\Vert ^2 - \tfrac{(\Vert B\Vert ^2\rho _k\tau _k^2 + \mu _g\tau _k)}{2}\Vert \tilde{y}^{k+1} - y^{\star }\Vert ^2 - R_k, \end{aligned}$$

Using the lower bound (65) of $R_k$ into this inequality, we obtain (19). $\square $

1.5 A.5 The proof of Corollary 1: application to composite convex minimization

By the $L_f$-Lipschitz continuity of f and Lemma 1, we have

$$\begin{aligned}&0 \le P(y^k) - P^{\star } = f(y^k) + g(y^k) - P^{\star } \le f(x^k) + g(y^k) + \vert f(y^k) - f(x^k)\vert - P^{\star } \nonumber \\&\quad \le f(x^k) + g(y^k) - P^{\star } + L_f\Vert x^k - y^k\Vert \nonumber \\&\quad \overset{(8)}{\le } S_{\rho _{k-1}}(z^k) - \frac{\rho _{k-1}}{2}\Vert x^k - y^k\Vert ^2 + L_f\Vert x^k - y^k\Vert , \end{aligned}$$

(70)

where $S_{\rho }(z) := \varPhi _{\rho }(z) - P^{\star }$. This inequality also leads to

$$\begin{aligned} \Vert x^k - y^k\Vert \le \frac{1}{\rho _{k-1}}\left( L_f + \sqrt{L_f^2 + 2\rho _{k-1}S_{\rho _{k-1}}(z^k)}\right) \le \frac{1}{\rho _{k-1}}\left( 2L_f + \sqrt{2\rho _{k-1}S_{\rho _{k-1}}(z^k)}\right) . \end{aligned}$$

(71)

Since using (23) is equivalent to applying Algorithm 1 to its constrained reformulation, by (16), we have

$$\begin{aligned} S_{\rho _{k-1}}(z^k) \le \frac{\rho _0\Vert y^0 - y^{\star }\Vert ^2}{2k}~~~\text {and}~~~\rho _k = \rho _0(k+1). \end{aligned}$$

Using these expressions into (71) we get

$$\begin{aligned} \Vert x^k - y^k\Vert \le \frac{1}{\rho _{0}k}\left( 2L_f + \sqrt{\rho _0^2\Vert y^0 - y^{\star }\Vert ^2}\right) = \frac{2L_f+\rho _0\Vert y^0-y^{\star }\Vert }{\rho _0k}. \end{aligned}$$

Substituting this into (70) and using the bound of $S_{\rho _{k-1}}$, we obtain (25).

Now, if we use (24), then it is equivalent to applying Algorithm 2 with Option 1 to solve its constrained reformulation. In this case, from the proof of Theorem 2, we can derive

$$\begin{aligned} S_{\rho _{k-1}}(z^k) \le \frac{2(\rho _0 + \mu _g)\Vert y^0 - y^{\star }\Vert ^2}{(k+1)^2}~~~~\text {and}~~~\frac{\rho _0(k+1)^2}{4} \le \rho _{k-1} \le k^{2}\rho _0. \end{aligned}$$

Combining these estimates and (71), we have $\Vert x^k - y^k\Vert \le \tfrac{8(L_f + (\rho _0 + \mu _g)\Vert y^0 - y^{\star }\Vert )}{\rho _0(k+1)^2}$. Substituting this into (70) and using the bound of $S_{\rho _{k-1}}$ we obtain (26). $\square $

1.6 A.6 The proof of Theorem 3: extension to the sum of three objective functions

Using the Lipschitz gradient continuity of h and [35, Theorem 2.1.5], we have

$$\begin{aligned} h(y^{k+1})\le & {} h(\hat{y}^k) + \langle \nabla {h}(\hat{y}^k), y^{k+1} - \hat{y}^k\rangle + \tfrac{L_h}{2}\Vert y^{k+1} - \hat{y}^k\Vert ^2 \\\le & {} h(\hat{y}^k) + \langle \nabla {h}(\hat{y}^k), y- \hat{y}^k\rangle + \langle \nabla {h}(\hat{y}^k), y^{k+1} - y\rangle + \tfrac{L_h}{2}\Vert y^{k+1} - \hat{y}^k\Vert ^2. \end{aligned}$$

In addition, the optimality condition of (34) is

$$\begin{aligned} 0 = \nabla {g}(y^{k+1}) + \nabla {h}(\hat{y}^k) + \rho _k\nabla _y{\psi }(x^{k+1},\hat{y}^k) + \hat{\beta }_k(y^{k+1} - \hat{y}^k), ~\nabla {g}(y^{k+1})\in \partial {g}(y^{k+1}). \end{aligned}$$

Using these expressions and the same argument as the proof of Lemma 5, we derive

$$\begin{aligned} \varPhi _{\rho _k}(z^{k+1})\le & {} f(x) + g(y) + h(\hat{y}^k) + \langle \nabla {h}(\hat{y}^k, y- \hat{y}^k\rangle + \rho _k\ell _k(z) \nonumber \\&+ \gamma _k\langle \hat{x}^k - x^{k+1}, x^{k+1} - x\rangle + \hat{\beta }_k\langle \hat{y}^k - y^{k+1}, y^{k+1} - y\rangle \nonumber \\&+ \tfrac{\rho _k\Vert B\Vert ^2 + L_h}{2}\Vert y^{k+1} -\hat{y}^k\Vert ^2 . \end{aligned}$$

(72)

Finally, with the same proof as in (15), and $\hat{\beta }_k = \Vert B\Vert ^2\rho _k + L_h$, we can show that

$$\begin{aligned} \varPhi _{\rho _k}(z^{k+1})&\le (1-\tau _k)\varPhi _{\rho _{k-1}}(z^k) + \tau _kF(z^{\star }) + \tfrac{\gamma _k\tau _k^2}{2}\Vert \tilde{x}^k - x^{\star }\Vert ^2 \nonumber \\&\quad - \tfrac{\gamma _k\tau _k^2}{2}\Vert \tilde{x}^{k+1} - x^{\star }\Vert ^2 + \tfrac{\hat{\beta }_k\tau _k^2}{2}\Vert \tilde{y}^k - y^{\star }\Vert ^2 - \tfrac{\hat{\beta }_k\tau _k^2}{2}\Vert \tilde{y}^{k+1} - y^{\star }\Vert ^2 \nonumber \\&\quad - \tfrac{(1-\tau _k)}{2}\left[ \rho _{k-1} - \rho _k(1-\tau _k)\right] \Vert s^k\Vert ^2, \end{aligned}$$

(73)

where $s^k := Ax^k + By^k - c- \mathrm {proj}_{\mathcal {K}}\big (Ax^k + By^k - c\big )$. In order to telescope, we impose conditions on $\rho _k$ and $\tau _k$ as

$$\begin{aligned} \frac{(\rho _k\Vert B\Vert ^2 + L_h)\tau _k^2}{1 - \tau _k} \le (\rho _{k-1}\Vert B\Vert ^2 + L_h)\tau _{k-1}^2 ~~\text {and}~~\rho _k = \frac{\rho _{k-1}}{1-\tau _k}. \end{aligned}$$

If we choose $\tau _{k} = \frac{1}{k+1}$, then $\rho _k = \rho _0(k+1)$. The first condition above becomes

$$\begin{aligned}&\frac{\rho _0\Vert B\Vert ^2(k+1) + L_h}{k(k+1)} \le \frac{\rho _0\Vert B\Vert ^2k + L_h}{k^2} \\&\quad \Leftrightarrow \rho _0\Vert B\Vert ^2 k(k+1) + L_hk \le \rho _0\Vert B\Vert ^2k(k+1) + L_h(k+1). \end{aligned}$$

which certainly holds.

The remaining proof of the first part in Corollary 3 is similar to the proof of Theorem 1, but with $R_p^2 := \gamma _0 \Vert x^0 - x^{\star }\Vert ^2 + (L_h + \rho _0\Vert B\Vert ^2)\Vert y^0 - y^{\star }\Vert ^2$ due to (73).

We now prove the second part of Corollary 3. For the case (i) with $\mu _g > 0$, the proof is very similar to the proof of Theorem 2, but $\rho _k\Vert B\Vert ^2$ is changed to $\hat{\beta }_k$ and is updated as $\hat{\beta }_k = \rho _k\left\| B\right\| ^2 + L_h$. We omit the detail of this analysis here. We only prove the second case (ii) when $L_h < 2\mu _h$.

Using the convexity and the Lipschitz gradient continuity of h, we can derive

$$\begin{aligned} h(\breve{y}^{k+1})\le & {} (1-\tau _k)h(y^k) + \tau _kh(\tilde{y}^{k+1}) - \frac{\mu _h\tau _k(1-\tau _k)}{2}\Vert \tilde{y}^{k+1} - y^k\Vert ^2 \\\le & {} (1-\tau _k)h(y^k) + \tau _kh(\tilde{y}^{k}) + \tau _k\langle \nabla {h}(\tilde{y}^k), \tilde{y}^{k+1} - \tilde{y}^k\rangle + \frac{\tau _kL_h}{2}\Vert \tilde{y}^{k+1} - \tilde{y}^k\Vert ^2 \\\le & {} (1-\tau _k)h(y^k) + \tau _kh(y^{\star }) + \tau _k\langle \nabla {h}(\tilde{y}^k), \tilde{y}^{k+1} -y^{\star }\rangle \\&+ \frac{\tau _kL_h}{2}\Vert \tilde{y}^{k+1} - \tilde{y}^k\Vert ^2 - \frac{\tau _k\mu _h}{2}\Vert \tilde{y}^k - y^{\star }\Vert ^2. \end{aligned}$$

Using this estimate, with a similar proof as of (60), we can derive

$$\begin{aligned}&\breve{\varPhi }_{k+1} := f(x^{k+1}) + g(\breve{y}^{k+1}) + h(\breve{y}^{k+1}) + \rho _k\mathcal {Q}_k(\breve{y}^{k+1}) \\&\quad \overset{(57),(58), (59)}{\le } (1-\tau _k)\big [ F(z^{k}) + \rho _k\ell _k(z^k) \big ] + \tau _k\big [F(z^{\star }) +\rho _k\ell _{k }(\tilde{z}^{k+1})\big ]\\&\qquad \qquad \quad +\tau _k\langle \nabla {f}(x^{k+1}), \tilde{x}^{k+1} - x^{\star }\rangle +~ \tau _k\langle \nabla {g}(\tilde{y}^{k+1}) + \nabla {h}(\tilde{y}^k), \tilde{y}^{k+1}- y^{\star }\rangle \\&\qquad \qquad \quad + \frac{\left( \rho _k\tau _k^2\Vert B\Vert ^2 + \tau _k L_h \right) }{2}\Vert \tilde{y}^{k+1} - \tilde{y}^k\Vert ^2 -~ \frac{\tau _k\mu _g}{2}\Vert \tilde{y}^{k+1} - y^{\star }\Vert ^2 - \frac{\tau _k\mu _h}{2}\Vert \tilde{y}^k- y^{\star }\Vert ^2 \\&\qquad \quad \le (1-\tau _k)\big [ F(z^{k}) + \rho _k\ell _k(z^k) \big ] + \tau _k F(z^{\star }) -\tfrac{\rho _k\tau _k}{2}\Vert \hat{s}^{k+1}\Vert ^2 \\&\qquad \qquad \quad +~ \gamma _0\tau _k\langle x^{k+1} - \hat{x}^k, x^{\star } - \tilde{x}^{k+1}\rangle +~ \tau _k^2\hat{\beta }_k\langle \tilde{y}^{k+1} - \tilde{y}^{k}, y^{\star }-\tilde{y}^{k+1} \rangle \\&\qquad \qquad \quad + \frac{\left( \rho _k\tau _k^2\Vert B\Vert ^2 + \tau _kL_h\right) }{2}\Vert \tilde{y}^{k+1} - \tilde{y}^k\Vert ^2 -~ \frac{\tau _k\mu _g}{2}\Vert \tilde{y}^{k+ 1} - y^{\star }\Vert ^2 - \frac{\tau _k\mu _h}{2}\Vert \tilde{y}^k - y^{\star }\Vert ^2. \end{aligned}$$

Here, we use the optimality condition of (10) and (35) into the last inequality, and $\nabla f, \nabla g$, and $\nabla h$ are subgradients of f, g, and h, respectively.

Using the same argument as the proof of (19), if we denote $S_k := \varPhi _{\rho _{k-1}}(z^k) - F^{\star }$, then the last inequality above together with (36) leads to

$$\begin{aligned} S_{k+1}&+ \tfrac{\gamma _0\tau _k^2}{2}\Vert \tilde{x}^{k+1} - x^{\star }\Vert ^2 + \tfrac{\hat{\beta }_k\tau _k^2 + \mu _g\tau _k}{2}\Vert \tilde{y}^{k+1} - y^{\star }\Vert ^2 \le (1 - \tau _k)S_k + \tfrac{\gamma _0\tau _k^2}{2}\Vert \tilde{x}^k - x^{\star }\Vert ^2 \nonumber \\&+ \tfrac{\hat{\beta }_k\tau _k^2 - \tau _k\mu _h}{2}\Vert \tilde{y}^k - y^{\star }\Vert ^2 - \tfrac{(\hat{\beta }_k\tau _k^2 - \rho _k\tau _k^2\Vert B\Vert ^2 - \tau _kL_h)}{2}\Vert \tilde{y}^{k+1} - \tilde{y}^{k}\Vert ^2 \nonumber \\&- \frac{(1-\tau _k)}{2}\left[ \rho _{k-1} - \rho _k(1-\tau _k)\right] \Vert s^k\Vert ^2, \end{aligned}$$

(74)

where $s^k := Ax^k + By^k - c- \mathrm {proj}_{\mathcal {K}}\big (Ax^k + By^k - c\big )$. We still choose the update rule for $\tau _k$, $\rho _k$ and $\gamma _k$ as in Algorithm 2. Then, in order to telescope this inequality, we impose the following conditions:

$$\begin{aligned} \hat{\beta }_k = \rho _k\Vert B\Vert ^2 + \tfrac{L_h}{\tau _k}, ~~~\text {and}~~~\hat{\beta }_k\tau _k^2 - \mu _h\tau _k \le (1-\tau _k)(\hat{\beta }_{k-1}\tau _{k-1}^2 + \mu _g\tau _{k-1}). \end{aligned}$$

Using the first condition into the second one and noting that $1 - \tau _k = \frac{\tau _k^2}{\tau _{k-1}^2}$ and $\rho _k = \frac{\rho _0}{\tau _k^2}$, we obtain $ \rho _0\left\| B\right\| ^2 + L_h - \mu _h \le \frac{\tau _k}{\tau _{k-1}}(L_h + \mu _g)$. This condition holds if $\rho _0 \le \frac{\mu _g + 2\mu _h - L_h}{2\Vert B\Vert ^2} > 0$. Using (74) we have the same conclusion as in Theorem 2. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tran-Dinh, Q. Proximal alternating penalty algorithms for nonsmooth constrained convex optimization. Comput Optim Appl 72, 1–43 (2019). https://doi.org/10.1007/s10589-018-0033-z

Download citation

Received: 09 November 2017
Published: 27 September 2018
Issue Date: 15 January 2019
DOI: https://doi.org/10.1007/s10589-018-0033-z

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Proximal alternating penalty algorithms for nonsmooth constrained convex optimization

Abstract

Access this article

Similar content being viewed by others

Tseng’s extragradient method with double projection for solving pseudomonotone variational inequality problems in Hilbert spaces

The Frank-Wolfe Algorithm: A Short Introduction

A New Insight on Augmented Lagrangian Method with Applications in Machine Learning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

A Appendix: The proof of technical results in the main text

1.1 A.1 Properties of the distance function \(\mathrm {dist}_{\mathcal {K}}(\cdot )\)

Lemma 4

Proof

1.2 A.2 Descent property of the alternating scheme in Algorithms 1 and 2

Lemma 5

Proof

1.3 A.3 The proof of Lemma 2: the key estimate of Algorithm 1

1.4 A.4 The proof of Lemma 3: the key estimate of Algorithm 2

1.5 A.5 The proof of Corollary 1: application to composite convex minimization

1.6 A.6 The proof of Theorem 3: extension to the sum of three objective functions

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Proximal alternating penalty algorithms for nonsmooth constrained convex optimization

Abstract

Access this article

Similar content being viewed by others

Tseng’s extragradient method with double projection for solving pseudomonotone variational inequality problems in Hilbert spaces

The Frank-Wolfe Algorithm: A Short Introduction

A New Insight on Augmented Lagrangian Method with Applications in Machine Learning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

A Appendix: The proof of technical results in the main text

A Appendix: The proof of technical results in the main text

1.1 A.1 Properties of the distance function \(\mathrm {dist}_{\mathcal {K}}(\cdot )\)

Lemma 4

Proof

1.2 A.2 Descent property of the alternating scheme in Algorithms 1 and 2

Lemma 5

Proof

1.3 A.3 The proof of Lemma 2: the key estimate of Algorithm 1

1.4 A.4 The proof of Lemma 3: the key estimate of Algorithm 2

1.5 A.5 The proof of Corollary 1: application to composite convex minimization

1.6 A.6 The proof of Theorem 3: extension to the sum of three objective functions

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation