An Accelerated Coordinate Gradient Descent Algorithm for Non-separable Composite Optimization

Aberdam, Aviad; Beck, Amir

doi:10.1007/s10957-021-01957-1

An Accelerated Coordinate Gradient Descent Algorithm for Non-separable Composite Optimization

Published: 23 November 2021

Volume 193, pages 219–246, (2022)
Cite this article

Journal of Optimization Theory and Applications Aims and scope Submit manuscript

1137 Accesses
4 Citations
Explore all metrics

Abstract

Coordinate descent algorithms are popular in machine learning and large-scale data analysis problems due to their low computational cost iterative schemes and their improved performances. In this work, we define a monotone accelerated coordinate gradient descent-type method for problems consisting of minimizing $f+g$, where f is quadratic and g is nonsmooth and non-separable and has a low-complexity proximal mapping. The algorithm is enabled by employing the forward–backward envelope, a composite envelope that possess an exact smooth reformulation of $f+g$. We prove the algorithm achieves a convergence rate of $O(1/k^{1.5})$ in terms of the original objective function, improving current coordinate descent-type algorithms. In addition, we describe an adaptive variant of the algorithm that backtracks the spectral information and coordinate Lipschitz constants of the problem. We numerically examine our algorithms on various settings, including two-dimensional total-variation-based image inpainting problems, showing a clear advantage in performance over current coordinate descent-type methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Tseng’s extragradient method with double projection for solving pseudomonotone variational inequality problems in Hilbert spaces

Article 10 April 2024

The Frank-Wolfe Algorithm: A Short Introduction

Article Open access 13 December 2023

A New Insight on Augmented Lagrangian Method with Applications in Machine Learning

Article 13 April 2024

Notes

In the extreme case where ${\mathbf{M }}= {\mathbf{0 }}$, the condition is $\mu \in (0,\infty )$.
Recall that for a nonempty closed and convex set C, $P_C$ denotes the orthogonal projection operator.
The existence of such a Lipschitz constant is warranted by Lemma 4.1.
We describe below how to reproduce these synthetic datasets, and they are available from the authors on reasonable request.
This standard dataset is available in https://github.com/aaberdam/AdaLISTA.

References

Auslender, A., Teboulle, M.: Interior gradient and proximal methods for convex and conic optimization. SIAM J. Optim. 16(3), 697–725 (2006)
Article MathSciNet Google Scholar
Barbero, A., Sra, S.: Modular proximal optimization for multidimensional total-variation regularization. arXiv preprint arXiv:1411.0589 (2014)
Beck, A.: First-Order Methods in Optimization, vol. 25. SIAM (2017)
Beck, A., Pauwels, E., Sabach, S.: The cyclic block conditional gradient method for convex optimization problems. SIAM J. Optim. 25(4), 2024–2049 (2015)
Article MathSciNet Google Scholar
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
Article MathSciNet Google Scholar
Beck, A., Teboulle, M.: Smoothing and first order methods: a unified framework. SIAM J. Optim. 22(2), 557–580 (2012)
Article MathSciNet Google Scholar
Beck, A., Tetruashvili, L.: On the convergence of block coordinate descent type methods. SIAM J. Optim. 23(4), 2037–2060 (2013)
Article MathSciNet Google Scholar
Bertsekas, D.P.: Nonlinear Program. Athena Scientific Optimization and Computation Series, 2nd edn. Athena Scientific, Belmont (1999)
Google Scholar
Bertsekas, D.P., Tsitsiklis, J.N.: Parallel and Distributed Computation: Numerical Methods, vol. 23. Prentice Hall, Englewood Cliffs (1989)
MATH Google Scholar
Burges, C.J.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2(2), 121–167 (1998)
Article Google Scholar
Combettes, P.L., Pesquet, J.C.: Stochastic quasi-Fejér block-coordinate fixed point iterations with random sweeping. SIAM J. Optim. 25(2), 1221–1248 (2015)
Article MathSciNet Google Scholar
Daubechies, I., Defrise, M., De Mol, C.: An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Commun. Pure Appl. Math. J. Issued Courant Inst. Math. Sci. 57(11), 1413–1457 (2004)
Article MathSciNet Google Scholar
Duchi, J., Shalev-Shwartz, S., Singer, Y., Chandra, T.: Efficient projections onto the l 1-ball for learning in high dimensions. In: Proceedings of the 25th International Conference on Machine Learning, pp. 272–279 (2008)
Fercoq, O., Bianchi, P.: A coordinate-descent primal-dual algorithm with large step size and possibly nonseparable functions. SIAM J. Optim. 29(1), 100–134 (2019)
Article MathSciNet Google Scholar
Fercoq, O., Richtárik, P.: Accelerated, parallel, and proximal coordinate descent. SIAM J. Optim. 25(4), 1997–2023 (2015)
Article MathSciNet Google Scholar
Friedman, J., Hastie, T., Höfling, H., Tibshirani, R.: Pathwise coordinate optimization. Ann. Appl. Stat. 1(2), 302–332 (2007)
Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1 (2010)
Article Google Scholar
Giselsson, P., Fält, M.: Envelope functions: unifications and further properties. J. Optim. Theory Appl. 178(3), 673–698 (2018). https://doi.org/10.1007/s10957-018-1328-z
Article MathSciNet MATH Google Scholar
Hanzely, F., Kovalev, D., Richtárik, P.: Variance reduced coordinate descent with acceleration: new method with a surprising application to finite-sum problems. arXiv preprint arXiv:2002.04670 (2020)
Hanzely, F., Mishchenko, K., Richtárik, P.: SEGA: Variance reduction via gradient sketching. In: Advances in Neural Information Processing Systems, vol. 31, pp. 2082–2093 (2018)
Hanzely, F., Richtárik, P.: One method to rule them all: variance reduction for data, parameters and many new methods. arXiv preprint arXiv:1905.11266 (2019)
Hastie, T., Tibshirani, R., Wainwright, M.: Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press (2015)
Hong, M., Wang, X., Razaviyayn, M., Luo, Z.Q.: Iteration complexity analysis of block coordinate descent methods. Math. Program. 163(1–2), 85–114 (2017)
Article MathSciNet Google Scholar
Johnson, N.A.: A dynamic programming algorithm for the fused Lasso and $L_0$-segmentation. J. Comput. Graph. Stat. 22(2), 246–260 (2013)
Article Google Scholar
Kolmogorov, V., Pock, T., Rolinek, M.: Total variation on a tree. SIAM J. Imaging Sci. 9(2), 605–636 (2016)
Article MathSciNet Google Scholar
Lacoste-Julien, S., Jaggi, M., Schmidt, M., Pletscher, P.: Block-coordinate Frank-Wolfe optimization for structural SVMs. In: Dasgupta, S., McAllester, D. (eds.) Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 53–61. PMLR (2013)
Latafat, P., Themelis, A., Patrinos, P.: Block-coordinate and incremental aggregated proximal gradient methods for nonsmooth nonconvex problems. Math. Program. 1–30 (2021)
Lu, H., Freund, R., Mirrokni, V.: Accelerating greedy coordinate descent methods. In: Dy, J., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 80, pp. 3257–3266. PMLR (2018)
Maculan, N., Santiago, C.P., Macambira, E., Jardim, M.: An O(n) algorithm for projecting a vector on the intersection of a hyperplane and a box in R n. J. Optim. Theory Appl. 117(3), 553–574 (2003)
Article MathSciNet Google Scholar
Markowitz, H.: Portfolio selection. J. Finance 7(1), 77–91 (1952)
Google Scholar
Moreau, J.J.: Proximité et dualité dans un espace hilbertien. Bull. de la Société mathématique de France 93, 273–299 (1965)
Article MathSciNet Google Scholar
Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)
Article MathSciNet Google Scholar
Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013)
Article MathSciNet Google Scholar
Ortega, J.M., Rheinboldt, W.C.: Iterative Solution of Nonlinear Equations in Several Variables, vol. 30. SIAM (1970)
Rockafellar, R.T.: Convex Analysis. Princeton Mathematical Series, No. 28, Princeton University Press, Princeton (1970)
Book Google Scholar
Stella, L., Themelis, A., Patrinos, P.: Forward–backward quasi-newton methods for nonsmooth optimization problems. Comput. Optim. Appl. 67(3), 443–487 (2017)
Article MathSciNet Google Scholar
Tseng, P.: On accelerated proximal gradient methods for convex–concave optimization. Unpublished manuscript (2008)
Wright, S.J.: Coordinate descent algorithms. Math. Program. 151(1), 3–34 (2015)
Article MathSciNet Google Scholar

Download references

Acknowledgements

The research of A. Beck is supported by the ISF Grant 926-21. A. Aberdam thanks the Azrieli foundation for providing additional research support. The authors would like to thank two anonymous reviewers for their valuable suggestions that improved the final manuscript.

Author information

Authors and Affiliations

Electrical Engineering, Technion - Israel Institute of Technology, Haifa, Israel
Aviad Aberdam
School of Mathematical Sciences, Tel Aviv University, Tel Aviv, Israel
Amir Beck

Authors

Aviad Aberdam
View author publications
You can also search for this author in PubMed Google Scholar
Amir Beck
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Amir Beck.

Additional information

Communicated by Massimo Pappalardo.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Proof of Theorem 4.1

Throughout the proof, we use the notation: $\mathbf{L}= \hbox {diag}(\{L_i\}_{i=1}^n)$ and denote the $\mathbf{L}$-norm and the $\mathbf{L}$-inner product by

$$ \langle \mathbf{x},\mathbf{y}\rangle _{\mathbf{L}} \equiv \textstyle \sum _{i=1}^n L_i x_i y_i; \; \Vert \mathbf{x}\Vert _{\mathbf{L}} \equiv \sqrt{\langle \mathbf{x},\mathbf{x}\rangle _{\mathbf{L}}} =\sqrt{\textstyle \sum _{i=1}^n L_i x_i^2}. \; $$

By the definition of step 4 and the block descent lemma [3, Lemma 11.8], it follows that

$$ H({\tilde{\mathbf{x}}}^{k+1}) \le H(\mathbf{y}^k)+\nabla _{i_k} H(\mathbf{y}^k)({\tilde{x}}^{k+1}_{i_k}-y^k_{i_k})+\frac{L_{i_k}}{2}({\tilde{x}}^{k+1}_{i_k}-y^k_{i_k})^2$$

Taking the expectation with respect to $i_{k}$, and recalling that ${\tilde{\mathbf{x}}}^{k+1} = \mathbf{y}^k-\frac{1}{L_{i_k}} \nabla _{i_k} H(\mathbf{y}^k)\mathbf{e}_{i_k}$, we obtain

$$\begin{aligned} {\mathbb {E}}_{i_k} H({\tilde{\mathbf{x}}}^{k+1}) \le H(\mathbf{y}^k)+ \nabla H(\mathbf{y}^k)^T(\mathbf{s}^{k+1}-\mathbf{y}^k)+\frac{n}{2} \Vert \mathbf{s}^{k+1}-\mathbf{y}^k\Vert _{\mathbf{L}}^2,\nonumber \\\end{aligned}$$

(7.1)

where $\mathbf{s}^{k+1}= \mathbf{y}^k- \frac{1}{n} \mathbf{L}^{-1}\nabla H(\mathbf{y}^k).$ Define

$$\begin{aligned} \mathbf{t}^{k+1}\equiv & {} \mathbf{z}^k- \frac{1}{n\theta ^k } \mathbf{L}^{-1}\nabla H(\mathbf{y}^k)\nonumber \\= & {} \displaystyle \mathop {\mathrm{argmin}}_{\mathbf{y}} \left\{ \nabla H(\mathbf{y}^k)^T(\mathbf{y}-\mathbf{z}^k)+\frac{n\theta ^k }{2} \Vert \mathbf{y}-\mathbf{z}^k\Vert _{\mathbf{L}}^2 \right\} . \end{aligned}$$

(7.2)

Obviously, $\mathbf{s}^{k+1}-\mathbf{y}^k = \theta ^k (\mathbf{t}^{k+1}-\mathbf{z}^k)$. Thus, by (7.1) and the fact that $H(\mathbf{x}^{k+1})\le H({\tilde{\mathbf{x}}}^{k+1})$ (step 5), it follows that

$$\begin{aligned} {\mathbb {E}}_{i_k} H(\mathbf{x}^{k+1})\le & {} {\mathbb {E}}_{i_k} H({\tilde{\mathbf{x}}}^{k+1}) \le H(\mathbf{y}^k)+ \nabla H(\mathbf{y}^k)^T(\mathbf{s}^{k+1}-\mathbf{y}^k)+\frac{n}{2} \Vert \mathbf{s}^{k+1}-\mathbf{y}^k\Vert _{\mathbf{L}}^2\nonumber \\= & {} H(\mathbf{y}^k)+\theta ^k \left[ \nabla H(\mathbf{y}^k)^T(\mathbf{t}^{k+1}-\mathbf{z}^k)+\frac{n\theta ^k }{2} \Vert \mathbf{t}^{k+1}-\mathbf{z}^k\Vert _{\mathbf{L}}^2 \right] . \end{aligned}$$

(7.3)

By Tseng’s three-points property [37, Property 1] and the relation (7.2), we have

$$\begin{aligned} \nabla H(\mathbf{y}^k)^T(\mathbf{x}^*-\mathbf{z}^k) +\frac{n\theta ^k}{2} \Vert \mathbf{x}^*-\mathbf{z}^k\Vert _{\mathbf{L}}^2- \nabla H(\mathbf{y}^k)^T(\mathbf{t}^{k+1}-\mathbf{z}^k)\nonumber \\ -\frac{n\theta ^k}{2} \Vert \mathbf{t}^{k+1}-\mathbf{z}^k\Vert _{\mathbf{L}}^2 \ge \frac{n\theta ^k}{2}\Vert \mathbf{x}^*-\mathbf{t}^{k+1}\Vert _{\mathbf{L}}^2. \end{aligned}$$

(7.4)

Combining the above with (7.3) yields

$$\begin{aligned} {\mathbb {E}}_{i_k} H(\mathbf{x}^{k+1})\le & {} H(\mathbf{y}^k)+\theta ^k \left[ \nabla H(\mathbf{y}^k)^T(\mathbf{x}^*-\mathbf{z}^k)+\frac{n\theta ^k}{2} \Vert \mathbf{x}^*-\mathbf{z}^k\Vert _{\mathbf{L}}^2 -\frac{n\theta ^k}{2} \Vert \mathbf{x}^*-\mathbf{t}^{k+1}\Vert _{\mathbf{L}}^2 \right] \nonumber \\= & {} H(\mathbf{y}^k)+\theta ^k \left[ \nabla H(\mathbf{y}^k)^T(\mathbf{x}^*-\mathbf{z}^k)+\frac{n^2\theta ^k}{2} \Vert \mathbf{x}^*-\mathbf{z}^k\Vert _{\mathbf{L}}^2 -\frac{n^2\theta ^k}{2} {\mathbb {E}}_{i_k}\Vert \mathbf{x}^*-\mathbf{z}^{k+1}\Vert _{\mathbf{L}}^2 \right] ,\nonumber \\ \end{aligned}$$

(7.5)

where the equality follows by the following argument:

$$\begin{aligned} \Vert \mathbf{x}^*-\mathbf{z}^k\Vert _{\mathbf{L}}^2-\Vert \mathbf{x}^*-\mathbf{t}^{k+1}\Vert _{\mathbf{L}}^2= & {} {2}\langle \mathbf{t}^{k+1}-\mathbf{z}^k, \mathbf{x}^*-\mathbf{z}^k\rangle _{\mathbf{L}} -\Vert \mathbf{t}^{k+1}-\mathbf{z}^k\Vert _{\mathbf{L}}^2\\ {}= & {} {2}n {\mathbb {E}}_{i_k}\langle \mathbf{z}^{k+1}-\mathbf{z}^k, \mathbf{x}^*-\mathbf{z}^k\rangle _{\mathbf{L}} -n{\mathbb {E}}_{i_k}\Vert \mathbf{z}^{k+1}-\mathbf{z}^k\Vert _{\mathbf{L}}^2 \\= & {} n{\mathbb {E}}_{i_k} (\Vert \mathbf{x}^*-\mathbf{z}^k\Vert _{\mathbf{L}}^2-\Vert \mathbf{x}^*-\mathbf{z}^{k+1}\Vert _{\mathbf{L}}^2) \end{aligned}$$

Now, using the update formula in step 2, we have

$$\begin{aligned} \nabla H(\mathbf{y}^k)^T(\theta ^k\mathbf{x}^*- \theta ^k \mathbf{z}^k)= & {} \nabla H(\mathbf{y}^k)^T(\theta ^k\mathbf{x}^*-\mathbf{y}^k+(1-\theta ^k)\mathbf{x}^k) \nonumber \\= & {} \theta ^k \nabla H(\mathbf{y}^k)^T(\mathbf{x}^*-\mathbf{y}^k)+(1-\theta ^k) \nabla H(\mathbf{y}^k)^T(\mathbf{x}^k-\mathbf{y}^k).\nonumber \\ \end{aligned}$$

(7.6)

Thus, combining (7.5) and (7.6) along with the gradient inequality, the following is implied:

$$\begin{aligned} {\mathbb {E}}_{i_k} H(\mathbf{x}^{k+1}) \le (1-\theta ^k)H(\mathbf{x}^k)+\theta ^k H(\mathbf{x}^*)+\frac{n^2(\theta ^k)^2}{2} \Vert \mathbf{x}^*\nonumber \\ -\mathbf{z}^k\Vert _{\mathbf{L}}^2 -\frac{n^2 (\theta ^k)^2}{2} {\mathbb {E}}_{i_k}\Vert \mathbf{x}^*-\mathbf{z}^{k+1}\Vert _{\mathbf{L}}^2, \end{aligned}$$

(7.7)

which is the same as

$$\begin{aligned} {\mathbb {E}}_{i_k} H(\mathbf{x}^{k+1})-H(\mathbf{x}^*)\le & {} (1-\theta ^k)(H(\mathbf{x}^k)-H(\mathbf{x}^*))+\frac{n^2(\theta ^k)^2}{2} \Vert \mathbf{x}^*\nonumber \\&\quad -\, \mathbf{z}^k\Vert _{\mathbf{L}}^2 -\frac{n^2 (\theta ^k)^2}{2} {\mathbb {E}}_{i_k}\Vert \mathbf{x}^*-\mathbf{z}^{k+1}\Vert _{\mathbf{L}}^2. \end{aligned}$$

(7.8)

Taking expectation over $\xi _{k-1}$ leads to

$$\begin{aligned} {\mathbb {E}}_{\xi _k} H(\mathbf{x}^{k+1})-H(\mathbf{x}^*)\le & {} (1-\theta ^k)({\mathbb {E}}_{\xi _{k-1}}H(\mathbf{x}^k)-H(\mathbf{x}^*))\nonumber \\&+\frac{n^2(\theta ^k)^2}{2} {\mathbb {E}}_{\xi _{k-1}} \Vert \mathbf{x}^*-\mathbf{z}^k\Vert _{\mathbf{L}}^2 -\frac{n^2 (\theta ^k)^2}{2} {\mathbb {E}}_{\xi _k}\Vert \mathbf{x}^*-\mathbf{z}^{k+1}\Vert _{\mathbf{L}}^2.\nonumber \\ \end{aligned}$$

(7.9)

Denoting $e_k \equiv {\mathbb {E}}_{\xi _{k-1}} H(\mathbf{x}^{k})-H(\mathbf{x}^*)$ and $\Delta _k \equiv \frac{n^2}{2}{\mathbb {E}}_{\xi _{k-1}} \Vert \mathbf{x}^*-\mathbf{z}^k\Vert _{\mathbf{L}}^2$, we can rewrite (7.9) as

$$ e_{k+1}\le (1-\theta ^k)e_k+(\theta ^k)^2\Delta _k-(\theta ^k)^2\Delta _{k+1}.$$

Dividing the inequality by $(\theta ^k)^2$ yields

$$ \frac{1}{(\theta ^k)^2} e_{k+1}\le \frac{1-\theta ^k}{(\theta ^k)^2}e_k+\Delta _k-\Delta _{k+1}.$$

By the definition of the sequence $\theta ^k$ (Step 6), the above is the same as

$$ \frac{1}{(\theta ^k)^2} e_{k+1}\le \frac{1}{(\theta ^{k-1})^2}e_k+\Delta _k-\Delta _{k+1},$$

and hence,

$$ \frac{1}{(\theta ^k)^2} e_{k+1}+\Delta _{k+1}\le \frac{1}{(\theta ^{k-1})^2}e_k+\Delta _k.$$

Since $\theta ^0=1$ the above inequality results in that $\frac{1}{(\theta ^{k-1})^2} e_{k}\le \Delta _0,$ which by the facts that $\Delta _0 = \frac{n^2}{2}\Vert \mathbf{x}^*-\mathbf{x}^0\Vert _{\mathbf{L}}^2$ and $\theta ^k \le \frac{2}{k+2}$ (see [37]) leads to the desired result (4.3).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aberdam, A., Beck, A. An Accelerated Coordinate Gradient Descent Algorithm for Non-separable Composite Optimization. J Optim Theory Appl 193, 219–246 (2022). https://doi.org/10.1007/s10957-021-01957-1

Download citation

Received: 16 March 2021
Accepted: 29 September 2021
Published: 23 November 2021
Issue Date: June 2022
DOI: https://doi.org/10.1007/s10957-021-01957-1

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Accelerated Coordinate Gradient Descent Algorithm for Non-separable Composite Optimization

Abstract

Access this article

Similar content being viewed by others

Tseng’s extragradient method with double projection for solving pseudomonotone variational inequality problems in Hilbert spaces

The Frank-Wolfe Algorithm: A Short Introduction

A New Insight on Augmented Lagrangian Method with Applications in Machine Learning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix A: Proof of Theorem 4.1

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

An Accelerated Coordinate Gradient Descent Algorithm for Non-separable Composite Optimization

Abstract

Access this article

Similar content being viewed by others

Tseng’s extragradient method with double projection for solving pseudomonotone variational inequality problems in Hilbert spaces

The Frank-Wolfe Algorithm: A Short Introduction

A New Insight on Augmented Lagrangian Method with Applications in Machine Learning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix A: Proof of Theorem 4.1

Appendix A: Proof of Theorem 4.1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation