Skip to main content
Log in

An Accelerated Coordinate Gradient Descent Algorithm for Non-separable Composite Optimization

  • Published:
Journal of Optimization Theory and Applications Aims and scope Submit manuscript

Abstract

Coordinate descent algorithms are popular in machine learning and large-scale data analysis problems due to their low computational cost iterative schemes and their improved performances. In this work, we define a monotone accelerated coordinate gradient descent-type method for problems consisting of minimizing \(f+g\), where f is quadratic and g is nonsmooth and non-separable and has a low-complexity proximal mapping. The algorithm is enabled by employing the forward–backward envelope, a composite envelope that possess an exact smooth reformulation of \(f+g\). We prove the algorithm achieves a convergence rate of \(O(1/k^{1.5})\) in terms of the original objective function, improving current coordinate descent-type algorithms. In addition, we describe an adaptive variant of the algorithm that backtracks the spectral information and coordinate Lipschitz constants of the problem. We numerically examine our algorithms on various settings, including two-dimensional total-variation-based image inpainting problems, showing a clear advantage in performance over current coordinate descent-type methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. In the extreme case where \({\mathbf{M }}= {\mathbf{0 }}\), the condition is \(\mu \in (0,\infty )\).

  2. Recall that for a nonempty closed and convex set C, \(P_C\) denotes the orthogonal projection operator.

  3. The existence of such a Lipschitz constant is warranted by Lemma 4.1.

  4. We describe below how to reproduce these synthetic datasets, and they are available from the authors on reasonable request.

  5. This standard dataset is available in https://github.com/aaberdam/AdaLISTA.

References

  1. Auslender, A., Teboulle, M.: Interior gradient and proximal methods for convex and conic optimization. SIAM J. Optim. 16(3), 697–725 (2006)

    Article  MathSciNet  Google Scholar 

  2. Barbero, A., Sra, S.: Modular proximal optimization for multidimensional total-variation regularization. arXiv preprint arXiv:1411.0589 (2014)

  3. Beck, A.: First-Order Methods in Optimization, vol. 25. SIAM (2017)

  4. Beck, A., Pauwels, E., Sabach, S.: The cyclic block conditional gradient method for convex optimization problems. SIAM J. Optim. 25(4), 2024–2049 (2015)

    Article  MathSciNet  Google Scholar 

  5. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)

    Article  MathSciNet  Google Scholar 

  6. Beck, A., Teboulle, M.: Smoothing and first order methods: a unified framework. SIAM J. Optim. 22(2), 557–580 (2012)

    Article  MathSciNet  Google Scholar 

  7. Beck, A., Tetruashvili, L.: On the convergence of block coordinate descent type methods. SIAM J. Optim. 23(4), 2037–2060 (2013)

    Article  MathSciNet  Google Scholar 

  8. Bertsekas, D.P.: Nonlinear Program. Athena Scientific Optimization and Computation Series, 2nd edn. Athena Scientific, Belmont (1999)

    Google Scholar 

  9. Bertsekas, D.P., Tsitsiklis, J.N.: Parallel and Distributed Computation: Numerical Methods, vol. 23. Prentice Hall, Englewood Cliffs (1989)

    MATH  Google Scholar 

  10. Burges, C.J.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2(2), 121–167 (1998)

    Article  Google Scholar 

  11. Combettes, P.L., Pesquet, J.C.: Stochastic quasi-Fejér block-coordinate fixed point iterations with random sweeping. SIAM J. Optim. 25(2), 1221–1248 (2015)

    Article  MathSciNet  Google Scholar 

  12. Daubechies, I., Defrise, M., De Mol, C.: An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Commun. Pure Appl. Math. J. Issued Courant Inst. Math. Sci. 57(11), 1413–1457 (2004)

    Article  MathSciNet  Google Scholar 

  13. Duchi, J., Shalev-Shwartz, S., Singer, Y., Chandra, T.: Efficient projections onto the l 1-ball for learning in high dimensions. In: Proceedings of the 25th International Conference on Machine Learning, pp. 272–279 (2008)

  14. Fercoq, O., Bianchi, P.: A coordinate-descent primal-dual algorithm with large step size and possibly nonseparable functions. SIAM J. Optim. 29(1), 100–134 (2019)

    Article  MathSciNet  Google Scholar 

  15. Fercoq, O., Richtárik, P.: Accelerated, parallel, and proximal coordinate descent. SIAM J. Optim. 25(4), 1997–2023 (2015)

    Article  MathSciNet  Google Scholar 

  16. Friedman, J., Hastie, T., Höfling, H., Tibshirani, R.: Pathwise coordinate optimization. Ann. Appl. Stat. 1(2), 302–332 (2007)

  17. Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1 (2010)

    Article  Google Scholar 

  18. Giselsson, P., Fält, M.: Envelope functions: unifications and further properties. J. Optim. Theory Appl. 178(3), 673–698 (2018). https://doi.org/10.1007/s10957-018-1328-z

    Article  MathSciNet  MATH  Google Scholar 

  19. Hanzely, F., Kovalev, D., Richtárik, P.: Variance reduced coordinate descent with acceleration: new method with a surprising application to finite-sum problems. arXiv preprint arXiv:2002.04670 (2020)

  20. Hanzely, F., Mishchenko, K., Richtárik, P.: SEGA: Variance reduction via gradient sketching. In: Advances in Neural Information Processing Systems, vol. 31, pp. 2082–2093 (2018)

  21. Hanzely, F., Richtárik, P.: One method to rule them all: variance reduction for data, parameters and many new methods. arXiv preprint arXiv:1905.11266 (2019)

  22. Hastie, T., Tibshirani, R., Wainwright, M.: Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press (2015)

  23. Hong, M., Wang, X., Razaviyayn, M., Luo, Z.Q.: Iteration complexity analysis of block coordinate descent methods. Math. Program. 163(1–2), 85–114 (2017)

    Article  MathSciNet  Google Scholar 

  24. Johnson, N.A.: A dynamic programming algorithm for the fused Lasso and \(L_0\)-segmentation. J. Comput. Graph. Stat. 22(2), 246–260 (2013)

    Article  Google Scholar 

  25. Kolmogorov, V., Pock, T., Rolinek, M.: Total variation on a tree. SIAM J. Imaging Sci. 9(2), 605–636 (2016)

    Article  MathSciNet  Google Scholar 

  26. Lacoste-Julien, S., Jaggi, M., Schmidt, M., Pletscher, P.: Block-coordinate Frank-Wolfe optimization for structural SVMs. In: Dasgupta, S., McAllester, D. (eds.) Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 53–61. PMLR (2013)

  27. Latafat, P., Themelis, A., Patrinos, P.: Block-coordinate and incremental aggregated proximal gradient methods for nonsmooth nonconvex problems. Math. Program. 1–30 (2021)

  28. Lu, H., Freund, R., Mirrokni, V.: Accelerating greedy coordinate descent methods. In: Dy, J., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 80, pp. 3257–3266. PMLR (2018)

  29. Maculan, N., Santiago, C.P., Macambira, E., Jardim, M.: An O(n) algorithm for projecting a vector on the intersection of a hyperplane and a box in R n. J. Optim. Theory Appl. 117(3), 553–574 (2003)

    Article  MathSciNet  Google Scholar 

  30. Markowitz, H.: Portfolio selection. J. Finance 7(1), 77–91 (1952)

    Google Scholar 

  31. Moreau, J.J.: Proximité et dualité dans un espace hilbertien. Bull. de la Société mathématique de France 93, 273–299 (1965)

    Article  MathSciNet  Google Scholar 

  32. Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)

    Article  MathSciNet  Google Scholar 

  33. Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013)

    Article  MathSciNet  Google Scholar 

  34. Ortega, J.M., Rheinboldt, W.C.: Iterative Solution of Nonlinear Equations in Several Variables, vol. 30. SIAM (1970)

  35. Rockafellar, R.T.: Convex Analysis. Princeton Mathematical Series, No. 28, Princeton University Press, Princeton (1970)

    Book  Google Scholar 

  36. Stella, L., Themelis, A., Patrinos, P.: Forward–backward quasi-newton methods for nonsmooth optimization problems. Comput. Optim. Appl. 67(3), 443–487 (2017)

    Article  MathSciNet  Google Scholar 

  37. Tseng, P.: On accelerated proximal gradient methods for convex–concave optimization. Unpublished manuscript (2008)

  38. Wright, S.J.: Coordinate descent algorithms. Math. Program. 151(1), 3–34 (2015)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

The research of A. Beck is supported by the ISF Grant 926-21. A. Aberdam thanks the Azrieli foundation for providing additional research support. The authors would like to thank two anonymous reviewers for their valuable suggestions that improved the final manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Amir Beck.

Additional information

Communicated by Massimo Pappalardo.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Proof of Theorem 4.1

Appendix A: Proof of Theorem 4.1

Throughout the proof, we use the notation: \(\mathbf{L}= \hbox {diag}(\{L_i\}_{i=1}^n)\) and denote the \(\mathbf{L}\)-norm and the \(\mathbf{L}\)-inner product by

$$ \langle \mathbf{x},\mathbf{y}\rangle _{\mathbf{L}} \equiv \textstyle \sum _{i=1}^n L_i x_i y_i; \; \Vert \mathbf{x}\Vert _{\mathbf{L}} \equiv \sqrt{\langle \mathbf{x},\mathbf{x}\rangle _{\mathbf{L}}} =\sqrt{\textstyle \sum _{i=1}^n L_i x_i^2}. \; $$

By the definition of step 4 and the block descent lemma [3, Lemma 11.8], it follows that

$$ H({\tilde{\mathbf{x}}}^{k+1}) \le H(\mathbf{y}^k)+\nabla _{i_k} H(\mathbf{y}^k)({\tilde{x}}^{k+1}_{i_k}-y^k_{i_k})+\frac{L_{i_k}}{2}({\tilde{x}}^{k+1}_{i_k}-y^k_{i_k})^2$$

Taking the expectation with respect to \(i_{k}\), and recalling that \({\tilde{\mathbf{x}}}^{k+1} = \mathbf{y}^k-\frac{1}{L_{i_k}} \nabla _{i_k} H(\mathbf{y}^k)\mathbf{e}_{i_k}\), we obtain

$$\begin{aligned} {\mathbb {E}}_{i_k} H({\tilde{\mathbf{x}}}^{k+1}) \le H(\mathbf{y}^k)+ \nabla H(\mathbf{y}^k)^T(\mathbf{s}^{k+1}-\mathbf{y}^k)+\frac{n}{2} \Vert \mathbf{s}^{k+1}-\mathbf{y}^k\Vert _{\mathbf{L}}^2,\nonumber \\\end{aligned}$$
(7.1)

where \(\mathbf{s}^{k+1}= \mathbf{y}^k- \frac{1}{n} \mathbf{L}^{-1}\nabla H(\mathbf{y}^k).\) Define

$$\begin{aligned} \mathbf{t}^{k+1}\equiv & {} \mathbf{z}^k- \frac{1}{n\theta ^k } \mathbf{L}^{-1}\nabla H(\mathbf{y}^k)\nonumber \\= & {} \displaystyle \mathop {\mathrm{argmin}}_{\mathbf{y}} \left\{ \nabla H(\mathbf{y}^k)^T(\mathbf{y}-\mathbf{z}^k)+\frac{n\theta ^k }{2} \Vert \mathbf{y}-\mathbf{z}^k\Vert _{\mathbf{L}}^2 \right\} . \end{aligned}$$
(7.2)

Obviously, \(\mathbf{s}^{k+1}-\mathbf{y}^k = \theta ^k (\mathbf{t}^{k+1}-\mathbf{z}^k)\). Thus, by (7.1) and the fact that \(H(\mathbf{x}^{k+1})\le H({\tilde{\mathbf{x}}}^{k+1})\) (step 5), it follows that

$$\begin{aligned} {\mathbb {E}}_{i_k} H(\mathbf{x}^{k+1})\le & {} {\mathbb {E}}_{i_k} H({\tilde{\mathbf{x}}}^{k+1}) \le H(\mathbf{y}^k)+ \nabla H(\mathbf{y}^k)^T(\mathbf{s}^{k+1}-\mathbf{y}^k)+\frac{n}{2} \Vert \mathbf{s}^{k+1}-\mathbf{y}^k\Vert _{\mathbf{L}}^2\nonumber \\= & {} H(\mathbf{y}^k)+\theta ^k \left[ \nabla H(\mathbf{y}^k)^T(\mathbf{t}^{k+1}-\mathbf{z}^k)+\frac{n\theta ^k }{2} \Vert \mathbf{t}^{k+1}-\mathbf{z}^k\Vert _{\mathbf{L}}^2 \right] . \end{aligned}$$
(7.3)

By Tseng’s three-points property [37, Property 1] and the relation (7.2), we have

$$\begin{aligned} \nabla H(\mathbf{y}^k)^T(\mathbf{x}^*-\mathbf{z}^k) +\frac{n\theta ^k}{2} \Vert \mathbf{x}^*-\mathbf{z}^k\Vert _{\mathbf{L}}^2- \nabla H(\mathbf{y}^k)^T(\mathbf{t}^{k+1}-\mathbf{z}^k)\nonumber \\ -\frac{n\theta ^k}{2} \Vert \mathbf{t}^{k+1}-\mathbf{z}^k\Vert _{\mathbf{L}}^2 \ge \frac{n\theta ^k}{2}\Vert \mathbf{x}^*-\mathbf{t}^{k+1}\Vert _{\mathbf{L}}^2. \end{aligned}$$
(7.4)

Combining the above with (7.3) yields

$$\begin{aligned} {\mathbb {E}}_{i_k} H(\mathbf{x}^{k+1})\le & {} H(\mathbf{y}^k)+\theta ^k \left[ \nabla H(\mathbf{y}^k)^T(\mathbf{x}^*-\mathbf{z}^k)+\frac{n\theta ^k}{2} \Vert \mathbf{x}^*-\mathbf{z}^k\Vert _{\mathbf{L}}^2 -\frac{n\theta ^k}{2} \Vert \mathbf{x}^*-\mathbf{t}^{k+1}\Vert _{\mathbf{L}}^2 \right] \nonumber \\= & {} H(\mathbf{y}^k)+\theta ^k \left[ \nabla H(\mathbf{y}^k)^T(\mathbf{x}^*-\mathbf{z}^k)+\frac{n^2\theta ^k}{2} \Vert \mathbf{x}^*-\mathbf{z}^k\Vert _{\mathbf{L}}^2 -\frac{n^2\theta ^k}{2} {\mathbb {E}}_{i_k}\Vert \mathbf{x}^*-\mathbf{z}^{k+1}\Vert _{\mathbf{L}}^2 \right] ,\nonumber \\ \end{aligned}$$
(7.5)

where the equality follows by the following argument:

$$\begin{aligned} \Vert \mathbf{x}^*-\mathbf{z}^k\Vert _{\mathbf{L}}^2-\Vert \mathbf{x}^*-\mathbf{t}^{k+1}\Vert _{\mathbf{L}}^2= & {} {2}\langle \mathbf{t}^{k+1}-\mathbf{z}^k, \mathbf{x}^*-\mathbf{z}^k\rangle _{\mathbf{L}} -\Vert \mathbf{t}^{k+1}-\mathbf{z}^k\Vert _{\mathbf{L}}^2\\ {}= & {} {2}n {\mathbb {E}}_{i_k}\langle \mathbf{z}^{k+1}-\mathbf{z}^k, \mathbf{x}^*-\mathbf{z}^k\rangle _{\mathbf{L}} -n{\mathbb {E}}_{i_k}\Vert \mathbf{z}^{k+1}-\mathbf{z}^k\Vert _{\mathbf{L}}^2 \\= & {} n{\mathbb {E}}_{i_k} (\Vert \mathbf{x}^*-\mathbf{z}^k\Vert _{\mathbf{L}}^2-\Vert \mathbf{x}^*-\mathbf{z}^{k+1}\Vert _{\mathbf{L}}^2) \end{aligned}$$

Now, using the update formula in step 2, we have

$$\begin{aligned} \nabla H(\mathbf{y}^k)^T(\theta ^k\mathbf{x}^*- \theta ^k \mathbf{z}^k)= & {} \nabla H(\mathbf{y}^k)^T(\theta ^k\mathbf{x}^*-\mathbf{y}^k+(1-\theta ^k)\mathbf{x}^k) \nonumber \\= & {} \theta ^k \nabla H(\mathbf{y}^k)^T(\mathbf{x}^*-\mathbf{y}^k)+(1-\theta ^k) \nabla H(\mathbf{y}^k)^T(\mathbf{x}^k-\mathbf{y}^k).\nonumber \\ \end{aligned}$$
(7.6)

Thus, combining (7.5) and (7.6) along with the gradient inequality, the following is implied:

$$\begin{aligned} {\mathbb {E}}_{i_k} H(\mathbf{x}^{k+1}) \le (1-\theta ^k)H(\mathbf{x}^k)+\theta ^k H(\mathbf{x}^*)+\frac{n^2(\theta ^k)^2}{2} \Vert \mathbf{x}^*\nonumber \\ -\mathbf{z}^k\Vert _{\mathbf{L}}^2 -\frac{n^2 (\theta ^k)^2}{2} {\mathbb {E}}_{i_k}\Vert \mathbf{x}^*-\mathbf{z}^{k+1}\Vert _{\mathbf{L}}^2, \end{aligned}$$
(7.7)

which is the same as

$$\begin{aligned} {\mathbb {E}}_{i_k} H(\mathbf{x}^{k+1})-H(\mathbf{x}^*)\le & {} (1-\theta ^k)(H(\mathbf{x}^k)-H(\mathbf{x}^*))+\frac{n^2(\theta ^k)^2}{2} \Vert \mathbf{x}^*\nonumber \\&\quad -\, \mathbf{z}^k\Vert _{\mathbf{L}}^2 -\frac{n^2 (\theta ^k)^2}{2} {\mathbb {E}}_{i_k}\Vert \mathbf{x}^*-\mathbf{z}^{k+1}\Vert _{\mathbf{L}}^2. \end{aligned}$$
(7.8)

Taking expectation over \(\xi _{k-1}\) leads to

$$\begin{aligned} {\mathbb {E}}_{\xi _k} H(\mathbf{x}^{k+1})-H(\mathbf{x}^*)\le & {} (1-\theta ^k)({\mathbb {E}}_{\xi _{k-1}}H(\mathbf{x}^k)-H(\mathbf{x}^*))\nonumber \\&+\frac{n^2(\theta ^k)^2}{2} {\mathbb {E}}_{\xi _{k-1}} \Vert \mathbf{x}^*-\mathbf{z}^k\Vert _{\mathbf{L}}^2 -\frac{n^2 (\theta ^k)^2}{2} {\mathbb {E}}_{\xi _k}\Vert \mathbf{x}^*-\mathbf{z}^{k+1}\Vert _{\mathbf{L}}^2.\nonumber \\ \end{aligned}$$
(7.9)

Denoting \(e_k \equiv {\mathbb {E}}_{\xi _{k-1}} H(\mathbf{x}^{k})-H(\mathbf{x}^*)\) and \(\Delta _k \equiv \frac{n^2}{2}{\mathbb {E}}_{\xi _{k-1}} \Vert \mathbf{x}^*-\mathbf{z}^k\Vert _{\mathbf{L}}^2\), we can rewrite (7.9) as

$$ e_{k+1}\le (1-\theta ^k)e_k+(\theta ^k)^2\Delta _k-(\theta ^k)^2\Delta _{k+1}.$$

Dividing the inequality by \((\theta ^k)^2\) yields

$$ \frac{1}{(\theta ^k)^2} e_{k+1}\le \frac{1-\theta ^k}{(\theta ^k)^2}e_k+\Delta _k-\Delta _{k+1}.$$

By the definition of the sequence \(\theta ^k\) (Step 6), the above is the same as

$$ \frac{1}{(\theta ^k)^2} e_{k+1}\le \frac{1}{(\theta ^{k-1})^2}e_k+\Delta _k-\Delta _{k+1},$$

and hence,

$$ \frac{1}{(\theta ^k)^2} e_{k+1}+\Delta _{k+1}\le \frac{1}{(\theta ^{k-1})^2}e_k+\Delta _k.$$

Since \(\theta ^0=1\) the above inequality results in that \(\frac{1}{(\theta ^{k-1})^2} e_{k}\le \Delta _0,\) which by the facts that \(\Delta _0 = \frac{n^2}{2}\Vert \mathbf{x}^*-\mathbf{x}^0\Vert _{\mathbf{L}}^2\) and \(\theta ^k \le \frac{2}{k+2}\) (see [37]) leads to the desired result (4.3).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Aberdam, A., Beck, A. An Accelerated Coordinate Gradient Descent Algorithm for Non-separable Composite Optimization. J Optim Theory Appl 193, 219–246 (2022). https://doi.org/10.1007/s10957-021-01957-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10957-021-01957-1

Keywords

Mathematics Subject Classification

Navigation