Abstract
Coordinate descent algorithms are popular in machine learning and large-scale data analysis problems due to their low computational cost iterative schemes and their improved performances. In this work, we define a monotone accelerated coordinate gradient descent-type method for problems consisting of minimizing \(f+g\), where f is quadratic and g is nonsmooth and non-separable and has a low-complexity proximal mapping. The algorithm is enabled by employing the forward–backward envelope, a composite envelope that possess an exact smooth reformulation of \(f+g\). We prove the algorithm achieves a convergence rate of \(O(1/k^{1.5})\) in terms of the original objective function, improving current coordinate descent-type algorithms. In addition, we describe an adaptive variant of the algorithm that backtracks the spectral information and coordinate Lipschitz constants of the problem. We numerically examine our algorithms on various settings, including two-dimensional total-variation-based image inpainting problems, showing a clear advantage in performance over current coordinate descent-type methods.
Similar content being viewed by others
Notes
In the extreme case where \({\mathbf{M }}= {\mathbf{0 }}\), the condition is \(\mu \in (0,\infty )\).
Recall that for a nonempty closed and convex set C, \(P_C\) denotes the orthogonal projection operator.
The existence of such a Lipschitz constant is warranted by Lemma 4.1.
We describe below how to reproduce these synthetic datasets, and they are available from the authors on reasonable request.
This standard dataset is available in https://github.com/aaberdam/AdaLISTA.
References
Auslender, A., Teboulle, M.: Interior gradient and proximal methods for convex and conic optimization. SIAM J. Optim. 16(3), 697–725 (2006)
Barbero, A., Sra, S.: Modular proximal optimization for multidimensional total-variation regularization. arXiv preprint arXiv:1411.0589 (2014)
Beck, A.: First-Order Methods in Optimization, vol. 25. SIAM (2017)
Beck, A., Pauwels, E., Sabach, S.: The cyclic block conditional gradient method for convex optimization problems. SIAM J. Optim. 25(4), 2024–2049 (2015)
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
Beck, A., Teboulle, M.: Smoothing and first order methods: a unified framework. SIAM J. Optim. 22(2), 557–580 (2012)
Beck, A., Tetruashvili, L.: On the convergence of block coordinate descent type methods. SIAM J. Optim. 23(4), 2037–2060 (2013)
Bertsekas, D.P.: Nonlinear Program. Athena Scientific Optimization and Computation Series, 2nd edn. Athena Scientific, Belmont (1999)
Bertsekas, D.P., Tsitsiklis, J.N.: Parallel and Distributed Computation: Numerical Methods, vol. 23. Prentice Hall, Englewood Cliffs (1989)
Burges, C.J.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2(2), 121–167 (1998)
Combettes, P.L., Pesquet, J.C.: Stochastic quasi-Fejér block-coordinate fixed point iterations with random sweeping. SIAM J. Optim. 25(2), 1221–1248 (2015)
Daubechies, I., Defrise, M., De Mol, C.: An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Commun. Pure Appl. Math. J. Issued Courant Inst. Math. Sci. 57(11), 1413–1457 (2004)
Duchi, J., Shalev-Shwartz, S., Singer, Y., Chandra, T.: Efficient projections onto the l 1-ball for learning in high dimensions. In: Proceedings of the 25th International Conference on Machine Learning, pp. 272–279 (2008)
Fercoq, O., Bianchi, P.: A coordinate-descent primal-dual algorithm with large step size and possibly nonseparable functions. SIAM J. Optim. 29(1), 100–134 (2019)
Fercoq, O., Richtárik, P.: Accelerated, parallel, and proximal coordinate descent. SIAM J. Optim. 25(4), 1997–2023 (2015)
Friedman, J., Hastie, T., Höfling, H., Tibshirani, R.: Pathwise coordinate optimization. Ann. Appl. Stat. 1(2), 302–332 (2007)
Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1 (2010)
Giselsson, P., Fält, M.: Envelope functions: unifications and further properties. J. Optim. Theory Appl. 178(3), 673–698 (2018). https://doi.org/10.1007/s10957-018-1328-z
Hanzely, F., Kovalev, D., Richtárik, P.: Variance reduced coordinate descent with acceleration: new method with a surprising application to finite-sum problems. arXiv preprint arXiv:2002.04670 (2020)
Hanzely, F., Mishchenko, K., Richtárik, P.: SEGA: Variance reduction via gradient sketching. In: Advances in Neural Information Processing Systems, vol. 31, pp. 2082–2093 (2018)
Hanzely, F., Richtárik, P.: One method to rule them all: variance reduction for data, parameters and many new methods. arXiv preprint arXiv:1905.11266 (2019)
Hastie, T., Tibshirani, R., Wainwright, M.: Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press (2015)
Hong, M., Wang, X., Razaviyayn, M., Luo, Z.Q.: Iteration complexity analysis of block coordinate descent methods. Math. Program. 163(1–2), 85–114 (2017)
Johnson, N.A.: A dynamic programming algorithm for the fused Lasso and \(L_0\)-segmentation. J. Comput. Graph. Stat. 22(2), 246–260 (2013)
Kolmogorov, V., Pock, T., Rolinek, M.: Total variation on a tree. SIAM J. Imaging Sci. 9(2), 605–636 (2016)
Lacoste-Julien, S., Jaggi, M., Schmidt, M., Pletscher, P.: Block-coordinate Frank-Wolfe optimization for structural SVMs. In: Dasgupta, S., McAllester, D. (eds.) Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 53–61. PMLR (2013)
Latafat, P., Themelis, A., Patrinos, P.: Block-coordinate and incremental aggregated proximal gradient methods for nonsmooth nonconvex problems. Math. Program. 1–30 (2021)
Lu, H., Freund, R., Mirrokni, V.: Accelerating greedy coordinate descent methods. In: Dy, J., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 80, pp. 3257–3266. PMLR (2018)
Maculan, N., Santiago, C.P., Macambira, E., Jardim, M.: An O(n) algorithm for projecting a vector on the intersection of a hyperplane and a box in R n. J. Optim. Theory Appl. 117(3), 553–574 (2003)
Markowitz, H.: Portfolio selection. J. Finance 7(1), 77–91 (1952)
Moreau, J.J.: Proximité et dualité dans un espace hilbertien. Bull. de la Société mathématique de France 93, 273–299 (1965)
Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)
Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013)
Ortega, J.M., Rheinboldt, W.C.: Iterative Solution of Nonlinear Equations in Several Variables, vol. 30. SIAM (1970)
Rockafellar, R.T.: Convex Analysis. Princeton Mathematical Series, No. 28, Princeton University Press, Princeton (1970)
Stella, L., Themelis, A., Patrinos, P.: Forward–backward quasi-newton methods for nonsmooth optimization problems. Comput. Optim. Appl. 67(3), 443–487 (2017)
Tseng, P.: On accelerated proximal gradient methods for convex–concave optimization. Unpublished manuscript (2008)
Wright, S.J.: Coordinate descent algorithms. Math. Program. 151(1), 3–34 (2015)
Acknowledgements
The research of A. Beck is supported by the ISF Grant 926-21. A. Aberdam thanks the Azrieli foundation for providing additional research support. The authors would like to thank two anonymous reviewers for their valuable suggestions that improved the final manuscript.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Massimo Pappalardo.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A: Proof of Theorem 4.1
Appendix A: Proof of Theorem 4.1
Throughout the proof, we use the notation: \(\mathbf{L}= \hbox {diag}(\{L_i\}_{i=1}^n)\) and denote the \(\mathbf{L}\)-norm and the \(\mathbf{L}\)-inner product by
By the definition of step 4 and the block descent lemma [3, Lemma 11.8], it follows that
Taking the expectation with respect to \(i_{k}\), and recalling that \({\tilde{\mathbf{x}}}^{k+1} = \mathbf{y}^k-\frac{1}{L_{i_k}} \nabla _{i_k} H(\mathbf{y}^k)\mathbf{e}_{i_k}\), we obtain
where \(\mathbf{s}^{k+1}= \mathbf{y}^k- \frac{1}{n} \mathbf{L}^{-1}\nabla H(\mathbf{y}^k).\) Define
Obviously, \(\mathbf{s}^{k+1}-\mathbf{y}^k = \theta ^k (\mathbf{t}^{k+1}-\mathbf{z}^k)\). Thus, by (7.1) and the fact that \(H(\mathbf{x}^{k+1})\le H({\tilde{\mathbf{x}}}^{k+1})\) (step 5), it follows that
By Tseng’s three-points property [37, Property 1] and the relation (7.2), we have
Combining the above with (7.3) yields
where the equality follows by the following argument:
Now, using the update formula in step 2, we have
Thus, combining (7.5) and (7.6) along with the gradient inequality, the following is implied:
which is the same as
Taking expectation over \(\xi _{k-1}\) leads to
Denoting \(e_k \equiv {\mathbb {E}}_{\xi _{k-1}} H(\mathbf{x}^{k})-H(\mathbf{x}^*)\) and \(\Delta _k \equiv \frac{n^2}{2}{\mathbb {E}}_{\xi _{k-1}} \Vert \mathbf{x}^*-\mathbf{z}^k\Vert _{\mathbf{L}}^2\), we can rewrite (7.9) as
Dividing the inequality by \((\theta ^k)^2\) yields
By the definition of the sequence \(\theta ^k\) (Step 6), the above is the same as
and hence,
Since \(\theta ^0=1\) the above inequality results in that \(\frac{1}{(\theta ^{k-1})^2} e_{k}\le \Delta _0,\) which by the facts that \(\Delta _0 = \frac{n^2}{2}\Vert \mathbf{x}^*-\mathbf{x}^0\Vert _{\mathbf{L}}^2\) and \(\theta ^k \le \frac{2}{k+2}\) (see [37]) leads to the desired result (4.3).
Rights and permissions
About this article
Cite this article
Aberdam, A., Beck, A. An Accelerated Coordinate Gradient Descent Algorithm for Non-separable Composite Optimization. J Optim Theory Appl 193, 219–246 (2022). https://doi.org/10.1007/s10957-021-01957-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10957-021-01957-1