Abstract
In the Euclidean setting the proximal gradient method and its accelerated variants are a class of efficient algorithms for optimization problems with decomposable objective. In this paper, we develop a Riemannian proximal gradient method (RPG) and its accelerated variant (ARPG) for similar problems but constrained on a manifold. The global convergence of RPG is established under mild assumptions, and the O(1/k) is also derived for RPG based on the notion of retraction convexity. If assuming the objective function obeys the Rimannian Kurdyka–Łojasiewicz (KL) property, it is further shown that the sequence generated by RPG converges to a single stationary point. As in the Euclidean setting, local convergence rate can be established if the objective function satisfies the Riemannian KL property with an exponent. Moreover, we show that the restriction of a semialgebraic function onto the Stiefel manifold satisfies the Riemannian KL property, which covers for example the well-known sparse PCA problem. Numerical experiments on random and synthetic data are conducted to test the performance of the proposed RPG and ARPG.












Similar content being viewed by others
Notes
The commonly-used update expression is \(x_{k+1}=\arg \min _x\langle \nabla f(x_k),x-x_k\rangle _2+\frac{L}{2}\Vert x-x_k\Vert _2^2+g(x)\). We reformulate it equivalently for the convenience of the Riemannian formulation given later.
Such result can be obtained by noting (i) \({{\,\mathrm{D}\,}}f(x)[\eta ] = {\left\langle P_{{{\,\mathrm{T}\,}}_x {\mathcal {M}}} \nabla f(x),\eta \right\rangle _{{{\,\mathrm{F}\,}}}} = {\left\langle {{\,\mathrm{grad}\,}}f(x),\eta \right\rangle _{x}}\) for [15, (B.2)], and (ii) there exists a constant \(\alpha > 0\) such that \(\Vert \eta \Vert _{{{\,\mathrm{F}\,}}} \le \alpha \Vert \eta \Vert _x\) for all \(x \in {\mathcal {M}}\) by smoothness of the Riemannian metric and compactness of \({\mathcal {M}}\).
When the desingularising function has the form \(\varsigma (t) = \frac{C}{\theta } t^{\theta }\) for some \(C > 0\), \(\theta \in (0, 1]\), we say that F satisfies the Riemannian KL property with an exponent \(\theta \), as in the Euclidean case.
References
Absil, P.-A., Mahony, R., Sepulchre, R.: Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton (2008)
Absil, P.A., Mahony, R., Trumpf, J.: An Extrinsic Look at the Riemannian Hessian (2013)
Attouch, H., Bolte, J., Redont, P., Soubeyran, A.: Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the Kurdyka–Łojasiewicz inequality. Math. Oper. Res. 35(2), 438–457 (2010)
Attouch, H., Bolte, J., Svaiter, B. F.: Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods. Math. Program. 137, 91–129 (2013)
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009). https://doi.org/10.1137/080716542
Beck, A.: First-Order Methods in Optimization. Society for Industrial and Applied Mathematics, Philadelphia (2017)
Bento, G.C., da Cruz Neto, J. X., Oliveira, P.R.: Convergence of inexact descent methods for nonconvex optimization on Riemannian manifold (2011). arXiv preprint arXiv:1103.4828
Bento, G.C., Ferreira, O.P., Melo, J.G.: Iteration-complexity of gradient, subgradient and proximal point methods on Riemannian manifolds. J. Optim. Theory Appl. 173(2), 548–562 (2017)
Bochnak, J., Coste, M., Roy, M.-F.: Real Algebraic Geometry. Springer, Berlin (1998)
Bolte, J., Daniilidis, A., Lewis, A.: The łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM J. Optim. 17(4), 1205–1223 (2007)
Bolte, J., Daniilidis, A., Lewis, A., Shiota, M.: Clarke subgradients of stratifiable functions. SIAM J. Optim. 18(2), 556–572 (2007)
Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. 146(1–2), 459–494 (2014)
Boothby, W.M.: An Introduction to Differentiable Manifolds and Riemannian Geometry, 2nd edn. Academic Press, London (1986)
Boumal, N.: An introduction to optimization on smooth manifolds (2020)
Boumal, N., Absil, P.-A., Cartis, C.: Global rates of convergence for nonconvex optimization on manifolds. IMA J. Numer. Anal. 39(1), 1–33 (2018)
Chen, S., Ma, S., So, A.M.-C., Zhang, T.: Proximal gradient method for nonsmooth optimization over the Stiefel manifold. SIAM J. Optim. 30(1), 210–239 (2020)
Chen, W., Hui, J., You, Y.: An augmented Lagrangian method for \(\ell _{1}\)-regularized optimization problems with orthogonality constraints. SIAM J. Sci. Comput. 38(4), B570–B592 (2016)
Daniilidis, A., Deville, R., Durand-Cartagena, E., Rifford, L.: Self-contracted curves in Riemannian manifolds. J. Math. Anal. Appl. 457, 1333–1352 (2018)
Darzentas, J.: Problem Complexity and Method Efficiency in Optimization (1983)
de Carvalho Bento, G., Bitar, S.D.B., da Cruz Neto, J.X., Oliveira, P.R., de Oliveira Souza, J.C.: Computing Riemannian center of mass on Hadamard manifolds. J. Optim. Theory Appl. 183, 977–992 (2019)
do Carmo, M.P.: Riemannian geometry. Mathematics: Theory & Applications (1992)
Ferreira, O.P., Oliveira, P.R.: Proximal point algorithm on Riemannian manifolds. Optimization 51(2), 257–270 (2002)
Genicot, M., Huang, W., Trendafilov, N.T.: Weakly correlated sparse components with nearly orthonormal loadings. In: Geometric Science of Information, pp. 484–490 (2015)
Ghadimi, S., Lan, G.: Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Program. 156, 59–99 (2016)
Grohs, P., Hosseini, S.: \(\epsilon \)-subgradient algorithms for locally lipschitz functions on Riemannian manifolds. Adv. Comput. Math. (2015). https://doi.org/10.1007/s10444-015-9426-z
Grohs, P., Hosseini, S.: Nonsmooth trust region algorithms for locally Lipschitz functions on Riemannian manifolds. IMA J. Numer. Anal. (2015). https://doi.org/10.1093/imanum/drv043
Hosseini, S.: Convergence of nonsmooth descent methods via Kurdyka–Łojasiewicz inequality on Riemannian manifolds (2017). INS Preprint No. 1523
Hosseini, S., Huang, W., Yousefpour, R.: Line search algorithms for locally Lipschitz functions on Riemannian manifolds. SIAM J. Optim. 28(1), 596–619 (2018)
Hosseini, S., Pouryayevali, M.R.: Generalized gradients and characterization of epi-Lipschitz sets in Riemannian manifolds. Nonlinear Anal. Theory Methods Appl. 74(12), 3884–3895 (2011)
Hosseini, S., Uschmajew, A.: A Riemannian gradient sampling algorithm for nonsmooth optimization on manifolds. SIAM J. Optim. 27(1), 173–189 (2017)
Huang, W.: Optimization algorithms on Riemannian manifolds with applications. PhD thesis, Florida State University, Department of Mathematics (2013)
Huang, W., Gallivan, K.A., Absil, P.-A.: A Broyden class of quasi-Newton methods for Riemannian optimization. SIAM J. Optim. 25(3), 1660–1685 (2015)
Huang, W., Wei, K.: Extending FISTA to Riemannian optimization for sparse PCA (2019). arXiv:1909.05485
Huang, W., Wei, K.: Riemannian proximal gradient methods (extended version) (2019). arXiv:1909.06065
Jolliffe, I.T., Trendafilov, N.T., Uddin, M.: A modified principal component technique based on the Lasso. J. Comput. Graph. Stat. 12(3), 531–547 (2003)
Kurdyka, K., Mostowski, T., Adam, P.: Proof of the gradient conjecture of R. Thom. Ann. Math. 152, 763–792 (2000)
Lageman, C.: Convergence of gradient-like dynamical systems and optimization algorithms. PhD thesis, Universitat Wurzburg (2007)
Lai, R., Osher, S.: A splitting method for orthogonality constrained problems. J. Sci. Comput. 58(2), 431–449 (2014)
Lee, J.M.: Introduction to Riemannian Manifolds. Volume 176 of Graduate Texts in Mathematics, 2nd edn. Springer, Berlin (2018)
Li, H., Lin, Z.: Accelerated proximal gradient methods for nonconvex programming. In: International Conference on Neural Information Processing Systems (2015)
Liu, Y., Shang, F., Cheng, J., Cheng, H., Jiao, L.: Accelerated first-order methods for geodesically convex optimization on Riemannian manifolds. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30, pp. 4868–4877. Curran Associates Inc, Red Hook (2017)
Nesterov, Y.E.: A method for solving the convex programming problem with convergence rate \({O}(1/k^{2})\). Dokl. Akas. Nauk SSSR 269, 543–547 (1983). (in Russian)
Sjöstrand, K., Clemmensen, L., Larsen, R., Einarsson, G., Ersboll, B.: SpaSM: a matlab toolbox for sparse statistical modeling. J. Stat. Softw. 84(10), 1–37 (2018)
Srivastava, A., Klassen, E.P.: Functional and Shape Data Analysis. Springer, New York (2016)
Tang, J., Liu, H.: Unsupervised feature selection for linked social media data. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 904–912 (2012)
Zhang, H., Sra, S.: First-order methods for geodesically convex optimization. In: Conference on Learning Theory (2016)
Zhang, Y., Lau, Y., Kuo, H.-W., Cheung, S., Pasupathy, A., Wright, J.: On the global geometry of sphere-constrained sparse blind deconvolution. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Acknowledgements
The authors would like to thank Zirui Zhou for fruitful discussions on the KL property, and thank Shiqian Ma for kindly sharing their codes with us.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Authors are listed alphabetically. WH was partially supported by the Fundamental Research Funds for the Central Universities (No. 20720190060) and the National Natural Science Foundation of China (No. 12001455). KW was partially supported by the NSFC Grant 11801088 and the Shanghai Sailing Program 18YF1401600.
Proofs of Lemmas 6 and 7
Proofs of Lemmas 6 and 7
1.1 Proof of Lemma 6
Proof
Since R is smooth and therefore \(C^2\), the mapping \(m: {{\,\mathrm{T}\,}}{\mathcal {M}} \times {\mathbb {R}} \rightarrow {{\,\mathrm{T}\,}}{\mathcal {M}}: (\eta , t) \mapsto \frac{D}{d t} \frac{d}{d t} R\left( t \eta \right) \) is continuous where \(\frac{D}{d t}\) denotes the covariant derivative along the curve \(t \mapsto R(t \eta )\), see definition of covariant derivative in e.g., [21, Proposition 2.2]. In addition, since the set \({\mathcal {D}} = \{ (\eta _x, t) \mid x \in {\bar{\varOmega }}, \Vert \eta _x\Vert _x = 1, 0 \le t \le \delta _T \}\) is compact, there exists a positive constant \(b_2\) such that
for all \((\eta , t) \in {\mathcal {D}}\).
If \(\eta _x = 0_x\), then the conclusion holds. Otherwise, let \({\tilde{\eta }}_x = \eta _x / \Vert \eta _x\Vert _x\). Since \({{\,\mathrm{dist}\,}}(x, y)\) is the shortest distance of a curve connecting x and y, we have
where the right side is the length of the curve \(R_x(t \eta _x)\). Using the Cauchy-Schwarz inequality and the invariance of the metric by the Riemannian affine connection, we have
It follows that
where \(b_3 = 1 + b_2 \delta _T / 2\). Combining (A.2) and (A.3) yields the result. \(\square \)
1.2 Proof of Lemma 7
Proof
For any \(x \in {\bar{\varOmega }}\), there exists a positive constant \(\varrho _x\) and a neighborhood \({\mathcal {U}}_x\) of x such that \({\mathcal {U}}_x\) is a totally restrictive set with respect to \(\varrho _x\). Since \({\bar{\varOmega }}\) is compact, there exists finite number of \(x_i\) such that their totally restrictive sets covering \({\bar{\varOmega }}\), i.e., \(\cup _{i = 1}^t {\mathcal {U}}_{x_i} \supset {\bar{\varOmega }}\). Let \(\delta = \frac{1}{2} \min (\varrho _{x_i}, i = 1, \ldots , t)\). We have that for any \(x \in {\bar{\varOmega }}\), the retraction R is a diffeomorphism on \({\mathbb {B}}(x, 2 \delta )\). Therefore, \({\mathcal {T}}_{R_{\eta _x}}^{\sharp }\) is invertible for any \(\eta _x\) satisfying \(\Vert \eta _x\Vert _x < 2 \delta \).
Since \({\mathcal {T}}_{R_{\eta _x}}^{-\sharp }\) is smooth with respect to \(\eta _x\) and the set \(\{\eta _x \mid x \in {{\bar{\varOmega }}}, \Vert \eta _x\Vert \le \delta \}\) is compact, there exists a constant \(L_t > 0\) such that
By Lemma 6, there exists a positive constant \(\kappa \) such that
for all \(x \in {\bar{\varOmega }}\) and for all \(\eta _x \in {\mathcal {B}}(0_x, \delta )\). Let \({\tilde{\delta }} = \min (\delta , i({\bar{\varOmega }}) / \kappa )\). For all \(\eta _x \in {\mathcal {B}}(0_x, {\tilde{\delta }})\) it holds that
By the definition of locally Lipschitz continuity of a vector field, we have \(\Vert {\mathcal {P}}_{\gamma }^{0 \leftarrow 1} \xi _y - \xi _x\Vert _x \le L_v {{\,\mathrm{dist}\,}}(x, y)\) for any \(x, y \in {{\bar{\varOmega }}}\) and \({{\,\mathrm{dist}\,}}(x, y) < i({\bar{\varOmega }})\). Since the parallel translation is isometric, it holds that \(\Vert \xi _y - {\mathcal {P}}_{\gamma }^{1 \leftarrow 0} \xi _x\Vert _y \le L_v {{\,\mathrm{dist}\,}}(x, y)\). Using (A.5) and (A.6) yields
for all \(\eta _x \in {\mathcal {B}}(0_x, {\tilde{\delta }})\), where \(y = R_x(\eta _x)\).
By [32, Lemma 3.5], for any \({\bar{x}} \in {\mathcal {M}}\), there exists a neighborhood \({\mathcal {U}}_{{\bar{x}}}\) of \({\bar{x}}\) and a positive number \(L_{{\bar{x}}}\) such that for all \(x, y \in {\mathcal {U}}_{{\bar{x}}}\) it holds that
Since \({{\bar{\varOmega }}}\) is compact, there exist finite number of \({\bar{x}}\), denoted by \({\bar{x}}_1, \ldots , {\bar{x}}_t\), such that \(\cup _{i=1}^t {\mathcal {U}}_{{\bar{x}}_i} \supset \varOmega \). Let \(L_{cc}\) denote \(\max (L_{{\bar{x}}_i}, i = 1, \ldots , t)\), and \(\sigma = \sup _{r}\left\{ r \in {\mathbb {R}} \mid \exists i, \hbox { such that } {\mathbb {B}}(z, r) \subseteq {\mathcal {U}}_{{\bar{x}}_i} \forall z \in {{\bar{\varOmega }}} \right\} \). Since the number of \({\bar{x}}_i\) is finite, we have \(L_{cc} < \infty \) and \(\sigma > 0\). Therefore, for any \(x, y \in {{\bar{\varOmega }}}\) satisfying \({{\,\mathrm{dist}\,}}(x, y) < \sigma \), it holds that
Note that \(\Vert \eta _x\Vert _x < \sigma / \kappa \) implies \({{\,\mathrm{dist}\,}}(x, y) < \sigma \) by (A.5). It follows from (A.4), (A.7) and (A.8) that for any \(x, y \in {{\bar{\varOmega }}}\) satisfying \(\Vert \eta _x\Vert _x < \min (\sigma / \kappa , {\tilde{\delta }})\),
where \(L_{c} = L_v \kappa + L_{cc} \sup _{x \in {{\bar{\varOmega }}}} \Vert \xi _x\Vert _x + a L_t\).\(\square \)
Rights and permissions
About this article
Cite this article
Huang, W., Wei, K. Riemannian proximal gradient methods. Math. Program. 194, 371–413 (2022). https://doi.org/10.1007/s10107-021-01632-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-021-01632-3