Skip to main content
Log in

Riemannian proximal gradient methods

  • Full Length Paper
  • Series A
  • Published:
Mathematical Programming Submit manuscript

Abstract

In the Euclidean setting the proximal gradient method and its accelerated variants are a class of efficient algorithms for optimization problems with decomposable objective. In this paper, we develop a Riemannian proximal gradient method (RPG) and its accelerated variant (ARPG) for similar problems but constrained on a manifold. The global convergence of RPG is established under mild assumptions, and the O(1/k) is also derived for RPG based on the notion of retraction convexity. If assuming the objective function obeys the Rimannian Kurdyka–Łojasiewicz (KL) property, it is further shown that the sequence generated by RPG converges to a single stationary point. As in the Euclidean setting, local convergence rate can be established if the objective function satisfies the Riemannian KL property with an exponent. Moreover, we show that the restriction of a semialgebraic function onto the Stiefel manifold satisfies the Riemannian KL property, which covers for example the well-known sparse PCA problem. Numerical experiments on random and synthetic data are conducted to test the performance of the proposed RPG and ARPG.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. The commonly-used update expression is \(x_{k+1}=\arg \min _x\langle \nabla f(x_k),x-x_k\rangle _2+\frac{L}{2}\Vert x-x_k\Vert _2^2+g(x)\). We reformulate it equivalently for the convenience of the Riemannian formulation given later.

  2. Such result can be obtained by noting (i) \({{\,\mathrm{D}\,}}f(x)[\eta ] = {\left\langle P_{{{\,\mathrm{T}\,}}_x {\mathcal {M}}} \nabla f(x),\eta \right\rangle _{{{\,\mathrm{F}\,}}}} = {\left\langle {{\,\mathrm{grad}\,}}f(x),\eta \right\rangle _{x}}\) for [15, (B.2)], and (ii) there exists a constant \(\alpha > 0\) such that \(\Vert \eta \Vert _{{{\,\mathrm{F}\,}}} \le \alpha \Vert \eta \Vert _x\) for all \(x \in {\mathcal {M}}\) by smoothness of the Riemannian metric and compactness of \({\mathcal {M}}\).

  3. The right hand side of (3.12) can be \({}{\kappa _{\varOmega }} \min (\Vert \eta _x\Vert _x^2, \Vert \xi _x\Vert _x^2) \Vert \zeta _y\Vert _y^2\). We use the the form in (3.12) for simplicity.

  4. When the desingularising function has the form \(\varsigma (t) = \frac{C}{\theta } t^{\theta }\) for some \(C > 0\), \(\theta \in (0, 1]\), we say that F satisfies the Riemannian KL property with an exponent \(\theta \), as in the Euclidean case.

References

  1. Absil, P.-A., Mahony, R., Sepulchre, R.: Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton (2008)

    Book  Google Scholar 

  2. Absil, P.A., Mahony, R., Trumpf, J.: An Extrinsic Look at the Riemannian Hessian (2013)

  3. Attouch, H., Bolte, J., Redont, P., Soubeyran, A.: Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the Kurdyka–Łojasiewicz inequality. Math. Oper. Res. 35(2), 438–457 (2010)

    Article  MathSciNet  Google Scholar 

  4. Attouch, H., Bolte, J., Svaiter, B. F.: Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods. Math. Program. 137, 91–129 (2013)

  5. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009). https://doi.org/10.1137/080716542

    Article  MathSciNet  MATH  Google Scholar 

  6. Beck, A.: First-Order Methods in Optimization. Society for Industrial and Applied Mathematics, Philadelphia (2017)

    Book  Google Scholar 

  7. Bento, G.C., da Cruz Neto, J. X., Oliveira, P.R.: Convergence of inexact descent methods for nonconvex optimization on Riemannian manifold (2011). arXiv preprint arXiv:1103.4828

  8. Bento, G.C., Ferreira, O.P., Melo, J.G.: Iteration-complexity of gradient, subgradient and proximal point methods on Riemannian manifolds. J. Optim. Theory Appl. 173(2), 548–562 (2017)

    Article  MathSciNet  Google Scholar 

  9. Bochnak, J., Coste, M., Roy, M.-F.: Real Algebraic Geometry. Springer, Berlin (1998)

  10. Bolte, J., Daniilidis, A., Lewis, A.: The łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM J. Optim. 17(4), 1205–1223 (2007)

    Article  Google Scholar 

  11. Bolte, J., Daniilidis, A., Lewis, A., Shiota, M.: Clarke subgradients of stratifiable functions. SIAM J. Optim. 18(2), 556–572 (2007)

    Article  MathSciNet  Google Scholar 

  12. Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. 146(1–2), 459–494 (2014)

    Article  MathSciNet  Google Scholar 

  13. Boothby, W.M.: An Introduction to Differentiable Manifolds and Riemannian Geometry, 2nd edn. Academic Press, London (1986)

    MATH  Google Scholar 

  14. Boumal, N.: An introduction to optimization on smooth manifolds (2020)

  15. Boumal, N., Absil, P.-A., Cartis, C.: Global rates of convergence for nonconvex optimization on manifolds. IMA J. Numer. Anal. 39(1), 1–33 (2018)

    Article  MathSciNet  Google Scholar 

  16. Chen, S., Ma, S., So, A.M.-C., Zhang, T.: Proximal gradient method for nonsmooth optimization over the Stiefel manifold. SIAM J. Optim. 30(1), 210–239 (2020)

  17. Chen, W., Hui, J., You, Y.: An augmented Lagrangian method for \(\ell _{1}\)-regularized optimization problems with orthogonality constraints. SIAM J. Sci. Comput. 38(4), B570–B592 (2016)

    Article  MathSciNet  Google Scholar 

  18. Daniilidis, A., Deville, R., Durand-Cartagena, E., Rifford, L.: Self-contracted curves in Riemannian manifolds. J. Math. Anal. Appl. 457, 1333–1352 (2018)

    Article  MathSciNet  Google Scholar 

  19. Darzentas, J.: Problem Complexity and Method Efficiency in Optimization (1983)

  20. de Carvalho Bento, G., Bitar, S.D.B., da Cruz Neto, J.X., Oliveira, P.R., de Oliveira Souza, J.C.: Computing Riemannian center of mass on Hadamard manifolds. J. Optim. Theory Appl. 183, 977–992 (2019)

  21. do Carmo, M.P.: Riemannian geometry. Mathematics: Theory & Applications (1992)

  22. Ferreira, O.P., Oliveira, P.R.: Proximal point algorithm on Riemannian manifolds. Optimization 51(2), 257–270 (2002)

    Article  MathSciNet  Google Scholar 

  23. Genicot, M., Huang, W., Trendafilov, N.T.: Weakly correlated sparse components with nearly orthonormal loadings. In: Geometric Science of Information, pp. 484–490 (2015)

  24. Ghadimi, S., Lan, G.: Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Program. 156, 59–99 (2016)

    Article  MathSciNet  Google Scholar 

  25. Grohs, P., Hosseini, S.: \(\epsilon \)-subgradient algorithms for locally lipschitz functions on Riemannian manifolds. Adv. Comput. Math. (2015). https://doi.org/10.1007/s10444-015-9426-z

    Article  MATH  Google Scholar 

  26. Grohs, P., Hosseini, S.: Nonsmooth trust region algorithms for locally Lipschitz functions on Riemannian manifolds. IMA J. Numer. Anal. (2015). https://doi.org/10.1093/imanum/drv043

    Article  MATH  Google Scholar 

  27. Hosseini, S.: Convergence of nonsmooth descent methods via Kurdyka–Łojasiewicz inequality on Riemannian manifolds (2017). INS Preprint No. 1523

  28. Hosseini, S., Huang, W., Yousefpour, R.: Line search algorithms for locally Lipschitz functions on Riemannian manifolds. SIAM J. Optim. 28(1), 596–619 (2018)

    Article  MathSciNet  Google Scholar 

  29. Hosseini, S., Pouryayevali, M.R.: Generalized gradients and characterization of epi-Lipschitz sets in Riemannian manifolds. Nonlinear Anal. Theory Methods Appl. 74(12), 3884–3895 (2011)

    Article  MathSciNet  Google Scholar 

  30. Hosseini, S., Uschmajew, A.: A Riemannian gradient sampling algorithm for nonsmooth optimization on manifolds. SIAM J. Optim. 27(1), 173–189 (2017)

    Article  MathSciNet  Google Scholar 

  31. Huang, W.: Optimization algorithms on Riemannian manifolds with applications. PhD thesis, Florida State University, Department of Mathematics (2013)

  32. Huang, W., Gallivan, K.A., Absil, P.-A.: A Broyden class of quasi-Newton methods for Riemannian optimization. SIAM J. Optim. 25(3), 1660–1685 (2015)

    Article  MathSciNet  Google Scholar 

  33. Huang, W., Wei, K.: Extending FISTA to Riemannian optimization for sparse PCA (2019). arXiv:1909.05485

  34. Huang, W., Wei, K.: Riemannian proximal gradient methods (extended version) (2019). arXiv:1909.06065

  35. Jolliffe, I.T., Trendafilov, N.T., Uddin, M.: A modified principal component technique based on the Lasso. J. Comput. Graph. Stat. 12(3), 531–547 (2003)

    Article  MathSciNet  Google Scholar 

  36. Kurdyka, K., Mostowski, T., Adam, P.: Proof of the gradient conjecture of R. Thom. Ann. Math. 152, 763–792 (2000)

    Article  MathSciNet  Google Scholar 

  37. Lageman, C.: Convergence of gradient-like dynamical systems and optimization algorithms. PhD thesis, Universitat Wurzburg (2007)

  38. Lai, R., Osher, S.: A splitting method for orthogonality constrained problems. J. Sci. Comput. 58(2), 431–449 (2014)

    Article  MathSciNet  Google Scholar 

  39. Lee, J.M.: Introduction to Riemannian Manifolds. Volume 176 of Graduate Texts in Mathematics, 2nd edn. Springer, Berlin (2018)

    Book  Google Scholar 

  40. Li, H., Lin, Z.: Accelerated proximal gradient methods for nonconvex programming. In: International Conference on Neural Information Processing Systems (2015)

  41. Liu, Y., Shang, F., Cheng, J., Cheng, H., Jiao, L.: Accelerated first-order methods for geodesically convex optimization on Riemannian manifolds. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30, pp. 4868–4877. Curran Associates Inc, Red Hook (2017)

    Google Scholar 

  42. Nesterov, Y.E.: A method for solving the convex programming problem with convergence rate \({O}(1/k^{2})\). Dokl. Akas. Nauk SSSR 269, 543–547 (1983). (in Russian)

    Google Scholar 

  43. Sjöstrand, K., Clemmensen, L., Larsen, R., Einarsson, G., Ersboll, B.: SpaSM: a matlab toolbox for sparse statistical modeling. J. Stat. Softw. 84(10), 1–37 (2018)

    Article  Google Scholar 

  44. Srivastava, A., Klassen, E.P.: Functional and Shape Data Analysis. Springer, New York (2016)

    Book  Google Scholar 

  45. Tang, J., Liu, H.: Unsupervised feature selection for linked social media data. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 904–912 (2012)

  46. Zhang, H., Sra, S.: First-order methods for geodesically convex optimization. In: Conference on Learning Theory (2016)

  47. Zhang, Y., Lau, Y., Kuo, H.-W., Cheung, S., Pasupathy, A., Wright, J.: On the global geometry of sphere-constrained sparse blind deconvolution. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

Download references

Acknowledgements

The authors would like to thank Zirui Zhou for fruitful discussions on the KL property, and thank Shiqian Ma for kindly sharing their codes with us.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Wen Huang or Ke Wei.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Authors are listed alphabetically. WH was partially supported by the Fundamental Research Funds for the Central Universities (No. 20720190060) and the National Natural Science Foundation of China (No. 12001455). KW was partially supported by the NSFC Grant 11801088 and the Shanghai Sailing Program 18YF1401600.

Proofs of Lemmas 6 and 7

Proofs of Lemmas 6 and 7

1.1 Proof of Lemma 6

Proof

Since R is smooth and therefore \(C^2\), the mapping \(m: {{\,\mathrm{T}\,}}{\mathcal {M}} \times {\mathbb {R}} \rightarrow {{\,\mathrm{T}\,}}{\mathcal {M}}: (\eta , t) \mapsto \frac{D}{d t} \frac{d}{d t} R\left( t \eta \right) \) is continuous where \(\frac{D}{d t}\) denotes the covariant derivative along the curve \(t \mapsto R(t \eta )\), see definition of covariant derivative in e.g., [21, Proposition 2.2]. In addition, since the set \({\mathcal {D}} = \{ (\eta _x, t) \mid x \in {\bar{\varOmega }}, \Vert \eta _x\Vert _x = 1, 0 \le t \le \delta _T \}\) is compact, there exists a positive constant \(b_2\) such that

$$\begin{aligned} \Vert m(\eta , t)\Vert \le b_2 \end{aligned}$$
(A.1)

for all \((\eta , t) \in {\mathcal {D}}\).

If \(\eta _x = 0_x\), then the conclusion holds. Otherwise, let \({\tilde{\eta }}_x = \eta _x / \Vert \eta _x\Vert _x\). Since \({{\,\mathrm{dist}\,}}(x, y)\) is the shortest distance of a curve connecting x and y, we have

$$\begin{aligned} {{\,\mathrm{dist}\,}}(x, y) \le \int _0^{\Vert \eta _x\Vert _x} \left\| \frac{d}{d t} R_{x} \left( t {\tilde{\eta }}_x\right) \right\| _{R_x(t {\tilde{\eta }}_x)} d t, \end{aligned}$$
(A.2)

where the right side is the length of the curve \(R_x(t \eta _x)\). Using the Cauchy-Schwarz inequality and the invariance of the metric by the Riemannian affine connection, we have

$$\begin{aligned} \left| \frac{d}{d t} \left\| \frac{d}{d t} R_{x} (t {\tilde{\eta }}_x)\right\| \right|&= \left| \frac{d}{d t} \sqrt{ {\left\langle \frac{d}{d t} R_{x} (t {\tilde{\eta }}_x),\frac{d}{d t} R_{x} (t {\tilde{\eta }}_x) \right\rangle _{}} } \right| = \left| \frac{ {\left\langle \frac{D}{d t} \frac{d}{d t} R_{x} (t {\tilde{\eta }}_x) , \frac{d}{d t} R_{x} (t {\tilde{\eta }}_x) \right\rangle _{}} }{ \left\| \frac{d}{d t} R_{x} (t {\tilde{\eta }}_x) \right\| } \right| \\&\le \left\| \frac{D}{d t} \frac{d}{d t} R_{x} (t {\tilde{\eta }}_x) \right\| \le b_2. \qquad ({\mathrm{by}}~(A.1)) \end{aligned}$$

It follows that

$$\begin{aligned} \int _0^{\Vert \eta _x\Vert _x} \left\| \frac{d}{d t} R_{x} (t {\tilde{\eta }}_x)\right\| _{R_x(t \eta _x)} d t\le & {} \int _0^{\Vert \eta _x\Vert _x} (1 + b_2 t) d t \nonumber \\= & {} \Vert \eta _x\Vert _x + \frac{b_2}{2} \Vert \eta _x\Vert _x^2 \le b_3 \Vert \eta _x\Vert _x, \end{aligned}$$
(A.3)

where \(b_3 = 1 + b_2 \delta _T / 2\). Combining (A.2) and (A.3) yields the result. \(\square \)

1.2 Proof of Lemma 7

Proof

For any \(x \in {\bar{\varOmega }}\), there exists a positive constant \(\varrho _x\) and a neighborhood \({\mathcal {U}}_x\) of x such that \({\mathcal {U}}_x\) is a totally restrictive set with respect to \(\varrho _x\). Since \({\bar{\varOmega }}\) is compact, there exists finite number of \(x_i\) such that their totally restrictive sets covering \({\bar{\varOmega }}\), i.e., \(\cup _{i = 1}^t {\mathcal {U}}_{x_i} \supset {\bar{\varOmega }}\). Let \(\delta = \frac{1}{2} \min (\varrho _{x_i}, i = 1, \ldots , t)\). We have that for any \(x \in {\bar{\varOmega }}\), the retraction R is a diffeomorphism on \({\mathbb {B}}(x, 2 \delta )\). Therefore, \({\mathcal {T}}_{R_{\eta _x}}^{\sharp }\) is invertible for any \(\eta _x\) satisfying \(\Vert \eta _x\Vert _x < 2 \delta \).

Since \({\mathcal {T}}_{R_{\eta _x}}^{-\sharp }\) is smooth with respect to \(\eta _x\) and the set \(\{\eta _x \mid x \in {{\bar{\varOmega }}}, \Vert \eta _x\Vert \le \delta \}\) is compact, there exists a constant \(L_t > 0\) such that

$$\begin{aligned} \Vert {\mathcal {T}}_{R_{\eta _x}}^{-\sharp }\Vert \le L_t, \forall \eta _x \in \{\eta _x \mid x \in {{\bar{\varOmega }}}, \Vert \eta _x\Vert \le \delta \}. \end{aligned}$$
(A.4)

By Lemma 6, there exists a positive constant \(\kappa \) such that

$$\begin{aligned} {{\,\mathrm{dist}\,}}(x, R_x(\eta _x)) \le \kappa \Vert \eta _x\Vert _x \end{aligned}$$
(A.5)

for all \(x \in {\bar{\varOmega }}\) and for all \(\eta _x \in {\mathcal {B}}(0_x, \delta )\). Let \({\tilde{\delta }} = \min (\delta , i({\bar{\varOmega }}) / \kappa )\). For all \(\eta _x \in {\mathcal {B}}(0_x, {\tilde{\delta }})\) it holds that

$$\begin{aligned} {{\,\mathrm{dist}\,}}(x, R_x(\eta _x)) \le \kappa \Vert \eta _x\Vert _x \le i({\bar{\varOmega }}). \end{aligned}$$
(A.6)

By the definition of locally Lipschitz continuity of a vector field, we have \(\Vert {\mathcal {P}}_{\gamma }^{0 \leftarrow 1} \xi _y - \xi _x\Vert _x \le L_v {{\,\mathrm{dist}\,}}(x, y)\) for any \(x, y \in {{\bar{\varOmega }}}\) and \({{\,\mathrm{dist}\,}}(x, y) < i({\bar{\varOmega }})\). Since the parallel translation is isometric, it holds that \(\Vert \xi _y - {\mathcal {P}}_{\gamma }^{1 \leftarrow 0} \xi _x\Vert _y \le L_v {{\,\mathrm{dist}\,}}(x, y)\). Using (A.5) and (A.6) yields

$$\begin{aligned} \Vert \xi _y - {\mathcal {P}}_{\gamma }^{1 \leftarrow 0} \xi _x\Vert _x \le L_v {{\,\mathrm{dist}\,}}(x, y) \le L_v \kappa \Vert \eta _x\Vert _x, \end{aligned}$$
(A.7)

for all \(\eta _x \in {\mathcal {B}}(0_x, {\tilde{\delta }})\), where \(y = R_x(\eta _x)\).

By [32, Lemma 3.5], for any \({\bar{x}} \in {\mathcal {M}}\), there exists a neighborhood \({\mathcal {U}}_{{\bar{x}}}\) of \({\bar{x}}\) and a positive number \(L_{{\bar{x}}}\) such that for all \(x, y \in {\mathcal {U}}_{{\bar{x}}}\) it holds that

$$\begin{aligned} \Vert {\mathcal {P}}_{\gamma }^{1 \rightarrow 0} \xi _x - {\mathcal {T}}_{\eta _x}^{-\sharp } \xi _x\Vert _y \le L_x \Vert \xi _x\Vert _x \Vert \eta _x\Vert _x. \end{aligned}$$

Since \({{\bar{\varOmega }}}\) is compact, there exist finite number of \({\bar{x}}\), denoted by \({\bar{x}}_1, \ldots , {\bar{x}}_t\), such that \(\cup _{i=1}^t {\mathcal {U}}_{{\bar{x}}_i} \supset \varOmega \). Let \(L_{cc}\) denote \(\max (L_{{\bar{x}}_i}, i = 1, \ldots , t)\), and \(\sigma = \sup _{r}\left\{ r \in {\mathbb {R}} \mid \exists i, \hbox { such that } {\mathbb {B}}(z, r) \subseteq {\mathcal {U}}_{{\bar{x}}_i} \forall z \in {{\bar{\varOmega }}} \right\} \). Since the number of \({\bar{x}}_i\) is finite, we have \(L_{cc} < \infty \) and \(\sigma > 0\). Therefore, for any \(x, y \in {{\bar{\varOmega }}}\) satisfying \({{\,\mathrm{dist}\,}}(x, y) < \sigma \), it holds that

$$\begin{aligned} \Vert {\mathcal {P}}_{\gamma }^{1 \rightarrow 0} \xi _x - {\mathcal {T}}_{\eta _x}^{-\sharp } \xi _x\Vert _y \le L_{cc} \Vert \xi _x\Vert _x \Vert \eta _x\Vert _x. \end{aligned}$$
(A.8)

Note that \(\Vert \eta _x\Vert _x < \sigma / \kappa \) implies \({{\,\mathrm{dist}\,}}(x, y) < \sigma \) by (A.5). It follows from (A.4), (A.7) and (A.8) that for any \(x, y \in {{\bar{\varOmega }}}\) satisfying \(\Vert \eta _x\Vert _x < \min (\sigma / \kappa , {\tilde{\delta }})\),

$$\begin{aligned} \Vert \xi _y - {\mathcal {T}}_{\eta _x}^{-\sharp } (\xi _x + a \eta _x)\Vert _y\le & {} \Vert \xi _y - {\mathcal {P}}_{\gamma }^{1 \leftarrow 0} \xi _x\Vert _y + \Vert {\mathcal {P}}_{\gamma }^{1 \leftarrow 0} \xi _x - {\mathcal {T}}_{\eta _x}^{-\sharp } \xi _x\Vert _y + \Vert {\mathcal {T}}_{\eta _x}^{-\sharp } a \eta _x\Vert _y \\\le & {} L_c \Vert \eta _x\Vert _x, \end{aligned}$$

where \(L_{c} = L_v \kappa + L_{cc} \sup _{x \in {{\bar{\varOmega }}}} \Vert \xi _x\Vert _x + a L_t\).\(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Huang, W., Wei, K. Riemannian proximal gradient methods. Math. Program. 194, 371–413 (2022). https://doi.org/10.1007/s10107-021-01632-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-021-01632-3

Keywords

Mathematics Subject Classification

Navigation