Skip to main content

Advertisement

Log in

Linear convergence of an alternating polar decomposition method for low rank orthogonal tensor approximations

  • Full Length Paper
  • Series A
  • Published:
Mathematical Programming Submit manuscript

Abstract

Low rank orthogonal tensor approximation (LROTA) is an important problem in tensor computations and their applications. A classical and widely used algorithm is the alternating polar decomposition method (APD). In this paper, an improved version iAPD of the classical APD is proposed. For the first time, all of the following four fundamental properties are established for iAPD: (i) the algorithm converges globally and the whole sequence converges to a KKT point without any assumption; (ii) it exhibits an overall sublinear convergence with an explicit rate which is sharper than the usual O(1/k) for first order methods in optimization; (iii) more importantly, it converges R-linearly for a generic tensor without any assumption; (iv) for almost all LROTA problems, iAPD reduces to APD after finitely many iterations if it converges to a local minimizer.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. Yang’s paper is posted during our final preparation of this paper. We can see that we both employ the proximal technique. A difference is that the proximal correction in our algorithm is adaptive, and a theoretical investigation is also given (cf. Sect. 5) for the execution of the proximal correction.

  2. An algebraic variety here means a set of common roots of some polynomials. In other words, an algebraic variety is a closed subset of some \({\mathbb {R}}^N\) with respect to the Zariski topology.

  3. This is due to the fact that \({\text {dim}}(V(n-1,n))={\text {dim}}(V(n,n))\).

  4. In the general definition of KL property, \(\partial p({\mathbf {x}})\) is actually the limiting subdifferential of p at \(\mathbf {x}\). However, for functions considered in this paper, the two notions of subdifferentials coincide. For this reason, we do not distinguish them in the paper.

  5. Since U is orthonormal, we must have

    $$\begin{aligned} \Vert UAU\Vert _F^2=\Vert AU\Vert _F^2=\langle AUU^{\scriptscriptstyle {\mathsf {T}}},A\rangle \le \Vert A\Vert _F^2. \end{aligned}$$

References

  1. Absil, P.-A., Mahony, R., Sepulchre, R.: Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton, USA (2008)

    MATH  Google Scholar 

  2. Anandkumar, A., Ge, R., Hsu, D., Kakade, S.M., Telgarsky, M.: Tensor decompositions for learning latent variable models. J. Mach. Learn. Res. 15, 2773–2832 (2014)

    MathSciNet  MATH  Google Scholar 

  3. Anderson, E., Bai, Z.Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.: LAPACK Users’ Guide, 3rd edn. SIAM, Philadelphia, PA (1999)

    MATH  Google Scholar 

  4. Attouch, H., Bolte, J., Redont, P., Soubeyran, A.: Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the Kurdyka-Łojasiewicz inequality. Math. Oper. Res. 35, 438–457 (2010)

    MathSciNet  MATH  Google Scholar 

  5. Attouch, H., Bolte, J., Svaiter, B.F.: Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods. Math. Program. 137(1–2), 91–129 (2013)

    MathSciNet  MATH  Google Scholar 

  6. Beck, A.: First-Order Methods in Optimization. MOS-SIAM Series on Optimization, SIAM (2017)

  7. Bertsekas, D.P.: Nonlinear Programming, 2nd edn. Athena Scientific, Belmont, USA (1999)

    MATH  Google Scholar 

  8. Bochnak, J., Coste, M., Roy, M.-F.: Real Algebraic Geometry. Ergebnisse der Mathematik und ihrer Grenzgebiete, vol. 36. Springer, Berlin (1998)

    Google Scholar 

  9. Bolte, J., Daniilidis, A., Lewis, A.S., Shiota, M.: Clarke subgradients of stratifiable manifolds. SIAM J. Optim. 18(2), 556–572 (2007)

    MathSciNet  MATH  Google Scholar 

  10. Bolte, J., Daniilidis, A., Ley, O., Mazet, L.: Characterizations of Łojasiewicz inequalities and applications: subgradient flows, talweg, convexity. Trans. Am. Math. Soc. 362, 3319–3363 (2010)

    MATH  Google Scholar 

  11. Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. 146(1–2), 459–494 (2014)

    MathSciNet  MATH  Google Scholar 

  12. Boralevi, A., Draisma, J., Horobet, E., Robeva, E.: Orthogonal and unitary tensor decomposition from an algebraic perspective. Isr. J. Math. 222, 223–260 (2017)

    MathSciNet  MATH  Google Scholar 

  13. Chen, J., Saad, Y.: On the tensor SVD and the optimal low rank orthogonal approximation of tensors. SIAM J. Matrix Anal. Appl. 30, 1709–1734 (2009)

    MathSciNet  MATH  Google Scholar 

  14. Comon, P.: Independent component analysis, a new concept? Signal Process. 36, 287–314 (1994)

    MATH  Google Scholar 

  15. Comon, P.: Tensors: a brief introduction. IEEE Signal Proc. Mag. 31, 44–53 (2014)

    Google Scholar 

  16. Comon, P., Jutten, C.: Handbook of Blind Source Separation. Academic Press, Oxford (2010)

    Google Scholar 

  17. D’ Acunto, D., Kurdyka, K.: Explicit bounds for the Łojasiewicz exponent in the gradient inequality for polynomials. Ann. Polon. Math. 87, 51–61 (2005)

    MathSciNet  MATH  Google Scholar 

  18. De Lathauwer, L., De Moor, B., Vandewalle, J.: On the best rank-1 and rank-\((r_1, r_2,\ldots, r_N)\) approximation of higher-order tensors. SIAM J. Matrix Anal. Appl. 21, 1324–1342 (2000)

    MathSciNet  MATH  Google Scholar 

  19. De Silva, V., Lim, L.-H.: Tensor rank and the ill-posedness of the best low-rank approximation problem. SIAM J. Matrix Anal. Appl. 29, 1084–1127 (2008)

    MathSciNet  MATH  Google Scholar 

  20. do Carmo, M.P.: Riemannian Geometry. Springer, Boston, Birkhäuser (1992)

    MATH  Google Scholar 

  21. Eckart, C., Young, G.: The approximation of one matrix by another of lower rank. Psychometrika 1, 211–218 (1936)

    MATH  Google Scholar 

  22. Edelman, A., Arias, T.A., Smith, S.T.: The geometry of algorithms with orthogonality constraints. SIAM. J. Matrix Anal. Appl. 20(2), 303–353 (1998)

    MathSciNet  MATH  Google Scholar 

  23. Franc, A.: Etude Algébrique des Multitableaux: Apports de l’Algébre Tensorielle. Thèse de Doctorat, Spécialité Statistiques, Univ. de Montpellier II, Montpellier, France (1992)

  24. Golub, G.H., Van Loan, C.F.: Matrix Computations, 4th edn. Johns Hopkins University Press, Baltimore, MD (2013)

    MATH  Google Scholar 

  25. Grasedyck, L., Kressner, D., Tobler, C.: A literature survey of low-rank tensor approximation techniques. GAMM-Mitt. 36, 53–78 (2013)

    MathSciNet  MATH  Google Scholar 

  26. Guan, Y., Chu, D.: Numerical computation for orthogonal low-rank approximation of tensors. SIAM J. Matrix Anal. Appl. 40, 1047–1065 (2019)

    MathSciNet  MATH  Google Scholar 

  27. Hackbusch, W.: Tensor Spaces and Numerical Tensor Calculus. Springer, Berlin (2012)

    MATH  Google Scholar 

  28. Hartshorne, R.: Algebraic Geometry. Graduate Texts in Mathematics 52. Springer, New York (1977)

    Google Scholar 

  29. Higham, N.J.: Computing the polar decomposition-with applications. SIAM J. Sci. Statist. Comput. 7, 1160–1174 (1986)

    MathSciNet  MATH  Google Scholar 

  30. Horn, R.A., Johnson, C.R.: Matrix Analysis. Cambridge University Press, New York (1985)

    MATH  Google Scholar 

  31. Hu, S., Li, G.: Convergence rate analysis for the higher order power method in best rank one approximations of tensors. Numer. Math. 140, 993–1031 (2018)

    MathSciNet  MATH  Google Scholar 

  32. Ishteva, M., Absil, P.-A., Van Dooren, P.: Jacobi algorithm for the best low multilinear rank approximation of symmetric tensors. SIAM J. Matrix Anal. Appl. 34, 651–672 (2013)

    MathSciNet  MATH  Google Scholar 

  33. Knyazev, A.V., Argentati, M.E.: Majorization for changes in angles between subspaces, Ritz values, and graph Laplacian spectra. SIAM J. Matrix Anal. Appl. 29, 15–32 (2006)

    MathSciNet  MATH  Google Scholar 

  34. Kolda, T.G.: Orthogonal tensor decompositions. SIAM J. Matrix Anal. Appl. 23, 243–255 (2001)

    MathSciNet  MATH  Google Scholar 

  35. Kolda, T.G.: A counterexample to the possibility of an extension of the Eckart-Young low rank approximation theorem for the orthogonal rank tensor decomposition. SIAM J. Matrix Anal. Appl. 24, 762–767 (2002)

    MathSciNet  MATH  Google Scholar 

  36. Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Rev. 51, 455–500 (2009)

    MathSciNet  MATH  Google Scholar 

  37. Kruskal, J.B.: Three-way arrays: rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics. Linear Algebr. Appl. 18, 95–138 (1977)

    MathSciNet  MATH  Google Scholar 

  38. Kurdyka, K.: On gradients of functions definable in o-minimal structures. Annales de l’Institut Fourier 48(3), 769–783 (1998)

    MathSciNet  MATH  Google Scholar 

  39. Landsberg, J.M.: Tensors: Geometry and Applications. AMS, Providence, RI (2012)

    MATH  Google Scholar 

  40. Lee, J.D., Panageas, I., Piliouras, G., Simchowitz, M., Jordan, M.I., Recht, B.: First-order methods almost always avoid strict saddle points. Math. Program. 176, 311–337 (2019)

    MathSciNet  MATH  Google Scholar 

  41. Lee, J.D., Simchowitz, M., Jordan, M.I., Recht, B.: Gradient descent only converges to minimizers. P. Mach. Learn. Res. 49, 1–12 (2016)

    Google Scholar 

  42. Li, G., Pong, T.K.: Calculus of the exponent of Kurdyka-Łojasiewicz inequality and its applications to linear convergence of first-order methods. Found. Comput. Math. 18, 1199–1232 (2018)

    MathSciNet  MATH  Google Scholar 

  43. Li, J., Usevich, K., Comon, P.: Globally convergent jacobi-type algorithms for simultaneous orthogonal symmetric tensor diagonalization. SIAM J. Matrix Anal. Appl. 39, 1–22 (2018)

    MathSciNet  MATH  Google Scholar 

  44. Li, J., Usevich, K., Comon, P.: On approximate diagonalization of third order symmetric tensors by orthogonal transformations. Linear Algebr. Appl. 576, 324–351 (2019)

    MathSciNet  MATH  Google Scholar 

  45. Li, J., Usevich, K., Comon, P.: Jacobi-type algorithm for low rank orthogonal approximation of symmetric tensors and its convergence analysis. arXiv: 1911.00659

  46. Lim, L.-H.: Tensors and hypermatrices. Chapter 15 in Handbook of Linear Algebra. In: Hogben, L. (eds.) (2013)

  47. Liu, H., Wu, W., So, A.M.-C.: Quadratic optimization with orthogonality constraints: explicit Łojasiewicz exponent and linear convergence of line-search methods. In: Proceedings of the 33rd International Convergnece on Machine Learning, New York, USA (2016)

  48. Łojasiewicz, S.: Une propriété topologique des sous-ensembles analytiques réels, Les Équations aux Dérivées Partielles. Éditions du centre National de la Recherche Scientifique, Paris, pp. 87–89 (1963)

  49. Martin, C.D.M., Van Loan, C.F.: A Jacobi-type method for computing orthogonal tensor decompositions. SIAM J. Matrix Anal. Appl. 30, 1219–1232 (2008)

    MathSciNet  MATH  Google Scholar 

  50. McCullagh, P.: Tensor Methods in Statistics. Chapman and Hall, London (1987)

    MATH  Google Scholar 

  51. Milnor, J.: Morse Theory. Annals Math. Studies, 51. Princeton Univ. Press, Princeton, NJ (1963)

  52. Mohlenkamp, M.J.: Musings on multilinear fitting. Linear Algebr. Appl. 438, 834–852 (2013)

    MathSciNet  MATH  Google Scholar 

  53. Mu, C., Hsu, D., Goldfarb, D.: Successive rank-one approximations for nearly orthogonally decomposable symmetric tensors. SIAM J. Matrix Anal. Appl. 36, 1638–1659 (2015)

    MathSciNet  MATH  Google Scholar 

  54. Nesterov, Y.: Introductory Lectures on Convex Optimization-A Basic Course. Kluwer Academic Publishers, London (2004)

    MATH  Google Scholar 

  55. Qi, L., Luo, Z.: Tensor Analysis: Spectral Theory and Special Tensors. SIAM (2017)

  56. Rockafellar, R.T., Wets, R.: Variational Analysis. Grundlehren der Mathematischen Wissenschaften, vol. 317. Springer, Berlin (1998)

    Google Scholar 

  57. Shafarevich, I.R.: Basic Algebraic Geometry. Springer-Verlag, Berlin (1977)

    MATH  Google Scholar 

  58. Sørensen, M., De Lathauwer, L., Comon, P., Jcart, S., Deneire, L.: Canonical polyadic decomposition with a columnwise orthonormal factor matrix. SIAM J. Matrix Anal. Appl. 33, 1190–1213 (2012)

    MathSciNet  MATH  Google Scholar 

  59. Sun, J., Chen, C.: Generalized polar decomposition. Math. Numer. Sinica 11, 262–273 (1989)

    MathSciNet  MATH  Google Scholar 

  60. Uschmajew, A.: Local convergence of the alternating least squares algorithm for canonical tensor approximation. SIAM J. Matrix Anal. Appl. 33, 639–652 (2012)

    MathSciNet  MATH  Google Scholar 

  61. Uschmajew, A.: A new convergence proof for the high-order power method and generalizations. Pacific J. Optim. 11, 309–321 (2015)

    MathSciNet  MATH  Google Scholar 

  62. Usevich, K., Li, J., Comon, P.: Approximate matrix and tensor diagonalization by unitary transformations: convergence of Jacobi-type algorithms. SIAM J. Optim. 30, 2998–3028 (2020)

    MathSciNet  MATH  Google Scholar 

  63. Wang, L., Chu, M.T.: On the global convergence of the alternating least squares method for rank-one tensor approximation. SIAM J. Matrix Anal. Appl. 35(3), 1058–1072 (2014)

    MathSciNet  MATH  Google Scholar 

  64. Wang, L., Chu, M.T., Yu, B.: Orthogonal low rank tensor approximation: alternating least squares method and its global convergence. SIAM J. Matrix Anal. Appl. 36, 1–19 (2015)

    MathSciNet  MATH  Google Scholar 

  65. Yang, Y.: The epsilon-alternating least squares for orthogonal low-rank tensor approximation and its global convergence. SIAM J. Matrix Anal. Appl. 41, 1797–1825 (2020)

    MathSciNet  MATH  Google Scholar 

  66. Zhang, T., Golub, G.H.: Rank-one approximation to high order tensors. SIAM J. Matrix Anal. Appl. 23, 534–550 (2001)

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

We would like to acknowledge the anonymous associate editor and two referees for their thoughtful suggestions and helpful comments on our manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ke Ye.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

We claim that parts of the paper have not been published or were submitted elsewhere.

This work is partially supported by National Science Foundation of China (Grant No. 11771328). The first author is also partially supported by National Science Foundation of China (Grant No. 12171128), Young Elite Scientists Sponsorship Program by Tianjin and the Natural Science Foundation of Zhejiang Province, China (Grant No. LD19A010002 and Grant No. LY22A010022). The second author is also partially supported by National Science Foundation of China (Grant No. 11801548 and Grant No. 11688101), National Key R &D Program of China (Grant No. 2018YFA0306702 and Grant No. 2020YFA0712300) and CAS Project for Young Scientists in Basic Research (Grant No. YSBR-008).

Appendices

A Preliminaries

This section consists of preliminaries on differential geometry and optimization theory, which are necessary to get through the details of proofs in this paper. To be more precise, we discuss Stiefel manifolds and their basic properties in Appendix A.1. Morse functions and Kurdyka-Łojasiewicz property are introduced in Appendix A.2 and A.3, respectively.

1.1 A.1 Stiefel manifolds

Let \(m\le n\) be two positive integers and let \(V(m,n)\subseteq {\mathbb {R}}^{n \times m}\) be the set of all \(n \times m\) orthonormal matrices, i.e.,

$$\begin{aligned} V(m,n):=\{U\in {\mathbb {R}}^{n \times m}:U^{\scriptscriptstyle {\mathsf {T}}}U=I\}, \end{aligned}$$

where I is the identity matrix of matching size. Indeed, V(mn) admits a smooth manifold structure and is called the Stiefel manifold of orthonormal m-frames in \({\mathbb {R}}^n\). In particular, if \(m = n\) then V(nn) simply reduces to the orthogonal group \({{\,\mathrm{O}\,}}(n)\).

In general, if M is a Riemannian submanifold of \({\mathbb {R}}^k\), then the normal space \({\text {N}}_M({\mathbf {x}})\) of M at a point \({\mathbf {x}}\in M\) is defined to be the orthogonal complement of its tangent space \({\text {T}}_M({\mathbf {x}})\) in \({\mathbb {R}}^k\), i.e., \({\text {N}}_M ({\mathbf {x}}) {:}{=}{\text {T}}_M({\mathbf {x}})^{\perp }\). By definition, V(mn) is naturally a Riemannian submanifold of \({\mathbb {R}}^{n \times m}\) and hence its normal space is well-defined. Indeed, it follows from [56, Chapter 6.C] and [1, 22] that

$$\begin{aligned} {\text {N}}_{V(m,n)}(A)=\{AX: X\in {\text {S}}^{m \times m}\}, \end{aligned}$$
(65)

where \({\text {S}}^{m\times m}\subseteq {\mathbb {R}}^{m\times m}\) is the subspace of \(m \times m\) symmetric matrices.

Given a matrix \(B\in {\mathbb {R}}^{n \times m}\), the projection of B onto the normal space of V(mn) at a point \(A\in V(m,n)\) is

$$\begin{aligned} \pi _{{\text {N}}_{V(m,n)}(A)}(B)=A \left( \frac{A^{\scriptscriptstyle {\mathsf {T}}}B+B^{\scriptscriptstyle {\mathsf {T}}}A}{2} \right) . \end{aligned}$$

Likewise, given a matrix \(B\in {\mathbb {R}}^{n \times m}\), the projection of B onto the tangent space of V(mn) at A is given by

$$\begin{aligned} \pi _{{\text {T}}_{V(m,n)}(A)}(B)=A{\text {skew}}(A^{\scriptscriptstyle {\mathsf {T}}}B)+(I-AA^{\scriptscriptstyle {\mathsf {T}}})B, \end{aligned}$$
(66)

where \({\text {skew}}(C) {:}{=}\frac{C-C^{\scriptscriptstyle {\mathsf {T}}}}{2} \) for a square matrix \(C\in {\mathbb {R}}^{m \times m}\). A more explicit formula is given as

$$\begin{aligned} \pi _{{\text {T}}_{V(m,n)}(A)}(B)= (I-\frac{1}{2}AA^{\scriptscriptstyle {\mathsf {T}}})(B-AB^{\scriptscriptstyle {\mathsf {T}}}A). \end{aligned}$$
(67)

Given a function \(f : {\mathbb {R}}^k\rightarrow {\mathbb {R}}\cup \{\infty \}\), the subdifferential of f at \({\mathbf {x}}\in {\mathbb {R}}^k\) is defined as (cf. [56])

$$\begin{aligned} \partial f({\mathbf {x}}) {:}{=}\Bigg \{{\mathbf {v}}\in {\mathbb {R}}^k:\liminf _{{\mathbf {x}}\ne {\mathbf {y}}\rightarrow {\mathbf {x}}}\frac{f({\mathbf {y}})-f({\mathbf {x}})-\langle {\mathbf {v}},{\mathbf {y}}-{\mathbf {x}}\rangle }{\Vert {\mathbf {y}}-{\mathbf {x}}\Vert }\ge 0\Bigg \}. \end{aligned}$$

If \({\mathbf {0}}\in \partial f({\mathbf {x}})\), then \({\mathbf {x}}\) is a critical point of f.

The indicator function \(\delta _{V(m,n)}:{\mathbb {R}}^{n\times m} \rightarrow {\mathbb {R}} \cup \{\infty \}\) of V(mn) is defined by

$$\begin{aligned} \delta _{V(m,n)}(X) {:}{=}{\left\{ \begin{array}{ll}0&{}\text {if }X\in V(m,n),\\ +\infty &{}\text {otherwise}.\end{array}\right. } \end{aligned}$$
(68)

An important fact about the normal space \({\text {N}}_{V(m,n)}(A)\) and the subdifferential of the indicator function \(\delta _{V(m,n)}\) of V(mn) at A is (cf. [56])

$$\begin{aligned} \partial \delta _{V(m,n)}(A)={\text {N}}_{V(m,n)}(A). \end{aligned}$$
(69)

1.2 A.2 Morse functions

In the following, we introduce the notion of Morse functions and recall some of its basic properties. On a smooth manifold M, a smooth function \(f : M\rightarrow {\mathbb {R}}\) is called a Morse function if each critical point of f on M is nondegenerate, i.e., the Hessian matrix of f at each critical point is non-singular. The following result is well-known, see for example [51, Theorem 6.6].

Lemma 11

(Projection is Generically Morse) Let M be a submanifold of \({\mathbb {R}}^n\). For a generic \({\mathbf {a}} = (a_1,\dots ,a_n)^{\scriptscriptstyle {\mathsf {T}}}\in {\mathbb {R}}^n\), the Euclidean distance function

$$\begin{aligned} f({\mathbf {x}})= \Vert {\mathbf {x}}-{\mathbf {a}}\Vert ^2 \end{aligned}$$

is a Morse function on M.

We also need the following property (cf. [51, Corollary 2.3]) of nondegenerate critical points in the sequel.

Lemma 12

Let M be a manifold and let \(f:M \rightarrow {\mathbb {R}}\) be a smooth function. Nondegenerate critical points of f are isolated.

To conclude this subsection, we briefly discuss how critical points behave under a local diffeomorphism. For this purpose, we recall that a smooth manifold \(M_1\) is called locally diffeomorphic to a smooth manifold \(M_2\) if there is a smooth map \(\varphi : M_1\rightarrow M_2\) such that for each point \({\mathbf {x}}\in M_1\) there exists a neighborhood \(U\subseteq M_1\) of \({\mathbf {x}}\) and a neighborhood \(V\subseteq M_2\) of \(\varphi ({\mathbf {x}})\) such that the restriction \(\varphi |_U : U\rightarrow V\) is a diffeomorphism [20]. In this case, the corresponding \(\varphi \) is called a local diffeomorphism from \(M_1\) to \(M_2\). It is clear from the definition that whenever \(M_1\) is locally diffeomorphic to \(M_2\) then the two manifolds must have the same dimension. Moreover, we have the following result, whose proof can be found in [31, Proposition 5.2].

Proposition 11

Let a smooth manifold \(M_1\) be locally diffeomorphic to another smooth manifold \(M_2\) and let \(\varphi : M_1\rightarrow M_2\) be the corresponding local diffeomorphism. Let \(f : M_2\rightarrow {\mathbb {R}}\) be a smooth function. Then \({\mathbf {x}}\in M_1\) is a (nondegenerate) critical point of \(f\circ \varphi \) on \(M_1\) if and only if \(\varphi ({\mathbf {x}})\) is a (nondegenerate) critical point of f on \(M_2\).

When f is a smooth function on \({\mathbb {R}}^n\) and M is a submanifold of \({\mathbb {R}}^n\), we denote by \(\nabla f\) the gradient of f as a vector field on \({\mathbb {R}}^n\), while we denote by \({\text {grad}}(f)\) the Riemannian gradient of f as a vector field on M.

1.3 A.3 Kurdyka- Łojasiewicz property

In this subsection, we review some basic facts about the Kurdyka-Łojasiewicz property, which was first established for a smooth definable function in [38] by Kurdyka and later was even generalized to non-smooth definable case in [9] by Bolte, Daniilidis, Lewis and Shiota. Interested readers are also referred to [4, 5, 10, 42] for more discussions and applications of the Kurdyka-Łojasiewicz property.

Let p be an extended real-valued function and let \(\partial p({\mathbf {x}})\) be the set of subgradients of p at \({\mathbf {x}}\) (cf. [56]). We define \({\text {dom}}(\partial p) {:}{=}\{{\mathbf {x}}:\partial p({\mathbf {x}} )\ne \emptyset \}\) and take \({\mathbf {x}}^*\in {\text {dom}}(\partial p)\). If there exist some \(\eta \in (0,+\infty ]\), a neighborhood U of \({\mathbf {x}}^*\), and a continuous concave function \(\varphi :[0,\eta )\rightarrow {\mathbb {R}}_+\), such that

  1. 1.

    \(\varphi (0)=0\) and \(\varphi \) is continuously differentiable on \((0,\eta )\),

  2. 2.

    for all \(s\in (0,\eta )\), \(\varphi ^{\prime }(s)>0\), and

  3. 3.

    for all \({\mathbf {x}}\in U\cap {\{ {\mathbf {y}}:p({\mathbf {x}}^*)<p({\mathbf {y}})<p({\mathbf {x}}^*)+\eta \}}\), the Kurdyka-Łojasiewicz inequality holds

    $$\begin{aligned} \varphi ^{\prime }(p({\mathbf {x}}) - p({\mathbf {x}}^*)) {\text {dist}}({\mathbf {0}},\partial p({\mathbf {x}}))\ge 1, \end{aligned}$$

then we say that p has the Kurdyka-Łojasiewicz (abbreviated as KL) property at \({\mathbf {x}}^*\). Here \({\text {dist}}({\mathbf {0}},\partial p({\mathbf {x}} ))\) denotes the distance from \({\mathbf {0}}\) to the set \(\partial p({\mathbf {x}} )\).Footnote 4 If p is proper, lower semicontinuous, and has the KL property at each point of \({\text {dom}}(\partial p)\), then p is said to be a KL function. Examples of KL functions include real subanalytic functions and semi-algebraic functions [11]. In this paper, semi-algebraic functions are involved. We refer to [8] and references herein for more details on such functions. In particular, polynomial functions are semi-algebraic functions and hence KL functions. Another important fact is that the indicator function of a semi-algebraic set is also a semi-algebraic function [8, 11]. Also, a finite sum of semi-algebraic functions is again semi-algebraic. We assemble these facts to derive the following lemma which is crucial to the analysis of the global convergence of our algorithm.

Lemma 13

A finite sum of polynomial functions and indicator functions of semi-algebraic sets is a KL function.

While KL-property is used for global convergence analysis, the Łojasiewicz inequality discussed in the rest of this subsection is for convergence rate analysis. The classical Łojasiewicz inequality for analytic functions is stated as follows (cf. [48]):

  • (Classical Łojasiewicz’s Gradient Inequality) If f is an analytic function with \(f({\mathbf {0}})=0\) and \(\nabla f({\mathbf {0}})={\mathbf {0}}\), then there exist positive constants \(\mu , \kappa ,\) and \(\epsilon \) such that

    $$\begin{aligned} \Vert \nabla f({\mathbf {x}})\Vert \ge \mu |f({\mathbf {x}})|^{\kappa }\; \text{ for } \text{ all } \;\Vert {\mathbf {x}}\Vert \le \epsilon . \end{aligned}$$
    (70)

As pointed out in [4, 10], it is often difficult to determine the corresponding exponent \(\kappa \) in Łojasiewicz’s gradient inequality, and it is unknown for a general function. However, an estimate of the exponent \(\kappa \) in the gradient inequality is obtained by D’Acunto and Kurdyka in [17, Theorem 4.2] when f is a polynomial function. We record this fundamental result in the next lemma, which plays a key role in our sublinear convergence rate analysis.

Lemma 14

(Łojasiewicz’s Gradient Inequality for Polynomials) Let f be a real polynomial of degree d in n variables. Suppose that \(f({\mathbf {0}})=0\) and \(\nabla f({\mathbf {0}})={\mathbf {0}}\). There exist constants \(\mu , \epsilon > 0\) such that for all \(\Vert {\mathbf {x}}\Vert \le \epsilon \), we have

$$\begin{aligned} \Vert \nabla f({\mathbf {x}})\Vert \ge \mu |f({\mathbf {x}})|^{\kappa }\; \text{ with } \;\kappa = 1-\frac{1}{d(3d-3)^{n-1}}. \end{aligned}$$

Below is a more precise version of the classical Łojasiewicz gradient inequality (70), in which we can take the exponent \(\kappa \) to be 1/2 near a nondegenerate critical point. For the sake of subsequent analysis, it is stated for a smooth manifold [20].

Proposition 12

(Łojasiewicz’s Gradient Inequality) Let M be a smooth manifold and let \(f: M \rightarrow {\mathbb {R}}\) be a smooth function for which \({\mathbf {z}}^*\) is a nondegenerate critical point. Then there exist a neighborhood U in M of \({\mathbf {z}}^*\) and some \(\mu >0\) such that for all \({\mathbf {z}}\in {U}\)

$$\begin{aligned} \Vert {\text {grad}}(f)({\mathbf {z}})\Vert \ge \mu |f({\mathbf {z}})-f({\mathbf {z}}^*)|^{\frac{1}{2}}. \end{aligned}$$

B Properties on orthonormal matrices

1.1 B.1 Polar decomposition

In this section, we establish an error bound analysis for the polar decomposition. For a positive semidefinite matrix \(H\in {\text {S}}^{n}_+\), there exists a unique positive semidefinite matrix \(P\in {\text {S}}^{n}_+\) such that \(P^2=H\). In the literature, this matrix P is called the square root of the matrix H and denoted as \(P=\sqrt{H}\). If \(H=U\varSigma U^{\scriptscriptstyle {\mathsf {T}}}\) is the eigenvalue decomposition of H, then we have \(\sqrt{H}=U\sqrt{\varSigma }U^{\scriptscriptstyle {\mathsf {T}}}\), where \(\sqrt{\varSigma }\) is the diagonal matrix whose diagonal elements are square roots of those of \(\varSigma \). The next result is classical, which can be found in [24, 29].

Lemma 15

(Polar Decomposition) Let \(A\in {\mathbb {R}}^{m\times n}\) with \(m\ge n\). Then there exist an orthonormal matrix \(U\in V(n,m)\) and a unique symmetric positive semidefinite matrix \(H\in {\text {S}}^{n}_+\) such that \(A=UH\) and

$$\begin{aligned} U\in {\text {argmax}}\{\langle Q,A\rangle :Q\in V(n,m)\}. \end{aligned}$$
(71)

Moreover, if A is of full rank, then the matrix U is uniquely determined and H is positive definite.

The matrix decomposition \(A=UH\) as in Lemma 15 is called the polar decomposition of the matrix A [24]. For convenience, the matrix U is referred as a polar orthonormal factor matrix and the matrix H is the polar positive semidefinite factor matrix. The optimization reformulation (71) comes from the approximation problem

$$\begin{aligned} \min _{Q\in V(n,m)}\ \Vert B-QC\Vert ^2 \end{aligned}$$

for two given matrices B and C of proper sizes. In the following, we give a global error bound for this problem. To this end, the next lemma is useful.

Lemma 16

(Error Reformulation) Let pmn be positive integers with \(m\ge n\) and let \(B\in {\mathbb {R}}^{m\times p}\), \(C\in {\mathbb {R}}^{n\times p}\) be two given matrices. We set \(A {:}{=}BC^{\scriptscriptstyle {\mathsf {T}}}\in {\mathbb {R}}^{m\times n}\) and suppose that \(A=WH\) is a polar decomposition of A. We have

$$\begin{aligned} \Vert B-QC\Vert _F^2-\Vert B-WC\Vert _F^2= \Vert W\sqrt{H}-Q\sqrt{H}\Vert ^2_F \end{aligned}$$
(72)

for any orthonormal matrix \(Q\in V(n,m)\).

Proof

We have

$$\begin{aligned} \Vert B-QC\Vert _F^2-\Vert B-WC\Vert _F^2&=2\langle B,WC-QC\rangle \\&=2\langle A,W-Q\rangle \\&=2\langle WH,W-Q\rangle \\&= \langle WH,W\rangle -2\langle WH,Q\rangle +\langle QH,Q\rangle \\&= \Vert W\sqrt{H}-Q\sqrt{H}\Vert ^2_F, \end{aligned}$$

where both the first and the fourth equalities follow from the fact that both Q and W are in V(nm), and the last one is derived from the fact that H is symmetric and positive semidefinite by Lemma 15. \(\square \)

Given an \(n\times n\) symmetric positive semidefintie matrix H, we can define a symmetric bilinear form on \({\mathbb {R}}^{m\times n}\) by

$$\begin{aligned} \langle P,Q\rangle _H:=\langle PH,Q\rangle \end{aligned}$$
(73)

for all \(m\times n\) matrices P and Q. It can also induce a seminorm

$$\begin{aligned} \Vert A\Vert _H:=\sqrt{\langle A,A\rangle _H}=\Vert A\sqrt{H}\Vert _F. \end{aligned}$$
(74)

In particular, if H is positive definite, then \(\Vert \cdot \Vert _H\) is a norm on \({\mathbb {R}}^{m\times n}\). Thus, the error estimation in (72) can be viewed as a distance estimation between W and Q with respect to the distance induced by this norm. Moreover, if H is the identity matrix, then \(\Vert \cdot \Vert _H\) is simply the Frobenius norm which induces the Euclidean distance on \({\mathbb {R}}^{m\times n}\). By Lemma 16 it is easy to see that the optimizer in (71) is unique whenever A is of full rank.

The following result establishes the error estimation with respect to the Euclidean distance. Given a matrix \(A\in {\mathbb {R}}^{m\times n}\), let \(\sigma _{\min }(A)\) be the smallest singular value of A. If A is of full rank, then \(\sigma _{\min }(A)>0\).

Theorem 9

(Global Error Bound in Frobenius Norm) Let pmn be positive integers with \(m\ge n\) and let \(B\in {\mathbb {R}}^{m\times p}\) and \(C\in {\mathbb {R}}^{n\times p}\) be two given matrices. We set \(A:=BC^{\scriptscriptstyle {\mathsf {T}}}\in {\mathbb {R}}^{m\times n}\) and suppose that A is of full rank with the polar decomposition \(A=WH\). We have that

$$\begin{aligned} \Vert B-QC\Vert _F^2-\Vert B-WC\Vert _F^2\ge \sigma _{\min }(A)\Vert W-Q\Vert ^2_F \end{aligned}$$
(75)

for any orthonormal matrix \(Q\in V(n,m)\).

Proof

We know that in this case

$$\begin{aligned} \sqrt{H}-\sqrt{\sigma _{\min }(A)}I\in {\text {S}}^{n}_+. \end{aligned}$$

Therefore we may conclude that

$$\begin{aligned} \Vert W\sqrt{H}-Q\sqrt{H}\Vert ^2_F= & {} \Vert W-Q\Vert ^2_{\sqrt{H}}\ge \Vert W-Q\Vert ^2_{\sqrt{\sigma _{\min }(A)}I}\\= & {} \Vert W\sqrt{\sigma _{\min }(A)}I-Q\sqrt{\sigma _{\min }(A)}I\Vert ^2_F=\sigma _{\min }(A)\Vert W-Q\Vert _F^2. \end{aligned}$$

According to Lemma 16, we can derive the desired inequality. \(\square \)

Theorem 9 is a refinement of Sun and Chen’s result (cf. [59, Theorem 4.1]), in which the right hand side of (75) has an extra factor \(\frac{1}{4}\).

1.2 B.2 Principal angles between subspaces

Given two linear subspaces \({\mathbb {U}}\) and \({\mathbb {V}}\) of dimension r in \({\mathbb {R}}^n\), the principal angles \(\{\theta _i:i=1,\dots ,r\}\) between \({\mathbb {U}}\) and \({\mathbb {V}}\) and the associated principal vectors \(\{({\mathbf {u}}_i,{\mathbf {v}}_i):i=1,\dots ,r\}\) are defined recursively by

$$\begin{aligned} \cos (\theta _i)=\langle {\mathbf {u}}_i,{\mathbf {v}}_i\rangle = \max _{{\mathbf {u}}\in {\mathbb {U}},\ [{\mathbf {u}},{\mathbf {u}}_1,\dots ,{\mathbf {u}}_{i-1}]\in V(i,n)}{\left\{ \max _{{\mathbf {v}}\in {\mathbb {V}},\ [{\mathbf {v}},{\mathbf {v}}_1,\dots ,{\mathbf {v}}_{i-1}]\in V(i,n)} \langle {\mathbf {u}},{\mathbf {v}}\rangle \right\} }. \end{aligned}$$
(76)

The following result is standard, whose proof can be found in [24, Section 6.4.3].

Lemma 17

For any orthonormal matrices \(U,V\in V(r,n)\) of which subspaces spanned by column vectors are \({\mathbb {U}}\) and \({\mathbb {V}}\) respectively, we have

$$\begin{aligned} \sigma _i(U^{\scriptscriptstyle {\mathsf {T}}}V)=\cos (\theta _i)\ \text {for all }i=1,\dots ,r, \end{aligned}$$
(77)

where \(\theta _i\)’s are the principal angles between \({\mathbb {U}}\) and \({\mathbb {V}}\), and \(\sigma _i(U^{\scriptscriptstyle {\mathsf {T}}}V)\) is the i-th largest singular value of the matrix \(U^{\scriptscriptstyle {\mathsf {T}}}V\).

Lemma 18

For any orthonormal matrices \(U,V\in V(r,n)\), we have

$$\begin{aligned} \langle U, V \rangle \le \sum _{j=1}^r\sigma _j(U^{{\scriptscriptstyle {\mathsf {T}}}}V). \end{aligned}$$
(78)

Proof

We recall from [24, pp. 331] that

$$\begin{aligned} \min _{Q\in {{\,\mathrm{O}\,}}(r)} \Vert A - BQ \Vert ^2_F = \sum _{j=1}^r (\sigma _j(A)^2 - 2 \sigma _j(B^{{\scriptscriptstyle {\mathsf {T}}}}A) + \sigma _j(B)^2), \end{aligned}$$

for any \(n\times r\) matrices AB. In particular, if \(U,V\in {{\,\mathrm{V}\,}}(r,n)\) then

$$\begin{aligned} \sigma _j(U) = \sigma _j(V) = 1\ \text {for all }j=1,\dots ,r, \quad \Vert U \Vert ^2_F = \Vert V \Vert ^2_F = r. \end{aligned}$$

This implies that

$$\begin{aligned} 2 r - 2 \sum _{j=1}^r \sigma _j(U^{\scriptscriptstyle {\mathsf {T}}}V) = \min _{Q\in {{\,\mathrm{O}\,}}(r)} \Vert U - VQ \Vert ^2_F \le \Vert U - V \Vert ^2_F = 2r - 2 \langle U, V \rangle , \end{aligned}$$

and the desired inequality follows immediately. \(\square \)

Lemma 19

For any orthonormal matrices \(U,V\in V(r,n)\), we have

$$\begin{aligned} \Vert U^{\scriptscriptstyle {\mathsf {T}}}V-I\Vert _F^2 \le \Vert U-V\Vert ^2_F. \end{aligned}$$
(79)

Proof

We have

$$\begin{aligned} \Vert U^{\scriptscriptstyle {\mathsf {T}}}V-I\Vert _F^2 = r + \sum _{i=1}^r \cos ^2 \theta _i - 2 {{\,\mathrm{tr}\,}}(U^{\scriptscriptstyle {\mathsf {T}}}V) \le 2r-2{{\,\mathrm{tr}\,}}(U^{\scriptscriptstyle {\mathsf {T}}}V)=\Vert U-V\Vert _F^2, \end{aligned}$$

where the first equality follows from Lemma 17. \(\square \)

Let \({\mathbb {U}}^\perp \) be the orthogonal complement subspace of a given linear subspace \({\mathbb {U}}\) in \({\mathbb {R}}^n\). A useful fact about principal angles between two linear subspaces \({\mathbb {U}}\) and \({\mathbb {V}}\) and those between \({\mathbb {U}}^\perp \) and \({\mathbb {V}}^\perp \) is stated as follows. The proof can be found in [33, Theorem 2.7].

Lemma 20

Let \({\mathbb {U}}\) and \({\mathbb {V}}\) be two linear subspaces of the same dimension and let \(\frac{\pi }{2}\ge \theta _s \ge \dots \ge \theta _1 > 0\) be the nonzero principal angles between \({\mathbb {U}}\) and \({\mathbb {V}}\). Then the nonzero principal angles between \({\mathbb {U}}^\perp \) and \({\mathbb {V}}^\perp \) are \(\frac{\pi }{2}\ge \theta _s \ge \dots \ge \theta _1 > 0\).

The following result is for the general case, which might be of independent interests.

Lemma 21

Let \(m\ge n\) be positive integers and let \(V:=[V_1\ V_2]\in {{\,\mathrm{O}\,}}(m)\) with \(V_1\in V(n,m)\) and \(U\in V(n,m)\) be two given orthonormal matrices. Then, there exists an orthonormal matrix \(W\in V(m-n,m)\) such that \(P:=[U\ W]\in {{\,\mathrm{O}\,}}(m)\) and

$$\begin{aligned} \Vert P-V\Vert _F^2\le 2\Vert U-V_1\Vert _F^2. \end{aligned}$$
(80)

Proof

By a simple computation, it is straightforward to verify that (80) is equivalent to

$$\begin{aligned} \Vert W-V_2\Vert _F^2\le \Vert U-V_1\Vert _F^2. \end{aligned}$$
(81)

To that end, we let \(U_2\in V(m-n,m)\) be an orthonormal matrix such that \([U\ U_2]\in {{\,\mathrm{O}\,}}(m)\). Then, we have that the linear subspace \({\mathbb {U}}_2\) spanned by column vectors of \(U_2\) is the orthogonal complement of \({\mathbb {U}}_1\), which is spanned by column vectors of U. Likewise, let \({\mathbb {V}}_1\) and \({\mathbb {V}}_2\) be linear subspaces spanned by column vectors of \(V_1\) and \(V_2\) respectively.

Let \(\frac{\pi }{2}\ge \theta _s\ge \dots \ge \theta _1 > 0\) be the nonzero principal angles between \({\mathbb {U}}_2\) and \({\mathbb {V}}_2\) for some nonnegative integer \(s\le m-n\). We have by Lemmas 17, and 18 that

$$\begin{aligned} \langle U_2,V_2\rangle \le \sum _{i=1}^{m-n}\sigma _i(U_2^{\scriptscriptstyle {\mathsf {T}}}V_2)=\sum _{i=1}^s\cos (\theta _i)+(m-n)-s. \end{aligned}$$
(82)

Let \(Q\in {{\,\mathrm{O}\,}}(m-n)\) be a polar orthogonal factor matrix of the matrix \(U_2^{\scriptscriptstyle {\mathsf {T}}}V_2\). It follows from the property of polar decomposition that

$$\begin{aligned} \langle U_2Q,V_2\rangle = \sum _{i=1}^{m-n}\sigma _i(U_2^{\scriptscriptstyle {\mathsf {T}}}V_2). \end{aligned}$$
(83)

On the other hand, nonzero principal angles between \({\mathbb {U}}_1\) and \({\mathbb {V}}_1\) are \(\frac{\pi }{2}\ge \theta _s\ge \dots \ge \theta _1>0\) by Lemma 20. Therefore, by Lemmas 17, and 18, we have that

$$\begin{aligned} \langle U,V_1\rangle \le \sum _{i=1}^{n}\sigma _i(U^{\scriptscriptstyle {\mathsf {T}}}V_1)=\sum _{i=1}^s\cos (\theta _i)+n-s. \end{aligned}$$
(84)

In a conclusion, if we set \(W {:}{=}U_2Q\), then we have the following:

$$\begin{aligned} \Vert W-V_2\Vert _F^2&=2(m-n)-2\langle U_2Q,V_2\rangle \\&=2(m-n)-2\big (\sum _{i=1}^s\cos (\theta _i)+m-n-s\big )\\&=2n - 2\big (\sum _{i=1}^s\cos (\theta _i)+n-s\big )\\&\le 2n -2\langle U,V_1\rangle \\&=\Vert U-V_1\Vert ^2_F, \end{aligned}$$

where the second equality follows from (82) and (83) and the inequality follows from (84). \(\square \)

C Proofs of technical lemmas in section 4

1.1 C.1 Proof of Lemma 6

Proof

For each \(i\in \{0,\dots ,k-1\}\), we have (cf. (23), (36) and (41))

$$\begin{aligned} f({\mathbb {U}}_{i+1,[p]})-f({\mathbb {U}}_{i,[p]})&=\sum _{j=1}^r(\lambda ^{i+1}_{j,[p]})^2-\sum _{j=1}^r(\lambda ^{i}_{j,[p]})^2\nonumber \\&=\sum _{j=1}^r(\lambda ^{i+1}_{j,[p]}+\lambda ^{i}_{j,[p]})(\lambda ^{i+1}_{j,[p]}-\lambda ^{i}_{j,[p]})\nonumber \\&=\sum _{j=1}^r\lambda ^{i+1}_{j,[p]}(\lambda ^{i+1}_{j,[p]}-\lambda ^{i}_{j,[p]})+\sum _{j=1}^r\lambda ^{i}_{j,[p]}(\lambda ^{i+1}_{j,[p]}-\lambda ^{i}_{j,[p]}). \end{aligned}$$
(85)

We first analyze the second summand in (85) and by considering the following two cases:

  1. 1.

    If \(\sigma ^{(i+1)}_{r,[p]}\ge \epsilon \), then there is no proximal step in Algorithm 1 and we have that \(\sigma _{\min }(S^{(i+1)}_{[p]})=\sigma ^{(i+1)}_{r,[p]}\ge \epsilon \), where \(S^{(i+1)}_{[p]}\) is the polar positive semidefinite factor matrix of \(V^{(i+1)}_{[p]}\varLambda ^{(i+1)}_{[p]}\) obtained in (39). From (35), (36) and (37) we notice that

    $$\begin{aligned} ((U^{(i+1)}_{[p]})^{\scriptscriptstyle {\mathsf {T}}}V^{(i+1)}_{[p]}\varLambda ^{(i+1)}_{[p]})_{jj}= & {} (({\mathbf {u}}^{(i+1)}_{j,[p]})^{\scriptscriptstyle {\mathsf {T}}}{\mathbf {v}}^{(i+1)}_{j,[p]})\lambda ^{i}_{j,[p]} =(({\mathbf {u}}^{(i+1)}_{j,[p]})^{\scriptscriptstyle {\mathsf {T}}}{\mathcal {A}}\tau _{i+1}{\mathbf {x}}^{(i+1)}_{j,[p]}) \lambda ^{i}_{j,[p]}\\= & {} {\mathcal {A}}\tau ({\mathbf {x}}^{(i+2)}_{j,[p]})\lambda ^{i}_{j,[p]} =\lambda ^{i+1}_{j,[p]}\lambda ^{i}_{j,[p]}, \end{aligned}$$

    and similarly \(((U^{(i+1)}_{[p-1]})^{\scriptscriptstyle {\mathsf {T}}}V^{(i+1)}_{[p]}\varLambda ^{(i+1)}_{[p]})_{jj} = \lambda ^{i}_{j,[p]}\lambda ^{i}_{j,[p]}\). Hence by Lemma 16, we obtain

    $$\begin{aligned} \sum _{j=1}^r\lambda ^{i}_{j,[p]}(\lambda ^{i+1}_{j,[p]}-\lambda ^{i}_{j,[p]})&={\text {Tr}}((U^{(i+1)}_{[p]})^{\scriptscriptstyle {\mathsf {T}}}V^{(i+1)}_{[p]}\varLambda ^{(i+1)}_{[p]}) -{\text {Tr}}((U^{(i+1)}_{[p-1]})^{\scriptscriptstyle {\mathsf {T}}}V^{(i+1)}_{[p]}\varLambda ^{(i+1)}_{[p]})\nonumber \\&=\frac{1}{2}\big \Vert (U^{(i+1)}_{[p]}-U^{(i+1)}_{[p-1]})\sqrt{S^{(i+1)}_{[p]}}\big \Vert _F^2\nonumber \\&\ge \frac{\epsilon }{2}\Vert U^{(i+1)}_{[p]}-U^{(i+1)}_{[p-1]}\Vert _F^2\nonumber \\&\ge 0. \end{aligned}$$
    (86)
  2. 2.

    If \(\sigma ^{(i+1)}_{r,[p]}<\epsilon \), we consider the following matrix optimization problem

    $$\begin{aligned} \begin{array}{rl} \max &{}\langle V^{(i+1)}_{[p]}\varLambda ^{(i+1)}_{[p]},U\rangle -\frac{\epsilon }{2}\Vert U-U^{(i+1)}_{[p-1]}\Vert _F^2\\ \text {s.t.}&{} U\in V(r,n_{i+1}). \end{array} \end{aligned}$$
    (87)

    Since \(U,U^{(i+1)}_{[p-1]} \in {V(r,n_{i+1})}\), we must have

    $$\begin{aligned} \frac{\epsilon }{2}\Vert U- U^{(i+1)}_{[p-1]}\Vert _F^2=\epsilon r-\epsilon \langle U^{(i+1)}_{[p-1]},U\rangle . \end{aligned}$$

    Thus, by Lemma 15, a global maximizer of (87) is given by a polar orthonormal factor matrix of the matrix \(V^{(i+1)}_{[p]}\varLambda ^{(i+1)}_{[p]}+\epsilon U^{(i+1)}_{[p-1]}\). By Substep 2 of Algorithm 1, \(U^{(i+1)}_{[p]}\) is a polar orthonormal factor matrix of the matrix \(V^{(i+1)}_{[p]}\varLambda ^{(i+1)}_{[p]}+\epsilon U^{(i+1)}_{[p-1]}\), and hence a global maximizer of (87). Thus, by the optimality of \(U^{(i+1)}_{[p]}\) for (87), we have

    $$\begin{aligned} \langle V^{(i+1)}_{[p]}\varLambda ^{(i+1)}_{[p]},U^{(i+1)}_{[p]}\rangle -\frac{\epsilon }{2}\Vert U^{(i+1)}_{[p]}-U^{(i+1)}_{[p-1]}\Vert _F^2\ge \langle V^{(i+1)}_{[p]}\varLambda ^{(i+1)}_{[p]},U^{(i+1)}_{[p-1]}\rangle . \end{aligned}$$

    Therefore, the inequality (86) in case (1) also holds in this case.

Consequently, we have

$$\begin{aligned} 0\le \sum _{j=1}^r\lambda ^{i}_{j,[p]}(\lambda ^{i+1}_{j,[p]}-\lambda ^{i}_{j,[p]})=\sum _{j=1}^r\lambda ^{i+1}_{j,[p]}\lambda ^{i}_{j,[p]}-\sum _{j=1}^r(\lambda ^{i}_{j,[p]})^2, \end{aligned}$$
(88)

which together with Cauchy-Schwartz inequality implies that

$$\begin{aligned} \big (\sum _{j=1}^r(\lambda ^{i}_{j,[p]})^2\big )^2\le \big (\sum _{j=1}^r\lambda ^{i}_{j,[p]}\lambda ^{i+1}_{j,[p]} \big )^2\le \sum _{j=1}^r(\lambda ^{i}_{j,[p]})^2\sum _{j=1}^r(\lambda ^{i+1}_{j,[p]})^2. \end{aligned}$$
(89)

Since \(f({\mathbb {U}}_{[0]})>0\), we conclude that \(f({\mathbb {U}}_{0,[1]})=\sum _{j=1}^r(\lambda ^{0}_{j,[1]})^2>0\) and hence \(\sum _{j=1}^r\lambda ^{1}_{j,[1]}\lambda ^{0}_{j,[1]}>0\) by (88). Thus, we conclude that

$$\begin{aligned} \sum _{j=1}^r\lambda ^{1}_{j,[1]}\lambda ^{0}_{j,[1]} \le \sum _{j=1}^r(\lambda ^{1}_{j,[1]})^2\frac{\sum _{j=1}^r(\lambda ^{0}_{j,[1]})^2}{\sum _{j=1}^r\lambda ^{1}_{j,[1]}\lambda ^{0}_{j,[1]}}\le \sum _{j=1}^r(\lambda ^{1}_{j,[1]})^2, \end{aligned}$$
(90)

where the first inequality follows from (89) and the second from (88). Combining (90) with (85) and (86), we may obtain (44) for \(i=0\) and \(p=1\).

On the one hand, from (89) we obtain

$$\begin{aligned} 0 \le \sum _{j=1}^r (\lambda _{j,[p]}^i)^2 (\sum _{j=1}^r (\lambda _{j,[p]}^{i+1})^2 - \sum _{j=1}^r (\lambda _{j,[p]}^i)^2). \end{aligned}$$

Since \(\sum _{j=1}^r (\lambda _{j,[p]}^i)^2 > 0\) if there is no truncation, we must have

$$\begin{aligned} 0 \le \sum _{j=1}^r (\lambda _{j,[p]}^{i+1})^2 - \sum _{j=1}^r (\lambda _{j,[p]}^i)^2 = f({\mathbb {U}}_{i+1,[p]})-f({\mathbb {U}}_{i,[p]}), \end{aligned}$$

i.e., the objective function f is monotonically increasing during the APD iteration as long as there is no truncation. On the other hand, there are at most r truncations and hence the total loss of f by the truncation is at most \(r\kappa ^2<f({\mathbb {U}}_{[0]})\). Therefore, f is always positive and according to (88), we may conclude that \(\sum _{j=1}^r\lambda ^{i+1}_{j,[p]}\lambda ^{i}_{j,[p]}>0\) along iterations. By induction on p, we obtain

$$\begin{aligned} \sum _{j=1}^r\lambda ^{i+1}_{j,[p]}\lambda ^{i}_{j,[p]} \le \sum _{j=1}^r(\lambda ^{i+1}_{j,[p]})^2\frac{\sum _{j=1}^r(\lambda ^{i}_{j,[p]})^2}{\sum _{j=1}^r\lambda ^{i+1}_{j,[p]}\lambda ^{i}_{j,[p]}}\le \sum _{j=1}^r(\lambda ^{i+1}_{j,[p]})^2, \end{aligned}$$

which together with (85) and (86), implies (44) for arbitrary nonnegative integer p. \(\square \)

1.2 C.2 Proof for Lemma 7

Proof

The set of subgradients of h can be partitioned as follows:

$$\begin{aligned} \partial h({\mathbb {U}})=(-\nabla _1 f({\mathbb {U}})+\partial \delta _{V(r,n_1)}(U^{(1)}))\times \dots \times (-\nabla _k f({\mathbb {U}})+\partial \delta _{V(r,n_k)}(U^{(k)})). \end{aligned}$$
(91)

Following the notation of Algorithm 1, we set

$$\begin{aligned} {\mathbf {x}}_{j}:=({\mathbf {u}}^{(1)}_{j,[p+1]},\dots ,{\mathbf {u}}^{(k)}_{j,[p+1]})\ \text {for all }j=1,\dots ,r, \end{aligned}$$

where \({\mathbf {u}}^{(i)}_{j,[p+1]}\) is the j-th column of the matrix \(U^{(i)}_{[p+1]}\) for all \(i\in \{1,\dots ,k\}\),

$$\begin{aligned} V^{(i)}:=\begin{bmatrix}{\mathbf {v}}^{(i)}_1&\dots&{\mathbf {v}}^{(i)}_r\end{bmatrix}\ \text {with }{\mathbf {v}}^{(i)}_j:={\mathcal {A}}\tau _i({\mathbf {x}}_{j})\ \text {for all }j=1,\dots ,r, \end{aligned}$$
(92)

for all \(i\in \{1,\dots ,k\}\) and

$$\begin{aligned} \varLambda :={\text {diag}}(\lambda _1,\dots ,\lambda _r)\ \text {with }\lambda _j:={\mathcal {A}}\tau ({\mathbf {x}}_{j}). \end{aligned}$$
(93)

By (39) and (87), we have

$$\begin{aligned} V^{(i)}_{[p+1]}\varLambda ^{(i)}_{[p+1]}+\alpha U^{(i)}_{[p]}=U^{(i)}_{[p+1]}S^{(i)}_{[p+1]}, \end{aligned}$$
(94)

where \(\alpha \in \{0,\epsilon \}\) depending on whether or not there is a proximal correction. According to (69) and (94), we have

$$\begin{aligned} -U^{(i)}_{[p+1]} \in \partial \delta _{V(r,n_i)}(U^{(i)}_{[p+1]}), \quad V^{(i)}_{[p+1]}\varLambda ^{(i)}_{[p+1]}+\alpha U^{(i)}_{[p]} \in \partial \delta _{V(r,n_i)}(U^{(i)}_{[p+1]}), \end{aligned}$$

which implies that \(V^{(i)}_{[p+1]}\varLambda ^{(i)}_{[p+1]}+\alpha \big (U^{(i)}_{[p]}-U^{(i)}_{[p+1]}\big )\in \partial \delta _{V(r,n_i)}(U^{(i)}_{[p+1]})\). If we take

$$\begin{aligned} W^{(i)}_{[p+1]}:=-2V^{(i)}\varLambda +2V^{(i)}_{[p+1]}\varLambda ^{(i)}_{[p+1]}+2\alpha \big (U^{(i)}_{[p]}-U^{(i)}_{[p+1]}\big ), \end{aligned}$$
(95)

then we have

$$\begin{aligned} W^{(i)}_{[p+1]}\in -2V^{(i)}\varLambda +\partial \delta _{V(r,n_i)}(U^{(i)}_{[p+1]}). \end{aligned}$$

On the other hand,

$$\begin{aligned} \frac{1}{2}\Vert W^{(i)}_{[p+1]}\Vert _F&=\Vert V^{(i)}\varLambda -V^{(i)}_{[p+1]}\varLambda ^{(i)}_{[p+1]}-\alpha \big (U^{(i)}_{[p]}-U^{(i)}_{[p+1]}\big )\Vert _F\\&\le \Vert V^{(i)}\varLambda -V^{(i)}_{[p+1]}\varLambda \Vert _F+\Vert V^{(i)}_{[p+1]}\varLambda -V^{(i)}_{[p+1]}\varLambda ^{(i)}_{[p+1]}\Vert _F+\alpha \Vert U^{(i)}_{[p]}-U^{(i)}_{[p+1]}\Vert _F\\&\le \Vert V^{(i)}-V^{(i)}_{[p+1]}\Vert _F\Vert \varLambda \Vert _F+\Vert V^{(i)}_{[p+1]}\Vert _F\Vert \varLambda -\varLambda ^{(i)}_{[p+1]}\Vert _F+\alpha \Vert U^{(i)}_{[p]}-U^{(i)}_{[p+1]}\Vert _F\\&\le {\Vert {\mathcal {A}}\Vert }\Vert \varLambda \Vert _F\big (\sum _{j=1}^r\Vert \tau _i({\mathbf {x}}_j)-\tau _i({\mathbf {x}}^i_{j,[p+1]})\Vert \big )\\&\quad +\Vert V^{(i)}_{[p+1]}\Vert _F\Vert {\mathcal {A}}\Vert \big (\sum _{j=1}^r\Vert \tau ({\mathbf {x}}_j)-\tau ({\mathbf {x}}^i_{j,[p+1]})\Vert \big )+\alpha \Vert U^{(i)}_{[p]}-U^{(i)}_{[p+1]}\Vert _F\\&\le \sqrt{r}\Vert {\mathcal {A}}\Vert ^2\big (\sum _{j=1}^r\sum _{s=i+1}^{k}\Vert {\mathbf {u}}^{(s)}_{j,[p+1]}-{\mathbf {u}}^{(s)}_{j,[p]}\Vert \big )\\&\quad +\sqrt{r}\Vert {\mathcal {A}}\Vert ^2\big (\sum _{j=1}^r\sum _{s=i}^{k}\Vert {\mathbf {u}}^{(s)}_{j,[p+1]}-{\mathbf {u}}^{(s)}_{j,[p]}\Vert \big )+\alpha \Vert U^{(i)}_{[p]}-U^{(i)}_{[p+1]}\Vert _F\\&\le (2r\sqrt{k}\Vert {\mathcal {A}}\Vert ^2 +\epsilon )\Vert {\mathbb {U}}_{[p+1]}-{\mathbb {U}}_{[p]}\Vert _F, \end{aligned}$$

where the third inequality follows from the fact that

$$\begin{aligned} V^{(i)}-V^{(i)}_{[p+1]}=\begin{bmatrix}{\mathcal {A}}(\tau _i({\mathbf {x}}_1)-\tau _i({\mathbf {x}}^i_{1,[p+1]}))&\dots&{\mathcal {A}}(\tau _i({\mathbf {x}}_r)-\tau _i({\mathbf {x}}^i_{r,[p+1]}))\end{bmatrix}, \end{aligned}$$

and a similar formula for \(\varLambda -\varLambda ^{(i)}_{[p+1]}\), the fourth follows from the fact that

$$\begin{aligned} |{\mathcal {A}}\tau ({\mathbf {x}})|\le \Vert {\mathcal {A}}\Vert \end{aligned}$$

for any vector \({\mathbf {x}}:=({\mathbf {x}}_1,\dots ,{\mathbf {x}}_k)\) with \(\Vert {\mathbf {x}}_i\Vert =1\) for all \(i=1,\dots ,k\) and the last one follows from \(\alpha \le \epsilon \). This, together with (91), implies (48). \(\square \)

1.3 C.3 Proof for Lemma 9

Proof

We let

$$\begin{aligned} W^{(i)}_{[p+1]}:=V^{(i)}\varLambda -V^{(i)}_{[p+1]}\varLambda ^{(i)}_{[p+1]}-\alpha \big (U^{(i)}_{[p]}-U^{(i)}_{[p+1]}\big ), \end{aligned}$$

where \(\alpha \in \{0,\epsilon \}\) depending on whether there is a proximal correction or not (cf. the proof for Lemma 7). It follows from Lemma 7 that

$$\begin{aligned} {\Vert W^{(i)}_{[p+1]}\Vert _F}\le \gamma _0 \Vert {\mathbb {U}}_{[p]}-{\mathbb {U}}_{[p+1]}\Vert _F \end{aligned}$$

for some constant \(\gamma _0>0\). By Algorithm 1, we have

$$\begin{aligned} V^{(i)}_{[p+1]}\varLambda ^{(i)}_{[p+1]}+\alpha U^{(i)}_{[p]}=U^{(i)}_{[p+1]}S^{(i)}_{[p+1]} \end{aligned}$$
(96)

where \(S^{(i)}_{[p+1]}\) is a symmetric positive semidefinite matrix. Since \(U^{(i)}_{[p+1]}\in V(r,n_i)\) is an orthonormal matrix, we have

$$\begin{aligned} S^{(i)}_{[p+1]}=(U^{(i)}_{[p+1]})^{\scriptscriptstyle {\mathsf {T}}}\big (V^{(i)}_{[p+1]}\varLambda ^{(i)}_{[p+1]}+\alpha U^{(i)}_{[p]}\big )=\big (V^{(i)}_{[p+1]}\varLambda ^{(i)}_{[p+1]}+\alpha U^{(i)}_{[p]}\big )^{\scriptscriptstyle {\mathsf {T}}}U^{(i)}_{[p+1]}, \end{aligned}$$
(97)

where the second equality follows from the symmetry of the matrix \(S^{(i)}_{[p+1]}\).

Consequently, we have

$$\begin{aligned}&\frac{1}{2}\Vert \nabla _i f({\mathbb {U}}_{[p+1]})-U^{(i)}_{[p+1]}(\nabla _i f({\mathbb {U}}_{[p+1]}))^{\scriptscriptstyle {\mathsf {T}}}U^{(i)}_{[p+1]}\Vert _F\nonumber \\&\quad =\Vert V^{(i)}\varLambda -U^{(i)}_{[p+1]}(V^{(i)}\varLambda )^{\scriptscriptstyle {\mathsf {T}}}U^{(i)}_{[p+1]}\Vert _F\nonumber \\&\quad =\Vert W^{(i)}_{[p+1]}+V^{(i)}_{[p+1]}\varLambda ^{(i)}_{[p+1]}+\alpha \big (U^{(i)}_{[p]} -U^{(i)}_{[p+1]}\big )-U^{(i)}_{[p+1]}(V^{(i)}\varLambda )^{\scriptscriptstyle {\mathsf {T}}}U^{(i)}_{[p+1]}\Vert _F\nonumber \\&\quad \le \Vert W^{(i)}_{[p+1]}\Vert _F+\alpha {\Vert U^{(i)}_{[p]}-U^{(i)}_{[p+1]}\Vert _F}+\Vert V^{(i)}_{[p+1]} \varLambda ^{(i)}_{[p+1]}-U^{(i)}_{[p+1]}(V^{(i)}\varLambda )^{\scriptscriptstyle {\mathsf {T}}}U^{(i)}_{[p+1]}\Vert _F\nonumber \\&\quad \le \gamma _1 \Vert {\mathbb {U}}_{[p]}-{\mathbb {U}}_{[p+1]}\Vert _F+\Vert V^{(i)}_{[p+1]} \varLambda ^{(i)}_{[p+1]}-U^{(i)}_{[p+1]}(V^{(i)}\varLambda )^{\scriptscriptstyle {\mathsf {T}}}U^{(i)}_{[p+1]}\Vert _F, \end{aligned}$$
(98)

where \(\gamma _1 = \gamma _0+\epsilon \). Next, we derive an estimation for the second summand of the right hand side of (98). To do this, we notice that

$$\begin{aligned}&\Vert V^{(i)}_{[p+1]}\varLambda ^{(i)}_{[p+1]}-U^{(i)}_{[p+1]}(V^{(i)}\varLambda )^{\scriptscriptstyle {\mathsf {T}}}U^{(i)}_{[p+1]}\Vert _F\nonumber \\&\quad =\Vert U^{(i)}_{[p+1]}S^{(i)}_{[p+1]}-\alpha U^{(i)}_{[p]} -U^{(i)}_{[p+1]}(V^{(i)}\varLambda )^{\scriptscriptstyle {\mathsf {T}}}U^{(i)}_{[p+1]}\Vert _F\nonumber \\&\quad =\Vert U^{(i)}_{[p+1]}\big (V^{(i)}_{[p+1]}\varLambda ^{(i)}_{[p+1]}+\alpha U^{(i)}_{[p]}\big )^{\scriptscriptstyle {\mathsf {T}}}U^{(i)}_{[p+1]}-\alpha U^{(i)}_{[p]}-U^{(i)}_{[p+1]}(V^{(i)}\varLambda )^{\scriptscriptstyle {\mathsf {T}}}U^{(i)}_{[p+1]}\Vert _F\nonumber \\&\quad \le \Vert U^{(i)}_{[p+1]}\big ((V^{(i)}_{[p+1]}\varLambda ^{(i)}_{[p+1]})^{\scriptscriptstyle {\mathsf {T}}}-(V^{(i)}\varLambda )^{\scriptscriptstyle {\mathsf {T}}}\big )U^{(i)}_{[p+1]}\Vert _F+\alpha \Vert U^{(i)}_{[p+1]}(U^{(i)}_{[p]})^{\scriptscriptstyle {\mathsf {T}}}U^{(i)}_{[p+1]}-U^{(i)}_{[p]}\Vert _F\nonumber \\&\quad \le \Vert V^{(i)}_{[p+1]}\varLambda ^{(i)}_{[p+1]}-V^{(i)}\varLambda \Vert _F+\alpha \Vert U^{(i)}_{[p+1]}(U^{(i)}_{[p]})^{\scriptscriptstyle {\mathsf {T}}}U^{(i)}_{[p+1]}-U^{(i)}_{[p]}\Vert _F\nonumber \\&\quad \le \gamma _2 \Vert {\mathbb {U}}_{[p+1]}-{\mathbb {U}}_{[p]}\Vert _F, \end{aligned}$$
(99)

where \(\gamma _2>0\) is some constant, the first equality follows from (96), the second from (97), the second inequalityFootnote 5 follows from the fact that \(U^{(i)}_{[p+1]}\in V(r,n_i)\) and the last inequality from the relation

$$\begin{aligned} \Vert (U^{(i)}_{[p]})^{\scriptscriptstyle {\mathsf {T}}}U^{(i)}_{[p+1]}-I\Vert _F\le \Vert U^{(i)}_{[p]}-U^{(i)}_{[p+1]}\Vert _F, \end{aligned}$$

which is obtained by Lemma 19. The desired inequality can be derived easily from (98) and (99). \(\square \)

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hu, S., Ye, K. Linear convergence of an alternating polar decomposition method for low rank orthogonal tensor approximations. Math. Program. 199, 1305–1364 (2023). https://doi.org/10.1007/s10107-022-01867-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-022-01867-8

Keywords

Mathematics Subject Classification

Profiles

  1. Ke Ye