Linear convergence of an alternating polar decomposition method for low rank orthogonal tensor approximations

Hu, Shenglong; Ye, Ke

doi:10.1007/s10107-022-01867-8

Linear convergence of an alternating polar decomposition method for low rank orthogonal tensor approximations

Full Length Paper
Series A
Published: 30 July 2022

Volume 199, pages 1305–1364, (2023)
Cite this article

Mathematical Programming Submit manuscript

886 Accesses
10 Citations
Explore all metrics

Abstract

Low rank orthogonal tensor approximation (LROTA) is an important problem in tensor computations and their applications. A classical and widely used algorithm is the alternating polar decomposition method (APD). In this paper, an improved version iAPD of the classical APD is proposed. For the first time, all of the following four fundamental properties are established for iAPD: (i) the algorithm converges globally and the whole sequence converges to a KKT point without any assumption; (ii) it exhibits an overall sublinear convergence with an explicit rate which is sharper than the usual O(1/k) for first order methods in optimization; (iii) more importantly, it converges R-linearly for a generic tensor without any assumption; (iv) for almost all LROTA problems, iAPD reduces to APD after finitely many iterations if it converges to a local minimizer.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic approximation using asymptotically optimal adaptive interpolation

Article Open access 16 May 2021

Implementable tensor methods in unconstrained convex optimization

Article Open access 21 November 2019

On the convergence of alternating minimization methods in variational PGD

Article 27 June 2017

Notes

Yang’s paper is posted during our final preparation of this paper. We can see that we both employ the proximal technique. A difference is that the proximal correction in our algorithm is adaptive, and a theoretical investigation is also given (cf. Sect. 5) for the execution of the proximal correction.
An algebraic variety here means a set of common roots of some polynomials. In other words, an algebraic variety is a closed subset of some ${\mathbb {R}}^N$ with respect to the Zariski topology.
This is due to the fact that ${\text {dim}}(V(n-1,n))={\text {dim}}(V(n,n))$.
In the general definition of KL property, $\partial p({\mathbf {x}})$ is actually the limiting subdifferential of p at $\mathbf {x}$. However, for functions considered in this paper, the two notions of subdifferentials coincide. For this reason, we do not distinguish them in the paper.
Since U is orthonormal, we must have
$$\begin{aligned} \Vert UAU\Vert _F^2=\Vert AU\Vert _F^2=\langle AUU^{\scriptscriptstyle {\mathsf {T}}},A\rangle \le \Vert A\Vert _F^2. \end{aligned}$$

References

Absil, P.-A., Mahony, R., Sepulchre, R.: Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton, USA (2008)
MATH Google Scholar
Anandkumar, A., Ge, R., Hsu, D., Kakade, S.M., Telgarsky, M.: Tensor decompositions for learning latent variable models. J. Mach. Learn. Res. 15, 2773–2832 (2014)
MathSciNet MATH Google Scholar
Anderson, E., Bai, Z.Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.: LAPACK Users’ Guide, 3rd edn. SIAM, Philadelphia, PA (1999)
MATH Google Scholar
Attouch, H., Bolte, J., Redont, P., Soubeyran, A.: Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the Kurdyka-Łojasiewicz inequality. Math. Oper. Res. 35, 438–457 (2010)
MathSciNet MATH Google Scholar
Attouch, H., Bolte, J., Svaiter, B.F.: Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods. Math. Program. 137(1–2), 91–129 (2013)
MathSciNet MATH Google Scholar
Beck, A.: First-Order Methods in Optimization. MOS-SIAM Series on Optimization, SIAM (2017)
Bertsekas, D.P.: Nonlinear Programming, 2nd edn. Athena Scientific, Belmont, USA (1999)
MATH Google Scholar
Bochnak, J., Coste, M., Roy, M.-F.: Real Algebraic Geometry. Ergebnisse der Mathematik und ihrer Grenzgebiete, vol. 36. Springer, Berlin (1998)
Google Scholar
Bolte, J., Daniilidis, A., Lewis, A.S., Shiota, M.: Clarke subgradients of stratifiable manifolds. SIAM J. Optim. 18(2), 556–572 (2007)
MathSciNet MATH Google Scholar
Bolte, J., Daniilidis, A., Ley, O., Mazet, L.: Characterizations of Łojasiewicz inequalities and applications: subgradient flows, talweg, convexity. Trans. Am. Math. Soc. 362, 3319–3363 (2010)
MATH Google Scholar
Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. 146(1–2), 459–494 (2014)
MathSciNet MATH Google Scholar
Boralevi, A., Draisma, J., Horobet, E., Robeva, E.: Orthogonal and unitary tensor decomposition from an algebraic perspective. Isr. J. Math. 222, 223–260 (2017)
MathSciNet MATH Google Scholar
Chen, J., Saad, Y.: On the tensor SVD and the optimal low rank orthogonal approximation of tensors. SIAM J. Matrix Anal. Appl. 30, 1709–1734 (2009)
MathSciNet MATH Google Scholar
Comon, P.: Independent component analysis, a new concept? Signal Process. 36, 287–314 (1994)
MATH Google Scholar
Comon, P.: Tensors: a brief introduction. IEEE Signal Proc. Mag. 31, 44–53 (2014)
Google Scholar
Comon, P., Jutten, C.: Handbook of Blind Source Separation. Academic Press, Oxford (2010)
Google Scholar
D’ Acunto, D., Kurdyka, K.: Explicit bounds for the Łojasiewicz exponent in the gradient inequality for polynomials. Ann. Polon. Math. 87, 51–61 (2005)
MathSciNet MATH Google Scholar
De Lathauwer, L., De Moor, B., Vandewalle, J.: On the best rank-1 and rank-$(r_1, r_2,\ldots, r_N)$ approximation of higher-order tensors. SIAM J. Matrix Anal. Appl. 21, 1324–1342 (2000)
MathSciNet MATH Google Scholar
De Silva, V., Lim, L.-H.: Tensor rank and the ill-posedness of the best low-rank approximation problem. SIAM J. Matrix Anal. Appl. 29, 1084–1127 (2008)
MathSciNet MATH Google Scholar
do Carmo, M.P.: Riemannian Geometry. Springer, Boston, Birkhäuser (1992)
MATH Google Scholar
Eckart, C., Young, G.: The approximation of one matrix by another of lower rank. Psychometrika 1, 211–218 (1936)
MATH Google Scholar
Edelman, A., Arias, T.A., Smith, S.T.: The geometry of algorithms with orthogonality constraints. SIAM. J. Matrix Anal. Appl. 20(2), 303–353 (1998)
MathSciNet MATH Google Scholar
Franc, A.: Etude Algébrique des Multitableaux: Apports de l’Algébre Tensorielle. Thèse de Doctorat, Spécialité Statistiques, Univ. de Montpellier II, Montpellier, France (1992)
Golub, G.H., Van Loan, C.F.: Matrix Computations, 4th edn. Johns Hopkins University Press, Baltimore, MD (2013)
MATH Google Scholar
Grasedyck, L., Kressner, D., Tobler, C.: A literature survey of low-rank tensor approximation techniques. GAMM-Mitt. 36, 53–78 (2013)
MathSciNet MATH Google Scholar
Guan, Y., Chu, D.: Numerical computation for orthogonal low-rank approximation of tensors. SIAM J. Matrix Anal. Appl. 40, 1047–1065 (2019)
MathSciNet MATH Google Scholar
Hackbusch, W.: Tensor Spaces and Numerical Tensor Calculus. Springer, Berlin (2012)
MATH Google Scholar
Hartshorne, R.: Algebraic Geometry. Graduate Texts in Mathematics 52. Springer, New York (1977)
Google Scholar
Higham, N.J.: Computing the polar decomposition-with applications. SIAM J. Sci. Statist. Comput. 7, 1160–1174 (1986)
MathSciNet MATH Google Scholar
Horn, R.A., Johnson, C.R.: Matrix Analysis. Cambridge University Press, New York (1985)
MATH Google Scholar
Hu, S., Li, G.: Convergence rate analysis for the higher order power method in best rank one approximations of tensors. Numer. Math. 140, 993–1031 (2018)
MathSciNet MATH Google Scholar
Ishteva, M., Absil, P.-A., Van Dooren, P.: Jacobi algorithm for the best low multilinear rank approximation of symmetric tensors. SIAM J. Matrix Anal. Appl. 34, 651–672 (2013)
MathSciNet MATH Google Scholar
Knyazev, A.V., Argentati, M.E.: Majorization for changes in angles between subspaces, Ritz values, and graph Laplacian spectra. SIAM J. Matrix Anal. Appl. 29, 15–32 (2006)
MathSciNet MATH Google Scholar
Kolda, T.G.: Orthogonal tensor decompositions. SIAM J. Matrix Anal. Appl. 23, 243–255 (2001)
MathSciNet MATH Google Scholar
Kolda, T.G.: A counterexample to the possibility of an extension of the Eckart-Young low rank approximation theorem for the orthogonal rank tensor decomposition. SIAM J. Matrix Anal. Appl. 24, 762–767 (2002)
MathSciNet MATH Google Scholar
Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Rev. 51, 455–500 (2009)
MathSciNet MATH Google Scholar
Kruskal, J.B.: Three-way arrays: rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics. Linear Algebr. Appl. 18, 95–138 (1977)
MathSciNet MATH Google Scholar
Kurdyka, K.: On gradients of functions definable in o-minimal structures. Annales de l’Institut Fourier 48(3), 769–783 (1998)
MathSciNet MATH Google Scholar
Landsberg, J.M.: Tensors: Geometry and Applications. AMS, Providence, RI (2012)
MATH Google Scholar
Lee, J.D., Panageas, I., Piliouras, G., Simchowitz, M., Jordan, M.I., Recht, B.: First-order methods almost always avoid strict saddle points. Math. Program. 176, 311–337 (2019)
MathSciNet MATH Google Scholar
Lee, J.D., Simchowitz, M., Jordan, M.I., Recht, B.: Gradient descent only converges to minimizers. P. Mach. Learn. Res. 49, 1–12 (2016)
Google Scholar
Li, G., Pong, T.K.: Calculus of the exponent of Kurdyka-Łojasiewicz inequality and its applications to linear convergence of first-order methods. Found. Comput. Math. 18, 1199–1232 (2018)
MathSciNet MATH Google Scholar
Li, J., Usevich, K., Comon, P.: Globally convergent jacobi-type algorithms for simultaneous orthogonal symmetric tensor diagonalization. SIAM J. Matrix Anal. Appl. 39, 1–22 (2018)
MathSciNet MATH Google Scholar
Li, J., Usevich, K., Comon, P.: On approximate diagonalization of third order symmetric tensors by orthogonal transformations. Linear Algebr. Appl. 576, 324–351 (2019)
MathSciNet MATH Google Scholar
Li, J., Usevich, K., Comon, P.: Jacobi-type algorithm for low rank orthogonal approximation of symmetric tensors and its convergence analysis. arXiv: 1911.00659
Lim, L.-H.: Tensors and hypermatrices. Chapter 15 in Handbook of Linear Algebra. In: Hogben, L. (eds.) (2013)
Liu, H., Wu, W., So, A.M.-C.: Quadratic optimization with orthogonality constraints: explicit Łojasiewicz exponent and linear convergence of line-search methods. In: Proceedings of the 33rd International Convergnece on Machine Learning, New York, USA (2016)
Łojasiewicz, S.: Une propriété topologique des sous-ensembles analytiques réels, Les Équations aux Dérivées Partielles. Éditions du centre National de la Recherche Scientifique, Paris, pp. 87–89 (1963)
Martin, C.D.M., Van Loan, C.F.: A Jacobi-type method for computing orthogonal tensor decompositions. SIAM J. Matrix Anal. Appl. 30, 1219–1232 (2008)
MathSciNet MATH Google Scholar
McCullagh, P.: Tensor Methods in Statistics. Chapman and Hall, London (1987)
MATH Google Scholar
Milnor, J.: Morse Theory. Annals Math. Studies, 51. Princeton Univ. Press, Princeton, NJ (1963)
Mohlenkamp, M.J.: Musings on multilinear fitting. Linear Algebr. Appl. 438, 834–852 (2013)
MathSciNet MATH Google Scholar
Mu, C., Hsu, D., Goldfarb, D.: Successive rank-one approximations for nearly orthogonally decomposable symmetric tensors. SIAM J. Matrix Anal. Appl. 36, 1638–1659 (2015)
MathSciNet MATH Google Scholar
Nesterov, Y.: Introductory Lectures on Convex Optimization-A Basic Course. Kluwer Academic Publishers, London (2004)
MATH Google Scholar
Qi, L., Luo, Z.: Tensor Analysis: Spectral Theory and Special Tensors. SIAM (2017)
Rockafellar, R.T., Wets, R.: Variational Analysis. Grundlehren der Mathematischen Wissenschaften, vol. 317. Springer, Berlin (1998)
Google Scholar
Shafarevich, I.R.: Basic Algebraic Geometry. Springer-Verlag, Berlin (1977)
MATH Google Scholar
Sørensen, M., De Lathauwer, L., Comon, P., Jcart, S., Deneire, L.: Canonical polyadic decomposition with a columnwise orthonormal factor matrix. SIAM J. Matrix Anal. Appl. 33, 1190–1213 (2012)
MathSciNet MATH Google Scholar
Sun, J., Chen, C.: Generalized polar decomposition. Math. Numer. Sinica 11, 262–273 (1989)
MathSciNet MATH Google Scholar
Uschmajew, A.: Local convergence of the alternating least squares algorithm for canonical tensor approximation. SIAM J. Matrix Anal. Appl. 33, 639–652 (2012)
MathSciNet MATH Google Scholar
Uschmajew, A.: A new convergence proof for the high-order power method and generalizations. Pacific J. Optim. 11, 309–321 (2015)
MathSciNet MATH Google Scholar
Usevich, K., Li, J., Comon, P.: Approximate matrix and tensor diagonalization by unitary transformations: convergence of Jacobi-type algorithms. SIAM J. Optim. 30, 2998–3028 (2020)
MathSciNet MATH Google Scholar
Wang, L., Chu, M.T.: On the global convergence of the alternating least squares method for rank-one tensor approximation. SIAM J. Matrix Anal. Appl. 35(3), 1058–1072 (2014)
MathSciNet MATH Google Scholar
Wang, L., Chu, M.T., Yu, B.: Orthogonal low rank tensor approximation: alternating least squares method and its global convergence. SIAM J. Matrix Anal. Appl. 36, 1–19 (2015)
MathSciNet MATH Google Scholar
Yang, Y.: The epsilon-alternating least squares for orthogonal low-rank tensor approximation and its global convergence. SIAM J. Matrix Anal. Appl. 41, 1797–1825 (2020)
MathSciNet MATH Google Scholar
Zhang, T., Golub, G.H.: Rank-one approximation to high order tensors. SIAM J. Matrix Anal. Appl. 23, 534–550 (2001)
MathSciNet MATH Google Scholar

Download references

Acknowledgements

We would like to acknowledge the anonymous associate editor and two referees for their thoughtful suggestions and helpful comments on our manuscript.

Author information

Authors and Affiliations

Department of Mathematics, School of Science, Hangzhou Dianzi University, Hangzhou, 310018, China
Shenglong Hu
KLMM, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, 100190, China
Ke Ye

Authors

Shenglong Hu
View author publications
Search author on:PubMed Google Scholar
Ke Ye
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Ke Ye.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

We claim that parts of the paper have not been published or were submitted elsewhere.

This work is partially supported by National Science Foundation of China (Grant No. 11771328). The first author is also partially supported by National Science Foundation of China (Grant No. 12171128), Young Elite Scientists Sponsorship Program by Tianjin and the Natural Science Foundation of Zhejiang Province, China (Grant No. LD19A010002 and Grant No. LY22A010022). The second author is also partially supported by National Science Foundation of China (Grant No. 11801548 and Grant No. 11688101), National Key R &D Program of China (Grant No. 2018YFA0306702 and Grant No. 2020YFA0712300) and CAS Project for Young Scientists in Basic Research (Grant No. YSBR-008).

Appendices

A Preliminaries

This section consists of preliminaries on differential geometry and optimization theory, which are necessary to get through the details of proofs in this paper. To be more precise, we discuss Stiefel manifolds and their basic properties in Appendix A.1. Morse functions and Kurdyka-Łojasiewicz property are introduced in Appendix A.2 and A.3, respectively.

1.1 A.1 Stiefel manifolds

Let $m\le n$ be two positive integers and let $V(m,n)\subseteq {\mathbb {R}}^{n \times m}$ be the set of all $n \times m$ orthonormal matrices, i.e.,

$$\begin{aligned} V(m,n):=\{U\in {\mathbb {R}}^{n \times m}:U^{\scriptscriptstyle {\mathsf {T}}}U=I\}, \end{aligned}$$

where I is the identity matrix of matching size. Indeed, V(m, n) admits a smooth manifold structure and is called the Stiefel manifold of orthonormal m-frames in ${\mathbb {R}}^n$. In particular, if $m = n$ then V(n, n) simply reduces to the orthogonal group ${{\,\mathrm{O}\,}}(n)$.

In general, if M is a Riemannian submanifold of ${\mathbb {R}}^k$, then the normal space ${\text {N}}_M({\mathbf {x}})$ of M at a point ${\mathbf {x}}\in M$ is defined to be the orthogonal complement of its tangent space ${\text {T}}_M({\mathbf {x}})$ in ${\mathbb {R}}^k$, i.e., ${\text {N}}_M ({\mathbf {x}}) {:}{=}{\text {T}}_M({\mathbf {x}})^{\perp }$. By definition, V(m, n) is naturally a Riemannian submanifold of ${\mathbb {R}}^{n \times m}$ and hence its normal space is well-defined. Indeed, it follows from [56, Chapter 6.C] and [1, 22] that

$$\begin{aligned} {\text {N}}_{V(m,n)}(A)=\{AX: X\in {\text {S}}^{m \times m}\}, \end{aligned}$$

(65)

where ${\text {S}}^{m\times m}\subseteq {\mathbb {R}}^{m\times m}$ is the subspace of $m \times m$ symmetric matrices.

Given a matrix $B\in {\mathbb {R}}^{n \times m}$, the projection of B onto the normal space of V(m, n) at a point $A\in V(m,n)$ is

$$\begin{aligned} \pi _{{\text {N}}_{V(m,n)}(A)}(B)=A \left( \frac{A^{\scriptscriptstyle {\mathsf {T}}}B+B^{\scriptscriptstyle {\mathsf {T}}}A}{2} \right) . \end{aligned}$$

Likewise, given a matrix $B\in {\mathbb {R}}^{n \times m}$, the projection of B onto the tangent space of V(m, n) at A is given by

$$\begin{aligned} \pi _{{\text {T}}_{V(m,n)}(A)}(B)=A{\text {skew}}(A^{\scriptscriptstyle {\mathsf {T}}}B)+(I-AA^{\scriptscriptstyle {\mathsf {T}}})B, \end{aligned}$$

(66)

where ${\text {skew}}(C) {:}{=}\frac{C-C^{\scriptscriptstyle {\mathsf {T}}}}{2} $ for a square matrix $C\in {\mathbb {R}}^{m \times m}$. A more explicit formula is given as

$$\begin{aligned} \pi _{{\text {T}}_{V(m,n)}(A)}(B)= (I-\frac{1}{2}AA^{\scriptscriptstyle {\mathsf {T}}})(B-AB^{\scriptscriptstyle {\mathsf {T}}}A). \end{aligned}$$

(67)

Given a function $f : {\mathbb {R}}^k\rightarrow {\mathbb {R}}\cup \{\infty \}$, the subdifferential of f at ${\mathbf {x}}\in {\mathbb {R}}^k$ is defined as (cf. [56])

$$\begin{aligned} \partial f({\mathbf {x}}) {:}{=}\Bigg \{{\mathbf {v}}\in {\mathbb {R}}^k:\liminf _{{\mathbf {x}}\ne {\mathbf {y}}\rightarrow {\mathbf {x}}}\frac{f({\mathbf {y}})-f({\mathbf {x}})-\langle {\mathbf {v}},{\mathbf {y}}-{\mathbf {x}}\rangle }{\Vert {\mathbf {y}}-{\mathbf {x}}\Vert }\ge 0\Bigg \}. \end{aligned}$$

If ${\mathbf {0}}\in \partial f({\mathbf {x}})$, then ${\mathbf {x}}$ is a critical point of f.

The indicator function $\delta _{V(m,n)}:{\mathbb {R}}^{n\times m} \rightarrow {\mathbb {R}} \cup \{\infty \}$ of V(m, n) is defined by

$$\begin{aligned} \delta _{V(m,n)}(X) {:}{=}{\left\{ \begin{array}{ll}0&{}\text {if }X\in V(m,n),\\ +\infty &{}\text {otherwise}.\end{array}\right. } \end{aligned}$$

(68)

An important fact about the normal space ${\text {N}}_{V(m,n)}(A)$ and the subdifferential of the indicator function $\delta _{V(m,n)}$ of V(m, n) at A is (cf. [56])

$$\begin{aligned} \partial \delta _{V(m,n)}(A)={\text {N}}_{V(m,n)}(A). \end{aligned}$$

(69)

1.2 A.2 Morse functions

In the following, we introduce the notion of Morse functions and recall some of its basic properties. On a smooth manifold M, a smooth function $f : M\rightarrow {\mathbb {R}}$ is called a Morse function if each critical point of f on M is nondegenerate, i.e., the Hessian matrix of f at each critical point is non-singular. The following result is well-known, see for example [51, Theorem 6.6].

Lemma 11

(Projection is Generically Morse) Let M be a submanifold of ${\mathbb {R}}^n$. For a generic ${\mathbf {a}} = (a_1,\dots ,a_n)^{\scriptscriptstyle {\mathsf {T}}}\in {\mathbb {R}}^n$, the Euclidean distance function

$$\begin{aligned} f({\mathbf {x}})= \Vert {\mathbf {x}}-{\mathbf {a}}\Vert ^2 \end{aligned}$$

is a Morse function on M.

We also need the following property (cf. [51, Corollary 2.3]) of nondegenerate critical points in the sequel.

Lemma 12

Let M be a manifold and let $f:M \rightarrow {\mathbb {R}}$ be a smooth function. Nondegenerate critical points of f are isolated.

To conclude this subsection, we briefly discuss how critical points behave under a local diffeomorphism. For this purpose, we recall that a smooth manifold $M_1$ is called locally diffeomorphic to a smooth manifold $M_2$ if there is a smooth map $\varphi : M_1\rightarrow M_2$ such that for each point ${\mathbf {x}}\in M_1$ there exists a neighborhood $U\subseteq M_1$ of ${\mathbf {x}}$ and a neighborhood $V\subseteq M_2$ of $\varphi ({\mathbf {x}})$ such that the restriction $\varphi |_U : U\rightarrow V$ is a diffeomorphism [20]. In this case, the corresponding $\varphi $ is called a local diffeomorphism from $M_1$ to $M_2$. It is clear from the definition that whenever $M_1$ is locally diffeomorphic to $M_2$ then the two manifolds must have the same dimension. Moreover, we have the following result, whose proof can be found in [31, Proposition 5.2].

Proposition 11

Let a smooth manifold $M_1$ be locally diffeomorphic to another smooth manifold $M_2$ and let $\varphi : M_1\rightarrow M_2$ be the corresponding local diffeomorphism. Let $f : M_2\rightarrow {\mathbb {R}}$ be a smooth function. Then ${\mathbf {x}}\in M_1$ is a (nondegenerate) critical point of $f\circ \varphi $ on $M_1$ if and only if $\varphi ({\mathbf {x}})$ is a (nondegenerate) critical point of f on $M_2$.

When f is a smooth function on ${\mathbb {R}}^n$ and M is a submanifold of ${\mathbb {R}}^n$, we denote by $\nabla f$ the gradient of f as a vector field on ${\mathbb {R}}^n$, while we denote by ${\text {grad}}(f)$ the Riemannian gradient of f as a vector field on M.

1.3 A.3 Kurdyka- Łojasiewicz property

In this subsection, we review some basic facts about the Kurdyka-Łojasiewicz property, which was first established for a smooth definable function in [38] by Kurdyka and later was even generalized to non-smooth definable case in [9] by Bolte, Daniilidis, Lewis and Shiota. Interested readers are also referred to [4, 5, 10, 42] for more discussions and applications of the Kurdyka-Łojasiewicz property.

Let p be an extended real-valued function and let $\partial p({\mathbf {x}})$ be the set of subgradients of p at ${\mathbf {x}}$ (cf. [56]). We define ${\text {dom}}(\partial p) {:}{=}\{{\mathbf {x}}:\partial p({\mathbf {x}} )\ne \emptyset \}$ and take ${\mathbf {x}}^*\in {\text {dom}}(\partial p)$. If there exist some $\eta \in (0,+\infty ]$, a neighborhood U of ${\mathbf {x}}^*$, and a continuous concave function $\varphi :[0,\eta )\rightarrow {\mathbb {R}}_+$, such that

1.
$\varphi (0)=0$ and $\varphi $ is continuously differentiable on $(0,\eta )$,
2.
for all $s\in (0,\eta )$, $\varphi ^{\prime }(s)>0$, and
3.
for all ${\mathbf {x}}\in U\cap {\{ {\mathbf {y}}:p({\mathbf {x}}^*)<p({\mathbf {y}})<p({\mathbf {x}}^*)+\eta \}}$, the Kurdyka-Łojasiewicz inequality holds
$$\begin{aligned} \varphi ^{\prime }(p({\mathbf {x}}) - p({\mathbf {x}}^*)) {\text {dist}}({\mathbf {0}},\partial p({\mathbf {x}}))\ge 1, \end{aligned}$$

then we say that p has the Kurdyka-Łojasiewicz (abbreviated as KL) property at ${\mathbf {x}}^*$. Here ${\text {dist}}({\mathbf {0}},\partial p({\mathbf {x}} ))$ denotes the distance from ${\mathbf {0}}$ to the set $\partial p({\mathbf {x}} )$.^{Footnote 4} If p is proper, lower semicontinuous, and has the KL property at each point of ${\text {dom}}(\partial p)$, then p is said to be a KL function. Examples of KL functions include real subanalytic functions and semi-algebraic functions [11]. In this paper, semi-algebraic functions are involved. We refer to [8] and references herein for more details on such functions. In particular, polynomial functions are semi-algebraic functions and hence KL functions. Another important fact is that the indicator function of a semi-algebraic set is also a semi-algebraic function [8, 11]. Also, a finite sum of semi-algebraic functions is again semi-algebraic. We assemble these facts to derive the following lemma which is crucial to the analysis of the global convergence of our algorithm.

Lemma 13

A finite sum of polynomial functions and indicator functions of semi-algebraic sets is a KL function.

While KL-property is used for global convergence analysis, the Łojasiewicz inequality discussed in the rest of this subsection is for convergence rate analysis. The classical Łojasiewicz inequality for analytic functions is stated as follows (cf. [48]):

(Classical Łojasiewicz’s Gradient Inequality) If f is an analytic function with $f({\mathbf {0}})=0$ and $\nabla f({\mathbf {0}})={\mathbf {0}}$, then there exist positive constants $\mu , \kappa ,$ and $\epsilon $ such that
$$\begin{aligned} \Vert \nabla f({\mathbf {x}})\Vert \ge \mu |f({\mathbf {x}})|^{\kappa }\; \text{ for } \text{ all } \;\Vert {\mathbf {x}}\Vert \le \epsilon . \end{aligned}$$
(70)

As pointed out in [4, 10], it is often difficult to determine the corresponding exponent $\kappa $ in Łojasiewicz’s gradient inequality, and it is unknown for a general function. However, an estimate of the exponent $\kappa $ in the gradient inequality is obtained by D’Acunto and Kurdyka in [17, Theorem 4.2] when f is a polynomial function. We record this fundamental result in the next lemma, which plays a key role in our sublinear convergence rate analysis.

Lemma 14

(Łojasiewicz’s Gradient Inequality for Polynomials) Let f be a real polynomial of degree d in n variables. Suppose that $f({\mathbf {0}})=0$ and $\nabla f({\mathbf {0}})={\mathbf {0}}$. There exist constants $\mu , \epsilon > 0$ such that for all $\Vert {\mathbf {x}}\Vert \le \epsilon $, we have

$$\begin{aligned} \Vert \nabla f({\mathbf {x}})\Vert \ge \mu |f({\mathbf {x}})|^{\kappa }\; \text{ with } \;\kappa = 1-\frac{1}{d(3d-3)^{n-1}}. \end{aligned}$$

Below is a more precise version of the classical Łojasiewicz gradient inequality (70), in which we can take the exponent $\kappa $ to be 1/2 near a nondegenerate critical point. For the sake of subsequent analysis, it is stated for a smooth manifold [20].

Proposition 12

(Łojasiewicz’s Gradient Inequality) Let M be a smooth manifold and let $f: M \rightarrow {\mathbb {R}}$ be a smooth function for which ${\mathbf {z}}^*$ is a nondegenerate critical point. Then there exist a neighborhood U in M of ${\mathbf {z}}^*$ and some $\mu >0$ such that for all ${\mathbf {z}}\in {U}$

$$\begin{aligned} \Vert {\text {grad}}(f)({\mathbf {z}})\Vert \ge \mu |f({\mathbf {z}})-f({\mathbf {z}}^*)|^{\frac{1}{2}}. \end{aligned}$$

B Properties on orthonormal matrices

1.1 B.1 Polar decomposition

In this section, we establish an error bound analysis for the polar decomposition. For a positive semidefinite matrix $H\in {\text {S}}^{n}_+$, there exists a unique positive semidefinite matrix $P\in {\text {S}}^{n}_+$ such that $P^2=H$. In the literature, this matrix P is called the square root of the matrix H and denoted as $P=\sqrt{H}$. If $H=U\varSigma U^{\scriptscriptstyle {\mathsf {T}}}$ is the eigenvalue decomposition of H, then we have $\sqrt{H}=U\sqrt{\varSigma }U^{\scriptscriptstyle {\mathsf {T}}}$, where $\sqrt{\varSigma }$ is the diagonal matrix whose diagonal elements are square roots of those of $\varSigma $. The next result is classical, which can be found in [24, 29].

Lemma 15

(Polar Decomposition) Let $A\in {\mathbb {R}}^{m\times n}$ with $m\ge n$. Then there exist an orthonormal matrix $U\in V(n,m)$ and a unique symmetric positive semidefinite matrix $H\in {\text {S}}^{n}_+$ such that $A=UH$ and

$$\begin{aligned} U\in {\text {argmax}}\{\langle Q,A\rangle :Q\in V(n,m)\}. \end{aligned}$$

(71)

Moreover, if A is of full rank, then the matrix U is uniquely determined and H is positive definite.

The matrix decomposition $A=UH$ as in Lemma 15 is called the polar decomposition of the matrix A [24]. For convenience, the matrix U is referred as a polar orthonormal factor matrix and the matrix H is the polar positive semidefinite factor matrix. The optimization reformulation (71) comes from the approximation problem

$$\begin{aligned} \min _{Q\in V(n,m)}\ \Vert B-QC\Vert ^2 \end{aligned}$$

for two given matrices B and C of proper sizes. In the following, we give a global error bound for this problem. To this end, the next lemma is useful.

Lemma 16

(Error Reformulation) Let p, m, n be positive integers with $m\ge n$ and let $B\in {\mathbb {R}}^{m\times p}$, $C\in {\mathbb {R}}^{n\times p}$ be two given matrices. We set $A {:}{=}BC^{\scriptscriptstyle {\mathsf {T}}}\in {\mathbb {R}}^{m\times n}$ and suppose that $A=WH$ is a polar decomposition of A. We have

$$\begin{aligned} \Vert B-QC\Vert _F^2-\Vert B-WC\Vert _F^2= \Vert W\sqrt{H}-Q\sqrt{H}\Vert ^2_F \end{aligned}$$

(72)

for any orthonormal matrix $Q\in V(n,m)$.

Proof

We have

$$\begin{aligned} \Vert B-QC\Vert _F^2-\Vert B-WC\Vert _F^2&=2\langle B,WC-QC\rangle \\&=2\langle A,W-Q\rangle \\&=2\langle WH,W-Q\rangle \\&= \langle WH,W\rangle -2\langle WH,Q\rangle +\langle QH,Q\rangle \\&= \Vert W\sqrt{H}-Q\sqrt{H}\Vert ^2_F, \end{aligned}$$

where both the first and the fourth equalities follow from the fact that both Q and W are in V(n, m), and the last one is derived from the fact that H is symmetric and positive semidefinite by Lemma 15. $\square $

Given an $n\times n$ symmetric positive semidefintie matrix H, we can define a symmetric bilinear form on ${\mathbb {R}}^{m\times n}$ by

$$\begin{aligned} \langle P,Q\rangle _H:=\langle PH,Q\rangle \end{aligned}$$

(73)

for all $m\times n$ matrices P and Q. It can also induce a seminorm

$$\begin{aligned} \Vert A\Vert _H:=\sqrt{\langle A,A\rangle _H}=\Vert A\sqrt{H}\Vert _F. \end{aligned}$$

(74)

In particular, if H is positive definite, then $\Vert \cdot \Vert _H$ is a norm on ${\mathbb {R}}^{m\times n}$. Thus, the error estimation in (72) can be viewed as a distance estimation between W and Q with respect to the distance induced by this norm. Moreover, if H is the identity matrix, then $\Vert \cdot \Vert _H$ is simply the Frobenius norm which induces the Euclidean distance on ${\mathbb {R}}^{m\times n}$. By Lemma 16 it is easy to see that the optimizer in (71) is unique whenever A is of full rank.

The following result establishes the error estimation with respect to the Euclidean distance. Given a matrix $A\in {\mathbb {R}}^{m\times n}$, let $\sigma _{\min }(A)$ be the smallest singular value of A. If A is of full rank, then $\sigma _{\min }(A)>0$.

Theorem 9

(Global Error Bound in Frobenius Norm) Let p, m, n be positive integers with $m\ge n$ and let $B\in {\mathbb {R}}^{m\times p}$ and $C\in {\mathbb {R}}^{n\times p}$ be two given matrices. We set $A:=BC^{\scriptscriptstyle {\mathsf {T}}}\in {\mathbb {R}}^{m\times n}$ and suppose that A is of full rank with the polar decomposition $A=WH$. We have that

$$\begin{aligned} \Vert B-QC\Vert _F^2-\Vert B-WC\Vert _F^2\ge \sigma _{\min }(A)\Vert W-Q\Vert ^2_F \end{aligned}$$

(75)

for any orthonormal matrix $Q\in V(n,m)$.

Proof

We know that in this case

$$\begin{aligned} \sqrt{H}-\sqrt{\sigma _{\min }(A)}I\in {\text {S}}^{n}_+. \end{aligned}$$

Therefore we may conclude that

$$\begin{aligned} \Vert W\sqrt{H}-Q\sqrt{H}\Vert ^2_F= & {} \Vert W-Q\Vert ^2_{\sqrt{H}}\ge \Vert W-Q\Vert ^2_{\sqrt{\sigma _{\min }(A)}I}\\= & {} \Vert W\sqrt{\sigma _{\min }(A)}I-Q\sqrt{\sigma _{\min }(A)}I\Vert ^2_F=\sigma _{\min }(A)\Vert W-Q\Vert _F^2. \end{aligned}$$

According to Lemma 16, we can derive the desired inequality. $\square $

Theorem 9 is a refinement of Sun and Chen’s result (cf. [59, Theorem 4.1]), in which the right hand side of (75) has an extra factor $\frac{1}{4}$.

1.2 B.2 Principal angles between subspaces

Given two linear subspaces ${\mathbb {U}}$ and ${\mathbb {V}}$ of dimension r in ${\mathbb {R}}^n$, the principal angles $\{\theta _i:i=1,\dots ,r\}$ between ${\mathbb {U}}$ and ${\mathbb {V}}$ and the associated principal vectors $\{({\mathbf {u}}_i,{\mathbf {v}}_i):i=1,\dots ,r\}$ are defined recursively by

$$\begin{aligned} \cos (\theta _i)=\langle {\mathbf {u}}_i,{\mathbf {v}}_i\rangle = \max _{{\mathbf {u}}\in {\mathbb {U}},\ [{\mathbf {u}},{\mathbf {u}}_1,\dots ,{\mathbf {u}}_{i-1}]\in V(i,n)}{\left\{ \max _{{\mathbf {v}}\in {\mathbb {V}},\ [{\mathbf {v}},{\mathbf {v}}_1,\dots ,{\mathbf {v}}_{i-1}]\in V(i,n)} \langle {\mathbf {u}},{\mathbf {v}}\rangle \right\} }. \end{aligned}$$

(76)

The following result is standard, whose proof can be found in [24, Section 6.4.3].

Lemma 17

For any orthonormal matrices $U,V\in V(r,n)$ of which subspaces spanned by column vectors are ${\mathbb {U}}$ and ${\mathbb {V}}$ respectively, we have

$$\begin{aligned} \sigma _i(U^{\scriptscriptstyle {\mathsf {T}}}V)=\cos (\theta _i)\ \text {for all }i=1,\dots ,r, \end{aligned}$$

(77)

where $\theta _i$’s are the principal angles between ${\mathbb {U}}$ and ${\mathbb {V}}$, and $\sigma _i(U^{\scriptscriptstyle {\mathsf {T}}}V)$ is the i-th largest singular value of the matrix $U^{\scriptscriptstyle {\mathsf {T}}}V$.

Lemma 18

For any orthonormal matrices $U,V\in V(r,n)$, we have

$$\begin{aligned} \langle U, V \rangle \le \sum _{j=1}^r\sigma _j(U^{{\scriptscriptstyle {\mathsf {T}}}}V). \end{aligned}$$

(78)

Proof

We recall from [24, pp. 331] that

$$\begin{aligned} \min _{Q\in {{\,\mathrm{O}\,}}(r)} \Vert A - BQ \Vert ^2_F = \sum _{j=1}^r (\sigma _j(A)^2 - 2 \sigma _j(B^{{\scriptscriptstyle {\mathsf {T}}}}A) + \sigma _j(B)^2), \end{aligned}$$

for any $n\times r$ matrices A, B. In particular, if $U,V\in {{\,\mathrm{V}\,}}(r,n)$ then

$$\begin{aligned} \sigma _j(U) = \sigma _j(V) = 1\ \text {for all }j=1,\dots ,r, \quad \Vert U \Vert ^2_F = \Vert V \Vert ^2_F = r. \end{aligned}$$

This implies that

$$\begin{aligned} 2 r - 2 \sum _{j=1}^r \sigma _j(U^{\scriptscriptstyle {\mathsf {T}}}V) = \min _{Q\in {{\,\mathrm{O}\,}}(r)} \Vert U - VQ \Vert ^2_F \le \Vert U - V \Vert ^2_F = 2r - 2 \langle U, V \rangle , \end{aligned}$$

and the desired inequality follows immediately. $\square $

Lemma 19

For any orthonormal matrices $U,V\in V(r,n)$, we have

$$\begin{aligned} \Vert U^{\scriptscriptstyle {\mathsf {T}}}V-I\Vert _F^2 \le \Vert U-V\Vert ^2_F. \end{aligned}$$

(79)

Proof

We have

$$\begin{aligned} \Vert U^{\scriptscriptstyle {\mathsf {T}}}V-I\Vert _F^2 = r + \sum _{i=1}^r \cos ^2 \theta _i - 2 {{\,\mathrm{tr}\,}}(U^{\scriptscriptstyle {\mathsf {T}}}V) \le 2r-2{{\,\mathrm{tr}\,}}(U^{\scriptscriptstyle {\mathsf {T}}}V)=\Vert U-V\Vert _F^2, \end{aligned}$$

where the first equality follows from Lemma 17. $\square $

Let ${\mathbb {U}}^\perp $ be the orthogonal complement subspace of a given linear subspace ${\mathbb {U}}$ in ${\mathbb {R}}^n$. A useful fact about principal angles between two linear subspaces ${\mathbb {U}}$ and ${\mathbb {V}}$ and those between ${\mathbb {U}}^\perp $ and ${\mathbb {V}}^\perp $ is stated as follows. The proof can be found in [33, Theorem 2.7].

Lemma 20

Let ${\mathbb {U}}$ and ${\mathbb {V}}$ be two linear subspaces of the same dimension and let $\frac{\pi }{2}\ge \theta _s \ge \dots \ge \theta _1 > 0$ be the nonzero principal angles between ${\mathbb {U}}$ and ${\mathbb {V}}$. Then the nonzero principal angles between ${\mathbb {U}}^\perp $ and ${\mathbb {V}}^\perp $ are $\frac{\pi }{2}\ge \theta _s \ge \dots \ge \theta _1 > 0$.

The following result is for the general case, which might be of independent interests.

Lemma 21

Let $m\ge n$ be positive integers and let $V:=[V_1\ V_2]\in {{\,\mathrm{O}\,}}(m)$ with $V_1\in V(n,m)$ and $U\in V(n,m)$ be two given orthonormal matrices. Then, there exists an orthonormal matrix $W\in V(m-n,m)$ such that $P:=[U\ W]\in {{\,\mathrm{O}\,}}(m)$ and

$$\begin{aligned} \Vert P-V\Vert _F^2\le 2\Vert U-V_1\Vert _F^2. \end{aligned}$$

(80)

Proof

By a simple computation, it is straightforward to verify that (80) is equivalent to

$$\begin{aligned} \Vert W-V_2\Vert _F^2\le \Vert U-V_1\Vert _F^2. \end{aligned}$$

(81)

To that end, we let $U_2\in V(m-n,m)$ be an orthonormal matrix such that $[U\ U_2]\in {{\,\mathrm{O}\,}}(m)$. Then, we have that the linear subspace ${\mathbb {U}}_2$ spanned by column vectors of $U_2$ is the orthogonal complement of ${\mathbb {U}}_1$, which is spanned by column vectors of U. Likewise, let ${\mathbb {V}}_1$ and ${\mathbb {V}}_2$ be linear subspaces spanned by column vectors of $V_1$ and $V_2$ respectively.

Let $\frac{\pi }{2}\ge \theta _s\ge \dots \ge \theta _1 > 0$ be the nonzero principal angles between ${\mathbb {U}}_2$ and ${\mathbb {V}}_2$ for some nonnegative integer $s\le m-n$. We have by Lemmas 17, and 18 that

$$\begin{aligned} \langle U_2,V_2\rangle \le \sum _{i=1}^{m-n}\sigma _i(U_2^{\scriptscriptstyle {\mathsf {T}}}V_2)=\sum _{i=1}^s\cos (\theta _i)+(m-n)-s. \end{aligned}$$

(82)

Let $Q\in {{\,\mathrm{O}\,}}(m-n)$ be a polar orthogonal factor matrix of the matrix $U_2^{\scriptscriptstyle {\mathsf {T}}}V_2$. It follows from the property of polar decomposition that

$$\begin{aligned} \langle U_2Q,V_2\rangle = \sum _{i=1}^{m-n}\sigma _i(U_2^{\scriptscriptstyle {\mathsf {T}}}V_2). \end{aligned}$$

(83)

On the other hand, nonzero principal angles between ${\mathbb {U}}_1$ and ${\mathbb {V}}_1$ are $\frac{\pi }{2}\ge \theta _s\ge \dots \ge \theta _1>0$ by Lemma 20. Therefore, by Lemmas 17, and 18, we have that

$$\begin{aligned} \langle U,V_1\rangle \le \sum _{i=1}^{n}\sigma _i(U^{\scriptscriptstyle {\mathsf {T}}}V_1)=\sum _{i=1}^s\cos (\theta _i)+n-s. \end{aligned}$$

(84)

In a conclusion, if we set $W {:}{=}U_2Q$, then we have the following:

$$\begin{aligned} \Vert W-V_2\Vert _F^2&=2(m-n)-2\langle U_2Q,V_2\rangle \\&=2(m-n)-2\big (\sum _{i=1}^s\cos (\theta _i)+m-n-s\big )\\&=2n - 2\big (\sum _{i=1}^s\cos (\theta _i)+n-s\big )\\&\le 2n -2\langle U,V_1\rangle \\&=\Vert U-V_1\Vert ^2_F, \end{aligned}$$

where the second equality follows from (82) and (83) and the inequality follows from (84). $\square $

C Proofs of technical lemmas in section 4

1.1 C.1 Proof of Lemma 6

Proof

For each $i\in \{0,\dots ,k-1\}$, we have (cf. (23), (36) and (41))

$$\begin{aligned} f({\mathbb {U}}_{i+1,[p]})-f({\mathbb {U}}_{i,[p]})&=\sum _{j=1}^r(\lambda ^{i+1}_{j,[p]})^2-\sum _{j=1}^r(\lambda ^{i}_{j,[p]})^2\nonumber \\&=\sum _{j=1}^r(\lambda ^{i+1}_{j,[p]}+\lambda ^{i}_{j,[p]})(\lambda ^{i+1}_{j,[p]}-\lambda ^{i}_{j,[p]})\nonumber \\&=\sum _{j=1}^r\lambda ^{i+1}_{j,[p]}(\lambda ^{i+1}_{j,[p]}-\lambda ^{i}_{j,[p]})+\sum _{j=1}^r\lambda ^{i}_{j,[p]}(\lambda ^{i+1}_{j,[p]}-\lambda ^{i}_{j,[p]}). \end{aligned}$$

(85)

We first analyze the second summand in (85) and by considering the following two cases:

1.
If $\sigma ^{(i+1)}_{r,[p]}\ge \epsilon $, then there is no proximal step in Algorithm 1 and we have that $\sigma _{\min }(S^{(i+1)}_{[p]})=\sigma ^{(i+1)}_{r,[p]}\ge \epsilon $, where $S^{(i+1)}_{[p]}$ is the polar positive semidefinite factor matrix of $V^{(i+1)}_{[p]}\varLambda ^{(i+1)}_{[p]}$ obtained in (39). From (35), (36) and (37) we notice that
$$\begin{aligned} ((U^{(i+1)}_{[p]})^{\scriptscriptstyle {\mathsf {T}}}V^{(i+1)}_{[p]}\varLambda ^{(i+1)}_{[p]})_{jj}= & {} (({\mathbf {u}}^{(i+1)}_{j,[p]})^{\scriptscriptstyle {\mathsf {T}}}{\mathbf {v}}^{(i+1)}_{j,[p]})\lambda ^{i}_{j,[p]} =(({\mathbf {u}}^{(i+1)}_{j,[p]})^{\scriptscriptstyle {\mathsf {T}}}{\mathcal {A}}\tau _{i+1}{\mathbf {x}}^{(i+1)}_{j,[p]}) \lambda ^{i}_{j,[p]}\\= & {} {\mathcal {A}}\tau ({\mathbf {x}}^{(i+2)}_{j,[p]})\lambda ^{i}_{j,[p]} =\lambda ^{i+1}_{j,[p]}\lambda ^{i}_{j,[p]}, \end{aligned}$$
and similarly $((U^{(i+1)}_{[p-1]})^{\scriptscriptstyle {\mathsf {T}}}V^{(i+1)}_{[p]}\varLambda ^{(i+1)}_{[p]})_{jj} = \lambda ^{i}_{j,[p]}\lambda ^{i}_{j,[p]}$. Hence by Lemma 16, we obtain
$$\begin{aligned} \sum _{j=1}^r\lambda ^{i}_{j,[p]}(\lambda ^{i+1}_{j,[p]}-\lambda ^{i}_{j,[p]})&={\text {Tr}}((U^{(i+1)}_{[p]})^{\scriptscriptstyle {\mathsf {T}}}V^{(i+1)}_{[p]}\varLambda ^{(i+1)}_{[p]}) -{\text {Tr}}((U^{(i+1)}_{[p-1]})^{\scriptscriptstyle {\mathsf {T}}}V^{(i+1)}_{[p]}\varLambda ^{(i+1)}_{[p]})\nonumber \\&=\frac{1}{2}\big \Vert (U^{(i+1)}_{[p]}-U^{(i+1)}_{[p-1]})\sqrt{S^{(i+1)}_{[p]}}\big \Vert _F^2\nonumber \\&\ge \frac{\epsilon }{2}\Vert U^{(i+1)}_{[p]}-U^{(i+1)}_{[p-1]}\Vert _F^2\nonumber \\&\ge 0. \end{aligned}$$
(86)
2.
If $\sigma ^{(i+1)}_{r,[p]}<\epsilon $, we consider the following matrix optimization problem
$$\begin{aligned} \begin{array}{rl} \max &{}\langle V^{(i+1)}_{[p]}\varLambda ^{(i+1)}_{[p]},U\rangle -\frac{\epsilon }{2}\Vert U-U^{(i+1)}_{[p-1]}\Vert _F^2\\ \text {s.t.}&{} U\in V(r,n_{i+1}). \end{array} \end{aligned}$$
(87)
Since $U,U^{(i+1)}_{[p-1]} \in {V(r,n_{i+1})}$, we must have
$$\begin{aligned} \frac{\epsilon }{2}\Vert U- U^{(i+1)}_{[p-1]}\Vert _F^2=\epsilon r-\epsilon \langle U^{(i+1)}_{[p-1]},U\rangle . \end{aligned}$$
Thus, by Lemma 15, a global maximizer of (87) is given by a polar orthonormal factor matrix of the matrix $V^{(i+1)}_{[p]}\varLambda ^{(i+1)}_{[p]}+\epsilon U^{(i+1)}_{[p-1]}$. By Substep 2 of Algorithm 1, $U^{(i+1)}_{[p]}$ is a polar orthonormal factor matrix of the matrix $V^{(i+1)}_{[p]}\varLambda ^{(i+1)}_{[p]}+\epsilon U^{(i+1)}_{[p-1]}$, and hence a global maximizer of (87). Thus, by the optimality of $U^{(i+1)}_{[p]}$ for (87), we have
$$\begin{aligned} \langle V^{(i+1)}_{[p]}\varLambda ^{(i+1)}_{[p]},U^{(i+1)}_{[p]}\rangle -\frac{\epsilon }{2}\Vert U^{(i+1)}_{[p]}-U^{(i+1)}_{[p-1]}\Vert _F^2\ge \langle V^{(i+1)}_{[p]}\varLambda ^{(i+1)}_{[p]},U^{(i+1)}_{[p-1]}\rangle . \end{aligned}$$
Therefore, the inequality (86) in case (1) also holds in this case.

Consequently, we have

$$\begin{aligned} 0\le \sum _{j=1}^r\lambda ^{i}_{j,[p]}(\lambda ^{i+1}_{j,[p]}-\lambda ^{i}_{j,[p]})=\sum _{j=1}^r\lambda ^{i+1}_{j,[p]}\lambda ^{i}_{j,[p]}-\sum _{j=1}^r(\lambda ^{i}_{j,[p]})^2, \end{aligned}$$

(88)

which together with Cauchy-Schwartz inequality implies that

$$\begin{aligned} \big (\sum _{j=1}^r(\lambda ^{i}_{j,[p]})^2\big )^2\le \big (\sum _{j=1}^r\lambda ^{i}_{j,[p]}\lambda ^{i+1}_{j,[p]} \big )^2\le \sum _{j=1}^r(\lambda ^{i}_{j,[p]})^2\sum _{j=1}^r(\lambda ^{i+1}_{j,[p]})^2. \end{aligned}$$

(89)

Since $f({\mathbb {U}}_{[0]})>0$, we conclude that $f({\mathbb {U}}_{0,[1]})=\sum _{j=1}^r(\lambda ^{0}_{j,[1]})^2>0$ and hence $\sum _{j=1}^r\lambda ^{1}_{j,[1]}\lambda ^{0}_{j,[1]}>0$ by (88). Thus, we conclude that

$$\begin{aligned} \sum _{j=1}^r\lambda ^{1}_{j,[1]}\lambda ^{0}_{j,[1]} \le \sum _{j=1}^r(\lambda ^{1}_{j,[1]})^2\frac{\sum _{j=1}^r(\lambda ^{0}_{j,[1]})^2}{\sum _{j=1}^r\lambda ^{1}_{j,[1]}\lambda ^{0}_{j,[1]}}\le \sum _{j=1}^r(\lambda ^{1}_{j,[1]})^2, \end{aligned}$$

(90)

where the first inequality follows from (89) and the second from (88). Combining (90) with (85) and (86), we may obtain (44) for $i=0$ and $p=1$.

On the one hand, from (89) we obtain

$$\begin{aligned} 0 \le \sum _{j=1}^r (\lambda _{j,[p]}^i)^2 (\sum _{j=1}^r (\lambda _{j,[p]}^{i+1})^2 - \sum _{j=1}^r (\lambda _{j,[p]}^i)^2). \end{aligned}$$

Since $\sum _{j=1}^r (\lambda _{j,[p]}^i)^2 > 0$ if there is no truncation, we must have

$$\begin{aligned} 0 \le \sum _{j=1}^r (\lambda _{j,[p]}^{i+1})^2 - \sum _{j=1}^r (\lambda _{j,[p]}^i)^2 = f({\mathbb {U}}_{i+1,[p]})-f({\mathbb {U}}_{i,[p]}), \end{aligned}$$

i.e., the objective function f is monotonically increasing during the APD iteration as long as there is no truncation. On the other hand, there are at most r truncations and hence the total loss of f by the truncation is at most $r\kappa ^2<f({\mathbb {U}}_{[0]})$. Therefore, f is always positive and according to (88), we may conclude that $\sum _{j=1}^r\lambda ^{i+1}_{j,[p]}\lambda ^{i}_{j,[p]}>0$ along iterations. By induction on p, we obtain

$$\begin{aligned} \sum _{j=1}^r\lambda ^{i+1}_{j,[p]}\lambda ^{i}_{j,[p]} \le \sum _{j=1}^r(\lambda ^{i+1}_{j,[p]})^2\frac{\sum _{j=1}^r(\lambda ^{i}_{j,[p]})^2}{\sum _{j=1}^r\lambda ^{i+1}_{j,[p]}\lambda ^{i}_{j,[p]}}\le \sum _{j=1}^r(\lambda ^{i+1}_{j,[p]})^2, \end{aligned}$$

which together with (85) and (86), implies (44) for arbitrary nonnegative integer p. $\square $

1.2 C.2 Proof for Lemma 7

Proof

The set of subgradients of h can be partitioned as follows:

$$\begin{aligned} \partial h({\mathbb {U}})=(-\nabla _1 f({\mathbb {U}})+\partial \delta _{V(r,n_1)}(U^{(1)}))\times \dots \times (-\nabla _k f({\mathbb {U}})+\partial \delta _{V(r,n_k)}(U^{(k)})). \end{aligned}$$

(91)

Following the notation of Algorithm 1, we set

$$\begin{aligned} {\mathbf {x}}_{j}:=({\mathbf {u}}^{(1)}_{j,[p+1]},\dots ,{\mathbf {u}}^{(k)}_{j,[p+1]})\ \text {for all }j=1,\dots ,r, \end{aligned}$$

where ${\mathbf {u}}^{(i)}_{j,[p+1]}$ is the j-th column of the matrix $U^{(i)}_{[p+1]}$ for all $i\in \{1,\dots ,k\}$,

$$\begin{aligned} V^{(i)}:=\begin{bmatrix}{\mathbf {v}}^{(i)}_1&\dots&{\mathbf {v}}^{(i)}_r\end{bmatrix}\ \text {with }{\mathbf {v}}^{(i)}_j:={\mathcal {A}}\tau _i({\mathbf {x}}_{j})\ \text {for all }j=1,\dots ,r, \end{aligned}$$

(92)

for all $i\in \{1,\dots ,k\}$ and

$$\begin{aligned} \varLambda :={\text {diag}}(\lambda _1,\dots ,\lambda _r)\ \text {with }\lambda _j:={\mathcal {A}}\tau ({\mathbf {x}}_{j}). \end{aligned}$$

(93)

By (39) and (87), we have

$$\begin{aligned} V^{(i)}_{[p+1]}\varLambda ^{(i)}_{[p+1]}+\alpha U^{(i)}_{[p]}=U^{(i)}_{[p+1]}S^{(i)}_{[p+1]}, \end{aligned}$$

(94)

where $\alpha \in \{0,\epsilon \}$ depending on whether or not there is a proximal correction. According to (69) and (94), we have

$$\begin{aligned} -U^{(i)}_{[p+1]} \in \partial \delta _{V(r,n_i)}(U^{(i)}_{[p+1]}), \quad V^{(i)}_{[p+1]}\varLambda ^{(i)}_{[p+1]}+\alpha U^{(i)}_{[p]} \in \partial \delta _{V(r,n_i)}(U^{(i)}_{[p+1]}), \end{aligned}$$

which implies that $V^{(i)}_{[p+1]}\varLambda ^{(i)}_{[p+1]}+\alpha \big (U^{(i)}_{[p]}-U^{(i)}_{[p+1]}\big )\in \partial \delta _{V(r,n_i)}(U^{(i)}_{[p+1]})$. If we take

$$\begin{aligned} W^{(i)}_{[p+1]}:=-2V^{(i)}\varLambda +2V^{(i)}_{[p+1]}\varLambda ^{(i)}_{[p+1]}+2\alpha \big (U^{(i)}_{[p]}-U^{(i)}_{[p+1]}\big ), \end{aligned}$$

(95)

then we have

$$\begin{aligned} W^{(i)}_{[p+1]}\in -2V^{(i)}\varLambda +\partial \delta _{V(r,n_i)}(U^{(i)}_{[p+1]}). \end{aligned}$$

On the other hand,

$$\begin{aligned} \frac{1}{2}\Vert W^{(i)}_{[p+1]}\Vert _F&=\Vert V^{(i)}\varLambda -V^{(i)}_{[p+1]}\varLambda ^{(i)}_{[p+1]}-\alpha \big (U^{(i)}_{[p]}-U^{(i)}_{[p+1]}\big )\Vert _F\\&\le \Vert V^{(i)}\varLambda -V^{(i)}_{[p+1]}\varLambda \Vert _F+\Vert V^{(i)}_{[p+1]}\varLambda -V^{(i)}_{[p+1]}\varLambda ^{(i)}_{[p+1]}\Vert _F+\alpha \Vert U^{(i)}_{[p]}-U^{(i)}_{[p+1]}\Vert _F\\&\le \Vert V^{(i)}-V^{(i)}_{[p+1]}\Vert _F\Vert \varLambda \Vert _F+\Vert V^{(i)}_{[p+1]}\Vert _F\Vert \varLambda -\varLambda ^{(i)}_{[p+1]}\Vert _F+\alpha \Vert U^{(i)}_{[p]}-U^{(i)}_{[p+1]}\Vert _F\\&\le {\Vert {\mathcal {A}}\Vert }\Vert \varLambda \Vert _F\big (\sum _{j=1}^r\Vert \tau _i({\mathbf {x}}_j)-\tau _i({\mathbf {x}}^i_{j,[p+1]})\Vert \big )\\&\quad +\Vert V^{(i)}_{[p+1]}\Vert _F\Vert {\mathcal {A}}\Vert \big (\sum _{j=1}^r\Vert \tau ({\mathbf {x}}_j)-\tau ({\mathbf {x}}^i_{j,[p+1]})\Vert \big )+\alpha \Vert U^{(i)}_{[p]}-U^{(i)}_{[p+1]}\Vert _F\\&\le \sqrt{r}\Vert {\mathcal {A}}\Vert ^2\big (\sum _{j=1}^r\sum _{s=i+1}^{k}\Vert {\mathbf {u}}^{(s)}_{j,[p+1]}-{\mathbf {u}}^{(s)}_{j,[p]}\Vert \big )\\&\quad +\sqrt{r}\Vert {\mathcal {A}}\Vert ^2\big (\sum _{j=1}^r\sum _{s=i}^{k}\Vert {\mathbf {u}}^{(s)}_{j,[p+1]}-{\mathbf {u}}^{(s)}_{j,[p]}\Vert \big )+\alpha \Vert U^{(i)}_{[p]}-U^{(i)}_{[p+1]}\Vert _F\\&\le (2r\sqrt{k}\Vert {\mathcal {A}}\Vert ^2 +\epsilon )\Vert {\mathbb {U}}_{[p+1]}-{\mathbb {U}}_{[p]}\Vert _F, \end{aligned}$$

where the third inequality follows from the fact that

$$\begin{aligned} V^{(i)}-V^{(i)}_{[p+1]}=\begin{bmatrix}{\mathcal {A}}(\tau _i({\mathbf {x}}_1)-\tau _i({\mathbf {x}}^i_{1,[p+1]}))&\dots&{\mathcal {A}}(\tau _i({\mathbf {x}}_r)-\tau _i({\mathbf {x}}^i_{r,[p+1]}))\end{bmatrix}, \end{aligned}$$

and a similar formula for $\varLambda -\varLambda ^{(i)}_{[p+1]}$, the fourth follows from the fact that

$$\begin{aligned} |{\mathcal {A}}\tau ({\mathbf {x}})|\le \Vert {\mathcal {A}}\Vert \end{aligned}$$

for any vector ${\mathbf {x}}:=({\mathbf {x}}_1,\dots ,{\mathbf {x}}_k)$ with $\Vert {\mathbf {x}}_i\Vert =1$ for all $i=1,\dots ,k$ and the last one follows from $\alpha \le \epsilon $. This, together with (91), implies (48). $\square $

1.3 C.3 Proof for Lemma 9

Proof

We let

$$\begin{aligned} W^{(i)}_{[p+1]}:=V^{(i)}\varLambda -V^{(i)}_{[p+1]}\varLambda ^{(i)}_{[p+1]}-\alpha \big (U^{(i)}_{[p]}-U^{(i)}_{[p+1]}\big ), \end{aligned}$$

where $\alpha \in \{0,\epsilon \}$ depending on whether there is a proximal correction or not (cf. the proof for Lemma 7). It follows from Lemma 7 that

$$\begin{aligned} {\Vert W^{(i)}_{[p+1]}\Vert _F}\le \gamma _0 \Vert {\mathbb {U}}_{[p]}-{\mathbb {U}}_{[p+1]}\Vert _F \end{aligned}$$

for some constant $\gamma _0>0$. By Algorithm 1, we have

$$\begin{aligned} V^{(i)}_{[p+1]}\varLambda ^{(i)}_{[p+1]}+\alpha U^{(i)}_{[p]}=U^{(i)}_{[p+1]}S^{(i)}_{[p+1]} \end{aligned}$$

(96)

where $S^{(i)}_{[p+1]}$ is a symmetric positive semidefinite matrix. Since $U^{(i)}_{[p+1]}\in V(r,n_i)$ is an orthonormal matrix, we have

$$\begin{aligned} S^{(i)}_{[p+1]}=(U^{(i)}_{[p+1]})^{\scriptscriptstyle {\mathsf {T}}}\big (V^{(i)}_{[p+1]}\varLambda ^{(i)}_{[p+1]}+\alpha U^{(i)}_{[p]}\big )=\big (V^{(i)}_{[p+1]}\varLambda ^{(i)}_{[p+1]}+\alpha U^{(i)}_{[p]}\big )^{\scriptscriptstyle {\mathsf {T}}}U^{(i)}_{[p+1]}, \end{aligned}$$

(97)

where the second equality follows from the symmetry of the matrix $S^{(i)}_{[p+1]}$.

Consequently, we have

$$\begin{aligned}&\frac{1}{2}\Vert \nabla _i f({\mathbb {U}}_{[p+1]})-U^{(i)}_{[p+1]}(\nabla _i f({\mathbb {U}}_{[p+1]}))^{\scriptscriptstyle {\mathsf {T}}}U^{(i)}_{[p+1]}\Vert _F\nonumber \\&\quad =\Vert V^{(i)}\varLambda -U^{(i)}_{[p+1]}(V^{(i)}\varLambda )^{\scriptscriptstyle {\mathsf {T}}}U^{(i)}_{[p+1]}\Vert _F\nonumber \\&\quad =\Vert W^{(i)}_{[p+1]}+V^{(i)}_{[p+1]}\varLambda ^{(i)}_{[p+1]}+\alpha \big (U^{(i)}_{[p]} -U^{(i)}_{[p+1]}\big )-U^{(i)}_{[p+1]}(V^{(i)}\varLambda )^{\scriptscriptstyle {\mathsf {T}}}U^{(i)}_{[p+1]}\Vert _F\nonumber \\&\quad \le \Vert W^{(i)}_{[p+1]}\Vert _F+\alpha {\Vert U^{(i)}_{[p]}-U^{(i)}_{[p+1]}\Vert _F}+\Vert V^{(i)}_{[p+1]} \varLambda ^{(i)}_{[p+1]}-U^{(i)}_{[p+1]}(V^{(i)}\varLambda )^{\scriptscriptstyle {\mathsf {T}}}U^{(i)}_{[p+1]}\Vert _F\nonumber \\&\quad \le \gamma _1 \Vert {\mathbb {U}}_{[p]}-{\mathbb {U}}_{[p+1]}\Vert _F+\Vert V^{(i)}_{[p+1]} \varLambda ^{(i)}_{[p+1]}-U^{(i)}_{[p+1]}(V^{(i)}\varLambda )^{\scriptscriptstyle {\mathsf {T}}}U^{(i)}_{[p+1]}\Vert _F, \end{aligned}$$

(98)

where $\gamma _1 = \gamma _0+\epsilon $. Next, we derive an estimation for the second summand of the right hand side of (98). To do this, we notice that

$$\begin{aligned}&\Vert V^{(i)}_{[p+1]}\varLambda ^{(i)}_{[p+1]}-U^{(i)}_{[p+1]}(V^{(i)}\varLambda )^{\scriptscriptstyle {\mathsf {T}}}U^{(i)}_{[p+1]}\Vert _F\nonumber \\&\quad =\Vert U^{(i)}_{[p+1]}S^{(i)}_{[p+1]}-\alpha U^{(i)}_{[p]} -U^{(i)}_{[p+1]}(V^{(i)}\varLambda )^{\scriptscriptstyle {\mathsf {T}}}U^{(i)}_{[p+1]}\Vert _F\nonumber \\&\quad =\Vert U^{(i)}_{[p+1]}\big (V^{(i)}_{[p+1]}\varLambda ^{(i)}_{[p+1]}+\alpha U^{(i)}_{[p]}\big )^{\scriptscriptstyle {\mathsf {T}}}U^{(i)}_{[p+1]}-\alpha U^{(i)}_{[p]}-U^{(i)}_{[p+1]}(V^{(i)}\varLambda )^{\scriptscriptstyle {\mathsf {T}}}U^{(i)}_{[p+1]}\Vert _F\nonumber \\&\quad \le \Vert U^{(i)}_{[p+1]}\big ((V^{(i)}_{[p+1]}\varLambda ^{(i)}_{[p+1]})^{\scriptscriptstyle {\mathsf {T}}}-(V^{(i)}\varLambda )^{\scriptscriptstyle {\mathsf {T}}}\big )U^{(i)}_{[p+1]}\Vert _F+\alpha \Vert U^{(i)}_{[p+1]}(U^{(i)}_{[p]})^{\scriptscriptstyle {\mathsf {T}}}U^{(i)}_{[p+1]}-U^{(i)}_{[p]}\Vert _F\nonumber \\&\quad \le \Vert V^{(i)}_{[p+1]}\varLambda ^{(i)}_{[p+1]}-V^{(i)}\varLambda \Vert _F+\alpha \Vert U^{(i)}_{[p+1]}(U^{(i)}_{[p]})^{\scriptscriptstyle {\mathsf {T}}}U^{(i)}_{[p+1]}-U^{(i)}_{[p]}\Vert _F\nonumber \\&\quad \le \gamma _2 \Vert {\mathbb {U}}_{[p+1]}-{\mathbb {U}}_{[p]}\Vert _F, \end{aligned}$$

(99)

where $\gamma _2>0$ is some constant, the first equality follows from (96), the second from (97), the second inequality^{Footnote 5} follows from the fact that $U^{(i)}_{[p+1]}\in V(r,n_i)$ and the last inequality from the relation

$$\begin{aligned} \Vert (U^{(i)}_{[p]})^{\scriptscriptstyle {\mathsf {T}}}U^{(i)}_{[p+1]}-I\Vert _F\le \Vert U^{(i)}_{[p]}-U^{(i)}_{[p+1]}\Vert _F, \end{aligned}$$

which is obtained by Lemma 19. The desired inequality can be derived easily from (98) and (99). $\square $

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Hu, S., Ye, K. Linear convergence of an alternating polar decomposition method for low rank orthogonal tensor approximations. Math. Program. 199, 1305–1364 (2023). https://doi.org/10.1007/s10107-022-01867-8

Download citation

Received: 28 December 2019
Accepted: 14 July 2022
Published: 30 July 2022
Issue Date: May 2023
DOI: https://doi.org/10.1007/s10107-022-01867-8

Keywords

Mathematics Subject Classification

Profiles

Ke Ye View author profile

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Linear convergence of an alternating polar decomposition method for low rank orthogonal tensor approximations

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Automatic approximation using asymptotically optimal adaptive interpolation

Implementable tensor methods in unconstrained convex optimization

On the convergence of alternating minimization methods in variational PGD

Explore related subjects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

A Preliminaries

1.1 A.1 Stiefel manifolds

1.2 A.2 Morse functions

Lemma 11

Lemma 12

Proposition 11

1.3 A.3 Kurdyka- Łojasiewicz property

Lemma 13

Lemma 14

Proposition 12

B Properties on orthonormal matrices

1.1 B.1 Polar decomposition

Lemma 15

Lemma 16

Proof

Theorem 9

Proof

1.2 B.2 Principal angles between subspaces

Lemma 17

Lemma 18

Proof

Lemma 19

Proof

Lemma 20

Lemma 21

Proof

C Proofs of technical lemmas in section 4

1.1 C.1 Proof of Lemma 6

Proof

1.2 C.2 Proof for Lemma 7

Proof

1.3 C.3 Proof for Lemma 9

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Profiles

Subscribe and save

Buy Now