Abstract
Low rank orthogonal tensor approximation (LROTA) is an important problem in tensor computations and their applications. A classical and widely used algorithm is the alternating polar decomposition method (APD). In this paper, an improved version iAPD of the classical APD is proposed. For the first time, all of the following four fundamental properties are established for iAPD: (i) the algorithm converges globally and the whole sequence converges to a KKT point without any assumption; (ii) it exhibits an overall sublinear convergence with an explicit rate which is sharper than the usual O(1/k) for first order methods in optimization; (iii) more importantly, it converges R-linearly for a generic tensor without any assumption; (iv) for almost all LROTA problems, iAPD reduces to APD after finitely many iterations if it converges to a local minimizer.



Similar content being viewed by others
Notes
Yang’s paper is posted during our final preparation of this paper. We can see that we both employ the proximal technique. A difference is that the proximal correction in our algorithm is adaptive, and a theoretical investigation is also given (cf. Sect. 5) for the execution of the proximal correction.
An algebraic variety here means a set of common roots of some polynomials. In other words, an algebraic variety is a closed subset of some \({\mathbb {R}}^N\) with respect to the Zariski topology.
This is due to the fact that \({\text {dim}}(V(n-1,n))={\text {dim}}(V(n,n))\).
In the general definition of KL property, \(\partial p({\mathbf {x}})\) is actually the limiting subdifferential of p at \(\mathbf {x}\). However, for functions considered in this paper, the two notions of subdifferentials coincide. For this reason, we do not distinguish them in the paper.
Since U is orthonormal, we must have
$$\begin{aligned} \Vert UAU\Vert _F^2=\Vert AU\Vert _F^2=\langle AUU^{\scriptscriptstyle {\mathsf {T}}},A\rangle \le \Vert A\Vert _F^2. \end{aligned}$$
References
Absil, P.-A., Mahony, R., Sepulchre, R.: Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton, USA (2008)
Anandkumar, A., Ge, R., Hsu, D., Kakade, S.M., Telgarsky, M.: Tensor decompositions for learning latent variable models. J. Mach. Learn. Res. 15, 2773–2832 (2014)
Anderson, E., Bai, Z.Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.: LAPACK Users’ Guide, 3rd edn. SIAM, Philadelphia, PA (1999)
Attouch, H., Bolte, J., Redont, P., Soubeyran, A.: Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the Kurdyka-Łojasiewicz inequality. Math. Oper. Res. 35, 438–457 (2010)
Attouch, H., Bolte, J., Svaiter, B.F.: Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods. Math. Program. 137(1–2), 91–129 (2013)
Beck, A.: First-Order Methods in Optimization. MOS-SIAM Series on Optimization, SIAM (2017)
Bertsekas, D.P.: Nonlinear Programming, 2nd edn. Athena Scientific, Belmont, USA (1999)
Bochnak, J., Coste, M., Roy, M.-F.: Real Algebraic Geometry. Ergebnisse der Mathematik und ihrer Grenzgebiete, vol. 36. Springer, Berlin (1998)
Bolte, J., Daniilidis, A., Lewis, A.S., Shiota, M.: Clarke subgradients of stratifiable manifolds. SIAM J. Optim. 18(2), 556–572 (2007)
Bolte, J., Daniilidis, A., Ley, O., Mazet, L.: Characterizations of Łojasiewicz inequalities and applications: subgradient flows, talweg, convexity. Trans. Am. Math. Soc. 362, 3319–3363 (2010)
Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. 146(1–2), 459–494 (2014)
Boralevi, A., Draisma, J., Horobet, E., Robeva, E.: Orthogonal and unitary tensor decomposition from an algebraic perspective. Isr. J. Math. 222, 223–260 (2017)
Chen, J., Saad, Y.: On the tensor SVD and the optimal low rank orthogonal approximation of tensors. SIAM J. Matrix Anal. Appl. 30, 1709–1734 (2009)
Comon, P.: Independent component analysis, a new concept? Signal Process. 36, 287–314 (1994)
Comon, P.: Tensors: a brief introduction. IEEE Signal Proc. Mag. 31, 44–53 (2014)
Comon, P., Jutten, C.: Handbook of Blind Source Separation. Academic Press, Oxford (2010)
D’ Acunto, D., Kurdyka, K.: Explicit bounds for the Łojasiewicz exponent in the gradient inequality for polynomials. Ann. Polon. Math. 87, 51–61 (2005)
De Lathauwer, L., De Moor, B., Vandewalle, J.: On the best rank-1 and rank-\((r_1, r_2,\ldots, r_N)\) approximation of higher-order tensors. SIAM J. Matrix Anal. Appl. 21, 1324–1342 (2000)
De Silva, V., Lim, L.-H.: Tensor rank and the ill-posedness of the best low-rank approximation problem. SIAM J. Matrix Anal. Appl. 29, 1084–1127 (2008)
do Carmo, M.P.: Riemannian Geometry. Springer, Boston, Birkhäuser (1992)
Eckart, C., Young, G.: The approximation of one matrix by another of lower rank. Psychometrika 1, 211–218 (1936)
Edelman, A., Arias, T.A., Smith, S.T.: The geometry of algorithms with orthogonality constraints. SIAM. J. Matrix Anal. Appl. 20(2), 303–353 (1998)
Franc, A.: Etude Algébrique des Multitableaux: Apports de l’Algébre Tensorielle. Thèse de Doctorat, Spécialité Statistiques, Univ. de Montpellier II, Montpellier, France (1992)
Golub, G.H., Van Loan, C.F.: Matrix Computations, 4th edn. Johns Hopkins University Press, Baltimore, MD (2013)
Grasedyck, L., Kressner, D., Tobler, C.: A literature survey of low-rank tensor approximation techniques. GAMM-Mitt. 36, 53–78 (2013)
Guan, Y., Chu, D.: Numerical computation for orthogonal low-rank approximation of tensors. SIAM J. Matrix Anal. Appl. 40, 1047–1065 (2019)
Hackbusch, W.: Tensor Spaces and Numerical Tensor Calculus. Springer, Berlin (2012)
Hartshorne, R.: Algebraic Geometry. Graduate Texts in Mathematics 52. Springer, New York (1977)
Higham, N.J.: Computing the polar decomposition-with applications. SIAM J. Sci. Statist. Comput. 7, 1160–1174 (1986)
Horn, R.A., Johnson, C.R.: Matrix Analysis. Cambridge University Press, New York (1985)
Hu, S., Li, G.: Convergence rate analysis for the higher order power method in best rank one approximations of tensors. Numer. Math. 140, 993–1031 (2018)
Ishteva, M., Absil, P.-A., Van Dooren, P.: Jacobi algorithm for the best low multilinear rank approximation of symmetric tensors. SIAM J. Matrix Anal. Appl. 34, 651–672 (2013)
Knyazev, A.V., Argentati, M.E.: Majorization for changes in angles between subspaces, Ritz values, and graph Laplacian spectra. SIAM J. Matrix Anal. Appl. 29, 15–32 (2006)
Kolda, T.G.: Orthogonal tensor decompositions. SIAM J. Matrix Anal. Appl. 23, 243–255 (2001)
Kolda, T.G.: A counterexample to the possibility of an extension of the Eckart-Young low rank approximation theorem for the orthogonal rank tensor decomposition. SIAM J. Matrix Anal. Appl. 24, 762–767 (2002)
Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Rev. 51, 455–500 (2009)
Kruskal, J.B.: Three-way arrays: rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics. Linear Algebr. Appl. 18, 95–138 (1977)
Kurdyka, K.: On gradients of functions definable in o-minimal structures. Annales de l’Institut Fourier 48(3), 769–783 (1998)
Landsberg, J.M.: Tensors: Geometry and Applications. AMS, Providence, RI (2012)
Lee, J.D., Panageas, I., Piliouras, G., Simchowitz, M., Jordan, M.I., Recht, B.: First-order methods almost always avoid strict saddle points. Math. Program. 176, 311–337 (2019)
Lee, J.D., Simchowitz, M., Jordan, M.I., Recht, B.: Gradient descent only converges to minimizers. P. Mach. Learn. Res. 49, 1–12 (2016)
Li, G., Pong, T.K.: Calculus of the exponent of Kurdyka-Łojasiewicz inequality and its applications to linear convergence of first-order methods. Found. Comput. Math. 18, 1199–1232 (2018)
Li, J., Usevich, K., Comon, P.: Globally convergent jacobi-type algorithms for simultaneous orthogonal symmetric tensor diagonalization. SIAM J. Matrix Anal. Appl. 39, 1–22 (2018)
Li, J., Usevich, K., Comon, P.: On approximate diagonalization of third order symmetric tensors by orthogonal transformations. Linear Algebr. Appl. 576, 324–351 (2019)
Li, J., Usevich, K., Comon, P.: Jacobi-type algorithm for low rank orthogonal approximation of symmetric tensors and its convergence analysis. arXiv: 1911.00659
Lim, L.-H.: Tensors and hypermatrices. Chapter 15 in Handbook of Linear Algebra. In: Hogben, L. (eds.) (2013)
Liu, H., Wu, W., So, A.M.-C.: Quadratic optimization with orthogonality constraints: explicit Łojasiewicz exponent and linear convergence of line-search methods. In: Proceedings of the 33rd International Convergnece on Machine Learning, New York, USA (2016)
Łojasiewicz, S.: Une propriété topologique des sous-ensembles analytiques réels, Les Équations aux Dérivées Partielles. Éditions du centre National de la Recherche Scientifique, Paris, pp. 87–89 (1963)
Martin, C.D.M., Van Loan, C.F.: A Jacobi-type method for computing orthogonal tensor decompositions. SIAM J. Matrix Anal. Appl. 30, 1219–1232 (2008)
McCullagh, P.: Tensor Methods in Statistics. Chapman and Hall, London (1987)
Milnor, J.: Morse Theory. Annals Math. Studies, 51. Princeton Univ. Press, Princeton, NJ (1963)
Mohlenkamp, M.J.: Musings on multilinear fitting. Linear Algebr. Appl. 438, 834–852 (2013)
Mu, C., Hsu, D., Goldfarb, D.: Successive rank-one approximations for nearly orthogonally decomposable symmetric tensors. SIAM J. Matrix Anal. Appl. 36, 1638–1659 (2015)
Nesterov, Y.: Introductory Lectures on Convex Optimization-A Basic Course. Kluwer Academic Publishers, London (2004)
Qi, L., Luo, Z.: Tensor Analysis: Spectral Theory and Special Tensors. SIAM (2017)
Rockafellar, R.T., Wets, R.: Variational Analysis. Grundlehren der Mathematischen Wissenschaften, vol. 317. Springer, Berlin (1998)
Shafarevich, I.R.: Basic Algebraic Geometry. Springer-Verlag, Berlin (1977)
Sørensen, M., De Lathauwer, L., Comon, P., Jcart, S., Deneire, L.: Canonical polyadic decomposition with a columnwise orthonormal factor matrix. SIAM J. Matrix Anal. Appl. 33, 1190–1213 (2012)
Sun, J., Chen, C.: Generalized polar decomposition. Math. Numer. Sinica 11, 262–273 (1989)
Uschmajew, A.: Local convergence of the alternating least squares algorithm for canonical tensor approximation. SIAM J. Matrix Anal. Appl. 33, 639–652 (2012)
Uschmajew, A.: A new convergence proof for the high-order power method and generalizations. Pacific J. Optim. 11, 309–321 (2015)
Usevich, K., Li, J., Comon, P.: Approximate matrix and tensor diagonalization by unitary transformations: convergence of Jacobi-type algorithms. SIAM J. Optim. 30, 2998–3028 (2020)
Wang, L., Chu, M.T.: On the global convergence of the alternating least squares method for rank-one tensor approximation. SIAM J. Matrix Anal. Appl. 35(3), 1058–1072 (2014)
Wang, L., Chu, M.T., Yu, B.: Orthogonal low rank tensor approximation: alternating least squares method and its global convergence. SIAM J. Matrix Anal. Appl. 36, 1–19 (2015)
Yang, Y.: The epsilon-alternating least squares for orthogonal low-rank tensor approximation and its global convergence. SIAM J. Matrix Anal. Appl. 41, 1797–1825 (2020)
Zhang, T., Golub, G.H.: Rank-one approximation to high order tensors. SIAM J. Matrix Anal. Appl. 23, 534–550 (2001)
Acknowledgements
We would like to acknowledge the anonymous associate editor and two referees for their thoughtful suggestions and helpful comments on our manuscript.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
We claim that parts of the paper have not been published or were submitted elsewhere.
This work is partially supported by National Science Foundation of China (Grant No. 11771328). The first author is also partially supported by National Science Foundation of China (Grant No. 12171128), Young Elite Scientists Sponsorship Program by Tianjin and the Natural Science Foundation of Zhejiang Province, China (Grant No. LD19A010002 and Grant No. LY22A010022). The second author is also partially supported by National Science Foundation of China (Grant No. 11801548 and Grant No. 11688101), National Key R &D Program of China (Grant No. 2018YFA0306702 and Grant No. 2020YFA0712300) and CAS Project for Young Scientists in Basic Research (Grant No. YSBR-008).
Appendices
A Preliminaries
This section consists of preliminaries on differential geometry and optimization theory, which are necessary to get through the details of proofs in this paper. To be more precise, we discuss Stiefel manifolds and their basic properties in Appendix A.1. Morse functions and Kurdyka-Łojasiewicz property are introduced in Appendix A.2 and A.3, respectively.
1.1 A.1 Stiefel manifolds
Let \(m\le n\) be two positive integers and let \(V(m,n)\subseteq {\mathbb {R}}^{n \times m}\) be the set of all \(n \times m\) orthonormal matrices, i.e.,
where I is the identity matrix of matching size. Indeed, V(m, n) admits a smooth manifold structure and is called the Stiefel manifold of orthonormal m-frames in \({\mathbb {R}}^n\). In particular, if \(m = n\) then V(n, n) simply reduces to the orthogonal group \({{\,\mathrm{O}\,}}(n)\).
In general, if M is a Riemannian submanifold of \({\mathbb {R}}^k\), then the normal space \({\text {N}}_M({\mathbf {x}})\) of M at a point \({\mathbf {x}}\in M\) is defined to be the orthogonal complement of its tangent space \({\text {T}}_M({\mathbf {x}})\) in \({\mathbb {R}}^k\), i.e., \({\text {N}}_M ({\mathbf {x}}) {:}{=}{\text {T}}_M({\mathbf {x}})^{\perp }\). By definition, V(m, n) is naturally a Riemannian submanifold of \({\mathbb {R}}^{n \times m}\) and hence its normal space is well-defined. Indeed, it follows from [56, Chapter 6.C] and [1, 22] that
where \({\text {S}}^{m\times m}\subseteq {\mathbb {R}}^{m\times m}\) is the subspace of \(m \times m\) symmetric matrices.
Given a matrix \(B\in {\mathbb {R}}^{n \times m}\), the projection of B onto the normal space of V(m, n) at a point \(A\in V(m,n)\) is
Likewise, given a matrix \(B\in {\mathbb {R}}^{n \times m}\), the projection of B onto the tangent space of V(m, n) at A is given by
where \({\text {skew}}(C) {:}{=}\frac{C-C^{\scriptscriptstyle {\mathsf {T}}}}{2} \) for a square matrix \(C\in {\mathbb {R}}^{m \times m}\). A more explicit formula is given as
Given a function \(f : {\mathbb {R}}^k\rightarrow {\mathbb {R}}\cup \{\infty \}\), the subdifferential of f at \({\mathbf {x}}\in {\mathbb {R}}^k\) is defined as (cf. [56])
If \({\mathbf {0}}\in \partial f({\mathbf {x}})\), then \({\mathbf {x}}\) is a critical point of f.
The indicator function \(\delta _{V(m,n)}:{\mathbb {R}}^{n\times m} \rightarrow {\mathbb {R}} \cup \{\infty \}\) of V(m, n) is defined by
An important fact about the normal space \({\text {N}}_{V(m,n)}(A)\) and the subdifferential of the indicator function \(\delta _{V(m,n)}\) of V(m, n) at A is (cf. [56])
1.2 A.2 Morse functions
In the following, we introduce the notion of Morse functions and recall some of its basic properties. On a smooth manifold M, a smooth function \(f : M\rightarrow {\mathbb {R}}\) is called a Morse function if each critical point of f on M is nondegenerate, i.e., the Hessian matrix of f at each critical point is non-singular. The following result is well-known, see for example [51, Theorem 6.6].
Lemma 11
(Projection is Generically Morse) Let M be a submanifold of \({\mathbb {R}}^n\). For a generic \({\mathbf {a}} = (a_1,\dots ,a_n)^{\scriptscriptstyle {\mathsf {T}}}\in {\mathbb {R}}^n\), the Euclidean distance function
is a Morse function on M.
We also need the following property (cf. [51, Corollary 2.3]) of nondegenerate critical points in the sequel.
Lemma 12
Let M be a manifold and let \(f:M \rightarrow {\mathbb {R}}\) be a smooth function. Nondegenerate critical points of f are isolated.
To conclude this subsection, we briefly discuss how critical points behave under a local diffeomorphism. For this purpose, we recall that a smooth manifold \(M_1\) is called locally diffeomorphic to a smooth manifold \(M_2\) if there is a smooth map \(\varphi : M_1\rightarrow M_2\) such that for each point \({\mathbf {x}}\in M_1\) there exists a neighborhood \(U\subseteq M_1\) of \({\mathbf {x}}\) and a neighborhood \(V\subseteq M_2\) of \(\varphi ({\mathbf {x}})\) such that the restriction \(\varphi |_U : U\rightarrow V\) is a diffeomorphism [20]. In this case, the corresponding \(\varphi \) is called a local diffeomorphism from \(M_1\) to \(M_2\). It is clear from the definition that whenever \(M_1\) is locally diffeomorphic to \(M_2\) then the two manifolds must have the same dimension. Moreover, we have the following result, whose proof can be found in [31, Proposition 5.2].
Proposition 11
Let a smooth manifold \(M_1\) be locally diffeomorphic to another smooth manifold \(M_2\) and let \(\varphi : M_1\rightarrow M_2\) be the corresponding local diffeomorphism. Let \(f : M_2\rightarrow {\mathbb {R}}\) be a smooth function. Then \({\mathbf {x}}\in M_1\) is a (nondegenerate) critical point of \(f\circ \varphi \) on \(M_1\) if and only if \(\varphi ({\mathbf {x}})\) is a (nondegenerate) critical point of f on \(M_2\).
When f is a smooth function on \({\mathbb {R}}^n\) and M is a submanifold of \({\mathbb {R}}^n\), we denote by \(\nabla f\) the gradient of f as a vector field on \({\mathbb {R}}^n\), while we denote by \({\text {grad}}(f)\) the Riemannian gradient of f as a vector field on M.
1.3 A.3 Kurdyka- Łojasiewicz property
In this subsection, we review some basic facts about the Kurdyka-Łojasiewicz property, which was first established for a smooth definable function in [38] by Kurdyka and later was even generalized to non-smooth definable case in [9] by Bolte, Daniilidis, Lewis and Shiota. Interested readers are also referred to [4, 5, 10, 42] for more discussions and applications of the Kurdyka-Łojasiewicz property.
Let p be an extended real-valued function and let \(\partial p({\mathbf {x}})\) be the set of subgradients of p at \({\mathbf {x}}\) (cf. [56]). We define \({\text {dom}}(\partial p) {:}{=}\{{\mathbf {x}}:\partial p({\mathbf {x}} )\ne \emptyset \}\) and take \({\mathbf {x}}^*\in {\text {dom}}(\partial p)\). If there exist some \(\eta \in (0,+\infty ]\), a neighborhood U of \({\mathbf {x}}^*\), and a continuous concave function \(\varphi :[0,\eta )\rightarrow {\mathbb {R}}_+\), such that
-
1.
\(\varphi (0)=0\) and \(\varphi \) is continuously differentiable on \((0,\eta )\),
-
2.
for all \(s\in (0,\eta )\), \(\varphi ^{\prime }(s)>0\), and
-
3.
for all \({\mathbf {x}}\in U\cap {\{ {\mathbf {y}}:p({\mathbf {x}}^*)<p({\mathbf {y}})<p({\mathbf {x}}^*)+\eta \}}\), the Kurdyka-Łojasiewicz inequality holds
$$\begin{aligned} \varphi ^{\prime }(p({\mathbf {x}}) - p({\mathbf {x}}^*)) {\text {dist}}({\mathbf {0}},\partial p({\mathbf {x}}))\ge 1, \end{aligned}$$
then we say that p has the Kurdyka-Łojasiewicz (abbreviated as KL) property at \({\mathbf {x}}^*\). Here \({\text {dist}}({\mathbf {0}},\partial p({\mathbf {x}} ))\) denotes the distance from \({\mathbf {0}}\) to the set \(\partial p({\mathbf {x}} )\).Footnote 4 If p is proper, lower semicontinuous, and has the KL property at each point of \({\text {dom}}(\partial p)\), then p is said to be a KL function. Examples of KL functions include real subanalytic functions and semi-algebraic functions [11]. In this paper, semi-algebraic functions are involved. We refer to [8] and references herein for more details on such functions. In particular, polynomial functions are semi-algebraic functions and hence KL functions. Another important fact is that the indicator function of a semi-algebraic set is also a semi-algebraic function [8, 11]. Also, a finite sum of semi-algebraic functions is again semi-algebraic. We assemble these facts to derive the following lemma which is crucial to the analysis of the global convergence of our algorithm.
Lemma 13
A finite sum of polynomial functions and indicator functions of semi-algebraic sets is a KL function.
While KL-property is used for global convergence analysis, the Łojasiewicz inequality discussed in the rest of this subsection is for convergence rate analysis. The classical Łojasiewicz inequality for analytic functions is stated as follows (cf. [48]):
-
(Classical Łojasiewicz’s Gradient Inequality) If f is an analytic function with \(f({\mathbf {0}})=0\) and \(\nabla f({\mathbf {0}})={\mathbf {0}}\), then there exist positive constants \(\mu , \kappa ,\) and \(\epsilon \) such that
$$\begin{aligned} \Vert \nabla f({\mathbf {x}})\Vert \ge \mu |f({\mathbf {x}})|^{\kappa }\; \text{ for } \text{ all } \;\Vert {\mathbf {x}}\Vert \le \epsilon . \end{aligned}$$(70)
As pointed out in [4, 10], it is often difficult to determine the corresponding exponent \(\kappa \) in Łojasiewicz’s gradient inequality, and it is unknown for a general function. However, an estimate of the exponent \(\kappa \) in the gradient inequality is obtained by D’Acunto and Kurdyka in [17, Theorem 4.2] when f is a polynomial function. We record this fundamental result in the next lemma, which plays a key role in our sublinear convergence rate analysis.
Lemma 14
(Łojasiewicz’s Gradient Inequality for Polynomials) Let f be a real polynomial of degree d in n variables. Suppose that \(f({\mathbf {0}})=0\) and \(\nabla f({\mathbf {0}})={\mathbf {0}}\). There exist constants \(\mu , \epsilon > 0\) such that for all \(\Vert {\mathbf {x}}\Vert \le \epsilon \), we have
Below is a more precise version of the classical Łojasiewicz gradient inequality (70), in which we can take the exponent \(\kappa \) to be 1/2 near a nondegenerate critical point. For the sake of subsequent analysis, it is stated for a smooth manifold [20].
Proposition 12
(Łojasiewicz’s Gradient Inequality) Let M be a smooth manifold and let \(f: M \rightarrow {\mathbb {R}}\) be a smooth function for which \({\mathbf {z}}^*\) is a nondegenerate critical point. Then there exist a neighborhood U in M of \({\mathbf {z}}^*\) and some \(\mu >0\) such that for all \({\mathbf {z}}\in {U}\)
B Properties on orthonormal matrices
1.1 B.1 Polar decomposition
In this section, we establish an error bound analysis for the polar decomposition. For a positive semidefinite matrix \(H\in {\text {S}}^{n}_+\), there exists a unique positive semidefinite matrix \(P\in {\text {S}}^{n}_+\) such that \(P^2=H\). In the literature, this matrix P is called the square root of the matrix H and denoted as \(P=\sqrt{H}\). If \(H=U\varSigma U^{\scriptscriptstyle {\mathsf {T}}}\) is the eigenvalue decomposition of H, then we have \(\sqrt{H}=U\sqrt{\varSigma }U^{\scriptscriptstyle {\mathsf {T}}}\), where \(\sqrt{\varSigma }\) is the diagonal matrix whose diagonal elements are square roots of those of \(\varSigma \). The next result is classical, which can be found in [24, 29].
Lemma 15
(Polar Decomposition) Let \(A\in {\mathbb {R}}^{m\times n}\) with \(m\ge n\). Then there exist an orthonormal matrix \(U\in V(n,m)\) and a unique symmetric positive semidefinite matrix \(H\in {\text {S}}^{n}_+\) such that \(A=UH\) and
Moreover, if A is of full rank, then the matrix U is uniquely determined and H is positive definite.
The matrix decomposition \(A=UH\) as in Lemma 15 is called the polar decomposition of the matrix A [24]. For convenience, the matrix U is referred as a polar orthonormal factor matrix and the matrix H is the polar positive semidefinite factor matrix. The optimization reformulation (71) comes from the approximation problem
for two given matrices B and C of proper sizes. In the following, we give a global error bound for this problem. To this end, the next lemma is useful.
Lemma 16
(Error Reformulation) Let p, m, n be positive integers with \(m\ge n\) and let \(B\in {\mathbb {R}}^{m\times p}\), \(C\in {\mathbb {R}}^{n\times p}\) be two given matrices. We set \(A {:}{=}BC^{\scriptscriptstyle {\mathsf {T}}}\in {\mathbb {R}}^{m\times n}\) and suppose that \(A=WH\) is a polar decomposition of A. We have
for any orthonormal matrix \(Q\in V(n,m)\).
Proof
We have
where both the first and the fourth equalities follow from the fact that both Q and W are in V(n, m), and the last one is derived from the fact that H is symmetric and positive semidefinite by Lemma 15. \(\square \)
Given an \(n\times n\) symmetric positive semidefintie matrix H, we can define a symmetric bilinear form on \({\mathbb {R}}^{m\times n}\) by
for all \(m\times n\) matrices P and Q. It can also induce a seminorm
In particular, if H is positive definite, then \(\Vert \cdot \Vert _H\) is a norm on \({\mathbb {R}}^{m\times n}\). Thus, the error estimation in (72) can be viewed as a distance estimation between W and Q with respect to the distance induced by this norm. Moreover, if H is the identity matrix, then \(\Vert \cdot \Vert _H\) is simply the Frobenius norm which induces the Euclidean distance on \({\mathbb {R}}^{m\times n}\). By Lemma 16 it is easy to see that the optimizer in (71) is unique whenever A is of full rank.
The following result establishes the error estimation with respect to the Euclidean distance. Given a matrix \(A\in {\mathbb {R}}^{m\times n}\), let \(\sigma _{\min }(A)\) be the smallest singular value of A. If A is of full rank, then \(\sigma _{\min }(A)>0\).
Theorem 9
(Global Error Bound in Frobenius Norm) Let p, m, n be positive integers with \(m\ge n\) and let \(B\in {\mathbb {R}}^{m\times p}\) and \(C\in {\mathbb {R}}^{n\times p}\) be two given matrices. We set \(A:=BC^{\scriptscriptstyle {\mathsf {T}}}\in {\mathbb {R}}^{m\times n}\) and suppose that A is of full rank with the polar decomposition \(A=WH\). We have that
for any orthonormal matrix \(Q\in V(n,m)\).
Proof
We know that in this case
Therefore we may conclude that
According to Lemma 16, we can derive the desired inequality. \(\square \)
Theorem 9 is a refinement of Sun and Chen’s result (cf. [59, Theorem 4.1]), in which the right hand side of (75) has an extra factor \(\frac{1}{4}\).
1.2 B.2 Principal angles between subspaces
Given two linear subspaces \({\mathbb {U}}\) and \({\mathbb {V}}\) of dimension r in \({\mathbb {R}}^n\), the principal angles \(\{\theta _i:i=1,\dots ,r\}\) between \({\mathbb {U}}\) and \({\mathbb {V}}\) and the associated principal vectors \(\{({\mathbf {u}}_i,{\mathbf {v}}_i):i=1,\dots ,r\}\) are defined recursively by
The following result is standard, whose proof can be found in [24, Section 6.4.3].
Lemma 17
For any orthonormal matrices \(U,V\in V(r,n)\) of which subspaces spanned by column vectors are \({\mathbb {U}}\) and \({\mathbb {V}}\) respectively, we have
where \(\theta _i\)’s are the principal angles between \({\mathbb {U}}\) and \({\mathbb {V}}\), and \(\sigma _i(U^{\scriptscriptstyle {\mathsf {T}}}V)\) is the i-th largest singular value of the matrix \(U^{\scriptscriptstyle {\mathsf {T}}}V\).
Lemma 18
For any orthonormal matrices \(U,V\in V(r,n)\), we have
Proof
We recall from [24, pp. 331] that
for any \(n\times r\) matrices A, B. In particular, if \(U,V\in {{\,\mathrm{V}\,}}(r,n)\) then
This implies that
and the desired inequality follows immediately. \(\square \)
Lemma 19
For any orthonormal matrices \(U,V\in V(r,n)\), we have
Proof
We have
where the first equality follows from Lemma 17. \(\square \)
Let \({\mathbb {U}}^\perp \) be the orthogonal complement subspace of a given linear subspace \({\mathbb {U}}\) in \({\mathbb {R}}^n\). A useful fact about principal angles between two linear subspaces \({\mathbb {U}}\) and \({\mathbb {V}}\) and those between \({\mathbb {U}}^\perp \) and \({\mathbb {V}}^\perp \) is stated as follows. The proof can be found in [33, Theorem 2.7].
Lemma 20
Let \({\mathbb {U}}\) and \({\mathbb {V}}\) be two linear subspaces of the same dimension and let \(\frac{\pi }{2}\ge \theta _s \ge \dots \ge \theta _1 > 0\) be the nonzero principal angles between \({\mathbb {U}}\) and \({\mathbb {V}}\). Then the nonzero principal angles between \({\mathbb {U}}^\perp \) and \({\mathbb {V}}^\perp \) are \(\frac{\pi }{2}\ge \theta _s \ge \dots \ge \theta _1 > 0\).
The following result is for the general case, which might be of independent interests.
Lemma 21
Let \(m\ge n\) be positive integers and let \(V:=[V_1\ V_2]\in {{\,\mathrm{O}\,}}(m)\) with \(V_1\in V(n,m)\) and \(U\in V(n,m)\) be two given orthonormal matrices. Then, there exists an orthonormal matrix \(W\in V(m-n,m)\) such that \(P:=[U\ W]\in {{\,\mathrm{O}\,}}(m)\) and
Proof
By a simple computation, it is straightforward to verify that (80) is equivalent to
To that end, we let \(U_2\in V(m-n,m)\) be an orthonormal matrix such that \([U\ U_2]\in {{\,\mathrm{O}\,}}(m)\). Then, we have that the linear subspace \({\mathbb {U}}_2\) spanned by column vectors of \(U_2\) is the orthogonal complement of \({\mathbb {U}}_1\), which is spanned by column vectors of U. Likewise, let \({\mathbb {V}}_1\) and \({\mathbb {V}}_2\) be linear subspaces spanned by column vectors of \(V_1\) and \(V_2\) respectively.
Let \(\frac{\pi }{2}\ge \theta _s\ge \dots \ge \theta _1 > 0\) be the nonzero principal angles between \({\mathbb {U}}_2\) and \({\mathbb {V}}_2\) for some nonnegative integer \(s\le m-n\). We have by Lemmas 17, and 18 that
Let \(Q\in {{\,\mathrm{O}\,}}(m-n)\) be a polar orthogonal factor matrix of the matrix \(U_2^{\scriptscriptstyle {\mathsf {T}}}V_2\). It follows from the property of polar decomposition that
On the other hand, nonzero principal angles between \({\mathbb {U}}_1\) and \({\mathbb {V}}_1\) are \(\frac{\pi }{2}\ge \theta _s\ge \dots \ge \theta _1>0\) by Lemma 20. Therefore, by Lemmas 17, and 18, we have that
In a conclusion, if we set \(W {:}{=}U_2Q\), then we have the following:
where the second equality follows from (82) and (83) and the inequality follows from (84). \(\square \)
C Proofs of technical lemmas in section 4
1.1 C.1 Proof of Lemma 6
Proof
For each \(i\in \{0,\dots ,k-1\}\), we have (cf. (23), (36) and (41))
We first analyze the second summand in (85) and by considering the following two cases:
-
1.
If \(\sigma ^{(i+1)}_{r,[p]}\ge \epsilon \), then there is no proximal step in Algorithm 1 and we have that \(\sigma _{\min }(S^{(i+1)}_{[p]})=\sigma ^{(i+1)}_{r,[p]}\ge \epsilon \), where \(S^{(i+1)}_{[p]}\) is the polar positive semidefinite factor matrix of \(V^{(i+1)}_{[p]}\varLambda ^{(i+1)}_{[p]}\) obtained in (39). From (35), (36) and (37) we notice that
$$\begin{aligned} ((U^{(i+1)}_{[p]})^{\scriptscriptstyle {\mathsf {T}}}V^{(i+1)}_{[p]}\varLambda ^{(i+1)}_{[p]})_{jj}= & {} (({\mathbf {u}}^{(i+1)}_{j,[p]})^{\scriptscriptstyle {\mathsf {T}}}{\mathbf {v}}^{(i+1)}_{j,[p]})\lambda ^{i}_{j,[p]} =(({\mathbf {u}}^{(i+1)}_{j,[p]})^{\scriptscriptstyle {\mathsf {T}}}{\mathcal {A}}\tau _{i+1}{\mathbf {x}}^{(i+1)}_{j,[p]}) \lambda ^{i}_{j,[p]}\\= & {} {\mathcal {A}}\tau ({\mathbf {x}}^{(i+2)}_{j,[p]})\lambda ^{i}_{j,[p]} =\lambda ^{i+1}_{j,[p]}\lambda ^{i}_{j,[p]}, \end{aligned}$$and similarly \(((U^{(i+1)}_{[p-1]})^{\scriptscriptstyle {\mathsf {T}}}V^{(i+1)}_{[p]}\varLambda ^{(i+1)}_{[p]})_{jj} = \lambda ^{i}_{j,[p]}\lambda ^{i}_{j,[p]}\). Hence by Lemma 16, we obtain
$$\begin{aligned} \sum _{j=1}^r\lambda ^{i}_{j,[p]}(\lambda ^{i+1}_{j,[p]}-\lambda ^{i}_{j,[p]})&={\text {Tr}}((U^{(i+1)}_{[p]})^{\scriptscriptstyle {\mathsf {T}}}V^{(i+1)}_{[p]}\varLambda ^{(i+1)}_{[p]}) -{\text {Tr}}((U^{(i+1)}_{[p-1]})^{\scriptscriptstyle {\mathsf {T}}}V^{(i+1)}_{[p]}\varLambda ^{(i+1)}_{[p]})\nonumber \\&=\frac{1}{2}\big \Vert (U^{(i+1)}_{[p]}-U^{(i+1)}_{[p-1]})\sqrt{S^{(i+1)}_{[p]}}\big \Vert _F^2\nonumber \\&\ge \frac{\epsilon }{2}\Vert U^{(i+1)}_{[p]}-U^{(i+1)}_{[p-1]}\Vert _F^2\nonumber \\&\ge 0. \end{aligned}$$(86) -
2.
If \(\sigma ^{(i+1)}_{r,[p]}<\epsilon \), we consider the following matrix optimization problem
$$\begin{aligned} \begin{array}{rl} \max &{}\langle V^{(i+1)}_{[p]}\varLambda ^{(i+1)}_{[p]},U\rangle -\frac{\epsilon }{2}\Vert U-U^{(i+1)}_{[p-1]}\Vert _F^2\\ \text {s.t.}&{} U\in V(r,n_{i+1}). \end{array} \end{aligned}$$(87)Since \(U,U^{(i+1)}_{[p-1]} \in {V(r,n_{i+1})}\), we must have
$$\begin{aligned} \frac{\epsilon }{2}\Vert U- U^{(i+1)}_{[p-1]}\Vert _F^2=\epsilon r-\epsilon \langle U^{(i+1)}_{[p-1]},U\rangle . \end{aligned}$$Thus, by Lemma 15, a global maximizer of (87) is given by a polar orthonormal factor matrix of the matrix \(V^{(i+1)}_{[p]}\varLambda ^{(i+1)}_{[p]}+\epsilon U^{(i+1)}_{[p-1]}\). By Substep 2 of Algorithm 1, \(U^{(i+1)}_{[p]}\) is a polar orthonormal factor matrix of the matrix \(V^{(i+1)}_{[p]}\varLambda ^{(i+1)}_{[p]}+\epsilon U^{(i+1)}_{[p-1]}\), and hence a global maximizer of (87). Thus, by the optimality of \(U^{(i+1)}_{[p]}\) for (87), we have
$$\begin{aligned} \langle V^{(i+1)}_{[p]}\varLambda ^{(i+1)}_{[p]},U^{(i+1)}_{[p]}\rangle -\frac{\epsilon }{2}\Vert U^{(i+1)}_{[p]}-U^{(i+1)}_{[p-1]}\Vert _F^2\ge \langle V^{(i+1)}_{[p]}\varLambda ^{(i+1)}_{[p]},U^{(i+1)}_{[p-1]}\rangle . \end{aligned}$$Therefore, the inequality (86) in case (1) also holds in this case.
Consequently, we have
which together with Cauchy-Schwartz inequality implies that
Since \(f({\mathbb {U}}_{[0]})>0\), we conclude that \(f({\mathbb {U}}_{0,[1]})=\sum _{j=1}^r(\lambda ^{0}_{j,[1]})^2>0\) and hence \(\sum _{j=1}^r\lambda ^{1}_{j,[1]}\lambda ^{0}_{j,[1]}>0\) by (88). Thus, we conclude that
where the first inequality follows from (89) and the second from (88). Combining (90) with (85) and (86), we may obtain (44) for \(i=0\) and \(p=1\).
On the one hand, from (89) we obtain
Since \(\sum _{j=1}^r (\lambda _{j,[p]}^i)^2 > 0\) if there is no truncation, we must have
i.e., the objective function f is monotonically increasing during the APD iteration as long as there is no truncation. On the other hand, there are at most r truncations and hence the total loss of f by the truncation is at most \(r\kappa ^2<f({\mathbb {U}}_{[0]})\). Therefore, f is always positive and according to (88), we may conclude that \(\sum _{j=1}^r\lambda ^{i+1}_{j,[p]}\lambda ^{i}_{j,[p]}>0\) along iterations. By induction on p, we obtain
which together with (85) and (86), implies (44) for arbitrary nonnegative integer p. \(\square \)
1.2 C.2 Proof for Lemma 7
Proof
The set of subgradients of h can be partitioned as follows:
Following the notation of Algorithm 1, we set
where \({\mathbf {u}}^{(i)}_{j,[p+1]}\) is the j-th column of the matrix \(U^{(i)}_{[p+1]}\) for all \(i\in \{1,\dots ,k\}\),
for all \(i\in \{1,\dots ,k\}\) and
By (39) and (87), we have
where \(\alpha \in \{0,\epsilon \}\) depending on whether or not there is a proximal correction. According to (69) and (94), we have
which implies that \(V^{(i)}_{[p+1]}\varLambda ^{(i)}_{[p+1]}+\alpha \big (U^{(i)}_{[p]}-U^{(i)}_{[p+1]}\big )\in \partial \delta _{V(r,n_i)}(U^{(i)}_{[p+1]})\). If we take
then we have
On the other hand,
where the third inequality follows from the fact that
and a similar formula for \(\varLambda -\varLambda ^{(i)}_{[p+1]}\), the fourth follows from the fact that
for any vector \({\mathbf {x}}:=({\mathbf {x}}_1,\dots ,{\mathbf {x}}_k)\) with \(\Vert {\mathbf {x}}_i\Vert =1\) for all \(i=1,\dots ,k\) and the last one follows from \(\alpha \le \epsilon \). This, together with (91), implies (48). \(\square \)
1.3 C.3 Proof for Lemma 9
Proof
We let
where \(\alpha \in \{0,\epsilon \}\) depending on whether there is a proximal correction or not (cf. the proof for Lemma 7). It follows from Lemma 7 that
for some constant \(\gamma _0>0\). By Algorithm 1, we have
where \(S^{(i)}_{[p+1]}\) is a symmetric positive semidefinite matrix. Since \(U^{(i)}_{[p+1]}\in V(r,n_i)\) is an orthonormal matrix, we have
where the second equality follows from the symmetry of the matrix \(S^{(i)}_{[p+1]}\).
Consequently, we have
where \(\gamma _1 = \gamma _0+\epsilon \). Next, we derive an estimation for the second summand of the right hand side of (98). To do this, we notice that
where \(\gamma _2>0\) is some constant, the first equality follows from (96), the second from (97), the second inequalityFootnote 5 follows from the fact that \(U^{(i)}_{[p+1]}\in V(r,n_i)\) and the last inequality from the relation
which is obtained by Lemma 19. The desired inequality can be derived easily from (98) and (99). \(\square \)
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Hu, S., Ye, K. Linear convergence of an alternating polar decomposition method for low rank orthogonal tensor approximations. Math. Program. 199, 1305–1364 (2023). https://doi.org/10.1007/s10107-022-01867-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-022-01867-8
Keywords
- Orthogonally decomposable tensors
- Low rank orthogonal tensor approximation
- R-linear convergence
- Sublinear convergence
- Global convergence