Abstract
The problem of optimizing a quadratic form over an orthogonality constraint (QP-OC for short) is one of the most fundamental matrix optimization problems and arises in many applications. In this paper, we characterize the growth behavior of the objective function around the critical points of the QP-OC problem and demonstrate how such characterization can be used to obtain strong convergence rate results for iterative methods that exploit the manifold structure of the orthogonality constraint (i.e., the Stiefel manifold) to find a critical point of the problem. Specifically, our primary contribution is to show that the Łojasiewicz exponent at any critical point of the QP-OC problem is 1 / 2. Such a result is significant, as it expands the currently very limited repertoire of optimization problems for which the Łojasiewicz exponent is explicitly known. Moreover, it allows us to show, in a unified manner and for the first time, that a large family of retraction-based line-search methods will converge linearly to a critical point of the QP-OC problem. Then, as our secondary contribution, we propose a stochastic variance-reduced gradient (SVRG) method called Stiefel-SVRG for solving the QP-OC problem and present a novel Łojasiewicz inequality-based linear convergence analysis of the method. An important feature of Stiefel-SVRG is that it allows for general retractions and does not require the computation of any vector transport on the Stiefel manifold. As such, it is computationally more advantageous than other recently-proposed SVRG-type algorithms for manifold optimization.
Similar content being viewed by others
Notes
Such an assumption is omitted in the original text of [3] but is needed for the result in [3, Section 4.8.2] to hold. The omission is corrected in the online errata at https://sites.uclouvain.be/absil/amsbook/errata.html.
That is, there exist constants \(r_0>0\), \(r_1\in (0,1)\) and index \(K\ge 0\) such that \(\Vert X^k-X^*\Vert _F \le r_0r_1^k\) for all \(k\ge K\).
Stiefel-SVRG was first presented by the second author at the 13th Chinese Workshop on Machine Learning and Applications held in Nanjing, China in 2015 [45]. As such, it predates the SVRG methods for manifold optimization developed in [37, 58]. More importantly, Stiefel-SVRG does not require the computation of any vector transport, which makes it computationally more advantageous than the SVRG methods proposed in [37, 58].
References
Abrudan, T.E., Eriksson, J., Koivunen, V.: Steepest descent algorithms for optimization under unitary matrix constraint. IEEE Trans. Signal Process. 56(3), 1134–1147 (2008)
Absil, P.-A., Mahony, R., Andrews, B.: Convergence of the iterates of descent methods for analytic cost functions. SIAM J. Optim. 16(2), 531–547 (2005)
Absil, P.-A., Mahony, R., Sepulchre, R.: Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton (2008)
Absil, P.-A., Malick, J.: Projection-like retractions on matrix manifolds. SIAM J. Optim. 22(1), 135–158 (2012)
Agarwal, A., Anandkumar, A., Jain, P., Netrapalli, P.: Learning sparsely used overcomplete dictionaries via alternating minimization. SIAM J. Optim. 26(4), 2775–2799 (2016)
Bolla, M., Michaletzky, G., Tusnády, G., Ziermann, M.: Extrema of sums of heterogeneous quadratic forms. Linear Algebra Appl. 269(1–3), 331–365 (1998)
Bolte, J., Danilidis, A., Ley, O., Mazet, L.: Characterizations of Łojasiewicz inequalities: subgradient flows, talweg, convexity. Trans. Am. Math. Soc. 362(6), 3319–3363 (2010)
Bolte, J., Ngyuen, T.P., Peypouquet, J., Suter, B.W.: From error bounds to the complexity of first-order descent methods for convex functions. Math. Program. Ser. A 165(2), 471–507 (2017)
Bonnabel, S.: Stochastic gradient descent on Riemannian manifolds. IEEE Trans. Autom. Control 58(9), 2217–2229 (2013)
Candès, E.J., Li, X., Soltanolkotabi, M.: Phase retrieval via Wirtinger flow: theory and algorithms. IEEE Trans. Inf. Theory 61(4), 1985–2007 (2015)
Chang, X.-W., Paige, C.C., Stewart, G.W.: Perturbation analyses for the QR factorization. SIAM J. Matrix Anal. Appl. 18(3), 775–791 (1997)
Dieci, L., Eirola, T.: On smooth decompositions of matrices. SIAM J. Matrix Anal. Appl. 20(3), 800–819 (1999)
Feehan, P.M.N.: Global existence and convergence of solutions to gradient systems and applications to Yang–Mills gradient flow. Monograph. arxiv.org/abs/1409.1525 (2014)
Forti, M., Nistri, P., Quincampoix, M.: Convergence of neural networks for programming problems via a nonsmooth Łojasiewicz inequality. IEEE Trans. Neural Netw. 17(6), 1471–1486 (2006)
Golub, G.H., Van Loan, C.F.: Matrix Computations, 3rd edn. The Johns Hopkins University Press, Baltimore, MD (1996)
Hardt, M.: Understanding alternating minimization for matrix completion. In: Proceedings of the 55th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2014), pp. 651–660 (2014)
Hou, K., Zhou, Z., So, A. M.-C., Luo, Z.-Q.: On the linear convergence of the proximal gradient method for trace norm regularization. In: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 26: Proceedings of the 2013 Conference, pp. 710–718 (2013)
Jain, P., Oh, S.: Provable tensor factorization with missing data. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 27: Proceedings of the 2014 Conference, pp. 1431–1439 (2014)
Jiang, B., Dai, Y.-H.: A framework of constraint preserving update schemes for optimization on stiefel manifold. Math. Program. Ser. A 153(2), 535–575 (2015)
Johnson, R., Zhang, T.: Accelerating stochatic gradient descent using predictive variance reduction. In: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 26: Proceedings of the 2013 Conference, pp. 315–323 (2013)
Kaneko, T., Fiori, S., Tanaka, T.: Empirical arithmetic averaging over the compact stiefel manifold. IEEE Trans. Signal Process. 61(4), 883–894 (2013)
Kokiopoulou, E., Chen, J., Saad, Y.: Trace optimization and eigenproblems in dimension reduction methods. Numer. Linear Algebra Appl. 18(3), 565–602 (2011)
Li, G., Mordukhovich, B.S., Phạm, T.S.: New fractional error bounds for polynomial systems with applications to Hölderian stability in optimization and spectral theory of tensors. Math. Program. Ser. A 153(2), 333–362 (2015)
Li, G., Pong, T.K.: Calculus of the exponent of Kurdyka–Łojasiewicz inequality and its applications to linear convergence of first-order methods. Found. Comput. Math. (2017)
Liu, H., Wu, W., So, A.M.-C.: Quadratic optimization with orthogonality constraints: explicit Łojasiewicz exponent and linear convergence of line-search methods. In: Proceedings of the 33rd International Conference on Machine Learning (ICML 2016), pp. 1158–1167 (2016)
Liu, H., Yue, M.-C., So, A.M.-C.: On the estimation performance and convergence rate of the generalized power method for phase synchronization. SIAM J. Optim. 27(4), 2426–2446 (2017)
Luo, Z.-Q.: New error bounds and their applications to convergence analysis of iterative algorithms. Math. Program. Ser. B 88(2), 341–355 (2000)
Luo, Z.-Q., Pang, J.-S.: Error bounds for analytic systems and their applications. Math. Program. 67(1), 1–28 (1994)
Luo, Z.-Q., Sturm, J.F.: Error bounds for quadratic systems. In: Frenk, H., Roos, K., Terlaky, T., Zhang, S. (eds.) High Performance Optimization, Volume 33 of Applied Optimization, pp. 383–404. Springer, Dordrecht (2000)
Luo, Z.-Q., Tseng, P.: Error bounds and convergence analysis of feasible descent methods: a general approach. Ann. Oper. Res. 46(1), 157–178 (1993)
Manton, J.H.: Optimization algorithms exploiting unitary constraints. IEEE Trans. Signal Process. 50(3), 635–650 (2002)
Merlet, B., Nguyen, T.N.: Convergence to equilibrium for discretizations of gradient-like flows on Riemannian manifolds. Differ. Integral Equ. 26(5–6), 571–602 (2013)
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic Publishers, Boston (2004)
Netrapalli, P., Jain, P., Sanghavi, S.: Phase retrieval using alternating minimization. IEEE Trans. Signal Process. 63(18), 4814–4826 (2015)
Saad, Y.: Numerical Methods for Large Eigenvalue Problems. Classics in Applied Mathematics, revised edition . Society for Industrial and Applied Mathematics, Philadelphia (2011)
Sato, H., Iwai, T.: A Riemannian optimization approach to the matrix singular value decomposition. SIAM J. Optim. 23(1), 188–212 (2013)
Sato, H., Kasai, H., Mishra, B.: Riemannian stochastic variance reduced gradient. Manuscript, arxiv.org/abs/1702.05594 (2017)
Schneider, R., Uschmajew, A.: Convergence results for projected line-search methods on varieties of low-rank matrices via Łojasiewicz inequality. SIAM J. Optim. 25(1), 622–646 (2015)
Schönemann, P.H.: A generalized solution of the orthogonal Procrustes problem. Psychometrika 31(1), 1–10 (1966)
Schönemann, P.H.: On two-sided orthogonal Procrustes problems. Psychometrika 33(1), 19–33 (1968)
Shamir, O.: A stochastic PCA and SVD algorithm with an exponential convergence rate. In: Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), pp. 144–152 (2015)
Shamir, O.: Fast stochastic algorithms for SVD and PCA: convergence properties and convexity. In: Proceedings of the 33rd International Conference on Machine Learning (ICML 2016), pp. 248–256 (2016)
Smith, S.T.: Optimization techniques on Riemannian manifolds. In: Bloch, A. (ed.) Hamiltonian and Gradient Flows, Algorithms and Control. Fields Institue Communications, pp. 113–136. American Mathematical Society, Providence (1994)
So, A.M.-C.: Moment inequalities for sums of random matrices and their applications in optimization. Math. Program. Ser. A 130(1), 125–151 (2011)
So, A.M.-C.: Pinning down the Łojasiewicz exponent: towards understanding the convergence behavior of first-order methods for structured non-convex optimization problems. Slides. http://lamda.nju.edu.cn/conf/mla15/files/suwz.pdf (2015)
So, A.M.-C., Zhou, Z.: Non-asymptotic convergence analysis of inexact gradient methods for machine learning without strong convexity. Optim. Methods Softw. 32(4), 963–992 (2017)
Sun, J.: On perturbation bounds for the QR factorization. Linear Algebra Appl. 215, 95–111 (1995)
Sun, J., Qu, Q., Wright, J.: A geometric analysis of phase retrieval. Found. Comput. Math. (2017)
Sun, J., Qu, Q., Wright, J.: Complete dictionary recovery over the sphere I: overview and the geometric picture. IEEE Trans. Inf. Theory 63(2), 853–884 (2017)
Sun, J., Qu, Q., Wright, J.: Complete dictionary recovery over the sphere II: recovery by Riemannian trust-region method. IEEE Trans. Inf. Theory 63(2), 885–914 (2017)
Sun, R., Luo, Z.-Q.: Guaranteed matrix completion via non-convex factorization. IEEE Trans. Inf. Theory 62(11), 6535–6579 (2016)
Sun, W.W., Lu, J., Liu, H., Cheng, G.: Provable sparse tensor decomposition. J. R. Stat. Soc. B 79(3), 899–916 (2017)
Udrişte, C.: Convex Functions and Optimization Methods on Riemannian Manifolds, Volume 297 of Mathematics and Its Applications. Springer, Dordrecht (1994)
Uschmajew, A.: A new convergence proof for the higher-order power method and generalizations. Pac. J. Optim. 11(2), 309–321 (2015)
Wen, Z., Yin, W.: A feasible method for optimization with orthogonality constraints. Math. Program. Ser. A 142(1–2), 397–434 (2013)
Yang, Y.: Globally convergent optimization algorithms on Riemannian manifolds: uniform framework for unconstrained and constrained optimization. J. Optim. Theory Appl. 132(2), 245–265 (2007)
Yger, F., Berar, M., Gasso, G., Rakotomamonjy, A.: Adaptive canonical correlation analysis based on matrix manifolds. In: Proceedings of the 29th International Conference on Machine Learning (ICML 2012), pp. 1071–1078 (2012)
Zhang, H., Reddi, S. J., Sra, S.: Riemannian SVRG: fast stochastic optimization on Riemannian manifolds. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems 29: Proceedings of the 2016 Conference, pp. 4592–4600 (2016)
Zhang, H., Sra, S.: First-order methods for geodesically convex optimization. In Feldman, V., Rakhlin, A., Shamir, O. (eds.) Proceedings of the 29th Annual Conference on Learning Theory (COLT 2016), Volume 49 of Proceedings of Machine Learning Research, pp. 1617–1638 (2016)
Zheng, Q., Lafferty, J.: A convergent gradient descent algorithm for rank minimization and semidefinite programming from random linear measurements. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems 28: Proceedings of the 2015 Conference, pp. 109–117 (2015)
Zhong, Y., Boumal, N.: Near-optimal bounds for phase synchronization. SIAM J. Optim. 28(2), 989–1016 (2018)
Zhou, Z., So, A.M.-C.: A unified approach to error bounds for structured convex optimization problems. Math. Program. Ser. A 165(2), 689–728 (2017)
Zhou, Z., Zhang, Q., So, A.M.-C.: \(\ell _{1,p}\)-norm regularization: error bounds and convergence rate analysis of first-order methods. In: Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), pp. 1501–1510 (2015)
Acknowledgements
We thank the associate editor for coordinating the review of our manuscript and the anonymous reviewer for his/her detailed comments.
Author information
Authors and Affiliations
Corresponding author
Additional information
A preliminary version of this work has appeared in the Proceedings of the 33rd International Conference on Machine Learning (ICML 2016), 2016 [25]. This research is supported in part by the Hong Kong Research Grants Council (RGC) General Research Fund (GRF) Projects CUHK 14205314, CUHK 14206814, and CUHK 14208117.
Appendices
Appendix
Proof of Proposition 4
Observe that given any \(X\in \mathcal {X}_{h,\Pi }\), we can write
Thus, if \(X \in \mathcal {X}_{h,\Pi } \cap \mathcal {X}_{h',\Pi '}\), then
for any \(P_i \in \mathcal {O}^{s_i-s_{i-1}}\) (\(i=1,\ldots ,n_A\)) and \(Q_j \in \mathcal {O}^{t_j-t_{j-1}}\) (\(j=1,\ldots ,n_B\)). This implies that \(\mathcal {X}_{h,\Pi } = \mathcal {X}_{h',\Pi '}\).
Now, suppose that \(\mathcal {X}_{h,\Pi } \cap \mathcal {X}_{h',\Pi '}=\emptyset \). Let \(X\in \mathcal {X}_{h,\Pi }\) and \(X'\in \mathcal {X}_{h',\Pi '}\) be arbitrary. Then, there exist \(P_i \in \mathcal {O}^{s_i-s_{i-1}}\) (\(i=1,\ldots ,n_A\)) and \(Q_j \in \mathcal {O}^{t_j-t_{j-1}}\) (\(j=1,\ldots ,n_B\)) such that
Consider the following block decomposition of \(E(h)\Pi \) (and similarly for \(E(h')\Pi '\)):
where \(E_{i,j}(h,\Pi )\in \mathbb R^{(s_i-s_{i-1})\times (t_j-t_{j-1})}\) for \(i=1,\ldots ,n_A\) and \(j=1,\ldots ,n_B\). Let \(|E_{i,j}(h,\Pi )|\) be the number of ones in \(E_{i,j}(h,\Pi )\). We then have two cases:
Case 1. There exist \(i\in \{1,\ldots ,n_A\}\) and \(j\in \{1,\ldots ,n_B\}\) such that \(|E_{i,j}(h,\Pi )| \not = |E_{i,j}(h',\Pi ')|\).
It can be seen from (11) that for any \(u\in \mathcal {H}\), every column of E(u) has exactly one 1. Hence, for any \(u\in \mathcal {H}\) and \(\Phi \in \mathcal {P}^n\), every column of \(E(u)\Phi \) also has exactly one 1. In particular, we have
which implies that \(|E_{i',j}(h,\Pi )| \not = |E_{i',j}(h',\Pi ')|\) for some \(i'\in \{1,\ldots ,n_A\}{\setminus }\{i\}\). Now, we compute
Both terms in (49) are instances of the two-sided orthogonal Procrustes problem and admit the following characterization [40]:
Here, \(K=\min \{s_i-s_{i-1},t_j-t_{j-1}\}\), \(K'=\min \{s_{i'}-s_{i'-1},t_j-t_{j-1}\}\), and \(\sigma _k(Y)\) is the kth largest singular value of Y. Observe that for any \(\alpha \in \{1,\ldots ,n_A\}\), \(\beta \in \{1,\ldots ,n_B\}\), \(u\in \mathcal {H}\), and \(\Phi \in \mathcal {P}^n\), every non-zero row and every non-zero column of \(E_{\alpha ,\beta }(u,\Phi )\) has exactly one 1. It follows that the singular values of \(E_{\alpha ,\beta }(u,\Phi )\) are either 0 or 1, and there are \(|E_{\alpha ,\beta }(u,\Phi )|\) of the latter. Since \(|E_{i,j}(h,\Pi )| \not = |E_{i,j}(h',\Pi ')|\) and \(|E_{i',j}(h,\Pi )| \not = |E_{i',j}(h',\Pi ')|\), we conclude from (49) that \(\Vert X-X'\Vert _F^2 \ge 2\).
Case 2.\(|E_{i,j}(h,\Pi )| = |E_{i,j}(h',\Pi ')|\) for \(i=1,\ldots ,n_A\) and \(j=1,\ldots ,n_B\).
We show that \(X=X'\) in this case, which would then contradict the assumption that \(\mathcal {X}_{h,\Pi } \cap \mathcal {X}_{h',\Pi '}=\emptyset \). To begin, let \(i\in \{1,\ldots ,n_A\}\) be arbitrary and consider the ith block row of \(E(h)\Pi \) and \(E(h')\Pi '\); i.e.,
By (11), every non-zero row of \(\mathrm{BlkRow}_i(E(h)\Pi )\) and \(\mathrm{BlkRow}_i \left( E(h')\Pi ' \right) \) has exactly one 1. Moreover, we have \(|E_{i,j}(h,\Pi )| = |E_{i,j}(h',\Pi ')|\) for \(j=1,\ldots ,n_B\) by assumption. Hence, we can find permutation matrices \(\Phi _{i,1},\Phi _{i,2},\ldots ,\Phi _{i,n_B}\in \mathcal {P}^{s_i-s_{i-1}}\) such that for \(j=1,\ldots ,n_B\),
-
(i)
the indices of the rows of \(\Phi _{i,j} \left( \Phi _{i,j-1}\Phi _{i,j-2}\cdots \Phi _{i,1}E_{i,j}(h,\Pi ) \right) \) that contain a 1 are the same as those of \(E_{i,j}(h',\Pi ')\) that contain a 1 (i.e., the kth row of \(\Phi _{i,j} \left( \Phi _{i,j-1}\Phi _{i,j-2}\cdots \Phi _{i,1}E_{i,j}(h,\Pi ) \right) \) contains a 1 if and only if the kth row of \(E_{i,j}(h',\Pi ')\) contains a 1, where \(k\in \{1,\ldots ,s_i-s_{i-1}\}\));
-
(ii)
the indices of the rows of \(\left( \Phi _{i,j-1}\Phi _{i,j-2}\cdots \Phi _{i,1}\right) \left[ E_{i,1}(h,\Pi ) \, \cdots \, E_{i,j-1}(h,\Pi ) \right] \) that contain a 1 are fixed by \(\Phi _{i,j}\) (i.e., if the kth row of \(\left( \Phi _{i,j-1}\Phi _{i,j-2}\cdots \Phi _{i,1}\right) \left[ E_{i,1}(h,\Pi ) \, \cdots \, E_{i,j-1}(h,\Pi ) \right] \) contains a 1, then \(\Phi _{i,j}e_k=e_k\), where \(e_k\) is the kth standard basis vector of \(\mathbb R^{s_i-s_{i-1}}\) and \(k\in \{1,\ldots ,s_i-s_{i-1}\}\)).
Upon letting \(\Phi _i = \Phi _{i,n_B}\Phi _{i,n_B-1}\cdots \Phi _{i,1} \in \mathcal {P}^{s_i-s_{i-1}}\) and using properties (i) and (ii) above, we see that the indices of the rows of \(\Phi _iE_{i,j}(h,\Pi )\) that contain a 1 are the same as those of \(E_{i,j}(h',\Pi ')\) that contain a 1 for \(j=1,\ldots ,n_B\).
Next, let \(j\in \{1,\ldots ,n_B\}\) be arbitrary and consider the jth block column of \(\mathrm{BlkDiag}(\Phi _1,\ldots ,\Phi _{n_A})\cdot E(h) \cdot \Pi \) and \(E(h')\Pi '\); i.e.,
By (11), each column of \(\mathrm{BlkCol}_j \left( \mathrm{BlkDiag}(\Phi _1,\ldots ,\Phi _{n_A}) \cdot E(h) \cdot \Pi \right) \) and \(\mathrm{BlkCol}_j \left( E(h')\Pi ' \right) \) has exactly one 1. Since \(|E_{i,j}(h,\Pi )| = |E_{i,j}(h',\Pi ')|\) for \(i=1,\ldots ,n_A\) by assumption, we have \(|\Phi _iE_{i,j}(h,\Pi )| = |E_{i,j}(h',\Pi ')|\). Moreover, by the definition of \(\Phi _1,\ldots ,\Phi _{n_A}\), the indices of the rows of \(\Phi _iE_{i,j}(h,\Pi )\) that contain a 1 are the same as those of \(E_{i,j}(h',\Pi ')\) that contain a 1. Thus, there exists a permutation matrix \(\Psi _j\in \mathcal {P}^{t_j-t_{j-1}}\) such that
In particular, we obtain
Since a permutation matrix is also an orthogonal matrix, we conclude from (48) that \(\Vert X-X'\Vert _F^2=0\), or equivalently, \(X=X'\), as desired.
Proof of Proposition 5
Using (17) and (18), it can be verified that
Since up to a permutation of the rows \(\bar{E}_j\) takes the form (19), in order to obtain the desired bound on \(\mathrm{dist}^2(X,\mathcal {X}_{h,\Pi })\), it remains to prove the following:
Lemma 1
Let \(S = \begin{bmatrix} S_1\\S_2 \end{bmatrix} \in \mathrm{St}(p,q)\) be given, with \(S_1 \in \mathbb R^{q\times q}\) and \(S_2 \in \mathbb R^{(p-q)\times q}\). Consider the following problem:
Suppose that \(v^*<1\). Then, we have \(\Vert S_2\Vert _F^2 \le v^* \le 2\Vert S_2\Vert _F^2\).
Proof
Since
it suffices to consider the problem
Problem (50) is an instance of the orthogonal Procrustes problem, whose optimal solution is given by \(X^*=UV^T\), where \(S_1=U\Sigma V^T\) is the singular value decomposition of \(S_1\) [39]. It follows that
Now, since \(S\in \mathrm{St}(p,q)\), we have \(S^TS = S_1^TS_1 + S_2^TS_2 = I_q\), or equivalently,
This implies that \(\mathbf 0\preceq \Sigma \preceq I_q\) and
It follows that
This, together with the fact that \(\Vert S_2\Vert _F^2 \le v^* < 1\), yields the desired result. \(\square \)
Proof of Proposition 6
Recall that
Upon observing that \(AP^*=P^*A\), \(BQ^*=Q^*B\) and using (9), (18), we compute
Now, observe that the columns of \(\bar{X}\) are orthonormal and span an n-dimensional subspace \(\mathcal {L}\). In particular, for \(j=1,\ldots ,n_B\), each column of \(A\bar{X}_j\) can be decomposed as \(u+v\), where u is a linear combination of the columns of \(\bar{X}\) and \(v\in \mathcal {L}^\perp \), the orthogonal complement of \(\mathcal {L}\). In view of the structure of \(\bar{X}\) in (18), this leads to
where \(T_j \in \mathbb R^{m\times (t_j-t_{j-1})}\) is formed by projecting the columns of \(A\bar{X}_j\) onto \(\mathcal {L}^\perp \). Hence,
where \(\lambda _B=\min \{\lambda _{B,g},\lambda _{B,s}\}\), \(\lambda _{B,g} = \min _{j\in \{1,\ldots ,n_B-1\}} (b_{t_j}-b_{t_{j+1}}) > 0\), and \(\lambda _{B,s} = \min _{j\in \{1,\ldots ,n_B\}} |b_{t_j}|>0\). By combining the above with (51), the proof is completed.
Proof of Proposition 7
Consider a fixed \(j\in \{1,\ldots ,n_B\}\). Let \(\Delta _k\) be the kth column of \(A\bar{X}_j-\bar{X}_j\bar{X}_j^TA\bar{X}_j\), where \(k=1,\ldots ,t_j-t_{j-1}\). Since
our goal is to establish a lower bound on \(\Vert \Delta _k\Vert _2^2\) for \(k=1,\ldots ,t_j-t_{j-1}\). Towards that end, let \(\bar{x}_k\) be the kth column of \(\bar{X}_j\) and \((\bar{x}_k)_\alpha \) be the \(\alpha \)th entry of \(\bar{x}_k\), where \(k=1,\ldots ,t_j-t_{j-1}\) and \(\alpha =1,\ldots ,m\). Then, we can write
Suppose that \(\mathrm{dist}(X,\mathcal {X}_{h,\Pi }) = \left\| \bar{X} - E(h) \Pi \right\| _F = \tau \) for some \(\tau \in (0,1)\). Using the representations of \(\bar{X}\) and \(E(h)\Pi \) in (18), we have
where \(\iota (k)\) is the coordinate of the kth column of \(\bar{E}_j(h)\) that equals 1. Now, by (52),
Let \(\mathrm{proj}_{\mathcal {I}_j}\) be the projector onto the coordinates in \(\mathcal {I}_j = \left\{ k \in \{1,\ldots ,m\} \,\left| \,\right. \right. \left. \left. \left[ \bar{E}_j(h) \right] _k = \mathbf 0\right. \right\} \) (recall that \(\left[ \bar{E}_j(h) \right] _k\) is the kth row of \(\bar{E}_j(h)\)). Clearly, we have
where
Let \(\lambda _{A,m}=\max _{i\in \{1,\ldots ,n_A\}} |a_{s_i}|\) be the largest (in magnitude) eigenvalue of A. Using (53) and the fact that \(\iota (k)\not =\iota (\ell )\) whenever \(k\not =\ell \), we bound
and
This implies that \(|\nu _\ell | \le \lambda _{A,m}(m\tau ^2+2\tau )\) for \(\ell =1,\ldots ,t_j-t_{j-1}\). Moreover, since \(\bar{x}_1,\ldots ,\bar{x}_{t_j-t_{j-1}}\) are the columns of \(\bar{X}_j\), by Proposition 5, the definition of \(\mathcal {I}_j\), and the assumption that \(\mathrm{dist}(X,\mathcal {X}_{h,\Pi }) = \tau \), we have
It follows from (54) that
Next, we bound the first term on the right-hand side of the above inequality. Considering the structure of A in (8), let \(i'\in \{0,1,\ldots ,n_A-1\}\) be such that \(s_{i'}+1 \le \iota (k) \le s_{i'+1}\) and recall that \(\lambda _{A,g} = \min _{i\in \{1,\ldots ,n_A-1\}} (a_{s_i}-a_{s_{i+1}}) > 0\). Then, we have
To bound the term \(\left\| \mathrm{proj}_{\mathcal {I}_j \cap \{s_{i'}+1,\ldots ,s_{i'+1}\}} (\bar{x}_k) \right\| _2^2\), we proceed as follows. Let \(\bar{Y} = XQ^*\Pi ^T \in \mathrm{St}(m,n)\). Then, we have \(\bar{X} = (P^*)^TXQ^* = (P^*)^T\bar{Y}\Pi \) and
We are now interested in locating the entries of \(\mathrm{proj}_{\mathcal {I}_j \cap \{s_{i'}+1,\ldots ,s_{i'+1}\}} (\bar{x}_k)\) in the matrix \((P^*)^T\bar{Y}\). Towards that end, recall that \(P^*=\mathrm{BlkDiag}\left( P_1^*,\ldots ,P_{n_A}^* \right) \) and consider the decomposition
where \(P_i^* \in \mathcal {O}^{s_i-s_{i-1}}\) and \(\bar{Y}_{i,i} \in \mathbb R^{(s_i-s_{i-1})\times h_i}\), for \(i=1,\ldots ,n_A\). Since \(\iota (k)\) is the coordinate of the kth column of \(\bar{E}_j(h)\) that equals 1 and \(s_{i'}+1 \le \iota (k) \le s_{i'+1}\), we see from (10) and (11) that the kth column of \(\bar{E}_j(h)\) belongs to
As \(\bar{x}_k\) is the kth column of \(\bar{X}_j\) and \(\left\| \bar{X} - E(h)\Pi \right\| _F^2 = \sum _{j=1}^{n_B} \Vert \bar{X}_j - \bar{E}_j(h)\Vert _F^2\), it follows that all the entries of \(\mathrm{proj}_{\mathcal {I}_j \cap \{s_{i'}+1,\ldots ,s_{i'+1}\}} (\bar{x}_k)\) lie in \((P_{i'+1}^*)^T\bar{Y}_{i'+1,i'+1}\). Furthermore, by (58) and the definition of \(\mathcal {I}_j\), the entries of \(\mathrm{proj}_{\mathcal {I}_j \cap \{s_{i'}+1,\ldots ,s_{i'+1}\}} (\bar{x}_k)\) do not intersect the diagonal of the top \(h_{i'+1}\times h_{i'+1}\) block of \((P_{i'+1}^*)^T\bar{Y}_{i'+1,i'+1}\). Consequently, we have
To obtain an upper bound on the right-hand side of (59), we need the following lemma:
Lemma 2
Consider the decomposition of \((P^*)^T\bar{Y}\) in (57). For \(i=1,\ldots ,n_A\), let
Suppose that \(v_i^*<1\). Then, we have
Let us defer the proof of Lemma 2 to the end of this section. Now, observe that by (11) and (17),
Since \(\mathrm{dist}(X,\mathcal {X}_{h,\Pi }) = \tau \) for some \(\tau \in (0,1)\), we have \(\sum _{1\le i\not =j\le n_A}\Vert \bar{Y}_{i,j}\Vert _F^2 \le \tau ^2\) from (61). Hence, by Lemma 2 and (59), we have
and
This, together with (55), (56) and the fact that the implications
hold for any \(a,b,c\in \mathbb R\), yields
It follows that
(recall that \(\left[ \bar{X}_j\right] _k\) is the kth row of \(\bar{X}_j\)). Upon summing both sides of the above inequality over \(j=1,\ldots ,n_B\) and using Proposition 5 and the assumption that \(\mathrm{dist}(X,\mathcal {X}_{h,\Pi })=\tau \), we obtain
whenever \(\tau \in (0,1)\) satisfies
To complete the proof, it remains to prove Lemma 2.
Proof of Lemma 2
Consider a fixed \(i\in \{1,\ldots ,n_A\}\). Note that Problem (60) is again an instance of the orthogonal Procrustes problem. Hence, by the result in [39], an optimal solution to Problem (60) is given by
where \(\bar{Y}_{i,i}=H_i \begin{bmatrix} \Sigma _i \\ \mathbf 0\end{bmatrix} W_i^T\) is a singular value decomposition of \(\bar{Y}_{i,i}\). It follows from (60) that
Now, since \(\bar{Y}\in \mathrm{St}(m,n)\), we have
or equivalently,
By following the arguments in the proof of Lemma 1, we conclude that
as desired. \(\square \)
Second-order boundedness of some retractions on \(\mathrm{St}(m,n)\)
1.1 Second-order boundedness of \(R_\mathsf{polar}\)
Let \(X\in \mathrm{St}(m,n)\) and \(\xi \in T(X)\) be arbitrary. By definition, we have
Let \(\xi ^T\xi =U\Sigma U^T\) be a spectral decomposition of \(\xi ^T\xi \) with \(\Sigma =\mathrm{Diag}(\lambda _1,\ldots ,\lambda _n)\) and \(\lambda _1,\ldots ,\lambda _n\ge 0\). Then, a simple calculation yields
Since \(\Vert X+\xi \Vert \le \Vert X\Vert +\Vert \xi \Vert \le 1+\Vert \xi \Vert _F\), we conclude that whenever \(\Vert \xi \Vert _F \le 1\),
i.e., \(R_\mathsf{polar}\) satisfies Property (P) with \(\phi =M=1\).
1.2 Second-order boundedness of \(R_\mathsf{QR}\)
Let \(X\in \mathrm{St}(m,n)\) and \(\xi \in T(X)\) be arbitrary. Suppose that \(\Vert \xi \Vert _F \le 1/2\). Then, for any \(t\in [-1,1]\), the matrix \(X(t)=X+t\xi \) has full column rank and hence admits a unique thin QR-decomposition \(X(t)=Q(t)R(t)\), where \(Q(t)\in \mathrm{St}(m,n)\) and \(R(t)\in \mathbb R^{n\times n}\) are both differentiable and R(t) is upper triangular with positive diagonal entries; see, e.g., [12]. Since the unique thin QR-decomposition of X is given by \(X=XI_n\), we have \(R(0)=I_n\). This, together with the fact that \(\Vert Q(t)\Vert \le 1\), implies
To bound \(\Vert R'(t)\Vert _F\), we adopt the so-called matrix equation approach in [11, 47]. Using the identity \(R(t)^TR(t)=X(t)^TX(t)\) and the fact that \(\xi \in T(X)\) implies \(X^T\xi +\xi ^TX=\mathbf 0\), we have
Differentiating both sides of (63) with respect to t yields
In particular, since R(t) is invertible, we have
Now, observe that \(R'(t)R(t)^{-1}\) is upper triangular. Thus, the above identity implies that
where for any \(C\in \mathbb R^{n\times n}\),
Let \(\lambda _1,\ldots ,\lambda _n \ge 0\) be the eigenvalues of \(\xi ^T\xi \). Using (63) and the fact that \(2 \cdot \Vert \mathrm{up}(C)\Vert _F^2 \le \Vert C\Vert _F^2\) for any \(C\in \mathcal {S}^n\), we bound
On the other hand, we have \(\Vert R(t)\Vert \le \sqrt{1+t^2\cdot \Vert \xi \Vert ^2} \le \sqrt{5}/2\) by (63) and the assumption that \(\Vert \xi \Vert _F \le 1/2\) and \(t\in [-1,1]\). It follows that
Upon substituting this into (62) and integrating, we obtain
i.e., \(R_\mathsf{QR}\) satisfies Property (P) with \(\phi =1/2\) and \(M=\sqrt{10}/4\).
1.3 Second-order boundedness of \(R_\mathsf{cayley}\)
Let \(X\in \mathrm{St}(m,n)\) and \(\xi \in T(X)\) be arbitrary. Suppose that \(\Vert \xi \Vert _F \le 1/2\). Then, we have \(\Vert W(\xi )\Vert _F \le 2\cdot \Vert \xi \Vert _F \le 1\). Hence, we may write
In particular, we have
Now, observe that
where the last equality follows from the fact that \(\xi \in T(X)\). Hence, we obtain
i.e., \(R_\mathsf{cayley}\) satisfies Property (P) with \(\phi =1/2\) and \(M=4\).
Proof of Proposition 10
We first establish the inequality (44). Define \(\epsilon ^{k+1} = R(X^k,-\alpha \xi ^k)-(X^k-\alpha \xi ^k) = X^{k+1} - (X^k-\alpha \xi ^k)\) for \(k=0,1,\ldots ,\Gamma -1\). Then,
Now, let us bound the terms in (64) in turn. Using the fact that \(\xi ^k\) is the orthogonal projection of \(G^k\) onto \(T(X^k)\) and \(\nabla F_i(X)=2A_iXB\), \(\nabla F(X)=2AXB\), we have
By our choice of the step size \(\alpha \), we have \(\Vert \alpha \xi ^k\Vert _F \le \phi \le 1\). It follows from Property (P) and some simple calculation that
Moreover, it is clear that
Upon substituting (65)–(68) into (64) and simplifying, we obtain
with \(c_0=(M^2+4M+1)\cdot \Vert A\Vert \cdot \Vert B\Vert \), as desired.
Next, we establish the inequality (45). Since \(\xi ^0=\mathrm{grad}\,F(X^0)=\mathrm{proj}_{T(X^0)}(\nabla F(X^0))\), where \(\mathrm{proj}_{T(X)}\) is the projector onto T(X), by the idempotence of \(\mathrm{proj}_{T(X)}\) and the fact that \(\nabla F(X)=2AXB\), we have
Upon substituting this into (44) and noting that \(c_0 \le 1/(8\alpha )\), we obtain
Since \(X^0=\tilde{X}^s\) and \(F(\tilde{X}^{s+1}) \le F(X^1)\) by lines 2 and 9 of Algorithm 2, respectively, the above inequality is equivalent to (45).
The inequality (45) shows that the sequence \(\{F(\tilde{X}^s)\}_{s\ge 0}\) is monotonically decreasing, which, together with the fact that F is bounded below on \(\mathrm{St}(m,n)\), implies that \(F(\tilde{X}^s) \searrow F^*\) for some \(F^*\in \mathbb R\). By the continuity of F, we conclude that every limit point \(X^*\) of the sequence \(\{\tilde{X}^s\}_{s\ge 0}\) satisfies \(F(X^*)=F^*\) and \(\mathrm{grad}\,F(X^*)=\mathbf 0\). This completes the proof of Proposition 10.
Proof of Corollary 2
Let \(\mathscr {F}_k\) be the \(\sigma \)-algebra generated by \(X^0,\ldots ,X^k\) for \(k=0,1,\ldots ,\Gamma -1\). Since \({\mathbb E}[ G^k \mid \mathscr {F}_k ] = \nabla F(X^k)\), we have \({\mathbb E}[\xi ^k \mid \mathscr {F}_k ] = \mathrm{grad}\, F(X^k) = \mathrm{proj}_{T(X^k)}(\nabla F(X^k))\). Again, using the idempotence of \(\mathrm{proj}_{T(X)}\) and the fact that \(\nabla F(X)=2AXB\), we obtain
On the other hand, the non-expansiveness of \(\mathrm{proj}_{T(X)}\) yields
and hence
By the definition of \(G^k\) and the fact that \(\nabla F_i\) (resp. \(\nabla F\)) is Lipschitz continuous with parameter \(L_{F_i} \le 2\cdot \Vert A_i\Vert \cdot \Vert B\Vert \) for \(i=1,\ldots ,N\) (resp. \(L_F\le 2\cdot \Vert A\Vert \cdot \Vert B\Vert \)), we have
with \(c'=2\left( \max _{i\in \{1,\ldots ,N\}}\Vert A_i\Vert +\Vert A\Vert \right) \Vert B\Vert \). To bound \(\Vert X^k-X^0\Vert _F\), observe that
where (73) is due to the fact that \(\Vert \alpha \xi ^k\Vert _F \le \phi \le 1\) and (74) follows from (70). This yields
where \(c_1=c'(M+1)\). In particular, we have
which implies that
It follows from (71), (72), and (76) that
This, together with (44) and (69), yields the desired result.
Proof of Proposition 11
By Proposition 10, the global error bound for Problem (QP-OC) (Corollary 1), and the fact that \(\mathrm{grad}\,F(X)=D_{1/4}(X)\), we have
for all \(s\ge 0\). Since \(F(\tilde{X}^s) \searrow F^*\), the above inequality implies the existence of \(s_0\ge 0\) such that \(\mathrm{dist}(\tilde{X}^s,\mathcal {X}) \le \delta /3\) for all \(s\ge s_0\), where \(\delta \in (0,\sqrt{2}/2)\) is the constant given in Theorem 1. Now, consider a fixed \(s\ge s_0\) and let \(\hat{X}^s,\hat{X}^{s+1}\in \mathcal {X}\) be such that \(\mathrm{dist}(\tilde{X}^s,\mathcal {X}) = \Vert \tilde{X}^s-\hat{X}^s\Vert _F\) and \(\mathrm{dist}(\tilde{X}^{s+1},\mathcal {X}) = \Vert \tilde{X}^{s+1}-\hat{X}^{s+1}\Vert _F\). Suppose that \(\hat{X}^s\in \mathcal {X}_{h,\Pi }\) and \(\hat{X}^{s+1}\in \mathcal {X}_{h',\Pi '}\) with \(\mathcal {X}_{h,\Pi }\cap \mathcal {X}_{h',\Pi '}=\emptyset \). Then, we have \(\Vert \hat{X}^s-\hat{X}^{s+1}\Vert _F \ge \sqrt{2} \ge 2\delta \) by Proposition 4. On the other hand, using (75), the fact that \(\Vert \mathrm{grad}\,F(X)\Vert _F \le \Vert \nabla F(X)\Vert _F \le 2\cdot \Vert A\Vert _F\cdot \Vert B\Vert \) for all \(X\in \mathrm{St}(m,n)\), and our choice of the step size \(\alpha \), the sequence \(X^0=\tilde{X}^s,X^1,\ldots ,X^\Gamma \) generated by Algorithm 2 in epoch s satisfies
for \(k=0,1,\ldots ,\Gamma -1\). This implies that
which is a contradiction. Hence, we have \(\mathcal {X}_{h,\Pi }\cap \mathcal {X}_{h',\Pi '}\not =\emptyset \), which by Proposition 4 yields \(\mathcal {X}_{h,\Pi }=\mathcal {X}_{h',\Pi '}\). Consequently, we have \(\mathrm{dist}(\tilde{X}^s,\mathcal {X}) = \mathrm{dist}(\tilde{X}^s,\mathcal {X}_{h,\Pi }) \le \delta /3\) for all sufficiently large \(s\ge 0\). This, together with Proposition 10 and the fact that the function F is constant on \(\mathcal {X}_{h,\Pi }\), implies that every limit point of the sequence \(\{\tilde{X}^s\}_{s\ge 0}\) belongs to \(\mathcal {X}_{h,\Pi }\) and \(F(X)=F^*\) for all \(X\in \mathcal {X}_{h,\Pi }\). This completes the proof.
Rights and permissions
About this article
Cite this article
Liu, H., So, A.MC. & Wu, W. Quadratic optimization with orthogonality constraint: explicit Łojasiewicz exponent and linear convergence of retraction-based line-search and stochastic variance-reduced gradient methods. Math. Program. 178, 215–262 (2019). https://doi.org/10.1007/s10107-018-1285-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-018-1285-1
Keywords
- Quadratic optimization with orthogonality constraints
- Łojasiewicz inequality
- Line-search methods
- Stochastic variance-reduced gradient method
- Linear convergence