Skip to main content
Log in

Quadratic optimization with orthogonality constraint: explicit Łojasiewicz exponent and linear convergence of retraction-based line-search and stochastic variance-reduced gradient methods

  • Full Length Paper
  • Series A
  • Published:
Mathematical Programming Submit manuscript

Abstract

The problem of optimizing a quadratic form over an orthogonality constraint (QP-OC for short) is one of the most fundamental matrix optimization problems and arises in many applications. In this paper, we characterize the growth behavior of the objective function around the critical points of the QP-OC problem and demonstrate how such characterization can be used to obtain strong convergence rate results for iterative methods that exploit the manifold structure of the orthogonality constraint (i.e., the Stiefel manifold) to find a critical point of the problem. Specifically, our primary contribution is to show that the Łojasiewicz exponent at any critical point of the QP-OC problem is 1 / 2. Such a result is significant, as it expands the currently very limited repertoire of optimization problems for which the Łojasiewicz exponent is explicitly known. Moreover, it allows us to show, in a unified manner and for the first time, that a large family of retraction-based line-search methods will converge linearly to a critical point of the QP-OC problem. Then, as our secondary contribution, we propose a stochastic variance-reduced gradient (SVRG) method called Stiefel-SVRG for solving the QP-OC problem and present a novel Łojasiewicz inequality-based linear convergence analysis of the method. An important feature of Stiefel-SVRG is that it allows for general retractions and does not require the computation of any vector transport on the Stiefel manifold. As such, it is computationally more advantageous than other recently-proposed SVRG-type algorithms for manifold optimization.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. Such an assumption is omitted in the original text of [3] but is needed for the result in [3, Section 4.8.2] to hold. The omission is corrected in the online errata at https://sites.uclouvain.be/absil/amsbook/errata.html.

  2. That is, there exist constants \(r_0>0\), \(r_1\in (0,1)\) and index \(K\ge 0\) such that \(\Vert X^k-X^*\Vert _F \le r_0r_1^k\) for all \(k\ge K\).

  3. Stiefel-SVRG was first presented by the second author at the 13th Chinese Workshop on Machine Learning and Applications held in Nanjing, China in 2015 [45]. As such, it predates the SVRG methods for manifold optimization developed in [37, 58]. More importantly, Stiefel-SVRG does not require the computation of any vector transport, which makes it computationally more advantageous than the SVRG methods proposed in [37, 58].

References

  1. Abrudan, T.E., Eriksson, J., Koivunen, V.: Steepest descent algorithms for optimization under unitary matrix constraint. IEEE Trans. Signal Process. 56(3), 1134–1147 (2008)

    MathSciNet  MATH  Google Scholar 

  2. Absil, P.-A., Mahony, R., Andrews, B.: Convergence of the iterates of descent methods for analytic cost functions. SIAM J. Optim. 16(2), 531–547 (2005)

    MathSciNet  MATH  Google Scholar 

  3. Absil, P.-A., Mahony, R., Sepulchre, R.: Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton (2008)

    MATH  Google Scholar 

  4. Absil, P.-A., Malick, J.: Projection-like retractions on matrix manifolds. SIAM J. Optim. 22(1), 135–158 (2012)

    MathSciNet  MATH  Google Scholar 

  5. Agarwal, A., Anandkumar, A., Jain, P., Netrapalli, P.: Learning sparsely used overcomplete dictionaries via alternating minimization. SIAM J. Optim. 26(4), 2775–2799 (2016)

    MathSciNet  MATH  Google Scholar 

  6. Bolla, M., Michaletzky, G., Tusnády, G., Ziermann, M.: Extrema of sums of heterogeneous quadratic forms. Linear Algebra Appl. 269(1–3), 331–365 (1998)

    MathSciNet  MATH  Google Scholar 

  7. Bolte, J., Danilidis, A., Ley, O., Mazet, L.: Characterizations of Łojasiewicz inequalities: subgradient flows, talweg, convexity. Trans. Am. Math. Soc. 362(6), 3319–3363 (2010)

    MATH  Google Scholar 

  8. Bolte, J., Ngyuen, T.P., Peypouquet, J., Suter, B.W.: From error bounds to the complexity of first-order descent methods for convex functions. Math. Program. Ser. A 165(2), 471–507 (2017)

    MathSciNet  MATH  Google Scholar 

  9. Bonnabel, S.: Stochastic gradient descent on Riemannian manifolds. IEEE Trans. Autom. Control 58(9), 2217–2229 (2013)

    MathSciNet  MATH  Google Scholar 

  10. Candès, E.J., Li, X., Soltanolkotabi, M.: Phase retrieval via Wirtinger flow: theory and algorithms. IEEE Trans. Inf. Theory 61(4), 1985–2007 (2015)

    MathSciNet  MATH  Google Scholar 

  11. Chang, X.-W., Paige, C.C., Stewart, G.W.: Perturbation analyses for the QR factorization. SIAM J. Matrix Anal. Appl. 18(3), 775–791 (1997)

    MathSciNet  MATH  Google Scholar 

  12. Dieci, L., Eirola, T.: On smooth decompositions of matrices. SIAM J. Matrix Anal. Appl. 20(3), 800–819 (1999)

    MathSciNet  MATH  Google Scholar 

  13. Feehan, P.M.N.: Global existence and convergence of solutions to gradient systems and applications to Yang–Mills gradient flow. Monograph. arxiv.org/abs/1409.1525 (2014)

  14. Forti, M., Nistri, P., Quincampoix, M.: Convergence of neural networks for programming problems via a nonsmooth Łojasiewicz inequality. IEEE Trans. Neural Netw. 17(6), 1471–1486 (2006)

    Google Scholar 

  15. Golub, G.H., Van Loan, C.F.: Matrix Computations, 3rd edn. The Johns Hopkins University Press, Baltimore, MD (1996)

    MATH  Google Scholar 

  16. Hardt, M.: Understanding alternating minimization for matrix completion. In: Proceedings of the 55th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2014), pp. 651–660 (2014)

  17. Hou, K., Zhou, Z., So, A. M.-C., Luo, Z.-Q.: On the linear convergence of the proximal gradient method for trace norm regularization. In: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 26: Proceedings of the 2013 Conference, pp. 710–718 (2013)

  18. Jain, P., Oh, S.: Provable tensor factorization with missing data. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 27: Proceedings of the 2014 Conference, pp. 1431–1439 (2014)

  19. Jiang, B., Dai, Y.-H.: A framework of constraint preserving update schemes for optimization on stiefel manifold. Math. Program. Ser. A 153(2), 535–575 (2015)

    MathSciNet  MATH  Google Scholar 

  20. Johnson, R., Zhang, T.: Accelerating stochatic gradient descent using predictive variance reduction. In: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 26: Proceedings of the 2013 Conference, pp. 315–323 (2013)

  21. Kaneko, T., Fiori, S., Tanaka, T.: Empirical arithmetic averaging over the compact stiefel manifold. IEEE Trans. Signal Process. 61(4), 883–894 (2013)

    MathSciNet  MATH  Google Scholar 

  22. Kokiopoulou, E., Chen, J., Saad, Y.: Trace optimization and eigenproblems in dimension reduction methods. Numer. Linear Algebra Appl. 18(3), 565–602 (2011)

    MathSciNet  MATH  Google Scholar 

  23. Li, G., Mordukhovich, B.S., Phạm, T.S.: New fractional error bounds for polynomial systems with applications to Hölderian stability in optimization and spectral theory of tensors. Math. Program. Ser. A 153(2), 333–362 (2015)

    MATH  Google Scholar 

  24. Li, G., Pong, T.K.: Calculus of the exponent of Kurdyka–Łojasiewicz inequality and its applications to linear convergence of first-order methods. Found. Comput. Math. (2017)

  25. Liu, H., Wu, W., So, A.M.-C.: Quadratic optimization with orthogonality constraints: explicit Łojasiewicz exponent and linear convergence of line-search methods. In: Proceedings of the 33rd International Conference on Machine Learning (ICML 2016), pp. 1158–1167 (2016)

  26. Liu, H., Yue, M.-C., So, A.M.-C.: On the estimation performance and convergence rate of the generalized power method for phase synchronization. SIAM J. Optim. 27(4), 2426–2446 (2017)

    MathSciNet  MATH  Google Scholar 

  27. Luo, Z.-Q.: New error bounds and their applications to convergence analysis of iterative algorithms. Math. Program. Ser. B 88(2), 341–355 (2000)

    MathSciNet  MATH  Google Scholar 

  28. Luo, Z.-Q., Pang, J.-S.: Error bounds for analytic systems and their applications. Math. Program. 67(1), 1–28 (1994)

    MathSciNet  MATH  Google Scholar 

  29. Luo, Z.-Q., Sturm, J.F.: Error bounds for quadratic systems. In: Frenk, H., Roos, K., Terlaky, T., Zhang, S. (eds.) High Performance Optimization, Volume 33 of Applied Optimization, pp. 383–404. Springer, Dordrecht (2000)

    Google Scholar 

  30. Luo, Z.-Q., Tseng, P.: Error bounds and convergence analysis of feasible descent methods: a general approach. Ann. Oper. Res. 46(1), 157–178 (1993)

    MathSciNet  MATH  Google Scholar 

  31. Manton, J.H.: Optimization algorithms exploiting unitary constraints. IEEE Trans. Signal Process. 50(3), 635–650 (2002)

    MathSciNet  MATH  Google Scholar 

  32. Merlet, B., Nguyen, T.N.: Convergence to equilibrium for discretizations of gradient-like flows on Riemannian manifolds. Differ. Integral Equ. 26(5–6), 571–602 (2013)

    MathSciNet  MATH  Google Scholar 

  33. Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic Publishers, Boston (2004)

    MATH  Google Scholar 

  34. Netrapalli, P., Jain, P., Sanghavi, S.: Phase retrieval using alternating minimization. IEEE Trans. Signal Process. 63(18), 4814–4826 (2015)

    MathSciNet  MATH  Google Scholar 

  35. Saad, Y.: Numerical Methods for Large Eigenvalue Problems. Classics in Applied Mathematics, revised edition . Society for Industrial and Applied Mathematics, Philadelphia (2011)

  36. Sato, H., Iwai, T.: A Riemannian optimization approach to the matrix singular value decomposition. SIAM J. Optim. 23(1), 188–212 (2013)

    MathSciNet  MATH  Google Scholar 

  37. Sato, H., Kasai, H., Mishra, B.: Riemannian stochastic variance reduced gradient. Manuscript, arxiv.org/abs/1702.05594 (2017)

  38. Schneider, R., Uschmajew, A.: Convergence results for projected line-search methods on varieties of low-rank matrices via Łojasiewicz inequality. SIAM J. Optim. 25(1), 622–646 (2015)

    MathSciNet  MATH  Google Scholar 

  39. Schönemann, P.H.: A generalized solution of the orthogonal Procrustes problem. Psychometrika 31(1), 1–10 (1966)

    MathSciNet  MATH  Google Scholar 

  40. Schönemann, P.H.: On two-sided orthogonal Procrustes problems. Psychometrika 33(1), 19–33 (1968)

    MathSciNet  Google Scholar 

  41. Shamir, O.: A stochastic PCA and SVD algorithm with an exponential convergence rate. In: Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), pp. 144–152 (2015)

  42. Shamir, O.: Fast stochastic algorithms for SVD and PCA: convergence properties and convexity. In: Proceedings of the 33rd International Conference on Machine Learning (ICML 2016), pp. 248–256 (2016)

  43. Smith, S.T.: Optimization techniques on Riemannian manifolds. In: Bloch, A. (ed.) Hamiltonian and Gradient Flows, Algorithms and Control. Fields Institue Communications, pp. 113–136. American Mathematical Society, Providence (1994)

    Google Scholar 

  44. So, A.M.-C.: Moment inequalities for sums of random matrices and their applications in optimization. Math. Program. Ser. A 130(1), 125–151 (2011)

    MathSciNet  MATH  Google Scholar 

  45. So, A.M.-C.: Pinning down the Łojasiewicz exponent: towards understanding the convergence behavior of first-order methods for structured non-convex optimization problems. Slides. http://lamda.nju.edu.cn/conf/mla15/files/suwz.pdf (2015)

  46. So, A.M.-C., Zhou, Z.: Non-asymptotic convergence analysis of inexact gradient methods for machine learning without strong convexity. Optim. Methods Softw. 32(4), 963–992 (2017)

    MathSciNet  MATH  Google Scholar 

  47. Sun, J.: On perturbation bounds for the QR factorization. Linear Algebra Appl. 215, 95–111 (1995)

    MathSciNet  MATH  Google Scholar 

  48. Sun, J., Qu, Q., Wright, J.: A geometric analysis of phase retrieval. Found. Comput. Math. (2017)

  49. Sun, J., Qu, Q., Wright, J.: Complete dictionary recovery over the sphere I: overview and the geometric picture. IEEE Trans. Inf. Theory 63(2), 853–884 (2017)

    MathSciNet  MATH  Google Scholar 

  50. Sun, J., Qu, Q., Wright, J.: Complete dictionary recovery over the sphere II: recovery by Riemannian trust-region method. IEEE Trans. Inf. Theory 63(2), 885–914 (2017)

    MathSciNet  MATH  Google Scholar 

  51. Sun, R., Luo, Z.-Q.: Guaranteed matrix completion via non-convex factorization. IEEE Trans. Inf. Theory 62(11), 6535–6579 (2016)

    MathSciNet  MATH  Google Scholar 

  52. Sun, W.W., Lu, J., Liu, H., Cheng, G.: Provable sparse tensor decomposition. J. R. Stat. Soc. B 79(3), 899–916 (2017)

    MathSciNet  MATH  Google Scholar 

  53. Udrişte, C.: Convex Functions and Optimization Methods on Riemannian Manifolds, Volume 297 of Mathematics and Its Applications. Springer, Dordrecht (1994)

    Google Scholar 

  54. Uschmajew, A.: A new convergence proof for the higher-order power method and generalizations. Pac. J. Optim. 11(2), 309–321 (2015)

    MathSciNet  MATH  Google Scholar 

  55. Wen, Z., Yin, W.: A feasible method for optimization with orthogonality constraints. Math. Program. Ser. A 142(1–2), 397–434 (2013)

    MathSciNet  MATH  Google Scholar 

  56. Yang, Y.: Globally convergent optimization algorithms on Riemannian manifolds: uniform framework for unconstrained and constrained optimization. J. Optim. Theory Appl. 132(2), 245–265 (2007)

    MathSciNet  MATH  Google Scholar 

  57. Yger, F., Berar, M., Gasso, G., Rakotomamonjy, A.: Adaptive canonical correlation analysis based on matrix manifolds. In: Proceedings of the 29th International Conference on Machine Learning (ICML 2012), pp. 1071–1078 (2012)

  58. Zhang, H., Reddi, S. J., Sra, S.: Riemannian SVRG: fast stochastic optimization on Riemannian manifolds. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems 29: Proceedings of the 2016 Conference, pp. 4592–4600 (2016)

  59. Zhang, H., Sra, S.: First-order methods for geodesically convex optimization. In Feldman, V., Rakhlin, A., Shamir, O. (eds.) Proceedings of the 29th Annual Conference on Learning Theory (COLT 2016), Volume 49 of Proceedings of Machine Learning Research, pp. 1617–1638 (2016)

  60. Zheng, Q., Lafferty, J.: A convergent gradient descent algorithm for rank minimization and semidefinite programming from random linear measurements. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems 28: Proceedings of the 2015 Conference, pp. 109–117 (2015)

  61. Zhong, Y., Boumal, N.: Near-optimal bounds for phase synchronization. SIAM J. Optim. 28(2), 989–1016 (2018)

    MathSciNet  MATH  Google Scholar 

  62. Zhou, Z., So, A.M.-C.: A unified approach to error bounds for structured convex optimization problems. Math. Program. Ser. A 165(2), 689–728 (2017)

    MathSciNet  MATH  Google Scholar 

  63. Zhou, Z., Zhang, Q., So, A.M.-C.: \(\ell _{1,p}\)-norm regularization: error bounds and convergence rate analysis of first-order methods. In: Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), pp. 1501–1510 (2015)

Download references

Acknowledgements

We thank the associate editor for coordinating the review of our manuscript and the anonymous reviewer for his/her detailed comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anthony Man-Cho So.

Additional information

A preliminary version of this work has appeared in the Proceedings of the 33rd International Conference on Machine Learning (ICML 2016), 2016 [25]. This research is supported in part by the Hong Kong Research Grants Council (RGC) General Research Fund (GRF) Projects CUHK 14205314, CUHK 14206814, and CUHK 14208117.

Appendices

Appendix

Proof of Proposition 4

Observe that given any \(X\in \mathcal {X}_{h,\Pi }\), we can write

$$\begin{aligned} \mathcal {X}_{h,\Pi }&= \left\{ \left. \mathrm{BlkDiag}(P_1,\ldots ,P_{n_A}) \cdot X \cdot \mathrm{BlkDiag}\left( Q_1^T,\ldots ,Q_{n_B}^T \right) \,\right| \, \right. \\&\qquad \left. P_i \in \mathcal {O}^{s_i-s_{i-1}} \text{ for } i=1,\ldots ,n_A; \, Q_j \in \mathcal {O}^{t_j-t_{j-1}} \text{ for } j=1,\ldots ,n_B \right\} . \end{aligned}$$

Thus, if \(X \in \mathcal {X}_{h,\Pi } \cap \mathcal {X}_{h',\Pi '}\), then

$$\begin{aligned} \mathrm{BlkDiag}(P_1,\ldots ,P_{n_A}) \cdot X \cdot \mathrm{BlkDiag}\left( Q_1^T,\ldots ,Q_{n_B}^T \right) \in \mathcal {X}_{h,\Pi } \cap \mathcal {X}_{h',\Pi '} \end{aligned}$$

for any \(P_i \in \mathcal {O}^{s_i-s_{i-1}}\) (\(i=1,\ldots ,n_A\)) and \(Q_j \in \mathcal {O}^{t_j-t_{j-1}}\) (\(j=1,\ldots ,n_B\)). This implies that \(\mathcal {X}_{h,\Pi } = \mathcal {X}_{h',\Pi '}\).

Now, suppose that \(\mathcal {X}_{h,\Pi } \cap \mathcal {X}_{h',\Pi '}=\emptyset \). Let \(X\in \mathcal {X}_{h,\Pi }\) and \(X'\in \mathcal {X}_{h',\Pi '}\) be arbitrary. Then, there exist \(P_i \in \mathcal {O}^{s_i-s_{i-1}}\) (\(i=1,\ldots ,n_A\)) and \(Q_j \in \mathcal {O}^{t_j-t_{j-1}}\) (\(j=1,\ldots ,n_B\)) such that

$$\begin{aligned}&\Vert X-X'\Vert _F^2 \nonumber \\&\quad = \left\| E(h')\Pi ' - \mathrm{BlkDiag}\left( P_1,\ldots ,P_{n_A} \right) \cdot E(h) \cdot \Pi \cdot \mathrm{BlkDiag}\left( Q_1^T,\ldots ,Q_{n_B}^T \right) \right\| _F^2. \end{aligned}$$
(48)

Consider the following block decomposition of \(E(h)\Pi \) (and similarly for \(E(h')\Pi '\)):

$$\begin{aligned} E(h)\Pi = \begin{bmatrix} E_{1,1}(h,\Pi )&\quad \cdots&\quad E_{1,n_B}(h,\Pi ) \\ \vdots&\quad \ddots&\quad \vdots \\ E_{n_A,1}(h,\Pi )&\quad \cdots&\quad E_{n_A,n_B}(h,\Pi ) \end{bmatrix} \end{aligned}$$

where \(E_{i,j}(h,\Pi )\in \mathbb R^{(s_i-s_{i-1})\times (t_j-t_{j-1})}\) for \(i=1,\ldots ,n_A\) and \(j=1,\ldots ,n_B\). Let \(|E_{i,j}(h,\Pi )|\) be the number of ones in \(E_{i,j}(h,\Pi )\). We then have two cases:

Case 1. There exist \(i\in \{1,\ldots ,n_A\}\) and \(j\in \{1,\ldots ,n_B\}\) such that \(|E_{i,j}(h,\Pi )| \not = |E_{i,j}(h',\Pi ')|\).

It can be seen from (11) that for any \(u\in \mathcal {H}\), every column of E(u) has exactly one 1. Hence, for any \(u\in \mathcal {H}\) and \(\Phi \in \mathcal {P}^n\), every column of \(E(u)\Phi \) also has exactly one 1. In particular, we have

$$\begin{aligned} \sum _{k=1}^{n_A} |E_{k,j}(h,\Pi )| = \sum _{k=1}^{n_A} |E_{k,j}(h',\Pi ')| = t_j-t_{j-1}, \end{aligned}$$

which implies that \(|E_{i',j}(h,\Pi )| \not = |E_{i',j}(h',\Pi ')|\) for some \(i'\in \{1,\ldots ,n_A\}{\setminus }\{i\}\). Now, we compute

$$\begin{aligned} \Vert X-X'\Vert _F^2\ge & {} \left\| E_{i,j}(h',\Pi ')-P_iE_{i,j}(h,\Pi )Q_j^T \right\| _F^2 \nonumber \\&+ \left\| E_{i',j}(h',\Pi ')-P_{i'}E_{i',j}(h,\Pi )Q_j^T \right\| _F^2 \nonumber \\\ge & {} \min _{\begin{array}{c} P \in \mathcal {O}^{s_i-s_{i-1}} \\ Q \in \mathcal {O}^{t_j-t_{j-1}} \end{array}} \left\| E_{i,j}(h',\Pi ')-PE_{i,j}(h,\Pi )Q^T \right\| _F^2 \nonumber \\&+\, \min _{\begin{array}{c} P \in \mathcal {O}^{s_{i'}-s_{i'-1}} \\ Q \in \mathcal {O}^{t_j-t_{j-1}} \end{array}} \left\| E_{i',j}(h',\Pi ')-PE_{i',j}(h,\Pi )Q^T \right\| _F^2. \end{aligned}$$
(49)

Both terms in (49) are instances of the two-sided orthogonal Procrustes problem and admit the following characterization [40]:

$$\begin{aligned}&\min _{\begin{array}{c} P \in \mathcal {O}^{s_i-s_{i-1}} \\ Q \in \mathcal {O}^{t_j-t_{j-1}} \end{array}} \left\| E_{i,j}(h',\Pi ')-PE_{i,j}(h,\Pi )Q^T \right\| _F^2\\&\quad = \sum _{k=1}^K \left( \sigma _k(E_{i,j}(h',\Pi ')) - \sigma _k(E_{i,j}(h,\Pi )) \right) ^2, \\&\min _{\begin{array}{c} P \in \mathcal {O}^{s_{i'}-s_{i'-1}} \\ Q \in \mathcal {O}^{t_j-t_{j-1}} \end{array}} \left\| E_{i',j}(h',\Pi ')-PE_{i',j}(h,\Pi )Q^T \right\| _F^2 \\&\quad = \sum _{k=1}^{K'} \left( \sigma _k(E_{i',j}(h',\Pi ')) - \sigma _k(E_{i',j}(h,\Pi )) \right) ^2. \end{aligned}$$

Here, \(K=\min \{s_i-s_{i-1},t_j-t_{j-1}\}\), \(K'=\min \{s_{i'}-s_{i'-1},t_j-t_{j-1}\}\), and \(\sigma _k(Y)\) is the kth largest singular value of Y. Observe that for any \(\alpha \in \{1,\ldots ,n_A\}\), \(\beta \in \{1,\ldots ,n_B\}\), \(u\in \mathcal {H}\), and \(\Phi \in \mathcal {P}^n\), every non-zero row and every non-zero column of \(E_{\alpha ,\beta }(u,\Phi )\) has exactly one 1. It follows that the singular values of \(E_{\alpha ,\beta }(u,\Phi )\) are either 0 or 1, and there are \(|E_{\alpha ,\beta }(u,\Phi )|\) of the latter. Since \(|E_{i,j}(h,\Pi )| \not = |E_{i,j}(h',\Pi ')|\) and \(|E_{i',j}(h,\Pi )| \not = |E_{i',j}(h',\Pi ')|\), we conclude from (49) that \(\Vert X-X'\Vert _F^2 \ge 2\).

Case 2.\(|E_{i,j}(h,\Pi )| = |E_{i,j}(h',\Pi ')|\) for \(i=1,\ldots ,n_A\) and \(j=1,\ldots ,n_B\).

We show that \(X=X'\) in this case, which would then contradict the assumption that \(\mathcal {X}_{h,\Pi } \cap \mathcal {X}_{h',\Pi '}=\emptyset \). To begin, let \(i\in \{1,\ldots ,n_A\}\) be arbitrary and consider the ith block row of \(E(h)\Pi \) and \(E(h')\Pi '\); i.e.,

$$\begin{aligned} \mathrm{BlkRow}_i(E(h)\Pi )= & {} \left[ E_{i,1}(h,\Pi ) \, \cdots \, E_{i,n_B}(h,\Pi ) \right] , \\ \mathrm{BlkRow}_i \left( E(h')\Pi ' \right)= & {} \left[ E_{i,1}(h',\Pi ') \, \cdots \, E_{i,n_B}(h',\Pi ') \right] . \end{aligned}$$

By (11), every non-zero row of \(\mathrm{BlkRow}_i(E(h)\Pi )\) and \(\mathrm{BlkRow}_i \left( E(h')\Pi ' \right) \) has exactly one 1. Moreover, we have \(|E_{i,j}(h,\Pi )| = |E_{i,j}(h',\Pi ')|\) for \(j=1,\ldots ,n_B\) by assumption. Hence, we can find permutation matrices \(\Phi _{i,1},\Phi _{i,2},\ldots ,\Phi _{i,n_B}\in \mathcal {P}^{s_i-s_{i-1}}\) such that for \(j=1,\ldots ,n_B\),

  1. (i)

    the indices of the rows of \(\Phi _{i,j} \left( \Phi _{i,j-1}\Phi _{i,j-2}\cdots \Phi _{i,1}E_{i,j}(h,\Pi ) \right) \) that contain a 1 are the same as those of \(E_{i,j}(h',\Pi ')\) that contain a 1 (i.e., the kth row of \(\Phi _{i,j} \left( \Phi _{i,j-1}\Phi _{i,j-2}\cdots \Phi _{i,1}E_{i,j}(h,\Pi ) \right) \) contains a 1 if and only if the kth row of \(E_{i,j}(h',\Pi ')\) contains a 1, where \(k\in \{1,\ldots ,s_i-s_{i-1}\}\));

  2. (ii)

    the indices of the rows of \(\left( \Phi _{i,j-1}\Phi _{i,j-2}\cdots \Phi _{i,1}\right) \left[ E_{i,1}(h,\Pi ) \, \cdots \, E_{i,j-1}(h,\Pi ) \right] \) that contain a 1 are fixed by \(\Phi _{i,j}\) (i.e., if the kth row of \(\left( \Phi _{i,j-1}\Phi _{i,j-2}\cdots \Phi _{i,1}\right) \left[ E_{i,1}(h,\Pi ) \, \cdots \, E_{i,j-1}(h,\Pi ) \right] \) contains a 1, then \(\Phi _{i,j}e_k=e_k\), where \(e_k\) is the kth standard basis vector of \(\mathbb R^{s_i-s_{i-1}}\) and \(k\in \{1,\ldots ,s_i-s_{i-1}\}\)).

Upon letting \(\Phi _i = \Phi _{i,n_B}\Phi _{i,n_B-1}\cdots \Phi _{i,1} \in \mathcal {P}^{s_i-s_{i-1}}\) and using properties (i) and (ii) above, we see that the indices of the rows of \(\Phi _iE_{i,j}(h,\Pi )\) that contain a 1 are the same as those of \(E_{i,j}(h',\Pi ')\) that contain a 1 for \(j=1,\ldots ,n_B\).

Next, let \(j\in \{1,\ldots ,n_B\}\) be arbitrary and consider the jth block column of \(\mathrm{BlkDiag}(\Phi _1,\ldots ,\Phi _{n_A})\cdot E(h) \cdot \Pi \) and \(E(h')\Pi '\); i.e.,

$$\begin{aligned} \mathrm{BlkCol}_j \left( \mathrm{BlkDiag}(\Phi _1,\ldots ,\Phi _{n_A}) \cdot E(h) \cdot \Pi \right)= & {} \left[ \begin{array}{c} \Phi _1E_{1,j}(h,\Pi ) \\ \vdots \\ \Phi _{n_A}E_{n_A,j}(h,\Pi ) \end{array} \right] , \\ \mathrm{BlkCol}_j \left( E(h')\Pi ' \right)= & {} \left[ \begin{array}{c} E_{1,j}(h',\Pi ') \\ \vdots \\ E_{n_A,j}(h',\Pi ') \end{array} \right] . \end{aligned}$$

By (11), each column of \(\mathrm{BlkCol}_j \left( \mathrm{BlkDiag}(\Phi _1,\ldots ,\Phi _{n_A}) \cdot E(h) \cdot \Pi \right) \) and \(\mathrm{BlkCol}_j \left( E(h')\Pi ' \right) \) has exactly one 1. Since \(|E_{i,j}(h,\Pi )| = |E_{i,j}(h',\Pi ')|\) for \(i=1,\ldots ,n_A\) by assumption, we have \(|\Phi _iE_{i,j}(h,\Pi )| = |E_{i,j}(h',\Pi ')|\). Moreover, by the definition of \(\Phi _1,\ldots ,\Phi _{n_A}\), the indices of the rows of \(\Phi _iE_{i,j}(h,\Pi )\) that contain a 1 are the same as those of \(E_{i,j}(h',\Pi ')\) that contain a 1. Thus, there exists a permutation matrix \(\Psi _j\in \mathcal {P}^{t_j-t_{j-1}}\) such that

$$\begin{aligned} \mathrm{BlkCol}_j \left( E(h')\Pi ' \right) = \mathrm{BlkCol}_j \left( \mathrm{BlkDiag}(\Phi _1,\ldots ,\Phi _{n_A}) \cdot E(h) \cdot \Pi \right) \cdot \Psi _j. \end{aligned}$$

In particular, we obtain

$$\begin{aligned} E(h')\Pi ' = \mathrm{BlkDiag}(\Phi _1,\ldots ,\Phi _{n_A}) \cdot E(h)\cdot \Pi \cdot \mathrm{BlkDiag}(\Psi _1,\ldots ,\Psi _{n_B}). \end{aligned}$$

Since a permutation matrix is also an orthogonal matrix, we conclude from (48) that \(\Vert X-X'\Vert _F^2=0\), or equivalently, \(X=X'\), as desired.

Proof of Proposition 5

Using (17) and (18), it can be verified that

$$\begin{aligned} \mathrm{dist}^2(X,\mathcal {X}_{h,\Pi })= & {} \left\| \bar{X} - E(h) \Pi \right\| _F^2 \\= & {} \min \left\{ \left. \left\| \bar{X} - E(h) \cdot \Pi \cdot \mathrm{BlkDiag}\left( Q_1^T,\ldots ,Q_{n_B}^T \right) \right\| _F^2 \,\right| \, \right. \\&\qquad \quad \left. Q_j \in \mathcal {O}^{t_j-t_{j-1}} \text{ for } j=1,\ldots ,n_B \right\} \\= & {} \sum _{j=1}^{n_B} \min \left\{ \left. \left\| \bar{X}_j - \bar{E}_j(h)Q_j^T \right\| _F^2 \,\right| \, Q_j \in \mathcal {O}^{t_j-t_{j-1}} \right\} . \end{aligned}$$

Since up to a permutation of the rows \(\bar{E}_j\) takes the form (19), in order to obtain the desired bound on \(\mathrm{dist}^2(X,\mathcal {X}_{h,\Pi })\), it remains to prove the following:

Lemma 1

Let \(S = \begin{bmatrix} S_1\\S_2 \end{bmatrix} \in \mathrm{St}(p,q)\) be given, with \(S_1 \in \mathbb R^{q\times q}\) and \(S_2 \in \mathbb R^{(p-q)\times q}\). Consider the following problem:

$$\begin{aligned} v^* = \min \left\{ \left. \left\| S - \begin{bmatrix} I_q\\ \mathbf 0\end{bmatrix} X \right\| _F^2 \,\right| \, X \in \mathcal {O}^q \right\} . \end{aligned}$$

Suppose that \(v^*<1\). Then, we have \(\Vert S_2\Vert _F^2 \le v^* \le 2\Vert S_2\Vert _F^2\).

Proof

Since

$$\begin{aligned} \left\| S - \begin{bmatrix} I_q\\\mathbf 0\end{bmatrix} X \right\| _F^2 = \Vert S_1-X\Vert _F^2 + \Vert S_2\Vert _F^2, \end{aligned}$$

it suffices to consider the problem

$$\begin{aligned} \min \left\{ \Vert S_1 - X\Vert _F^2 \mid X \in \mathcal {O}^q \right\} . \end{aligned}$$
(50)

Problem (50) is an instance of the orthogonal Procrustes problem, whose optimal solution is given by \(X^*=UV^T\), where \(S_1=U\Sigma V^T\) is the singular value decomposition of \(S_1\) [39]. It follows that

$$\begin{aligned} v^* = \Vert \Sigma -I_q\Vert _F^2 + \Vert S_2\Vert _F^2. \end{aligned}$$

Now, since \(S\in \mathrm{St}(p,q)\), we have \(S^TS = S_1^TS_1 + S_2^TS_2 = I_q\), or equivalently,

$$\begin{aligned} \Sigma ^2 + V^TS_2^TS_2V = I_q. \end{aligned}$$

This implies that \(\mathbf 0\preceq \Sigma \preceq I_q\) and

$$\begin{aligned} I_q - \Sigma = (I_q+\Sigma )^{-1} \left( V^TS_2^TS_2V \right) . \end{aligned}$$

It follows that

$$\begin{aligned} \frac{1}{4}\Vert S_2\Vert _F^4 + \Vert S_2\Vert _F^2 \le v^* \le \Vert S_2\Vert _F^4 + \Vert S_2\Vert _F^2. \end{aligned}$$

This, together with the fact that \(\Vert S_2\Vert _F^2 \le v^* < 1\), yields the desired result. \(\square \)

Proof of Proposition 6

Recall that

$$\begin{aligned} P^*= & {} \mathrm{BlkDiag}\left( P_1^*,\ldots ,P_{n_A}^* \right) \in \mathcal {O}^m, \\ Q^*= & {} \mathrm{BlkDiag}\left( Q_1^*,\ldots ,Q_{n_B}^* \right) \in \mathcal {O}^n, \\ \bar{X}= & {} (P^*)^TXQ^*. \end{aligned}$$

Upon observing that \(AP^*=P^*A\), \(BQ^*=Q^*B\) and using (9), (18), we compute

$$\begin{aligned}&\left\| AXB - XBX^TAX \right\| _F^2\nonumber \\&\quad = \left\| AP^*\bar{X}(Q^*)^TB - P^*\bar{X}(Q^*)^TBQ^*\bar{X}^T(P^*)^TAP^*\bar{X}(Q^*)^T \right\| _F^2 \nonumber \\&\quad = \left\| P^*\left( A\bar{X}B-\bar{X}B\bar{X}^TA\bar{X} \right) (Q^*)^T \right\| _F^2 \nonumber \\&\quad = \left\| A\bar{X}B-\bar{X}B\bar{X}^TA\bar{X} \right\| _F^2 \nonumber \\&\quad = \sum _{j=1}^{n_B} \left\| b_{t_j}A\bar{X}_j - \sum _{k=1}^{n_B} b_{t_k}\bar{X}_k\left( \bar{X}_k^TA\bar{X}_j \right) \right\| _F^2. \end{aligned}$$
(51)

Now, observe that the columns of \(\bar{X}\) are orthonormal and span an n-dimensional subspace \(\mathcal {L}\). In particular, for \(j=1,\ldots ,n_B\), each column of \(A\bar{X}_j\) can be decomposed as \(u+v\), where u is a linear combination of the columns of \(\bar{X}\) and \(v\in \mathcal {L}^\perp \), the orthogonal complement of \(\mathcal {L}\). In view of the structure of \(\bar{X}\) in (18), this leads to

$$\begin{aligned} A\bar{X}_j = \sum _{k=1}^{n_B} \bar{X}_k\left( \bar{X}_k^TA\bar{X}_j \right) + T_j, \end{aligned}$$

where \(T_j \in \mathbb R^{m\times (t_j-t_{j-1})}\) is formed by projecting the columns of \(A\bar{X}_j\) onto \(\mathcal {L}^\perp \). Hence,

$$\begin{aligned} \left\| b_{t_j}A\bar{X}_j - \sum _{k=1}^{n_B} b_{t_k}\bar{X}_k\left( \bar{X}_k^TA\bar{X}_j \right) \right\| _F^2= & {} \sum _{k\not =j} (b_{t_j}-b_{t_k})^2 \left\| \bar{X}_k \left( \bar{X}_k^TA\bar{X}_j \right) \right\| _F^2\\&+\, b_{t_j}^2 \cdot \Vert T_j\Vert _F^2 \\\ge & {} \lambda _B^2 \left( \sum _{k\not =j} \left\| \bar{X}_k \left( \bar{X}_k^TA\bar{X}_j \right) \right\| _F^2 + \Vert T_j\Vert _F^2 \right) \\= & {} \lambda _B^2 \cdot \left\| A\bar{X}_j - \bar{X}_j\bar{X}_j^TA\bar{X}_j \right\| _F^2, \end{aligned}$$

where \(\lambda _B=\min \{\lambda _{B,g},\lambda _{B,s}\}\), \(\lambda _{B,g} = \min _{j\in \{1,\ldots ,n_B-1\}} (b_{t_j}-b_{t_{j+1}}) > 0\), and \(\lambda _{B,s} = \min _{j\in \{1,\ldots ,n_B\}} |b_{t_j}|>0\). By combining the above with (51), the proof is completed.

Proof of Proposition 7

Consider a fixed \(j\in \{1,\ldots ,n_B\}\). Let \(\Delta _k\) be the kth column of \(A\bar{X}_j-\bar{X}_j\bar{X}_j^TA\bar{X}_j\), where \(k=1,\ldots ,t_j-t_{j-1}\). Since

$$\begin{aligned} \left\| A\bar{X}_j-\bar{X}_j\bar{X}_j^TA\bar{X}_j \right\| _F^2 = \sum _{k=1}^{t_j-t_{j-1}} \Vert \Delta _k\Vert _2^2, \end{aligned}$$

our goal is to establish a lower bound on \(\Vert \Delta _k\Vert _2^2\) for \(k=1,\ldots ,t_j-t_{j-1}\). Towards that end, let \(\bar{x}_k\) be the kth column of \(\bar{X}_j\) and \((\bar{x}_k)_\alpha \) be the \(\alpha \)th entry of \(\bar{x}_k\), where \(k=1,\ldots ,t_j-t_{j-1}\) and \(\alpha =1,\ldots ,m\). Then, we can write

$$\begin{aligned} \Delta _k = A\bar{x}_k - \sum _{\ell =1}^{t_j-t_{j-1}} \bar{x}_\ell \left( \bar{x}_\ell ^TA\bar{x}_k \right) . \end{aligned}$$
(52)

Suppose that \(\mathrm{dist}(X,\mathcal {X}_{h,\Pi }) = \left\| \bar{X} - E(h) \Pi \right\| _F = \tau \) for some \(\tau \in (0,1)\). Using the representations of \(\bar{X}\) and \(E(h)\Pi \) in (18), we have

$$\begin{aligned} (\bar{x}_k)_\alpha \in \left\{ \begin{array}{l@{\quad }l} [1-\tau ,1+\tau ] &{} \text{ if } \alpha =\iota (k), \\ {[}-\tau ,\tau ] &{} \text{ otherwise }, \end{array} \right. \end{aligned}$$
(53)

where \(\iota (k)\) is the coordinate of the kth column of \(\bar{E}_j(h)\) that equals 1. Now, by (52),

$$\begin{aligned} \Delta _k= & {} A\bar{x}_k - \bar{x}_k \left( \bar{x}_k^TA\bar{x}_k \right) - \sum _{\ell \not =k} \bar{x}_\ell \left( \bar{x}_\ell ^TA\bar{x}_k \right) \\= & {} \left( A - a_{\iota (k)}I_m \right) \bar{x}_k + \left( a_{\iota (k)}-\bar{x}_k^TA\bar{x}_k \right) \bar{x}_k - \sum _{\ell \not =k} \bar{x}_\ell \left( \bar{x}_\ell ^TA\bar{x}_k \right) . \end{aligned}$$

Let \(\mathrm{proj}_{\mathcal {I}_j}\) be the projector onto the coordinates in \(\mathcal {I}_j = \left\{ k \in \{1,\ldots ,m\} \,\left| \,\right. \right. \left. \left. \left[ \bar{E}_j(h) \right] _k = \mathbf 0\right. \right\} \) (recall that \(\left[ \bar{E}_j(h) \right] _k\) is the kth row of \(\bar{E}_j(h)\)). Clearly, we have

$$\begin{aligned} \Vert \Delta _k\Vert _2&\ge \Vert \mathrm{proj}_{\mathcal {I}_j}(\Delta _k)\Vert _2 \nonumber \\&\ge \left\| \mathrm{proj}_{\mathcal {I}_j}\left( \left( A - a_{\iota (k)}I_m \right) \bar{x}_k \right) \right\| _2 - \sum _{\ell =1}^{t_j-t_{j-1}} |\nu _\ell | \cdot \Vert \mathrm{proj}_{\mathcal {I}_j}(\bar{x}_\ell )\Vert _2, \end{aligned}$$
(54)

where

$$\begin{aligned} \nu _\ell = \left\{ \begin{array}{l@{\quad }l} a_{\iota (k)} - \bar{x}_k^TA\bar{x}_k &{} \text{ if } \ell =k, \\ \bar{x}_\ell ^TA\bar{x}_k &{} \text{ otherwise }. \end{array} \right. \end{aligned}$$

Let \(\lambda _{A,m}=\max _{i\in \{1,\ldots ,n_A\}} |a_{s_i}|\) be the largest (in magnitude) eigenvalue of A. Using (53) and the fact that \(\iota (k)\not =\iota (\ell )\) whenever \(k\not =\ell \), we bound

$$\begin{aligned} \left| a_{\iota (k)} - \bar{x}_k^TA\bar{x}_k \right| \le \left| a_{\iota (k)} \left( 1-(\bar{x}_k)_{\iota (k)}^2 \right) \right| + \left| \sum _{\alpha \not =\iota (k)} a_\alpha (\bar{x}_k)_\alpha ^2 \right| \le \lambda _{A,m} (m\tau ^2+2\tau ) \end{aligned}$$

and

$$\begin{aligned} \left| \bar{x}_\ell ^TA\bar{x}_k \right| \le \lambda _{A,m} \sum _{\alpha =1}^m |(\bar{x}_\ell )_\alpha | \cdot |(\bar{x}_k)_\alpha | \le \lambda _{A,m}(m\tau ^2+2\tau ) \quad \text{ for } \ell \not =k. \end{aligned}$$

This implies that \(|\nu _\ell | \le \lambda _{A,m}(m\tau ^2+2\tau )\) for \(\ell =1,\ldots ,t_j-t_{j-1}\). Moreover, since \(\bar{x}_1,\ldots ,\bar{x}_{t_j-t_{j-1}}\) are the columns of \(\bar{X}_j\), by Proposition 5, the definition of \(\mathcal {I}_j\), and the assumption that \(\mathrm{dist}(X,\mathcal {X}_{h,\Pi }) = \tau \), we have

$$\begin{aligned} \sum _{\ell =1}^{t_j-t_{j-1}} \left\| \mathrm{proj}_{\mathcal {I}_j}(\bar{x}_\ell ) \right\| _2^2 = \sum _{k\in \mathcal {I}_j} \left\| \left[ \bar{X}_j \right] _k \right\| _2^2 \le \tau ^2. \end{aligned}$$

It follows from (54) that

$$\begin{aligned} \Vert \Delta _k\Vert _2 \ge \left\| \mathrm{proj}_{\mathcal {I}_j}\left( \left( A - a_{\iota (k)}I_m \right) \bar{x}_k \right) \right\| _2 - \lambda _{A,m}\sqrt{t_j-t_{j-1}}(m\tau ^2+2\tau )\tau . \end{aligned}$$
(55)

Next, we bound the first term on the right-hand side of the above inequality. Considering the structure of A in (8), let \(i'\in \{0,1,\ldots ,n_A-1\}\) be such that \(s_{i'}+1 \le \iota (k) \le s_{i'+1}\) and recall that \(\lambda _{A,g} = \min _{i\in \{1,\ldots ,n_A-1\}} (a_{s_i}-a_{s_{i+1}}) > 0\). Then, we have

$$\begin{aligned}&\left\| \mathrm{proj}_{\mathcal {I}_j} \left( \left( A - a_{\iota (k)}I_m \right) \bar{x}_k\right) \right\| _2^2 \nonumber \\&= \sum _{i\not =i'} \sum _{\alpha \in \mathcal {I}_j \cap \{s_i+1,\ldots ,s_{i+1}\}} \left( \left( a_{s_i+1} - a_{\iota (k)} \right) (\bar{x}_k)_\alpha \right) ^2 \nonumber \\&\ge \lambda _{A,g}^2 \sum _{i\not =i'} \sum _{\alpha \in \mathcal {I}_j \cap \{s_i+1,\ldots ,s_{i+1}\}} (\bar{x}_k)_\alpha ^2 \nonumber \\&= \lambda _{A,g}^2 \left( \left\| \mathrm{proj}_{\mathcal {I}_j} \left( \bar{x}_k\right) \right\| _2^2 - \left\| \mathrm{proj}_{\mathcal {I}_j \cap \{s_{i'}+1,\ldots ,s_{i'+1}\}} (\bar{x}_k) \right\| _2^2 \right) . \end{aligned}$$
(56)

To bound the term \(\left\| \mathrm{proj}_{\mathcal {I}_j \cap \{s_{i'}+1,\ldots ,s_{i'+1}\}} (\bar{x}_k) \right\| _2^2\), we proceed as follows. Let \(\bar{Y} = XQ^*\Pi ^T \in \mathrm{St}(m,n)\). Then, we have \(\bar{X} = (P^*)^TXQ^* = (P^*)^T\bar{Y}\Pi \) and

$$\begin{aligned} \mathrm{dist}(X,\mathcal {X}_{h,\Pi }) = \left\| \bar{X} - E(h)\Pi \right\| _F = \left\| (P^*)^T\bar{Y} - E(h) \right\| _F. \end{aligned}$$

We are now interested in locating the entries of \(\mathrm{proj}_{\mathcal {I}_j \cap \{s_{i'}+1,\ldots ,s_{i'+1}\}} (\bar{x}_k)\) in the matrix \((P^*)^T\bar{Y}\). Towards that end, recall that \(P^*=\mathrm{BlkDiag}\left( P_1^*,\ldots ,P_{n_A}^* \right) \) and consider the decomposition

$$\begin{aligned} (P^*)^T\bar{Y} = \begin{bmatrix} (P_1^*)^T\bar{Y}_{1,1}&\quad \cdots&\quad (P_1^*)^T\bar{Y}_{1,n_A} \\ \vdots&\quad \ddots&\quad \vdots \\ (P_{n_A}^*)^T\bar{Y}_{n_A,1}&\quad \cdots&\quad (P_{n_A}^*)^T\bar{Y}_{n_A,n_A} \end{bmatrix}, \end{aligned}$$
(57)

where \(P_i^* \in \mathcal {O}^{s_i-s_{i-1}}\) and \(\bar{Y}_{i,i} \in \mathbb R^{(s_i-s_{i-1})\times h_i}\), for \(i=1,\ldots ,n_A\). Since \(\iota (k)\) is the coordinate of the kth column of \(\bar{E}_j(h)\) that equals 1 and \(s_{i'}+1 \le \iota (k) \le s_{i'+1}\), we see from (10) and (11) that the kth column of \(\bar{E}_j(h)\) belongs to

$$\begin{aligned} E_{i'+1}(h) = \left[ \begin{array}{c} \mathbf 0_{s_{i'}\times h_{i'+1}} \\ \hline I_{h_{i'+1}} \\ \mathbf 0_{(s_{i'+1}-s_{i'}-h_{i'+1})\times h_{i'+1}} \\ \hline \mathbf 0_{(m-s_{i'+1})\times h_{i'+1}} \end{array} \right] . \end{aligned}$$
(58)

As \(\bar{x}_k\) is the kth column of \(\bar{X}_j\) and \(\left\| \bar{X} - E(h)\Pi \right\| _F^2 = \sum _{j=1}^{n_B} \Vert \bar{X}_j - \bar{E}_j(h)\Vert _F^2\), it follows that all the entries of \(\mathrm{proj}_{\mathcal {I}_j \cap \{s_{i'}+1,\ldots ,s_{i'+1}\}} (\bar{x}_k)\) lie in \((P_{i'+1}^*)^T\bar{Y}_{i'+1,i'+1}\). Furthermore, by (58) and the definition of \(\mathcal {I}_j\), the entries of \(\mathrm{proj}_{\mathcal {I}_j \cap \{s_{i'}+1,\ldots ,s_{i'+1}\}} (\bar{x}_k)\) do not intersect the diagonal of the top \(h_{i'+1}\times h_{i'+1}\) block of \((P_{i'+1}^*)^T\bar{Y}_{i'+1,i'+1}\). Consequently, we have

$$\begin{aligned} \left\| \mathrm{proj}_{\mathcal {I}_j \cap \{s_{i'}+1,\ldots ,s_{i'+1}\}} (\bar{x}_k) \right\| _2^2 \le \left\| (P_{i'+1}^*)^T\bar{Y}_{i'+1,i'+1} - \begin{bmatrix} I_{h_{i'+1}} \\ \mathbf 0\end{bmatrix} \right\| _F^2. \end{aligned}$$
(59)

To obtain an upper bound on the right-hand side of (59), we need the following lemma:

Lemma 2

Consider the decomposition of \((P^*)^T\bar{Y}\) in (57). For \(i=1,\ldots ,n_A\), let

$$\begin{aligned} v_i^* = \min \left\{ \left. \left\| P_i^T\bar{Y}_{i,i} - \begin{bmatrix} I_{h_i} \\ \mathbf 0\end{bmatrix} \right\| _F^2 \,\right| \, P_i \in \mathcal {O}^{s_i-s_{i-1}} \right\} . \end{aligned}$$
(60)

Suppose that \(v_i^*<1\). Then, we have

$$\begin{aligned} \frac{1}{4}\left\| \sum _{j\not =i} \bar{Y}_{j,i}^T\bar{Y}_{j,i} \right\| _F^2 \le v_i^* \le \left\| \sum _{j\not =i} \bar{Y}_{j,i}^T\bar{Y}_{j,i} \right\| _F^2. \end{aligned}$$

Let us defer the proof of Lemma 2 to the end of this section. Now, observe that by (11) and (17),

$$\begin{aligned} \mathrm{dist}^2(X,\mathcal {X}_{h,\Pi }) =&\min \left\{ \left\| \mathrm{BlkDiag}\left( P_1^T,\ldots ,P_{n_A}^T \right) \cdot \bar{Y}- E(h) \right\| _F^2\,\Bigg |\right. \nonumber \\&\quad \quad \quad \left. \, P_i \in \mathcal {O}^{s_i-s_{i-1}} \text{ for } i=1,\ldots ,n_A \right\} \nonumber \\ =&\left. \sum _{i=1}^{n_A} \min \left\{ \left\| P_i^T\bar{Y}_{i,i} - \begin{bmatrix} I_{h_i} \\ \mathbf 0\end{bmatrix} \right\| _F^2 \,\right| \, P_i\in \mathcal {O}^{s_i-s_{i-1}} \right\} \nonumber \\&+ \sum _{1\le i\not =j \le n_A} \Vert \bar{Y}_{i,j}\Vert _F^2. \end{aligned}$$
(61)

Since \(\mathrm{dist}(X,\mathcal {X}_{h,\Pi }) = \tau \) for some \(\tau \in (0,1)\), we have \(\sum _{1\le i\not =j\le n_A}\Vert \bar{Y}_{i,j}\Vert _F^2 \le \tau ^2\) from (61). Hence, by Lemma 2 and (59), we have

$$\begin{aligned} v_i^* \le \left( \sum _{j\not =i} \Vert \bar{Y}_{j,i}\Vert _F^2 \right) ^2 \le \tau ^4 \qquad \text{ for } i=1,\ldots ,n_A \end{aligned}$$

and

$$\begin{aligned} \left\| \mathrm{proj}_{\mathcal {I}_j \cap \{s_{i'}+1,\ldots ,s_{i'+1}\}} (\bar{x}_k) \right\| _2^2 \le v_{i'+1}^* \le \tau ^4. \end{aligned}$$

This, together with (55), (56) and the fact that the implications

$$\begin{aligned} c \ge a - b \quad \Longrightarrow \quad a^2 \le 2(b^2+c^2) \quad \Longrightarrow \quad c^2 \ge \frac{a^2}{2}-b^2 \end{aligned}$$

hold for any \(a,b,c\in \mathbb R\), yields

$$\begin{aligned} \Vert \Delta _k\Vert _2^2 \ge \frac{\lambda _{A,g}^2}{2} \left( \left\| \mathrm{proj}_{\mathcal {I}_j} \left( \bar{x}_k\right) \right\| _2^2 - \tau ^4 \right) - \lambda _{A,m}^2(t_j-t_{j-1})(m\tau ^2+2\tau )^2\tau ^2. \end{aligned}$$

It follows that

$$\begin{aligned}&\left\| A\bar{X}_j-\bar{X}_j\bar{X}_j^TA\bar{X}_j \right\| _F^2 \,\,\,=\,\,\, \sum _{k=1}^{t_j-t_{j-1}} \Vert \Delta _k\Vert _2^2 \\&\quad \ge \frac{\lambda _{A,g}^2}{2}\sum _{k=1}^{t_j-t_{j-1}} \left\| \mathrm{proj}_{\mathcal {I}_j} \left( \bar{x}_k\right) \right\| _2^2 \\&\qquad - (t_j-t_{j-1}) \left( \frac{\lambda _{A,g}^2\tau ^4}{2} + \lambda _{A,m}^2(t_j-t_{j-1})(m\tau ^2+2\tau )^2\tau ^2 \right) \\&\quad = \frac{\lambda _{A,g}^2}{2} \sum _{k\in \mathcal {I}_j} \left\| \left[ \bar{X}_j\right] _k \right\| _2^2 - (t_j-t_{j-1})\left( \frac{\lambda _{A,g}^2\tau ^4}{2} + \lambda _{A,m}^2(t_j-t_{j-1})(m\tau ^2+2\tau )^2\tau ^2 \right) \end{aligned}$$

(recall that \(\left[ \bar{X}_j\right] _k\) is the kth row of \(\bar{X}_j\)). Upon summing both sides of the above inequality over \(j=1,\ldots ,n_B\) and using Proposition 5 and the assumption that \(\mathrm{dist}(X,\mathcal {X}_{h,\Pi })=\tau \), we obtain

$$\begin{aligned} \sum _{j=1}^{n_B} \left\| A\bar{X}_j-\bar{X}_j\bar{X}_j^TA\bar{X}_j \right\| _F^2\ge & {} \frac{\lambda _{A,g}^2}{4} \cdot \mathrm{dist}^2(X,\mathcal {X}_{h,\Pi }) \\&- \frac{n\lambda _{A,g}^2\tau ^4}{2} - n^2\lambda _{A,m}^2(m\tau ^2+2\tau )^2\tau ^2 \\\ge & {} \frac{\lambda _{A,g}^2}{8} \sum _{j=1}^{n_B}\sum _{k\in \mathcal {I}_j} \left\| \left[ \bar{X}_j\right] _k \right\| _2^2 \end{aligned}$$

whenever \(\tau \in (0,1)\) satisfies

$$\begin{aligned} \left( \frac{n\lambda _{A,g}^2}{2} + n^2\lambda _{A,m}^2(m+2)^2 \right) \tau ^2 \le \frac{\lambda _{A,g}^2}{8}. \end{aligned}$$

To complete the proof, it remains to prove Lemma 2.

Proof of Lemma 2

Consider a fixed \(i\in \{1,\ldots ,n_A\}\). Note that Problem (60) is again an instance of the orthogonal Procrustes problem. Hence, by the result in [39], an optimal solution to Problem (60) is given by

$$\begin{aligned} P_i^* = H_i \begin{bmatrix} W_i^T&\quad \mathbf 0\\ \mathbf 0&\quad I_{s_i-s_{i-1}-h_i} \end{bmatrix}, \end{aligned}$$

where \(\bar{Y}_{i,i}=H_i \begin{bmatrix} \Sigma _i \\ \mathbf 0\end{bmatrix} W_i^T\) is a singular value decomposition of \(\bar{Y}_{i,i}\). It follows from (60) that

$$\begin{aligned} v_i^* = \left\| (P_i^*)^T \bar{Y}_{i,i} - \begin{bmatrix} I_{h_i} \\ \mathbf 0\end{bmatrix} \right\| _F^2 = \Vert \Sigma _i - I_{h_i}\Vert _F^2. \end{aligned}$$

Now, since \(\bar{Y}\in \mathrm{St}(m,n)\), we have

$$\begin{aligned} \bar{Y}_{i,i}^T\bar{Y}_{i,i} + \sum _{j\not =i} \bar{Y}_{j,i}^T\bar{Y}_{j,i} = W_i\Sigma _i^2W_i^T + \sum _{j\not =i} \bar{Y}_{j,i}^T\bar{Y}_{j,i} = I_{h_i}, \end{aligned}$$

or equivalently,

$$\begin{aligned} \Sigma _i^2 + W_i^T \left( \sum _{j\not =i} \bar{Y}_{j,i}^T\bar{Y}_{j,i} \right) W_i = I_{h_i}. \end{aligned}$$

By following the arguments in the proof of Lemma 1, we conclude that

$$\begin{aligned} \frac{1}{4}\left\| \sum _{j\not =i} \bar{Y}_{j,i}^T\bar{Y}_{j,i} \right\| _F^2 \le v_i^* \le \left\| \sum _{j\not =i} \bar{Y}_{j,i}^T\bar{Y}_{j,i} \right\| _F^2, \end{aligned}$$

as desired. \(\square \)

Second-order boundedness of some retractions on \(\mathrm{St}(m,n)\)

1.1 Second-order boundedness of \(R_\mathsf{polar}\)

Let \(X\in \mathrm{St}(m,n)\) and \(\xi \in T(X)\) be arbitrary. By definition, we have

$$\begin{aligned} \Vert R_\mathsf{polar}(X,\xi ) - (X+\xi ) \Vert _F= & {} \Vert (X+\xi )(I_n + \xi ^T\xi )^{-1/2} - (X+\xi ) \Vert _F \\\le & {} \Vert X+\xi \Vert \cdot \Vert (I_n + \xi ^T\xi )^{-1/2} - I_n \Vert _F. \end{aligned}$$

Let \(\xi ^T\xi =U\Sigma U^T\) be a spectral decomposition of \(\xi ^T\xi \) with \(\Sigma =\mathrm{Diag}(\lambda _1,\ldots ,\lambda _n)\) and \(\lambda _1,\ldots ,\lambda _n\ge 0\). Then, a simple calculation yields

$$\begin{aligned} \Vert (I_n + \xi ^T\xi )^{-1/2} - I_n \Vert _F^2 = \sum _{i=1}^n ((1+\lambda _i)^{-1/2}-1)^2 \le \frac{1}{4}\sum _{i=1}^n \lambda _i^2 = \frac{1}{4} \cdot \Vert \xi ^T\xi \Vert _F^2. \end{aligned}$$

Since \(\Vert X+\xi \Vert \le \Vert X\Vert +\Vert \xi \Vert \le 1+\Vert \xi \Vert _F\), we conclude that whenever \(\Vert \xi \Vert _F \le 1\),

$$\begin{aligned} \Vert R_\mathsf{polar}(X,\xi ) - (X+\xi ) \Vert _F \le \Vert \xi \Vert _F^2; \end{aligned}$$

i.e., \(R_\mathsf{polar}\) satisfies Property (P) with \(\phi =M=1\).

1.2 Second-order boundedness of \(R_\mathsf{QR}\)

Let \(X\in \mathrm{St}(m,n)\) and \(\xi \in T(X)\) be arbitrary. Suppose that \(\Vert \xi \Vert _F \le 1/2\). Then, for any \(t\in [-1,1]\), the matrix \(X(t)=X+t\xi \) has full column rank and hence admits a unique thin QR-decomposition \(X(t)=Q(t)R(t)\), where \(Q(t)\in \mathrm{St}(m,n)\) and \(R(t)\in \mathbb R^{n\times n}\) are both differentiable and R(t) is upper triangular with positive diagonal entries; see, e.g., [12]. Since the unique thin QR-decomposition of X is given by \(X=XI_n\), we have \(R(0)=I_n\). This, together with the fact that \(\Vert Q(t)\Vert \le 1\), implies

$$\begin{aligned} \Vert R_\mathsf{QR}(X,\xi ) - (X+\xi ) \Vert _F= & {} \Vert Q(1)(I_n-R(1)) \Vert _F \le \Vert R(1)-R(0) \Vert _F \nonumber \\\le & {} \int _0^1 \Vert R'(t) \Vert _F \,dt. \end{aligned}$$
(62)

To bound \(\Vert R'(t)\Vert _F\), we adopt the so-called matrix equation approach in [11, 47]. Using the identity \(R(t)^TR(t)=X(t)^TX(t)\) and the fact that \(\xi \in T(X)\) implies \(X^T\xi +\xi ^TX=\mathbf 0\), we have

$$\begin{aligned} R(t)^TR(t) = I_n + t^2\xi ^T\xi . \end{aligned}$$
(63)

Differentiating both sides of (63) with respect to t yields

$$\begin{aligned} R'(t)^TR(t) + R(t)^TR'(t) = 2t\xi ^T\xi . \end{aligned}$$

In particular, since R(t) is invertible, we have

$$\begin{aligned} \left( R'(t)R(t)^{-1} \right) ^T + R'(t)R(t)^{-1} = 2t \left( R(t)^{-1} \right) ^T(\xi ^T\xi )R(t)^{-1}. \end{aligned}$$

Now, observe that \(R'(t)R(t)^{-1}\) is upper triangular. Thus, the above identity implies that

$$\begin{aligned} R'(t) = 2t\cdot \mathrm{up}\left[ \left( R(t)^{-1} \right) ^T(\xi ^T\xi )R(t)^{-1} \right] \cdot R(t), \end{aligned}$$

where for any \(C\in \mathbb R^{n\times n}\),

$$\begin{aligned} {[}\mathrm{up}(C)]_{ij} = \left\{ \begin{array}{l@{\quad }l} C_{ij} &{} \text{ if } i<j, \\ C_{ii}/2 &{} \text{ if } i=j, \\ 0 &{} \text{ otherwise }. \end{array} \right. \end{aligned}$$

Let \(\lambda _1,\ldots ,\lambda _n \ge 0\) be the eigenvalues of \(\xi ^T\xi \). Using (63) and the fact that \(2 \cdot \Vert \mathrm{up}(C)\Vert _F^2 \le \Vert C\Vert _F^2\) for any \(C\in \mathcal {S}^n\), we bound

$$\begin{aligned} 2 \left\| \mathrm{up}\left[ \left( R(t)^{-1} \right) ^T(\xi ^T\xi )R(t)^{-1} \right] \right\| _F^2\le & {} \left\| \left( R(t)^{-1} \right) ^T(\xi ^T\xi )R(t)^{-1} \right\| _F^2 \\= & {} \sum _{i=1}^n \left( \frac{\lambda _i}{1+t^2\lambda _i} \right) ^2 \\\le & {} \Vert \xi ^T\xi \Vert _F^2. \end{aligned}$$

On the other hand, we have \(\Vert R(t)\Vert \le \sqrt{1+t^2\cdot \Vert \xi \Vert ^2} \le \sqrt{5}/2\) by (63) and the assumption that \(\Vert \xi \Vert _F \le 1/2\) and \(t\in [-1,1]\). It follows that

$$\begin{aligned} \Vert R'(t)\Vert _F \le 2t \cdot \left\| \mathrm{up}\left[ \left( R(t)^{-1} \right) ^T(\xi ^T\xi )R(t)^{-1} \right] \right\| _F \cdot \Vert R(t) \Vert \le \frac{\sqrt{10}t}{2} \cdot \Vert \xi \Vert _F^2. \end{aligned}$$

Upon substituting this into (62) and integrating, we obtain

$$\begin{aligned} \Vert R_\mathsf{QR}(X,\xi ) - (X+\xi ) \Vert _F \le \frac{\sqrt{10}}{4} \cdot \Vert \xi \Vert _F^2; \end{aligned}$$

i.e., \(R_\mathsf{QR}\) satisfies Property (P) with \(\phi =1/2\) and \(M=\sqrt{10}/4\).

1.3 Second-order boundedness of \(R_\mathsf{cayley}\)

Let \(X\in \mathrm{St}(m,n)\) and \(\xi \in T(X)\) be arbitrary. Suppose that \(\Vert \xi \Vert _F \le 1/2\). Then, we have \(\Vert W(\xi )\Vert _F \le 2\cdot \Vert \xi \Vert _F \le 1\). Hence, we may write

$$\begin{aligned} \left( I_m - \frac{1}{2}W(\xi ) \right) ^{-1} = \sum _{i=0}^\infty \left( \frac{1}{2}W(\xi ) \right) ^i. \end{aligned}$$

In particular, we have

$$\begin{aligned}&\Vert R_\mathsf{cayley}(X,\xi ) - (X+\xi )\Vert _F \\&\quad = \left\| \left( I_m + \frac{1}{2}W(\xi ) + \sum _{i=2}^\infty \left( \frac{1}{2}W(\xi ) \right) ^i \right) \left( I_m + \frac{1}{2}W(\xi ) \right) X - (X+\xi ) \right\| _F \\&\quad = \left\| (W(\xi )X-\xi ) + \frac{1}{4}W(\xi )^2X + \left( \sum _{i=2}^\infty \left( \frac{1}{2}W(\xi ) \right) ^i \right) \left( I_m + \frac{1}{2}W(\xi ) \right) X \right\| _F. \end{aligned}$$

Now, observe that

$$\begin{aligned} W(\xi )X - \xi= & {} \left( I_m-\frac{1}{2}XX^T \right) \xi - \frac{1}{2}X\xi ^TX - \xi = -\frac{1}{2} X(X^T\xi +\xi ^TX) = \mathbf 0, \end{aligned}$$

where the last equality follows from the fact that \(\xi \in T(X)\). Hence, we obtain

$$\begin{aligned}&\Vert R_\mathsf{cayley}(X,\xi ) - (X+\xi )\Vert _F \\&\le \frac{1}{4} \cdot \Vert W(\xi )\Vert _F^2 + \left[ \sum _{i=2}^\infty \left( \frac{1}{2^i} + \frac{1}{2^{i+1}} \right) \right] \cdot \Vert W(\xi )\Vert _F^2 \\&\le 4 \cdot \Vert \xi \Vert _F^2; \end{aligned}$$

i.e., \(R_\mathsf{cayley}\) satisfies Property (P) with \(\phi =1/2\) and \(M=4\).

Proof of Proposition 10

We first establish the inequality (44). Define \(\epsilon ^{k+1} = R(X^k,-\alpha \xi ^k)-(X^k-\alpha \xi ^k) = X^{k+1} - (X^k-\alpha \xi ^k)\) for \(k=0,1,\ldots ,\Gamma -1\). Then,

$$\begin{aligned} F(X^{k+1})= & {} \mathrm{tr}\left[ (X^k-\alpha \xi ^k+\epsilon ^{k+1})^TA(X^k-\alpha \xi ^k+\epsilon ^{k+1})B \right] \nonumber \\= & {} F(X^k) - \alpha \cdot \mathrm{tr}\left[ \left( (X^k)^TA\xi ^k + (\xi ^k)^TAX^k \right) B \right] \nonumber \\&+\,\,\mathrm{tr}\left[ \left( (X^k)^TA\epsilon ^{k+1} + (\epsilon ^{k+1})^TAX^k \right) B \right] \nonumber \\&-\,\,\alpha \cdot \mathrm{tr}\left[ \left( (\xi ^k)^TA\epsilon ^{k+1} + (\epsilon ^{k+1})^TA\xi ^k \right) B \right] \nonumber \\&+\,\, \alpha ^2\cdot \mathrm{tr}\left[ (\xi ^k)^TA\xi ^kB \right] + \mathrm{tr}\left[ (\epsilon ^{k+1})^TA\epsilon ^{k+1}B \right] . \end{aligned}$$
(64)

Now, let us bound the terms in (64) in turn. Using the fact that \(\xi ^k\) is the orthogonal projection of \(G^k\) onto \(T(X^k)\) and \(\nabla F_i(X)=2A_iXB\), \(\nabla F(X)=2AXB\), we have

$$\begin{aligned} \Vert \xi ^k\Vert _F \le \Vert G^k\Vert _F \le 2\left( \Vert A_{i_k}X^kB\Vert _F + \Vert A_{i_k}X^0B\Vert _F + \Vert AX^0B\Vert _F \right) \!\le \! 6\cdot \Vert A\Vert _F\cdot \Vert B\Vert . \end{aligned}$$

By our choice of the step size \(\alpha \), we have \(\Vert \alpha \xi ^k\Vert _F \le \phi \le 1\). It follows from Property (P) and some simple calculation that

$$\begin{aligned} \mathrm{tr}\left[ \left( (X^k)^TA\epsilon ^{k+1} + (\epsilon ^{k+1})^TAX^k \right) B \right]\le & {} 2\cdot \Vert A\Vert \cdot \Vert B\Vert \cdot \Vert \epsilon ^{k+1}\Vert _F \nonumber \\\le & {} 2\alpha ^2M\cdot \Vert A\Vert \cdot \Vert B\Vert \cdot \Vert \xi ^k\Vert _F^2, \end{aligned}$$
(65)
$$\begin{aligned} -\mathrm{tr}\left[ \left( (\xi ^k)^TA\epsilon ^{k+1} + (\epsilon ^{k+1})^TA\xi ^k \right) B \right]\le & {} 2\cdot \Vert A\Vert \cdot \Vert B\Vert \cdot \Vert \xi ^k\Vert _F\cdot \Vert \epsilon ^{k+1}\Vert _F \nonumber \\\le & {} 2\alpha M\cdot \Vert A\Vert \cdot \Vert B\Vert \cdot \Vert \xi ^k\Vert _F^2, \end{aligned}$$
(66)
$$\begin{aligned} \mathrm{tr}\left[ (\epsilon ^{k+1})^TA\epsilon ^{k+1}B \right]\le & {} \Vert A\Vert \cdot \Vert B\Vert \cdot \Vert \epsilon ^{k+1}\Vert _F^2 \nonumber \\\le & {} \alpha ^2M^2 \cdot \Vert A\Vert \cdot \Vert B\Vert \cdot \Vert \xi ^k\Vert _F^2. \end{aligned}$$
(67)

Moreover, it is clear that

$$\begin{aligned} \mathrm{tr}\left[ (\xi ^k)^TA\xi ^kB \right] \le \Vert A\Vert \cdot \Vert B\Vert \cdot \Vert \xi ^k\Vert _F^2. \end{aligned}$$
(68)

Upon substituting (65)–(68) into (64) and simplifying, we obtain

$$\begin{aligned} F(X^{k+1}) - F(X^k) + \alpha \cdot \mathrm{tr}\left[ \left( (X^k)^TA\xi ^k + (\xi ^k)^TAX^k \right) B \right] \le c_0\alpha ^2\cdot \Vert \xi ^k\Vert _F^2 \end{aligned}$$

with \(c_0=(M^2+4M+1)\cdot \Vert A\Vert \cdot \Vert B\Vert \), as desired.

Next, we establish the inequality (45). Since \(\xi ^0=\mathrm{grad}\,F(X^0)=\mathrm{proj}_{T(X^0)}(\nabla F(X^0))\), where \(\mathrm{proj}_{T(X)}\) is the projector onto T(X), by the idempotence of \(\mathrm{proj}_{T(X)}\) and the fact that \(\nabla F(X)=2AXB\), we have

$$\begin{aligned} \mathrm{tr}\left[ \left( (X^0)^TA\xi ^0 + (\xi ^0)^TAX^0 \right) B \right] = \Vert \mathrm{grad}\,F(X^0)\Vert _F^2. \end{aligned}$$

Upon substituting this into (44) and noting that \(c_0 \le 1/(8\alpha )\), we obtain

$$\begin{aligned} F(X^1)-F(X^0) \le -\frac{7\alpha }{8} \cdot \Vert \mathrm{grad}\,F(X^0)\Vert _F^2. \end{aligned}$$

Since \(X^0=\tilde{X}^s\) and \(F(\tilde{X}^{s+1}) \le F(X^1)\) by lines 2 and 9 of Algorithm 2, respectively, the above inequality is equivalent to (45).

The inequality (45) shows that the sequence \(\{F(\tilde{X}^s)\}_{s\ge 0}\) is monotonically decreasing, which, together with the fact that F is bounded below on \(\mathrm{St}(m,n)\), implies that \(F(\tilde{X}^s) \searrow F^*\) for some \(F^*\in \mathbb R\). By the continuity of F, we conclude that every limit point \(X^*\) of the sequence \(\{\tilde{X}^s\}_{s\ge 0}\) satisfies \(F(X^*)=F^*\) and \(\mathrm{grad}\,F(X^*)=\mathbf 0\). This completes the proof of Proposition 10.

Proof of Corollary 2

Let \(\mathscr {F}_k\) be the \(\sigma \)-algebra generated by \(X^0,\ldots ,X^k\) for \(k=0,1,\ldots ,\Gamma -1\). Since \({\mathbb E}[ G^k \mid \mathscr {F}_k ] = \nabla F(X^k)\), we have \({\mathbb E}[\xi ^k \mid \mathscr {F}_k ] = \mathrm{grad}\, F(X^k) = \mathrm{proj}_{T(X^k)}(\nabla F(X^k))\). Again, using the idempotence of \(\mathrm{proj}_{T(X)}\) and the fact that \(\nabla F(X)=2AXB\), we obtain

$$\begin{aligned} {\mathbb E}\left[ \mathrm{tr}\left[ \left( (X^k)^TA\xi ^k + (\xi ^k)^TAX^k \right) B \right] \,\Big |\, \mathscr {F}_k \right] = \Vert \mathrm{grad}\,F(X^k)\Vert _F^2. \end{aligned}$$
(69)

On the other hand, the non-expansiveness of \(\mathrm{proj}_{T(X)}\) yields

$$\begin{aligned} \Vert \xi ^k\Vert _F\le & {} \Vert \xi ^k-\mathrm{grad}\,F(X^k)\Vert _F + \Vert \mathrm{grad}\,F(X^k)\Vert _F \nonumber \\= & {} \left\| \mathrm{proj}_{T(X^k)}(G^k)-\mathrm{proj}_{T(X^k)}(\nabla F(X^k)) \right\| _F + \Vert \mathrm{grad}\,F(X^k)\Vert _F \nonumber \\\le & {} \Vert G^k-\nabla F(X^k)\Vert _F + \Vert \mathrm{grad}\,F(X^k)\Vert _F \end{aligned}$$
(70)

and hence

$$\begin{aligned} \Vert \xi ^k\Vert _F^2 \le 2\left( \Vert G^k-\nabla F(X^k)\Vert _F^2 + \Vert \mathrm{grad}\,F(X^k)\Vert _F^2 \right) . \end{aligned}$$
(71)

By the definition of \(G^k\) and the fact that \(\nabla F_i\) (resp. \(\nabla F\)) is Lipschitz continuous with parameter \(L_{F_i} \le 2\cdot \Vert A_i\Vert \cdot \Vert B\Vert \) for \(i=1,\ldots ,N\) (resp. \(L_F\le 2\cdot \Vert A\Vert \cdot \Vert B\Vert \)), we have

$$\begin{aligned} \Vert G^k-\nabla F(X^k)\Vert _F\le & {} \left\| \nabla F_{i_k}(X^k) - \nabla F_{i_k}(X^0) \right\| _F + \left\| \nabla F(X^0) - \nabla F(X^k) \right\| _F \nonumber \\\le & {} c'\cdot \Vert X^k-X^0\Vert _F \end{aligned}$$
(72)

with \(c'=2\left( \max _{i\in \{1,\ldots ,N\}}\Vert A_i\Vert +\Vert A\Vert \right) \Vert B\Vert \). To bound \(\Vert X^k-X^0\Vert _F\), observe that

$$\begin{aligned} \Vert X^{k+1}-X^k\Vert _F= & {} \Vert \alpha \xi ^k+\epsilon ^{k+1}\Vert _F \nonumber \\\le & {} \alpha \cdot \Vert \xi ^k\Vert _F + \Vert \epsilon ^{k+1}\Vert _F \nonumber \\\le & {} \alpha \cdot \Vert \xi ^k\Vert _F + \alpha ^2M\cdot \Vert \xi ^k\Vert _F^2 \nonumber \\\le & {} \alpha (M+1)\cdot \Vert \xi ^k\Vert _F \end{aligned}$$
(73)
$$\begin{aligned}\le & {} \alpha (M+1)\left( c'\cdot \Vert X^k-X^0\Vert _F + \Vert \mathrm{grad}\,F(X^k)\Vert _F \right) , \end{aligned}$$
(74)

where (73) is due to the fact that \(\Vert \alpha \xi ^k\Vert _F \le \phi \le 1\) and (74) follows from (70). This yields

$$\begin{aligned} \Vert X^{k+1}-X^0\Vert _F\le & {} \Vert X^{k+1}-X^k\Vert _F + \Vert X^k-X^0\Vert _F \\\le & {} (c_1\alpha + 1) \cdot \Vert X^k-X^0\Vert _F + \alpha (M+1) \cdot \Vert \mathrm{grad}\,F(X^k)\Vert _F, \end{aligned}$$

where \(c_1=c'(M+1)\). In particular, we have

$$\begin{aligned} \Vert X^{k+1}-X^0\Vert _F \le \alpha (M+1) \sum _{j=0}^k (c_1\alpha +1)^{k-j} \cdot \Vert \mathrm{grad}\,F(X^j)\Vert _F, \end{aligned}$$
(75)

which implies that

$$\begin{aligned} \Vert X^{k+1}-X^0\Vert _F^2 \le \alpha ^2(M+1)^2(k+1) \sum _{j=0}^k (c_1\alpha + 1)^{2(k-j)} \cdot \Vert \mathrm{grad}\,F(X^j)\Vert _F^2. \end{aligned}$$
(76)

It follows from (71), (72), and (76) that

$$\begin{aligned} {\mathbb E}\left[ \Vert \xi ^k\Vert _F^2 \right]\le & {} 2c_1^2\alpha ^2k \sum _{j=0}^{k-1} (c_1\alpha +1)^{2(k-1-j)} {\mathbb E}\left[ \Vert \mathrm{grad}\,F(X^j)\Vert _F^2 \right] \\&+ 2{\mathbb E}\left[ \Vert \mathrm{grad}\,F(X^k)\Vert _F^2 \right] . \end{aligned}$$

This, together with (44) and (69), yields the desired result.

Proof of Proposition 11

By Proposition 10, the global error bound for Problem (QP-OC) (Corollary 1), and the fact that \(\mathrm{grad}\,F(X)=D_{1/4}(X)\), we have

$$\begin{aligned} F(\tilde{X}^{s+1}) - F(\tilde{X}^s) \le -\frac{7\alpha }{8\bar{\eta }^2} \cdot \mathrm{dist}^2(\tilde{X}^s,\mathcal {X}) \end{aligned}$$

for all \(s\ge 0\). Since \(F(\tilde{X}^s) \searrow F^*\), the above inequality implies the existence of \(s_0\ge 0\) such that \(\mathrm{dist}(\tilde{X}^s,\mathcal {X}) \le \delta /3\) for all \(s\ge s_0\), where \(\delta \in (0,\sqrt{2}/2)\) is the constant given in Theorem 1. Now, consider a fixed \(s\ge s_0\) and let \(\hat{X}^s,\hat{X}^{s+1}\in \mathcal {X}\) be such that \(\mathrm{dist}(\tilde{X}^s,\mathcal {X}) = \Vert \tilde{X}^s-\hat{X}^s\Vert _F\) and \(\mathrm{dist}(\tilde{X}^{s+1},\mathcal {X}) = \Vert \tilde{X}^{s+1}-\hat{X}^{s+1}\Vert _F\). Suppose that \(\hat{X}^s\in \mathcal {X}_{h,\Pi }\) and \(\hat{X}^{s+1}\in \mathcal {X}_{h',\Pi '}\) with \(\mathcal {X}_{h,\Pi }\cap \mathcal {X}_{h',\Pi '}=\emptyset \). Then, we have \(\Vert \hat{X}^s-\hat{X}^{s+1}\Vert _F \ge \sqrt{2} \ge 2\delta \) by Proposition 4. On the other hand, using (75), the fact that \(\Vert \mathrm{grad}\,F(X)\Vert _F \le \Vert \nabla F(X)\Vert _F \le 2\cdot \Vert A\Vert _F\cdot \Vert B\Vert \) for all \(X\in \mathrm{St}(m,n)\), and our choice of the step size \(\alpha \), the sequence \(X^0=\tilde{X}^s,X^1,\ldots ,X^\Gamma \) generated by Algorithm 2 in epoch s satisfies

$$\begin{aligned} \Vert X^{k+1}-X^0\Vert _F\le & {} 2\alpha (M+1)\cdot \Vert A\Vert _F\cdot \Vert B\Vert \sum _{j=0}^k (c_1\alpha + 1)^{k-j} \\= & {} \frac{2(M+1) \left( (c_1\alpha + 1)^{k+1}-1 \right) \cdot \Vert A\Vert _F\cdot \Vert B\Vert }{c_1} \\\le & {} \frac{\delta }{3} \end{aligned}$$

for \(k=0,1,\ldots ,\Gamma -1\). This implies that

$$\begin{aligned} \Vert \hat{X}^s-\hat{X}^{s+1} \Vert _F \le \Vert \hat{X}^s - \tilde{X}^s \Vert _F + \Vert \tilde{X}^{s+1}-\tilde{X}^s \Vert _F + \Vert \hat{X}^{s+1}-\tilde{X}^{s+1} \Vert _F \le \delta , \end{aligned}$$

which is a contradiction. Hence, we have \(\mathcal {X}_{h,\Pi }\cap \mathcal {X}_{h',\Pi '}\not =\emptyset \), which by Proposition 4 yields \(\mathcal {X}_{h,\Pi }=\mathcal {X}_{h',\Pi '}\). Consequently, we have \(\mathrm{dist}(\tilde{X}^s,\mathcal {X}) = \mathrm{dist}(\tilde{X}^s,\mathcal {X}_{h,\Pi }) \le \delta /3\) for all sufficiently large \(s\ge 0\). This, together with Proposition 10 and the fact that the function F is constant on \(\mathcal {X}_{h,\Pi }\), implies that every limit point of the sequence \(\{\tilde{X}^s\}_{s\ge 0}\) belongs to \(\mathcal {X}_{h,\Pi }\) and \(F(X)=F^*\) for all \(X\in \mathcal {X}_{h,\Pi }\). This completes the proof.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, H., So, A.MC. & Wu, W. Quadratic optimization with orthogonality constraint: explicit Łojasiewicz exponent and linear convergence of retraction-based line-search and stochastic variance-reduced gradient methods. Math. Program. 178, 215–262 (2019). https://doi.org/10.1007/s10107-018-1285-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-018-1285-1

Keywords

Mathematics Subject Classification

Navigation