Quadratic optimization with orthogonality constraint: explicit Łojasiewicz exponent and linear convergence of retraction-based line-search and stochastic variance-reduced gradient methods

Liu, Huikang; So, Anthony Man-Cho; Wu, Weijie

doi:10.1007/s10107-018-1285-1

Quadratic optimization with orthogonality constraint: explicit Łojasiewicz exponent and linear convergence of retraction-based line-search and stochastic variance-reduced gradient methods

Full Length Paper
Series A
Published: 01 June 2018

Volume 178, pages 215–262, (2019)
Cite this article

Mathematical Programming Submit manuscript

2381 Accesses
35 Citations
Explore all metrics

Abstract

The problem of optimizing a quadratic form over an orthogonality constraint (QP-OC for short) is one of the most fundamental matrix optimization problems and arises in many applications. In this paper, we characterize the growth behavior of the objective function around the critical points of the QP-OC problem and demonstrate how such characterization can be used to obtain strong convergence rate results for iterative methods that exploit the manifold structure of the orthogonality constraint (i.e., the Stiefel manifold) to find a critical point of the problem. Specifically, our primary contribution is to show that the Łojasiewicz exponent at any critical point of the QP-OC problem is 1 / 2. Such a result is significant, as it expands the currently very limited repertoire of optimization problems for which the Łojasiewicz exponent is explicitly known. Moreover, it allows us to show, in a unified manner and for the first time, that a large family of retraction-based line-search methods will converge linearly to a critical point of the QP-OC problem. Then, as our secondary contribution, we propose a stochastic variance-reduced gradient (SVRG) method called Stiefel-SVRG for solving the QP-OC problem and present a novel Łojasiewicz inequality-based linear convergence analysis of the method. An important feature of Stiefel-SVRG is that it allows for general retractions and does not require the computation of any vector transport on the Stiefel manifold. As such, it is computationally more advantageous than other recently-proposed SVRG-type algorithms for manifold optimization.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Analysis of the Barzilai-Borwein Step-Sizes for Problems in Hilbert Spaces

Article 13 May 2020

A Globally Convergent Algorithm for Nonconvex Optimization Based on Block Coordinate Update

Article 04 February 2017

Generalized Newton Method with Positive Definite Regularization for Nonsmooth Optimization Problems with Nonisolated Solutions

Article 13 March 2024

Notes

Such an assumption is omitted in the original text of [3] but is needed for the result in [3, Section 4.8.2] to hold. The omission is corrected in the online errata at https://sites.uclouvain.be/absil/amsbook/errata.html.
That is, there exist constants $r_0>0$, $r_1\in (0,1)$ and index $K\ge 0$ such that $\Vert X^k-X^*\Vert _F \le r_0r_1^k$ for all $k\ge K$.
Stiefel-SVRG was first presented by the second author at the 13th Chinese Workshop on Machine Learning and Applications held in Nanjing, China in 2015 [45]. As such, it predates the SVRG methods for manifold optimization developed in [37, 58]. More importantly, Stiefel-SVRG does not require the computation of any vector transport, which makes it computationally more advantageous than the SVRG methods proposed in [37, 58].

References

Abrudan, T.E., Eriksson, J., Koivunen, V.: Steepest descent algorithms for optimization under unitary matrix constraint. IEEE Trans. Signal Process. 56(3), 1134–1147 (2008)
MathSciNet MATH Google Scholar
Absil, P.-A., Mahony, R., Andrews, B.: Convergence of the iterates of descent methods for analytic cost functions. SIAM J. Optim. 16(2), 531–547 (2005)
MathSciNet MATH Google Scholar
Absil, P.-A., Mahony, R., Sepulchre, R.: Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton (2008)
MATH Google Scholar
Absil, P.-A., Malick, J.: Projection-like retractions on matrix manifolds. SIAM J. Optim. 22(1), 135–158 (2012)
MathSciNet MATH Google Scholar
Agarwal, A., Anandkumar, A., Jain, P., Netrapalli, P.: Learning sparsely used overcomplete dictionaries via alternating minimization. SIAM J. Optim. 26(4), 2775–2799 (2016)
MathSciNet MATH Google Scholar
Bolla, M., Michaletzky, G., Tusnády, G., Ziermann, M.: Extrema of sums of heterogeneous quadratic forms. Linear Algebra Appl. 269(1–3), 331–365 (1998)
MathSciNet MATH Google Scholar
Bolte, J., Danilidis, A., Ley, O., Mazet, L.: Characterizations of Łojasiewicz inequalities: subgradient flows, talweg, convexity. Trans. Am. Math. Soc. 362(6), 3319–3363 (2010)
MATH Google Scholar
Bolte, J., Ngyuen, T.P., Peypouquet, J., Suter, B.W.: From error bounds to the complexity of first-order descent methods for convex functions. Math. Program. Ser. A 165(2), 471–507 (2017)
MathSciNet MATH Google Scholar
Bonnabel, S.: Stochastic gradient descent on Riemannian manifolds. IEEE Trans. Autom. Control 58(9), 2217–2229 (2013)
MathSciNet MATH Google Scholar
Candès, E.J., Li, X., Soltanolkotabi, M.: Phase retrieval via Wirtinger flow: theory and algorithms. IEEE Trans. Inf. Theory 61(4), 1985–2007 (2015)
MathSciNet MATH Google Scholar
Chang, X.-W., Paige, C.C., Stewart, G.W.: Perturbation analyses for the QR factorization. SIAM J. Matrix Anal. Appl. 18(3), 775–791 (1997)
MathSciNet MATH Google Scholar
Dieci, L., Eirola, T.: On smooth decompositions of matrices. SIAM J. Matrix Anal. Appl. 20(3), 800–819 (1999)
MathSciNet MATH Google Scholar
Feehan, P.M.N.: Global existence and convergence of solutions to gradient systems and applications to Yang–Mills gradient flow. Monograph. arxiv.org/abs/1409.1525 (2014)
Forti, M., Nistri, P., Quincampoix, M.: Convergence of neural networks for programming problems via a nonsmooth Łojasiewicz inequality. IEEE Trans. Neural Netw. 17(6), 1471–1486 (2006)
Google Scholar
Golub, G.H., Van Loan, C.F.: Matrix Computations, 3rd edn. The Johns Hopkins University Press, Baltimore, MD (1996)
MATH Google Scholar
Hardt, M.: Understanding alternating minimization for matrix completion. In: Proceedings of the 55th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2014), pp. 651–660 (2014)
Hou, K., Zhou, Z., So, A. M.-C., Luo, Z.-Q.: On the linear convergence of the proximal gradient method for trace norm regularization. In: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 26: Proceedings of the 2013 Conference, pp. 710–718 (2013)
Jain, P., Oh, S.: Provable tensor factorization with missing data. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 27: Proceedings of the 2014 Conference, pp. 1431–1439 (2014)
Jiang, B., Dai, Y.-H.: A framework of constraint preserving update schemes for optimization on stiefel manifold. Math. Program. Ser. A 153(2), 535–575 (2015)
MathSciNet MATH Google Scholar
Johnson, R., Zhang, T.: Accelerating stochatic gradient descent using predictive variance reduction. In: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 26: Proceedings of the 2013 Conference, pp. 315–323 (2013)
Kaneko, T., Fiori, S., Tanaka, T.: Empirical arithmetic averaging over the compact stiefel manifold. IEEE Trans. Signal Process. 61(4), 883–894 (2013)
MathSciNet MATH Google Scholar
Kokiopoulou, E., Chen, J., Saad, Y.: Trace optimization and eigenproblems in dimension reduction methods. Numer. Linear Algebra Appl. 18(3), 565–602 (2011)
MathSciNet MATH Google Scholar
Li, G., Mordukhovich, B.S., Phạm, T.S.: New fractional error bounds for polynomial systems with applications to Hölderian stability in optimization and spectral theory of tensors. Math. Program. Ser. A 153(2), 333–362 (2015)
MATH Google Scholar
Li, G., Pong, T.K.: Calculus of the exponent of Kurdyka–Łojasiewicz inequality and its applications to linear convergence of first-order methods. Found. Comput. Math. (2017)
Liu, H., Wu, W., So, A.M.-C.: Quadratic optimization with orthogonality constraints: explicit Łojasiewicz exponent and linear convergence of line-search methods. In: Proceedings of the 33rd International Conference on Machine Learning (ICML 2016), pp. 1158–1167 (2016)
Liu, H., Yue, M.-C., So, A.M.-C.: On the estimation performance and convergence rate of the generalized power method for phase synchronization. SIAM J. Optim. 27(4), 2426–2446 (2017)
MathSciNet MATH Google Scholar
Luo, Z.-Q.: New error bounds and their applications to convergence analysis of iterative algorithms. Math. Program. Ser. B 88(2), 341–355 (2000)
MathSciNet MATH Google Scholar
Luo, Z.-Q., Pang, J.-S.: Error bounds for analytic systems and their applications. Math. Program. 67(1), 1–28 (1994)
MathSciNet MATH Google Scholar
Luo, Z.-Q., Sturm, J.F.: Error bounds for quadratic systems. In: Frenk, H., Roos, K., Terlaky, T., Zhang, S. (eds.) High Performance Optimization, Volume 33 of Applied Optimization, pp. 383–404. Springer, Dordrecht (2000)
Google Scholar
Luo, Z.-Q., Tseng, P.: Error bounds and convergence analysis of feasible descent methods: a general approach. Ann. Oper. Res. 46(1), 157–178 (1993)
MathSciNet MATH Google Scholar
Manton, J.H.: Optimization algorithms exploiting unitary constraints. IEEE Trans. Signal Process. 50(3), 635–650 (2002)
MathSciNet MATH Google Scholar
Merlet, B., Nguyen, T.N.: Convergence to equilibrium for discretizations of gradient-like flows on Riemannian manifolds. Differ. Integral Equ. 26(5–6), 571–602 (2013)
MathSciNet MATH Google Scholar
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic Publishers, Boston (2004)
MATH Google Scholar
Netrapalli, P., Jain, P., Sanghavi, S.: Phase retrieval using alternating minimization. IEEE Trans. Signal Process. 63(18), 4814–4826 (2015)
MathSciNet MATH Google Scholar
Saad, Y.: Numerical Methods for Large Eigenvalue Problems. Classics in Applied Mathematics, revised edition . Society for Industrial and Applied Mathematics, Philadelphia (2011)
Sato, H., Iwai, T.: A Riemannian optimization approach to the matrix singular value decomposition. SIAM J. Optim. 23(1), 188–212 (2013)
MathSciNet MATH Google Scholar
Sato, H., Kasai, H., Mishra, B.: Riemannian stochastic variance reduced gradient. Manuscript, arxiv.org/abs/1702.05594 (2017)
Schneider, R., Uschmajew, A.: Convergence results for projected line-search methods on varieties of low-rank matrices via Łojasiewicz inequality. SIAM J. Optim. 25(1), 622–646 (2015)
MathSciNet MATH Google Scholar
Schönemann, P.H.: A generalized solution of the orthogonal Procrustes problem. Psychometrika 31(1), 1–10 (1966)
MathSciNet MATH Google Scholar
Schönemann, P.H.: On two-sided orthogonal Procrustes problems. Psychometrika 33(1), 19–33 (1968)
MathSciNet Google Scholar
Shamir, O.: A stochastic PCA and SVD algorithm with an exponential convergence rate. In: Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), pp. 144–152 (2015)
Shamir, O.: Fast stochastic algorithms for SVD and PCA: convergence properties and convexity. In: Proceedings of the 33rd International Conference on Machine Learning (ICML 2016), pp. 248–256 (2016)
Smith, S.T.: Optimization techniques on Riemannian manifolds. In: Bloch, A. (ed.) Hamiltonian and Gradient Flows, Algorithms and Control. Fields Institue Communications, pp. 113–136. American Mathematical Society, Providence (1994)
Google Scholar
So, A.M.-C.: Moment inequalities for sums of random matrices and their applications in optimization. Math. Program. Ser. A 130(1), 125–151 (2011)
MathSciNet MATH Google Scholar
So, A.M.-C.: Pinning down the Łojasiewicz exponent: towards understanding the convergence behavior of first-order methods for structured non-convex optimization problems. Slides. http://lamda.nju.edu.cn/conf/mla15/files/suwz.pdf (2015)
So, A.M.-C., Zhou, Z.: Non-asymptotic convergence analysis of inexact gradient methods for machine learning without strong convexity. Optim. Methods Softw. 32(4), 963–992 (2017)
MathSciNet MATH Google Scholar
Sun, J.: On perturbation bounds for the QR factorization. Linear Algebra Appl. 215, 95–111 (1995)
MathSciNet MATH Google Scholar
Sun, J., Qu, Q., Wright, J.: A geometric analysis of phase retrieval. Found. Comput. Math. (2017)
Sun, J., Qu, Q., Wright, J.: Complete dictionary recovery over the sphere I: overview and the geometric picture. IEEE Trans. Inf. Theory 63(2), 853–884 (2017)
MathSciNet MATH Google Scholar
Sun, J., Qu, Q., Wright, J.: Complete dictionary recovery over the sphere II: recovery by Riemannian trust-region method. IEEE Trans. Inf. Theory 63(2), 885–914 (2017)
MathSciNet MATH Google Scholar
Sun, R., Luo, Z.-Q.: Guaranteed matrix completion via non-convex factorization. IEEE Trans. Inf. Theory 62(11), 6535–6579 (2016)
MathSciNet MATH Google Scholar
Sun, W.W., Lu, J., Liu, H., Cheng, G.: Provable sparse tensor decomposition. J. R. Stat. Soc. B 79(3), 899–916 (2017)
MathSciNet MATH Google Scholar
Udrişte, C.: Convex Functions and Optimization Methods on Riemannian Manifolds, Volume 297 of Mathematics and Its Applications. Springer, Dordrecht (1994)
Google Scholar
Uschmajew, A.: A new convergence proof for the higher-order power method and generalizations. Pac. J. Optim. 11(2), 309–321 (2015)
MathSciNet MATH Google Scholar
Wen, Z., Yin, W.: A feasible method for optimization with orthogonality constraints. Math. Program. Ser. A 142(1–2), 397–434 (2013)
MathSciNet MATH Google Scholar
Yang, Y.: Globally convergent optimization algorithms on Riemannian manifolds: uniform framework for unconstrained and constrained optimization. J. Optim. Theory Appl. 132(2), 245–265 (2007)
MathSciNet MATH Google Scholar
Yger, F., Berar, M., Gasso, G., Rakotomamonjy, A.: Adaptive canonical correlation analysis based on matrix manifolds. In: Proceedings of the 29th International Conference on Machine Learning (ICML 2012), pp. 1071–1078 (2012)
Zhang, H., Reddi, S. J., Sra, S.: Riemannian SVRG: fast stochastic optimization on Riemannian manifolds. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems 29: Proceedings of the 2016 Conference, pp. 4592–4600 (2016)
Zhang, H., Sra, S.: First-order methods for geodesically convex optimization. In Feldman, V., Rakhlin, A., Shamir, O. (eds.) Proceedings of the 29th Annual Conference on Learning Theory (COLT 2016), Volume 49 of Proceedings of Machine Learning Research, pp. 1617–1638 (2016)
Zheng, Q., Lafferty, J.: A convergent gradient descent algorithm for rank minimization and semidefinite programming from random linear measurements. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems 28: Proceedings of the 2015 Conference, pp. 109–117 (2015)
Zhong, Y., Boumal, N.: Near-optimal bounds for phase synchronization. SIAM J. Optim. 28(2), 989–1016 (2018)
MathSciNet MATH Google Scholar
Zhou, Z., So, A.M.-C.: A unified approach to error bounds for structured convex optimization problems. Math. Program. Ser. A 165(2), 689–728 (2017)
MathSciNet MATH Google Scholar
Zhou, Z., Zhang, Q., So, A.M.-C.: $\ell _{1,p}$-norm regularization: error bounds and convergence rate analysis of first-order methods. In: Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), pp. 1501–1510 (2015)

Download references

Acknowledgements

We thank the associate editor for coordinating the review of our manuscript and the anonymous reviewer for his/her detailed comments.

Author information

Authors and Affiliations

Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, N. T., Hong Kong
Huikang Liu, Anthony Man-Cho So & Weijie Wu
CUHK-BGI Innovation Institute of Trans-omics, The Chinese University of Hong Kong, Shatin, N. T., Hong Kong
Anthony Man-Cho So

Authors

Huikang Liu
View author publications
You can also search for this author inPubMed Google Scholar
Anthony Man-Cho So
View author publications
You can also search for this author inPubMed Google Scholar
Weijie Wu
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Anthony Man-Cho So.

Additional information

A preliminary version of this work has appeared in the Proceedings of the 33rd International Conference on Machine Learning (ICML 2016), 2016 [25]. This research is supported in part by the Hong Kong Research Grants Council (RGC) General Research Fund (GRF) Projects CUHK 14205314, CUHK 14206814, and CUHK 14208117.

Appendices

Appendix

Proof of Proposition 4

Observe that given any $X\in \mathcal {X}_{h,\Pi }$, we can write

$$\begin{aligned} \mathcal {X}_{h,\Pi }&= \left\{ \left. \mathrm{BlkDiag}(P_1,\ldots ,P_{n_A}) \cdot X \cdot \mathrm{BlkDiag}\left( Q_1^T,\ldots ,Q_{n_B}^T \right) \,\right| \, \right. \\&\qquad \left. P_i \in \mathcal {O}^{s_i-s_{i-1}} \text{ for } i=1,\ldots ,n_A; \, Q_j \in \mathcal {O}^{t_j-t_{j-1}} \text{ for } j=1,\ldots ,n_B \right\} . \end{aligned}$$

Thus, if $X \in \mathcal {X}_{h,\Pi } \cap \mathcal {X}_{h',\Pi '}$, then

$$\begin{aligned} \mathrm{BlkDiag}(P_1,\ldots ,P_{n_A}) \cdot X \cdot \mathrm{BlkDiag}\left( Q_1^T,\ldots ,Q_{n_B}^T \right) \in \mathcal {X}_{h,\Pi } \cap \mathcal {X}_{h',\Pi '} \end{aligned}$$

for any $P_i \in \mathcal {O}^{s_i-s_{i-1}}$ ($i=1,\ldots ,n_A$) and $Q_j \in \mathcal {O}^{t_j-t_{j-1}}$ ($j=1,\ldots ,n_B$). This implies that $\mathcal {X}_{h,\Pi } = \mathcal {X}_{h',\Pi '}$.

Now, suppose that $\mathcal {X}_{h,\Pi } \cap \mathcal {X}_{h',\Pi '}=\emptyset $. Let $X\in \mathcal {X}_{h,\Pi }$ and $X'\in \mathcal {X}_{h',\Pi '}$ be arbitrary. Then, there exist $P_i \in \mathcal {O}^{s_i-s_{i-1}}$ ($i=1,\ldots ,n_A$) and $Q_j \in \mathcal {O}^{t_j-t_{j-1}}$ ($j=1,\ldots ,n_B$) such that

$$\begin{aligned}&\Vert X-X'\Vert _F^2 \nonumber \\&\quad = \left\| E(h')\Pi ' - \mathrm{BlkDiag}\left( P_1,\ldots ,P_{n_A} \right) \cdot E(h) \cdot \Pi \cdot \mathrm{BlkDiag}\left( Q_1^T,\ldots ,Q_{n_B}^T \right) \right\| _F^2. \end{aligned}$$

(48)

Consider the following block decomposition of $E(h)\Pi $ (and similarly for $E(h')\Pi '$):

$$\begin{aligned} E(h)\Pi = \begin{bmatrix} E_{1,1}(h,\Pi )&\quad \cdots&\quad E_{1,n_B}(h,\Pi ) \\ \vdots&\quad \ddots&\quad \vdots \\ E_{n_A,1}(h,\Pi )&\quad \cdots&\quad E_{n_A,n_B}(h,\Pi ) \end{bmatrix} \end{aligned}$$

where $E_{i,j}(h,\Pi )\in \mathbb R^{(s_i-s_{i-1})\times (t_j-t_{j-1})}$ for $i=1,\ldots ,n_A$ and $j=1,\ldots ,n_B$. Let $|E_{i,j}(h,\Pi )|$ be the number of ones in $E_{i,j}(h,\Pi )$. We then have two cases:

Case 1. There exist $i\in \{1,\ldots ,n_A\}$ and $j\in \{1,\ldots ,n_B\}$ such that $|E_{i,j}(h,\Pi )| \not = |E_{i,j}(h',\Pi ')|$.

It can be seen from (11) that for any $u\in \mathcal {H}$, every column of E(u) has exactly one 1. Hence, for any $u\in \mathcal {H}$ and $\Phi \in \mathcal {P}^n$, every column of $E(u)\Phi $ also has exactly one 1. In particular, we have

$$\begin{aligned} \sum _{k=1}^{n_A} |E_{k,j}(h,\Pi )| = \sum _{k=1}^{n_A} |E_{k,j}(h',\Pi ')| = t_j-t_{j-1}, \end{aligned}$$

which implies that $|E_{i',j}(h,\Pi )| \not = |E_{i',j}(h',\Pi ')|$ for some $i'\in \{1,\ldots ,n_A\}{\setminus }\{i\}$. Now, we compute

$$\begin{aligned} \Vert X-X'\Vert _F^2\ge & {} \left\| E_{i,j}(h',\Pi ')-P_iE_{i,j}(h,\Pi )Q_j^T \right\| _F^2 \nonumber \\&+ \left\| E_{i',j}(h',\Pi ')-P_{i'}E_{i',j}(h,\Pi )Q_j^T \right\| _F^2 \nonumber \\\ge & {} \min _{\begin{array}{c} P \in \mathcal {O}^{s_i-s_{i-1}} \\ Q \in \mathcal {O}^{t_j-t_{j-1}} \end{array}} \left\| E_{i,j}(h',\Pi ')-PE_{i,j}(h,\Pi )Q^T \right\| _F^2 \nonumber \\&+\, \min _{\begin{array}{c} P \in \mathcal {O}^{s_{i'}-s_{i'-1}} \\ Q \in \mathcal {O}^{t_j-t_{j-1}} \end{array}} \left\| E_{i',j}(h',\Pi ')-PE_{i',j}(h,\Pi )Q^T \right\| _F^2. \end{aligned}$$

(49)

Both terms in (49) are instances of the two-sided orthogonal Procrustes problem and admit the following characterization [40]:

$$\begin{aligned}&\min _{\begin{array}{c} P \in \mathcal {O}^{s_i-s_{i-1}} \\ Q \in \mathcal {O}^{t_j-t_{j-1}} \end{array}} \left\| E_{i,j}(h',\Pi ')-PE_{i,j}(h,\Pi )Q^T \right\| _F^2\\&\quad = \sum _{k=1}^K \left( \sigma _k(E_{i,j}(h',\Pi ')) - \sigma _k(E_{i,j}(h,\Pi )) \right) ^2, \\&\min _{\begin{array}{c} P \in \mathcal {O}^{s_{i'}-s_{i'-1}} \\ Q \in \mathcal {O}^{t_j-t_{j-1}} \end{array}} \left\| E_{i',j}(h',\Pi ')-PE_{i',j}(h,\Pi )Q^T \right\| _F^2 \\&\quad = \sum _{k=1}^{K'} \left( \sigma _k(E_{i',j}(h',\Pi ')) - \sigma _k(E_{i',j}(h,\Pi )) \right) ^2. \end{aligned}$$

Here, $K=\min \{s_i-s_{i-1},t_j-t_{j-1}\}$, $K'=\min \{s_{i'}-s_{i'-1},t_j-t_{j-1}\}$, and $\sigma _k(Y)$ is the kth largest singular value of Y. Observe that for any $\alpha \in \{1,\ldots ,n_A\}$, $\beta \in \{1,\ldots ,n_B\}$, $u\in \mathcal {H}$, and $\Phi \in \mathcal {P}^n$, every non-zero row and every non-zero column of $E_{\alpha ,\beta }(u,\Phi )$ has exactly one 1. It follows that the singular values of $E_{\alpha ,\beta }(u,\Phi )$ are either 0 or 1, and there are $|E_{\alpha ,\beta }(u,\Phi )|$ of the latter. Since $|E_{i,j}(h,\Pi )| \not = |E_{i,j}(h',\Pi ')|$ and $|E_{i',j}(h,\Pi )| \not = |E_{i',j}(h',\Pi ')|$, we conclude from (49) that $\Vert X-X'\Vert _F^2 \ge 2$.

Case 2.$|E_{i,j}(h,\Pi )| = |E_{i,j}(h',\Pi ')|$ for $i=1,\ldots ,n_A$ and $j=1,\ldots ,n_B$.

We show that $X=X'$ in this case, which would then contradict the assumption that $\mathcal {X}_{h,\Pi } \cap \mathcal {X}_{h',\Pi '}=\emptyset $. To begin, let $i\in \{1,\ldots ,n_A\}$ be arbitrary and consider the ith block row of $E(h)\Pi $ and $E(h')\Pi '$; i.e.,

$$\begin{aligned} \mathrm{BlkRow}_i(E(h)\Pi )= & {} \left[ E_{i,1}(h,\Pi ) \, \cdots \, E_{i,n_B}(h,\Pi ) \right] , \\ \mathrm{BlkRow}_i \left( E(h')\Pi ' \right)= & {} \left[ E_{i,1}(h',\Pi ') \, \cdots \, E_{i,n_B}(h',\Pi ') \right] . \end{aligned}$$

By (11), every non-zero row of $\mathrm{BlkRow}_i(E(h)\Pi )$ and $\mathrm{BlkRow}_i \left( E(h')\Pi ' \right) $ has exactly one 1. Moreover, we have $|E_{i,j}(h,\Pi )| = |E_{i,j}(h',\Pi ')|$ for $j=1,\ldots ,n_B$ by assumption. Hence, we can find permutation matrices $\Phi _{i,1},\Phi _{i,2},\ldots ,\Phi _{i,n_B}\in \mathcal {P}^{s_i-s_{i-1}}$ such that for $j=1,\ldots ,n_B$,

(i)
the indices of the rows of $\Phi _{i,j} \left( \Phi _{i,j-1}\Phi _{i,j-2}\cdots \Phi _{i,1}E_{i,j}(h,\Pi ) \right) $ that contain a 1 are the same as those of $E_{i,j}(h',\Pi ')$ that contain a 1 (i.e., the kth row of $\Phi _{i,j} \left( \Phi _{i,j-1}\Phi _{i,j-2}\cdots \Phi _{i,1}E_{i,j}(h,\Pi ) \right) $ contains a 1 if and only if the kth row of $E_{i,j}(h',\Pi ')$ contains a 1, where $k\in \{1,\ldots ,s_i-s_{i-1}\}$);
(ii)
the indices of the rows of $\left( \Phi _{i,j-1}\Phi _{i,j-2}\cdots \Phi _{i,1}\right) \left[ E_{i,1}(h,\Pi ) \, \cdots \, E_{i,j-1}(h,\Pi ) \right] $ that contain a 1 are fixed by $\Phi _{i,j}$ (i.e., if the kth row of $\left( \Phi _{i,j-1}\Phi _{i,j-2}\cdots \Phi _{i,1}\right) \left[ E_{i,1}(h,\Pi ) \, \cdots \, E_{i,j-1}(h,\Pi ) \right] $ contains a 1, then $\Phi _{i,j}e_k=e_k$, where $e_k$ is the kth standard basis vector of $\mathbb R^{s_i-s_{i-1}}$ and $k\in \{1,\ldots ,s_i-s_{i-1}\}$).

Upon letting $\Phi _i = \Phi _{i,n_B}\Phi _{i,n_B-1}\cdots \Phi _{i,1} \in \mathcal {P}^{s_i-s_{i-1}}$ and using properties (i) and (ii) above, we see that the indices of the rows of $\Phi _iE_{i,j}(h,\Pi )$ that contain a 1 are the same as those of $E_{i,j}(h',\Pi ')$ that contain a 1 for $j=1,\ldots ,n_B$.

Next, let $j\in \{1,\ldots ,n_B\}$ be arbitrary and consider the jth block column of $\mathrm{BlkDiag}(\Phi _1,\ldots ,\Phi _{n_A})\cdot E(h) \cdot \Pi $ and $E(h')\Pi '$; i.e.,

$$\begin{aligned} \mathrm{BlkCol}_j \left( \mathrm{BlkDiag}(\Phi _1,\ldots ,\Phi _{n_A}) \cdot E(h) \cdot \Pi \right)= & {} \left[ \begin{array}{c} \Phi _1E_{1,j}(h,\Pi ) \\ \vdots \\ \Phi _{n_A}E_{n_A,j}(h,\Pi ) \end{array} \right] , \\ \mathrm{BlkCol}_j \left( E(h')\Pi ' \right)= & {} \left[ \begin{array}{c} E_{1,j}(h',\Pi ') \\ \vdots \\ E_{n_A,j}(h',\Pi ') \end{array} \right] . \end{aligned}$$

By (11), each column of $\mathrm{BlkCol}_j \left( \mathrm{BlkDiag}(\Phi _1,\ldots ,\Phi _{n_A}) \cdot E(h) \cdot \Pi \right) $ and $\mathrm{BlkCol}_j \left( E(h')\Pi ' \right) $ has exactly one 1. Since $|E_{i,j}(h,\Pi )| = |E_{i,j}(h',\Pi ')|$ for $i=1,\ldots ,n_A$ by assumption, we have $|\Phi _iE_{i,j}(h,\Pi )| = |E_{i,j}(h',\Pi ')|$. Moreover, by the definition of $\Phi _1,\ldots ,\Phi _{n_A}$, the indices of the rows of $\Phi _iE_{i,j}(h,\Pi )$ that contain a 1 are the same as those of $E_{i,j}(h',\Pi ')$ that contain a 1. Thus, there exists a permutation matrix $\Psi _j\in \mathcal {P}^{t_j-t_{j-1}}$ such that

$$\begin{aligned} \mathrm{BlkCol}_j \left( E(h')\Pi ' \right) = \mathrm{BlkCol}_j \left( \mathrm{BlkDiag}(\Phi _1,\ldots ,\Phi _{n_A}) \cdot E(h) \cdot \Pi \right) \cdot \Psi _j. \end{aligned}$$

In particular, we obtain

$$\begin{aligned} E(h')\Pi ' = \mathrm{BlkDiag}(\Phi _1,\ldots ,\Phi _{n_A}) \cdot E(h)\cdot \Pi \cdot \mathrm{BlkDiag}(\Psi _1,\ldots ,\Psi _{n_B}). \end{aligned}$$

Since a permutation matrix is also an orthogonal matrix, we conclude from (48) that $\Vert X-X'\Vert _F^2=0$, or equivalently, $X=X'$, as desired.

Proof of Proposition 5

Using (17) and (18), it can be verified that

$$\begin{aligned} \mathrm{dist}^2(X,\mathcal {X}_{h,\Pi })= & {} \left\| \bar{X} - E(h) \Pi \right\| _F^2 \\= & {} \min \left\{ \left. \left\| \bar{X} - E(h) \cdot \Pi \cdot \mathrm{BlkDiag}\left( Q_1^T,\ldots ,Q_{n_B}^T \right) \right\| _F^2 \,\right| \, \right. \\&\qquad \quad \left. Q_j \in \mathcal {O}^{t_j-t_{j-1}} \text{ for } j=1,\ldots ,n_B \right\} \\= & {} \sum _{j=1}^{n_B} \min \left\{ \left. \left\| \bar{X}_j - \bar{E}_j(h)Q_j^T \right\| _F^2 \,\right| \, Q_j \in \mathcal {O}^{t_j-t_{j-1}} \right\} . \end{aligned}$$

Since up to a permutation of the rows $\bar{E}_j$ takes the form (19), in order to obtain the desired bound on $\mathrm{dist}^2(X,\mathcal {X}_{h,\Pi })$, it remains to prove the following:

Lemma 1

Let $S = \begin{bmatrix} S_1\\S_2 \end{bmatrix} \in \mathrm{St}(p,q)$ be given, with $S_1 \in \mathbb R^{q\times q}$ and $S_2 \in \mathbb R^{(p-q)\times q}$. Consider the following problem:

$$\begin{aligned} v^* = \min \left\{ \left. \left\| S - \begin{bmatrix} I_q\\ \mathbf 0\end{bmatrix} X \right\| _F^2 \,\right| \, X \in \mathcal {O}^q \right\} . \end{aligned}$$

Suppose that $v^*<1$. Then, we have $\Vert S_2\Vert _F^2 \le v^* \le 2\Vert S_2\Vert _F^2$.

Proof

Since

$$\begin{aligned} \left\| S - \begin{bmatrix} I_q\\\mathbf 0\end{bmatrix} X \right\| _F^2 = \Vert S_1-X\Vert _F^2 + \Vert S_2\Vert _F^2, \end{aligned}$$

it suffices to consider the problem

$$\begin{aligned} \min \left\{ \Vert S_1 - X\Vert _F^2 \mid X \in \mathcal {O}^q \right\} . \end{aligned}$$

(50)

Problem (50) is an instance of the orthogonal Procrustes problem, whose optimal solution is given by $X^*=UV^T$, where $S_1=U\Sigma V^T$ is the singular value decomposition of $S_1$ [39]. It follows that

$$\begin{aligned} v^* = \Vert \Sigma -I_q\Vert _F^2 + \Vert S_2\Vert _F^2. \end{aligned}$$

Now, since $S\in \mathrm{St}(p,q)$, we have $S^TS = S_1^TS_1 + S_2^TS_2 = I_q$, or equivalently,

$$\begin{aligned} \Sigma ^2 + V^TS_2^TS_2V = I_q. \end{aligned}$$

This implies that $\mathbf 0\preceq \Sigma \preceq I_q$ and

$$\begin{aligned} I_q - \Sigma = (I_q+\Sigma )^{-1} \left( V^TS_2^TS_2V \right) . \end{aligned}$$

It follows that

$$\begin{aligned} \frac{1}{4}\Vert S_2\Vert _F^4 + \Vert S_2\Vert _F^2 \le v^* \le \Vert S_2\Vert _F^4 + \Vert S_2\Vert _F^2. \end{aligned}$$

This, together with the fact that $\Vert S_2\Vert _F^2 \le v^* < 1$, yields the desired result. $\square $

Proof of Proposition 6

Recall that

$$\begin{aligned} P^*= & {} \mathrm{BlkDiag}\left( P_1^*,\ldots ,P_{n_A}^* \right) \in \mathcal {O}^m, \\ Q^*= & {} \mathrm{BlkDiag}\left( Q_1^*,\ldots ,Q_{n_B}^* \right) \in \mathcal {O}^n, \\ \bar{X}= & {} (P^*)^TXQ^*. \end{aligned}$$

Upon observing that $AP^*=P^*A$, $BQ^*=Q^*B$ and using (9), (18), we compute

$$\begin{aligned}&\left\| AXB - XBX^TAX \right\| _F^2\nonumber \\&\quad = \left\| AP^*\bar{X}(Q^*)^TB - P^*\bar{X}(Q^*)^TBQ^*\bar{X}^T(P^*)^TAP^*\bar{X}(Q^*)^T \right\| _F^2 \nonumber \\&\quad = \left\| P^*\left( A\bar{X}B-\bar{X}B\bar{X}^TA\bar{X} \right) (Q^*)^T \right\| _F^2 \nonumber \\&\quad = \left\| A\bar{X}B-\bar{X}B\bar{X}^TA\bar{X} \right\| _F^2 \nonumber \\&\quad = \sum _{j=1}^{n_B} \left\| b_{t_j}A\bar{X}_j - \sum _{k=1}^{n_B} b_{t_k}\bar{X}_k\left( \bar{X}_k^TA\bar{X}_j \right) \right\| _F^2. \end{aligned}$$

(51)

Now, observe that the columns of $\bar{X}$ are orthonormal and span an n-dimensional subspace $\mathcal {L}$. In particular, for $j=1,\ldots ,n_B$, each column of $A\bar{X}_j$ can be decomposed as $u+v$, where u is a linear combination of the columns of $\bar{X}$ and $v\in \mathcal {L}^\perp $, the orthogonal complement of $\mathcal {L}$. In view of the structure of $\bar{X}$ in (18), this leads to

$$\begin{aligned} A\bar{X}_j = \sum _{k=1}^{n_B} \bar{X}_k\left( \bar{X}_k^TA\bar{X}_j \right) + T_j, \end{aligned}$$

where $T_j \in \mathbb R^{m\times (t_j-t_{j-1})}$ is formed by projecting the columns of $A\bar{X}_j$ onto $\mathcal {L}^\perp $. Hence,

$$\begin{aligned} \left\| b_{t_j}A\bar{X}_j - \sum _{k=1}^{n_B} b_{t_k}\bar{X}_k\left( \bar{X}_k^TA\bar{X}_j \right) \right\| _F^2= & {} \sum _{k\not =j} (b_{t_j}-b_{t_k})^2 \left\| \bar{X}_k \left( \bar{X}_k^TA\bar{X}_j \right) \right\| _F^2\\&+\, b_{t_j}^2 \cdot \Vert T_j\Vert _F^2 \\\ge & {} \lambda _B^2 \left( \sum _{k\not =j} \left\| \bar{X}_k \left( \bar{X}_k^TA\bar{X}_j \right) \right\| _F^2 + \Vert T_j\Vert _F^2 \right) \\= & {} \lambda _B^2 \cdot \left\| A\bar{X}_j - \bar{X}_j\bar{X}_j^TA\bar{X}_j \right\| _F^2, \end{aligned}$$

where $\lambda _B=\min \{\lambda _{B,g},\lambda _{B,s}\}$, $\lambda _{B,g} = \min _{j\in \{1,\ldots ,n_B-1\}} (b_{t_j}-b_{t_{j+1}}) > 0$, and $\lambda _{B,s} = \min _{j\in \{1,\ldots ,n_B\}} |b_{t_j}|>0$. By combining the above with (51), the proof is completed.

Proof of Proposition 7

Consider a fixed $j\in \{1,\ldots ,n_B\}$. Let $\Delta _k$ be the kth column of $A\bar{X}_j-\bar{X}_j\bar{X}_j^TA\bar{X}_j$, where $k=1,\ldots ,t_j-t_{j-1}$. Since

$$\begin{aligned} \left\| A\bar{X}_j-\bar{X}_j\bar{X}_j^TA\bar{X}_j \right\| _F^2 = \sum _{k=1}^{t_j-t_{j-1}} \Vert \Delta _k\Vert _2^2, \end{aligned}$$

our goal is to establish a lower bound on $\Vert \Delta _k\Vert _2^2$ for $k=1,\ldots ,t_j-t_{j-1}$. Towards that end, let $\bar{x}_k$ be the kth column of $\bar{X}_j$ and $(\bar{x}_k)_\alpha $ be the $\alpha $th entry of $\bar{x}_k$, where $k=1,\ldots ,t_j-t_{j-1}$ and $\alpha =1,\ldots ,m$. Then, we can write

$$\begin{aligned} \Delta _k = A\bar{x}_k - \sum _{\ell =1}^{t_j-t_{j-1}} \bar{x}_\ell \left( \bar{x}_\ell ^TA\bar{x}_k \right) . \end{aligned}$$

(52)

Suppose that $\mathrm{dist}(X,\mathcal {X}_{h,\Pi }) = \left\| \bar{X} - E(h) \Pi \right\| _F = \tau $ for some $\tau \in (0,1)$. Using the representations of $\bar{X}$ and $E(h)\Pi $ in (18), we have

$$\begin{aligned} (\bar{x}_k)_\alpha \in \left\{ \begin{array}{l@{\quad }l} [1-\tau ,1+\tau ] &{} \text{ if } \alpha =\iota (k), \\ {[}-\tau ,\tau ] &{} \text{ otherwise }, \end{array} \right. \end{aligned}$$

(53)

where $\iota (k)$ is the coordinate of the kth column of $\bar{E}_j(h)$ that equals 1. Now, by (52),

$$\begin{aligned} \Delta _k= & {} A\bar{x}_k - \bar{x}_k \left( \bar{x}_k^TA\bar{x}_k \right) - \sum _{\ell \not =k} \bar{x}_\ell \left( \bar{x}_\ell ^TA\bar{x}_k \right) \\= & {} \left( A - a_{\iota (k)}I_m \right) \bar{x}_k + \left( a_{\iota (k)}-\bar{x}_k^TA\bar{x}_k \right) \bar{x}_k - \sum _{\ell \not =k} \bar{x}_\ell \left( \bar{x}_\ell ^TA\bar{x}_k \right) . \end{aligned}$$

Let $\mathrm{proj}_{\mathcal {I}_j}$ be the projector onto the coordinates in $\mathcal {I}_j = \left\{ k \in \{1,\ldots ,m\} \,\left| \,\right. \right. \left. \left. \left[ \bar{E}_j(h) \right] _k = \mathbf 0\right. \right\} $ (recall that $\left[ \bar{E}_j(h) \right] _k$ is the kth row of $\bar{E}_j(h)$). Clearly, we have

$$\begin{aligned} \Vert \Delta _k\Vert _2&\ge \Vert \mathrm{proj}_{\mathcal {I}_j}(\Delta _k)\Vert _2 \nonumber \\&\ge \left\| \mathrm{proj}_{\mathcal {I}_j}\left( \left( A - a_{\iota (k)}I_m \right) \bar{x}_k \right) \right\| _2 - \sum _{\ell =1}^{t_j-t_{j-1}} |\nu _\ell | \cdot \Vert \mathrm{proj}_{\mathcal {I}_j}(\bar{x}_\ell )\Vert _2, \end{aligned}$$

(54)

where

$$\begin{aligned} \nu _\ell = \left\{ \begin{array}{l@{\quad }l} a_{\iota (k)} - \bar{x}_k^TA\bar{x}_k &{} \text{ if } \ell =k, \\ \bar{x}_\ell ^TA\bar{x}_k &{} \text{ otherwise }. \end{array} \right. \end{aligned}$$

Let $\lambda _{A,m}=\max _{i\in \{1,\ldots ,n_A\}} |a_{s_i}|$ be the largest (in magnitude) eigenvalue of A. Using (53) and the fact that $\iota (k)\not =\iota (\ell )$ whenever $k\not =\ell $, we bound

$$\begin{aligned} \left| a_{\iota (k)} - \bar{x}_k^TA\bar{x}_k \right| \le \left| a_{\iota (k)} \left( 1-(\bar{x}_k)_{\iota (k)}^2 \right) \right| + \left| \sum _{\alpha \not =\iota (k)} a_\alpha (\bar{x}_k)_\alpha ^2 \right| \le \lambda _{A,m} (m\tau ^2+2\tau ) \end{aligned}$$

and

$$\begin{aligned} \left| \bar{x}_\ell ^TA\bar{x}_k \right| \le \lambda _{A,m} \sum _{\alpha =1}^m |(\bar{x}_\ell )_\alpha | \cdot |(\bar{x}_k)_\alpha | \le \lambda _{A,m}(m\tau ^2+2\tau ) \quad \text{ for } \ell \not =k. \end{aligned}$$

This implies that $|\nu _\ell | \le \lambda _{A,m}(m\tau ^2+2\tau )$ for $\ell =1,\ldots ,t_j-t_{j-1}$. Moreover, since $\bar{x}_1,\ldots ,\bar{x}_{t_j-t_{j-1}}$ are the columns of $\bar{X}_j$, by Proposition 5, the definition of $\mathcal {I}_j$, and the assumption that $\mathrm{dist}(X,\mathcal {X}_{h,\Pi }) = \tau $, we have

$$\begin{aligned} \sum _{\ell =1}^{t_j-t_{j-1}} \left\| \mathrm{proj}_{\mathcal {I}_j}(\bar{x}_\ell ) \right\| _2^2 = \sum _{k\in \mathcal {I}_j} \left\| \left[ \bar{X}_j \right] _k \right\| _2^2 \le \tau ^2. \end{aligned}$$

It follows from (54) that

$$\begin{aligned} \Vert \Delta _k\Vert _2 \ge \left\| \mathrm{proj}_{\mathcal {I}_j}\left( \left( A - a_{\iota (k)}I_m \right) \bar{x}_k \right) \right\| _2 - \lambda _{A,m}\sqrt{t_j-t_{j-1}}(m\tau ^2+2\tau )\tau . \end{aligned}$$

(55)

Next, we bound the first term on the right-hand side of the above inequality. Considering the structure of A in (8), let $i'\in \{0,1,\ldots ,n_A-1\}$ be such that $s_{i'}+1 \le \iota (k) \le s_{i'+1}$ and recall that $\lambda _{A,g} = \min _{i\in \{1,\ldots ,n_A-1\}} (a_{s_i}-a_{s_{i+1}}) > 0$. Then, we have

$$\begin{aligned}&\left\| \mathrm{proj}_{\mathcal {I}_j} \left( \left( A - a_{\iota (k)}I_m \right) \bar{x}_k\right) \right\| _2^2 \nonumber \\&= \sum _{i\not =i'} \sum _{\alpha \in \mathcal {I}_j \cap \{s_i+1,\ldots ,s_{i+1}\}} \left( \left( a_{s_i+1} - a_{\iota (k)} \right) (\bar{x}_k)_\alpha \right) ^2 \nonumber \\&\ge \lambda _{A,g}^2 \sum _{i\not =i'} \sum _{\alpha \in \mathcal {I}_j \cap \{s_i+1,\ldots ,s_{i+1}\}} (\bar{x}_k)_\alpha ^2 \nonumber \\&= \lambda _{A,g}^2 \left( \left\| \mathrm{proj}_{\mathcal {I}_j} \left( \bar{x}_k\right) \right\| _2^2 - \left\| \mathrm{proj}_{\mathcal {I}_j \cap \{s_{i'}+1,\ldots ,s_{i'+1}\}} (\bar{x}_k) \right\| _2^2 \right) . \end{aligned}$$

(56)

To bound the term $\left\| \mathrm{proj}_{\mathcal {I}_j \cap \{s_{i'}+1,\ldots ,s_{i'+1}\}} (\bar{x}_k) \right\| _2^2$, we proceed as follows. Let $\bar{Y} = XQ^*\Pi ^T \in \mathrm{St}(m,n)$. Then, we have $\bar{X} = (P^*)^TXQ^* = (P^*)^T\bar{Y}\Pi $ and

$$\begin{aligned} \mathrm{dist}(X,\mathcal {X}_{h,\Pi }) = \left\| \bar{X} - E(h)\Pi \right\| _F = \left\| (P^*)^T\bar{Y} - E(h) \right\| _F. \end{aligned}$$

We are now interested in locating the entries of $\mathrm{proj}_{\mathcal {I}_j \cap \{s_{i'}+1,\ldots ,s_{i'+1}\}} (\bar{x}_k)$ in the matrix $(P^*)^T\bar{Y}$. Towards that end, recall that $P^*=\mathrm{BlkDiag}\left( P_1^*,\ldots ,P_{n_A}^* \right) $ and consider the decomposition

$$\begin{aligned} (P^*)^T\bar{Y} = \begin{bmatrix} (P_1^*)^T\bar{Y}_{1,1}&\quad \cdots&\quad (P_1^*)^T\bar{Y}_{1,n_A} \\ \vdots&\quad \ddots&\quad \vdots \\ (P_{n_A}^*)^T\bar{Y}_{n_A,1}&\quad \cdots&\quad (P_{n_A}^*)^T\bar{Y}_{n_A,n_A} \end{bmatrix}, \end{aligned}$$

(57)

where $P_i^* \in \mathcal {O}^{s_i-s_{i-1}}$ and $\bar{Y}_{i,i} \in \mathbb R^{(s_i-s_{i-1})\times h_i}$, for $i=1,\ldots ,n_A$. Since $\iota (k)$ is the coordinate of the kth column of $\bar{E}_j(h)$ that equals 1 and $s_{i'}+1 \le \iota (k) \le s_{i'+1}$, we see from (10) and (11) that the kth column of $\bar{E}_j(h)$ belongs to

$$\begin{aligned} E_{i'+1}(h) = \left[ \begin{array}{c} \mathbf 0_{s_{i'}\times h_{i'+1}} \\ \hline I_{h_{i'+1}} \\ \mathbf 0_{(s_{i'+1}-s_{i'}-h_{i'+1})\times h_{i'+1}} \\ \hline \mathbf 0_{(m-s_{i'+1})\times h_{i'+1}} \end{array} \right] . \end{aligned}$$

(58)

As $\bar{x}_k$ is the kth column of $\bar{X}_j$ and $\left\| \bar{X} - E(h)\Pi \right\| _F^2 = \sum _{j=1}^{n_B} \Vert \bar{X}_j - \bar{E}_j(h)\Vert _F^2$, it follows that all the entries of $\mathrm{proj}_{\mathcal {I}_j \cap \{s_{i'}+1,\ldots ,s_{i'+1}\}} (\bar{x}_k)$ lie in $(P_{i'+1}^*)^T\bar{Y}_{i'+1,i'+1}$. Furthermore, by (58) and the definition of $\mathcal {I}_j$, the entries of $\mathrm{proj}_{\mathcal {I}_j \cap \{s_{i'}+1,\ldots ,s_{i'+1}\}} (\bar{x}_k)$ do not intersect the diagonal of the top $h_{i'+1}\times h_{i'+1}$ block of $(P_{i'+1}^*)^T\bar{Y}_{i'+1,i'+1}$. Consequently, we have

$$\begin{aligned} \left\| \mathrm{proj}_{\mathcal {I}_j \cap \{s_{i'}+1,\ldots ,s_{i'+1}\}} (\bar{x}_k) \right\| _2^2 \le \left\| (P_{i'+1}^*)^T\bar{Y}_{i'+1,i'+1} - \begin{bmatrix} I_{h_{i'+1}} \\ \mathbf 0\end{bmatrix} \right\| _F^2. \end{aligned}$$

(59)

To obtain an upper bound on the right-hand side of (59), we need the following lemma:

Lemma 2

Consider the decomposition of $(P^*)^T\bar{Y}$ in (57). For $i=1,\ldots ,n_A$, let

$$\begin{aligned} v_i^* = \min \left\{ \left. \left\| P_i^T\bar{Y}_{i,i} - \begin{bmatrix} I_{h_i} \\ \mathbf 0\end{bmatrix} \right\| _F^2 \,\right| \, P_i \in \mathcal {O}^{s_i-s_{i-1}} \right\} . \end{aligned}$$

(60)

Suppose that $v_i^*<1$. Then, we have

$$\begin{aligned} \frac{1}{4}\left\| \sum _{j\not =i} \bar{Y}_{j,i}^T\bar{Y}_{j,i} \right\| _F^2 \le v_i^* \le \left\| \sum _{j\not =i} \bar{Y}_{j,i}^T\bar{Y}_{j,i} \right\| _F^2. \end{aligned}$$

Let us defer the proof of Lemma 2 to the end of this section. Now, observe that by (11) and (17),

$$\begin{aligned} \mathrm{dist}^2(X,\mathcal {X}_{h,\Pi }) =&\min \left\{ \left\| \mathrm{BlkDiag}\left( P_1^T,\ldots ,P_{n_A}^T \right) \cdot \bar{Y}- E(h) \right\| _F^2\,\Bigg |\right. \nonumber \\&\quad \quad \quad \left. \, P_i \in \mathcal {O}^{s_i-s_{i-1}} \text{ for } i=1,\ldots ,n_A \right\} \nonumber \\ =&\left. \sum _{i=1}^{n_A} \min \left\{ \left\| P_i^T\bar{Y}_{i,i} - \begin{bmatrix} I_{h_i} \\ \mathbf 0\end{bmatrix} \right\| _F^2 \,\right| \, P_i\in \mathcal {O}^{s_i-s_{i-1}} \right\} \nonumber \\&+ \sum _{1\le i\not =j \le n_A} \Vert \bar{Y}_{i,j}\Vert _F^2. \end{aligned}$$

(61)

Since $\mathrm{dist}(X,\mathcal {X}_{h,\Pi }) = \tau $ for some $\tau \in (0,1)$, we have $\sum _{1\le i\not =j\le n_A}\Vert \bar{Y}_{i,j}\Vert _F^2 \le \tau ^2$ from (61). Hence, by Lemma 2 and (59), we have

$$\begin{aligned} v_i^* \le \left( \sum _{j\not =i} \Vert \bar{Y}_{j,i}\Vert _F^2 \right) ^2 \le \tau ^4 \qquad \text{ for } i=1,\ldots ,n_A \end{aligned}$$

and

$$\begin{aligned} \left\| \mathrm{proj}_{\mathcal {I}_j \cap \{s_{i'}+1,\ldots ,s_{i'+1}\}} (\bar{x}_k) \right\| _2^2 \le v_{i'+1}^* \le \tau ^4. \end{aligned}$$

This, together with (55), (56) and the fact that the implications

$$\begin{aligned} c \ge a - b \quad \Longrightarrow \quad a^2 \le 2(b^2+c^2) \quad \Longrightarrow \quad c^2 \ge \frac{a^2}{2}-b^2 \end{aligned}$$

hold for any $a,b,c\in \mathbb R$, yields

$$\begin{aligned} \Vert \Delta _k\Vert _2^2 \ge \frac{\lambda _{A,g}^2}{2} \left( \left\| \mathrm{proj}_{\mathcal {I}_j} \left( \bar{x}_k\right) \right\| _2^2 - \tau ^4 \right) - \lambda _{A,m}^2(t_j-t_{j-1})(m\tau ^2+2\tau )^2\tau ^2. \end{aligned}$$

It follows that

$$\begin{aligned}&\left\| A\bar{X}_j-\bar{X}_j\bar{X}_j^TA\bar{X}_j \right\| _F^2 \,\,\,=\,\,\, \sum _{k=1}^{t_j-t_{j-1}} \Vert \Delta _k\Vert _2^2 \\&\quad \ge \frac{\lambda _{A,g}^2}{2}\sum _{k=1}^{t_j-t_{j-1}} \left\| \mathrm{proj}_{\mathcal {I}_j} \left( \bar{x}_k\right) \right\| _2^2 \\&\qquad - (t_j-t_{j-1}) \left( \frac{\lambda _{A,g}^2\tau ^4}{2} + \lambda _{A,m}^2(t_j-t_{j-1})(m\tau ^2+2\tau )^2\tau ^2 \right) \\&\quad = \frac{\lambda _{A,g}^2}{2} \sum _{k\in \mathcal {I}_j} \left\| \left[ \bar{X}_j\right] _k \right\| _2^2 - (t_j-t_{j-1})\left( \frac{\lambda _{A,g}^2\tau ^4}{2} + \lambda _{A,m}^2(t_j-t_{j-1})(m\tau ^2+2\tau )^2\tau ^2 \right) \end{aligned}$$

(recall that $\left[ \bar{X}_j\right] _k$ is the kth row of $\bar{X}_j$). Upon summing both sides of the above inequality over $j=1,\ldots ,n_B$ and using Proposition 5 and the assumption that $\mathrm{dist}(X,\mathcal {X}_{h,\Pi })=\tau $, we obtain

$$\begin{aligned} \sum _{j=1}^{n_B} \left\| A\bar{X}_j-\bar{X}_j\bar{X}_j^TA\bar{X}_j \right\| _F^2\ge & {} \frac{\lambda _{A,g}^2}{4} \cdot \mathrm{dist}^2(X,\mathcal {X}_{h,\Pi }) \\&- \frac{n\lambda _{A,g}^2\tau ^4}{2} - n^2\lambda _{A,m}^2(m\tau ^2+2\tau )^2\tau ^2 \\\ge & {} \frac{\lambda _{A,g}^2}{8} \sum _{j=1}^{n_B}\sum _{k\in \mathcal {I}_j} \left\| \left[ \bar{X}_j\right] _k \right\| _2^2 \end{aligned}$$

whenever $\tau \in (0,1)$ satisfies

$$\begin{aligned} \left( \frac{n\lambda _{A,g}^2}{2} + n^2\lambda _{A,m}^2(m+2)^2 \right) \tau ^2 \le \frac{\lambda _{A,g}^2}{8}. \end{aligned}$$

To complete the proof, it remains to prove Lemma 2.

Proof of Lemma 2

Consider a fixed $i\in \{1,\ldots ,n_A\}$. Note that Problem (60) is again an instance of the orthogonal Procrustes problem. Hence, by the result in [39], an optimal solution to Problem (60) is given by

$$\begin{aligned} P_i^* = H_i \begin{bmatrix} W_i^T&\quad \mathbf 0\\ \mathbf 0&\quad I_{s_i-s_{i-1}-h_i} \end{bmatrix}, \end{aligned}$$

where $\bar{Y}_{i,i}=H_i \begin{bmatrix} \Sigma _i \\ \mathbf 0\end{bmatrix} W_i^T$ is a singular value decomposition of $\bar{Y}_{i,i}$. It follows from (60) that

$$\begin{aligned} v_i^* = \left\| (P_i^*)^T \bar{Y}_{i,i} - \begin{bmatrix} I_{h_i} \\ \mathbf 0\end{bmatrix} \right\| _F^2 = \Vert \Sigma _i - I_{h_i}\Vert _F^2. \end{aligned}$$

Now, since $\bar{Y}\in \mathrm{St}(m,n)$, we have

$$\begin{aligned} \bar{Y}_{i,i}^T\bar{Y}_{i,i} + \sum _{j\not =i} \bar{Y}_{j,i}^T\bar{Y}_{j,i} = W_i\Sigma _i^2W_i^T + \sum _{j\not =i} \bar{Y}_{j,i}^T\bar{Y}_{j,i} = I_{h_i}, \end{aligned}$$

or equivalently,

$$\begin{aligned} \Sigma _i^2 + W_i^T \left( \sum _{j\not =i} \bar{Y}_{j,i}^T\bar{Y}_{j,i} \right) W_i = I_{h_i}. \end{aligned}$$

By following the arguments in the proof of Lemma 1, we conclude that

$$\begin{aligned} \frac{1}{4}\left\| \sum _{j\not =i} \bar{Y}_{j,i}^T\bar{Y}_{j,i} \right\| _F^2 \le v_i^* \le \left\| \sum _{j\not =i} \bar{Y}_{j,i}^T\bar{Y}_{j,i} \right\| _F^2, \end{aligned}$$

as desired. $\square $

Second-order boundedness of some retractions on $\mathrm{St}(m,n)$

1.1 Second-order boundedness of $R_\mathsf{polar}$

Let $X\in \mathrm{St}(m,n)$ and $\xi \in T(X)$ be arbitrary. By definition, we have

$$\begin{aligned} \Vert R_\mathsf{polar}(X,\xi ) - (X+\xi ) \Vert _F= & {} \Vert (X+\xi )(I_n + \xi ^T\xi )^{-1/2} - (X+\xi ) \Vert _F \\\le & {} \Vert X+\xi \Vert \cdot \Vert (I_n + \xi ^T\xi )^{-1/2} - I_n \Vert _F. \end{aligned}$$

Let $\xi ^T\xi =U\Sigma U^T$ be a spectral decomposition of $\xi ^T\xi $ with $\Sigma =\mathrm{Diag}(\lambda _1,\ldots ,\lambda _n)$ and $\lambda _1,\ldots ,\lambda _n\ge 0$. Then, a simple calculation yields

$$\begin{aligned} \Vert (I_n + \xi ^T\xi )^{-1/2} - I_n \Vert _F^2 = \sum _{i=1}^n ((1+\lambda _i)^{-1/2}-1)^2 \le \frac{1}{4}\sum _{i=1}^n \lambda _i^2 = \frac{1}{4} \cdot \Vert \xi ^T\xi \Vert _F^2. \end{aligned}$$

Since $\Vert X+\xi \Vert \le \Vert X\Vert +\Vert \xi \Vert \le 1+\Vert \xi \Vert _F$, we conclude that whenever $\Vert \xi \Vert _F \le 1$,

$$\begin{aligned} \Vert R_\mathsf{polar}(X,\xi ) - (X+\xi ) \Vert _F \le \Vert \xi \Vert _F^2; \end{aligned}$$

i.e., $R_\mathsf{polar}$ satisfies Property (P) with $\phi =M=1$.

1.2 Second-order boundedness of $R_\mathsf{QR}$

Let $X\in \mathrm{St}(m,n)$ and $\xi \in T(X)$ be arbitrary. Suppose that $\Vert \xi \Vert _F \le 1/2$. Then, for any $t\in [-1,1]$, the matrix $X(t)=X+t\xi $ has full column rank and hence admits a unique thin QR-decomposition $X(t)=Q(t)R(t)$, where $Q(t)\in \mathrm{St}(m,n)$ and $R(t)\in \mathbb R^{n\times n}$ are both differentiable and R(t) is upper triangular with positive diagonal entries; see, e.g., [12]. Since the unique thin QR-decomposition of X is given by $X=XI_n$, we have $R(0)=I_n$. This, together with the fact that $\Vert Q(t)\Vert \le 1$, implies

$$\begin{aligned} \Vert R_\mathsf{QR}(X,\xi ) - (X+\xi ) \Vert _F= & {} \Vert Q(1)(I_n-R(1)) \Vert _F \le \Vert R(1)-R(0) \Vert _F \nonumber \\\le & {} \int _0^1 \Vert R'(t) \Vert _F \,dt. \end{aligned}$$

(62)

To bound $\Vert R'(t)\Vert _F$, we adopt the so-called matrix equation approach in [11, 47]. Using the identity $R(t)^TR(t)=X(t)^TX(t)$ and the fact that $\xi \in T(X)$ implies $X^T\xi +\xi ^TX=\mathbf 0$, we have

$$\begin{aligned} R(t)^TR(t) = I_n + t^2\xi ^T\xi . \end{aligned}$$

(63)

Differentiating both sides of (63) with respect to t yields

$$\begin{aligned} R'(t)^TR(t) + R(t)^TR'(t) = 2t\xi ^T\xi . \end{aligned}$$

In particular, since R(t) is invertible, we have

$$\begin{aligned} \left( R'(t)R(t)^{-1} \right) ^T + R'(t)R(t)^{-1} = 2t \left( R(t)^{-1} \right) ^T(\xi ^T\xi )R(t)^{-1}. \end{aligned}$$

Now, observe that $R'(t)R(t)^{-1}$ is upper triangular. Thus, the above identity implies that

$$\begin{aligned} R'(t) = 2t\cdot \mathrm{up}\left[ \left( R(t)^{-1} \right) ^T(\xi ^T\xi )R(t)^{-1} \right] \cdot R(t), \end{aligned}$$

where for any $C\in \mathbb R^{n\times n}$,

$$\begin{aligned} {[}\mathrm{up}(C)]_{ij} = \left\{ \begin{array}{l@{\quad }l} C_{ij} &{} \text{ if } i<j, \\ C_{ii}/2 &{} \text{ if } i=j, \\ 0 &{} \text{ otherwise }. \end{array} \right. \end{aligned}$$

Let $\lambda _1,\ldots ,\lambda _n \ge 0$ be the eigenvalues of $\xi ^T\xi $. Using (63) and the fact that $2 \cdot \Vert \mathrm{up}(C)\Vert _F^2 \le \Vert C\Vert _F^2$ for any $C\in \mathcal {S}^n$, we bound

$$\begin{aligned} 2 \left\| \mathrm{up}\left[ \left( R(t)^{-1} \right) ^T(\xi ^T\xi )R(t)^{-1} \right] \right\| _F^2\le & {} \left\| \left( R(t)^{-1} \right) ^T(\xi ^T\xi )R(t)^{-1} \right\| _F^2 \\= & {} \sum _{i=1}^n \left( \frac{\lambda _i}{1+t^2\lambda _i} \right) ^2 \\\le & {} \Vert \xi ^T\xi \Vert _F^2. \end{aligned}$$

On the other hand, we have $\Vert R(t)\Vert \le \sqrt{1+t^2\cdot \Vert \xi \Vert ^2} \le \sqrt{5}/2$ by (63) and the assumption that $\Vert \xi \Vert _F \le 1/2$ and $t\in [-1,1]$. It follows that

$$\begin{aligned} \Vert R'(t)\Vert _F \le 2t \cdot \left\| \mathrm{up}\left[ \left( R(t)^{-1} \right) ^T(\xi ^T\xi )R(t)^{-1} \right] \right\| _F \cdot \Vert R(t) \Vert \le \frac{\sqrt{10}t}{2} \cdot \Vert \xi \Vert _F^2. \end{aligned}$$

Upon substituting this into (62) and integrating, we obtain

$$\begin{aligned} \Vert R_\mathsf{QR}(X,\xi ) - (X+\xi ) \Vert _F \le \frac{\sqrt{10}}{4} \cdot \Vert \xi \Vert _F^2; \end{aligned}$$

i.e., $R_\mathsf{QR}$ satisfies Property (P) with $\phi =1/2$ and $M=\sqrt{10}/4$.

1.3 Second-order boundedness of $R_\mathsf{cayley}$

Let $X\in \mathrm{St}(m,n)$ and $\xi \in T(X)$ be arbitrary. Suppose that $\Vert \xi \Vert _F \le 1/2$. Then, we have $\Vert W(\xi )\Vert _F \le 2\cdot \Vert \xi \Vert _F \le 1$. Hence, we may write

$$\begin{aligned} \left( I_m - \frac{1}{2}W(\xi ) \right) ^{-1} = \sum _{i=0}^\infty \left( \frac{1}{2}W(\xi ) \right) ^i. \end{aligned}$$

In particular, we have

$$\begin{aligned}&\Vert R_\mathsf{cayley}(X,\xi ) - (X+\xi )\Vert _F \\&\quad = \left\| \left( I_m + \frac{1}{2}W(\xi ) + \sum _{i=2}^\infty \left( \frac{1}{2}W(\xi ) \right) ^i \right) \left( I_m + \frac{1}{2}W(\xi ) \right) X - (X+\xi ) \right\| _F \\&\quad = \left\| (W(\xi )X-\xi ) + \frac{1}{4}W(\xi )^2X + \left( \sum _{i=2}^\infty \left( \frac{1}{2}W(\xi ) \right) ^i \right) \left( I_m + \frac{1}{2}W(\xi ) \right) X \right\| _F. \end{aligned}$$

Now, observe that

$$\begin{aligned} W(\xi )X - \xi= & {} \left( I_m-\frac{1}{2}XX^T \right) \xi - \frac{1}{2}X\xi ^TX - \xi = -\frac{1}{2} X(X^T\xi +\xi ^TX) = \mathbf 0, \end{aligned}$$

where the last equality follows from the fact that $\xi \in T(X)$. Hence, we obtain

$$\begin{aligned}&\Vert R_\mathsf{cayley}(X,\xi ) - (X+\xi )\Vert _F \\&\le \frac{1}{4} \cdot \Vert W(\xi )\Vert _F^2 + \left[ \sum _{i=2}^\infty \left( \frac{1}{2^i} + \frac{1}{2^{i+1}} \right) \right] \cdot \Vert W(\xi )\Vert _F^2 \\&\le 4 \cdot \Vert \xi \Vert _F^2; \end{aligned}$$

i.e., $R_\mathsf{cayley}$ satisfies Property (P) with $\phi =1/2$ and $M=4$.

Proof of Proposition 10

We first establish the inequality (44). Define $\epsilon ^{k+1} = R(X^k,-\alpha \xi ^k)-(X^k-\alpha \xi ^k) = X^{k+1} - (X^k-\alpha \xi ^k)$ for $k=0,1,\ldots ,\Gamma -1$. Then,

$$\begin{aligned} F(X^{k+1})= & {} \mathrm{tr}\left[ (X^k-\alpha \xi ^k+\epsilon ^{k+1})^TA(X^k-\alpha \xi ^k+\epsilon ^{k+1})B \right] \nonumber \\= & {} F(X^k) - \alpha \cdot \mathrm{tr}\left[ \left( (X^k)^TA\xi ^k + (\xi ^k)^TAX^k \right) B \right] \nonumber \\&+\,\,\mathrm{tr}\left[ \left( (X^k)^TA\epsilon ^{k+1} + (\epsilon ^{k+1})^TAX^k \right) B \right] \nonumber \\&-\,\,\alpha \cdot \mathrm{tr}\left[ \left( (\xi ^k)^TA\epsilon ^{k+1} + (\epsilon ^{k+1})^TA\xi ^k \right) B \right] \nonumber \\&+\,\, \alpha ^2\cdot \mathrm{tr}\left[ (\xi ^k)^TA\xi ^kB \right] + \mathrm{tr}\left[ (\epsilon ^{k+1})^TA\epsilon ^{k+1}B \right] . \end{aligned}$$

(64)

Now, let us bound the terms in (64) in turn. Using the fact that $\xi ^k$ is the orthogonal projection of $G^k$ onto $T(X^k)$ and $\nabla F_i(X)=2A_iXB$, $\nabla F(X)=2AXB$, we have

$$\begin{aligned} \Vert \xi ^k\Vert _F \le \Vert G^k\Vert _F \le 2\left( \Vert A_{i_k}X^kB\Vert _F + \Vert A_{i_k}X^0B\Vert _F + \Vert AX^0B\Vert _F \right) \!\le \! 6\cdot \Vert A\Vert _F\cdot \Vert B\Vert . \end{aligned}$$

By our choice of the step size $\alpha $, we have $\Vert \alpha \xi ^k\Vert _F \le \phi \le 1$. It follows from Property (P) and some simple calculation that

$$\begin{aligned} \mathrm{tr}\left[ \left( (X^k)^TA\epsilon ^{k+1} + (\epsilon ^{k+1})^TAX^k \right) B \right]\le & {} 2\cdot \Vert A\Vert \cdot \Vert B\Vert \cdot \Vert \epsilon ^{k+1}\Vert _F \nonumber \\\le & {} 2\alpha ^2M\cdot \Vert A\Vert \cdot \Vert B\Vert \cdot \Vert \xi ^k\Vert _F^2, \end{aligned}$$

(65)

$$\begin{aligned} -\mathrm{tr}\left[ \left( (\xi ^k)^TA\epsilon ^{k+1} + (\epsilon ^{k+1})^TA\xi ^k \right) B \right]\le & {} 2\cdot \Vert A\Vert \cdot \Vert B\Vert \cdot \Vert \xi ^k\Vert _F\cdot \Vert \epsilon ^{k+1}\Vert _F \nonumber \\\le & {} 2\alpha M\cdot \Vert A\Vert \cdot \Vert B\Vert \cdot \Vert \xi ^k\Vert _F^2, \end{aligned}$$

(66)

$$\begin{aligned} \mathrm{tr}\left[ (\epsilon ^{k+1})^TA\epsilon ^{k+1}B \right]\le & {} \Vert A\Vert \cdot \Vert B\Vert \cdot \Vert \epsilon ^{k+1}\Vert _F^2 \nonumber \\\le & {} \alpha ^2M^2 \cdot \Vert A\Vert \cdot \Vert B\Vert \cdot \Vert \xi ^k\Vert _F^2. \end{aligned}$$

(67)

Moreover, it is clear that

$$\begin{aligned} \mathrm{tr}\left[ (\xi ^k)^TA\xi ^kB \right] \le \Vert A\Vert \cdot \Vert B\Vert \cdot \Vert \xi ^k\Vert _F^2. \end{aligned}$$

(68)

Upon substituting (65)–(68) into (64) and simplifying, we obtain

$$\begin{aligned} F(X^{k+1}) - F(X^k) + \alpha \cdot \mathrm{tr}\left[ \left( (X^k)^TA\xi ^k + (\xi ^k)^TAX^k \right) B \right] \le c_0\alpha ^2\cdot \Vert \xi ^k\Vert _F^2 \end{aligned}$$

with $c_0=(M^2+4M+1)\cdot \Vert A\Vert \cdot \Vert B\Vert $, as desired.

Next, we establish the inequality (45). Since $\xi ^0=\mathrm{grad}\,F(X^0)=\mathrm{proj}_{T(X^0)}(\nabla F(X^0))$, where $\mathrm{proj}_{T(X)}$ is the projector onto T(X), by the idempotence of $\mathrm{proj}_{T(X)}$ and the fact that $\nabla F(X)=2AXB$, we have

$$\begin{aligned} \mathrm{tr}\left[ \left( (X^0)^TA\xi ^0 + (\xi ^0)^TAX^0 \right) B \right] = \Vert \mathrm{grad}\,F(X^0)\Vert _F^2. \end{aligned}$$

Upon substituting this into (44) and noting that $c_0 \le 1/(8\alpha )$, we obtain

$$\begin{aligned} F(X^1)-F(X^0) \le -\frac{7\alpha }{8} \cdot \Vert \mathrm{grad}\,F(X^0)\Vert _F^2. \end{aligned}$$

Since $X^0=\tilde{X}^s$ and $F(\tilde{X}^{s+1}) \le F(X^1)$ by lines 2 and 9 of Algorithm 2, respectively, the above inequality is equivalent to (45).

The inequality (45) shows that the sequence $\{F(\tilde{X}^s)\}_{s\ge 0}$ is monotonically decreasing, which, together with the fact that F is bounded below on $\mathrm{St}(m,n)$, implies that $F(\tilde{X}^s) \searrow F^*$ for some $F^*\in \mathbb R$. By the continuity of F, we conclude that every limit point $X^*$ of the sequence $\{\tilde{X}^s\}_{s\ge 0}$ satisfies $F(X^*)=F^*$ and $\mathrm{grad}\,F(X^*)=\mathbf 0$. This completes the proof of Proposition 10.

Proof of Corollary 2

Let $\mathscr {F}_k$ be the $\sigma $-algebra generated by $X^0,\ldots ,X^k$ for $k=0,1,\ldots ,\Gamma -1$. Since ${\mathbb E}[ G^k \mid \mathscr {F}_k ] = \nabla F(X^k)$, we have ${\mathbb E}[\xi ^k \mid \mathscr {F}_k ] = \mathrm{grad}\, F(X^k) = \mathrm{proj}_{T(X^k)}(\nabla F(X^k))$. Again, using the idempotence of $\mathrm{proj}_{T(X)}$ and the fact that $\nabla F(X)=2AXB$, we obtain

$$\begin{aligned} {\mathbb E}\left[ \mathrm{tr}\left[ \left( (X^k)^TA\xi ^k + (\xi ^k)^TAX^k \right) B \right] \,\Big |\, \mathscr {F}_k \right] = \Vert \mathrm{grad}\,F(X^k)\Vert _F^2. \end{aligned}$$

(69)

On the other hand, the non-expansiveness of $\mathrm{proj}_{T(X)}$ yields

$$\begin{aligned} \Vert \xi ^k\Vert _F\le & {} \Vert \xi ^k-\mathrm{grad}\,F(X^k)\Vert _F + \Vert \mathrm{grad}\,F(X^k)\Vert _F \nonumber \\= & {} \left\| \mathrm{proj}_{T(X^k)}(G^k)-\mathrm{proj}_{T(X^k)}(\nabla F(X^k)) \right\| _F + \Vert \mathrm{grad}\,F(X^k)\Vert _F \nonumber \\\le & {} \Vert G^k-\nabla F(X^k)\Vert _F + \Vert \mathrm{grad}\,F(X^k)\Vert _F \end{aligned}$$

(70)

and hence

$$\begin{aligned} \Vert \xi ^k\Vert _F^2 \le 2\left( \Vert G^k-\nabla F(X^k)\Vert _F^2 + \Vert \mathrm{grad}\,F(X^k)\Vert _F^2 \right) . \end{aligned}$$

(71)

By the definition of $G^k$ and the fact that $\nabla F_i$ (resp. $\nabla F$) is Lipschitz continuous with parameter $L_{F_i} \le 2\cdot \Vert A_i\Vert \cdot \Vert B\Vert $ for $i=1,\ldots ,N$ (resp. $L_F\le 2\cdot \Vert A\Vert \cdot \Vert B\Vert $), we have

$$\begin{aligned} \Vert G^k-\nabla F(X^k)\Vert _F\le & {} \left\| \nabla F_{i_k}(X^k) - \nabla F_{i_k}(X^0) \right\| _F + \left\| \nabla F(X^0) - \nabla F(X^k) \right\| _F \nonumber \\\le & {} c'\cdot \Vert X^k-X^0\Vert _F \end{aligned}$$

(72)

with $c'=2\left( \max _{i\in \{1,\ldots ,N\}}\Vert A_i\Vert +\Vert A\Vert \right) \Vert B\Vert $. To bound $\Vert X^k-X^0\Vert _F$, observe that

$$\begin{aligned} \Vert X^{k+1}-X^k\Vert _F= & {} \Vert \alpha \xi ^k+\epsilon ^{k+1}\Vert _F \nonumber \\\le & {} \alpha \cdot \Vert \xi ^k\Vert _F + \Vert \epsilon ^{k+1}\Vert _F \nonumber \\\le & {} \alpha \cdot \Vert \xi ^k\Vert _F + \alpha ^2M\cdot \Vert \xi ^k\Vert _F^2 \nonumber \\\le & {} \alpha (M+1)\cdot \Vert \xi ^k\Vert _F \end{aligned}$$

(73)

$$\begin{aligned}\le & {} \alpha (M+1)\left( c'\cdot \Vert X^k-X^0\Vert _F + \Vert \mathrm{grad}\,F(X^k)\Vert _F \right) , \end{aligned}$$

(74)

where (73) is due to the fact that $\Vert \alpha \xi ^k\Vert _F \le \phi \le 1$ and (74) follows from (70). This yields

$$\begin{aligned} \Vert X^{k+1}-X^0\Vert _F\le & {} \Vert X^{k+1}-X^k\Vert _F + \Vert X^k-X^0\Vert _F \\\le & {} (c_1\alpha + 1) \cdot \Vert X^k-X^0\Vert _F + \alpha (M+1) \cdot \Vert \mathrm{grad}\,F(X^k)\Vert _F, \end{aligned}$$

where $c_1=c'(M+1)$. In particular, we have

$$\begin{aligned} \Vert X^{k+1}-X^0\Vert _F \le \alpha (M+1) \sum _{j=0}^k (c_1\alpha +1)^{k-j} \cdot \Vert \mathrm{grad}\,F(X^j)\Vert _F, \end{aligned}$$

(75)

which implies that

$$\begin{aligned} \Vert X^{k+1}-X^0\Vert _F^2 \le \alpha ^2(M+1)^2(k+1) \sum _{j=0}^k (c_1\alpha + 1)^{2(k-j)} \cdot \Vert \mathrm{grad}\,F(X^j)\Vert _F^2. \end{aligned}$$

(76)

It follows from (71), (72), and (76) that

$$\begin{aligned} {\mathbb E}\left[ \Vert \xi ^k\Vert _F^2 \right]\le & {} 2c_1^2\alpha ^2k \sum _{j=0}^{k-1} (c_1\alpha +1)^{2(k-1-j)} {\mathbb E}\left[ \Vert \mathrm{grad}\,F(X^j)\Vert _F^2 \right] \\&+ 2{\mathbb E}\left[ \Vert \mathrm{grad}\,F(X^k)\Vert _F^2 \right] . \end{aligned}$$

This, together with (44) and (69), yields the desired result.

Proof of Proposition 11

By Proposition 10, the global error bound for Problem (QP-OC) (Corollary 1), and the fact that $\mathrm{grad}\,F(X)=D_{1/4}(X)$, we have

$$\begin{aligned} F(\tilde{X}^{s+1}) - F(\tilde{X}^s) \le -\frac{7\alpha }{8\bar{\eta }^2} \cdot \mathrm{dist}^2(\tilde{X}^s,\mathcal {X}) \end{aligned}$$

for all $s\ge 0$. Since $F(\tilde{X}^s) \searrow F^*$, the above inequality implies the existence of $s_0\ge 0$ such that $\mathrm{dist}(\tilde{X}^s,\mathcal {X}) \le \delta /3$ for all $s\ge s_0$, where $\delta \in (0,\sqrt{2}/2)$ is the constant given in Theorem 1. Now, consider a fixed $s\ge s_0$ and let $\hat{X}^s,\hat{X}^{s+1}\in \mathcal {X}$ be such that $\mathrm{dist}(\tilde{X}^s,\mathcal {X}) = \Vert \tilde{X}^s-\hat{X}^s\Vert _F$ and $\mathrm{dist}(\tilde{X}^{s+1},\mathcal {X}) = \Vert \tilde{X}^{s+1}-\hat{X}^{s+1}\Vert _F$. Suppose that $\hat{X}^s\in \mathcal {X}_{h,\Pi }$ and $\hat{X}^{s+1}\in \mathcal {X}_{h',\Pi '}$ with $\mathcal {X}_{h,\Pi }\cap \mathcal {X}_{h',\Pi '}=\emptyset $. Then, we have $\Vert \hat{X}^s-\hat{X}^{s+1}\Vert _F \ge \sqrt{2} \ge 2\delta $ by Proposition 4. On the other hand, using (75), the fact that $\Vert \mathrm{grad}\,F(X)\Vert _F \le \Vert \nabla F(X)\Vert _F \le 2\cdot \Vert A\Vert _F\cdot \Vert B\Vert $ for all $X\in \mathrm{St}(m,n)$, and our choice of the step size $\alpha $, the sequence $X^0=\tilde{X}^s,X^1,\ldots ,X^\Gamma $ generated by Algorithm 2 in epoch s satisfies

$$\begin{aligned} \Vert X^{k+1}-X^0\Vert _F\le & {} 2\alpha (M+1)\cdot \Vert A\Vert _F\cdot \Vert B\Vert \sum _{j=0}^k (c_1\alpha + 1)^{k-j} \\= & {} \frac{2(M+1) \left( (c_1\alpha + 1)^{k+1}-1 \right) \cdot \Vert A\Vert _F\cdot \Vert B\Vert }{c_1} \\\le & {} \frac{\delta }{3} \end{aligned}$$

for $k=0,1,\ldots ,\Gamma -1$. This implies that

$$\begin{aligned} \Vert \hat{X}^s-\hat{X}^{s+1} \Vert _F \le \Vert \hat{X}^s - \tilde{X}^s \Vert _F + \Vert \tilde{X}^{s+1}-\tilde{X}^s \Vert _F + \Vert \hat{X}^{s+1}-\tilde{X}^{s+1} \Vert _F \le \delta , \end{aligned}$$

which is a contradiction. Hence, we have $\mathcal {X}_{h,\Pi }\cap \mathcal {X}_{h',\Pi '}\not =\emptyset $, which by Proposition 4 yields $\mathcal {X}_{h,\Pi }=\mathcal {X}_{h',\Pi '}$. Consequently, we have $\mathrm{dist}(\tilde{X}^s,\mathcal {X}) = \mathrm{dist}(\tilde{X}^s,\mathcal {X}_{h,\Pi }) \le \delta /3$ for all sufficiently large $s\ge 0$. This, together with Proposition 10 and the fact that the function F is constant on $\mathcal {X}_{h,\Pi }$, implies that every limit point of the sequence $\{\tilde{X}^s\}_{s\ge 0}$ belongs to $\mathcal {X}_{h,\Pi }$ and $F(X)=F^*$ for all $X\in \mathcal {X}_{h,\Pi }$. This completes the proof.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, H., So, A.MC. & Wu, W. Quadratic optimization with orthogonality constraint: explicit Łojasiewicz exponent and linear convergence of retraction-based line-search and stochastic variance-reduced gradient methods. Math. Program. 178, 215–262 (2019). https://doi.org/10.1007/s10107-018-1285-1

Download citation

Received: 17 November 2017
Accepted: 30 April 2018
Published: 01 June 2018
Issue Date: November 2019
DOI: https://doi.org/10.1007/s10107-018-1285-1

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Quadratic optimization with orthogonality constraint: explicit Łojasiewicz exponent and linear convergence of retraction-based line-search and stochastic variance-reduced gradient methods

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Analysis of the Barzilai-Borwein Step-Sizes for Problems in Hilbert Spaces

A Globally Convergent Algorithm for Nonconvex Optimization Based on Block Coordinate Update

Generalized Newton Method with Positive Definite Regularization for Nonsmooth Optimization Problems with Nonisolated Solutions

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix

Proof of Proposition 4

Proof of Proposition 5

Lemma 1

Proof

Proof of Proposition 6

Proof of Proposition 7

Lemma 2

Proof of Lemma 2

Second-order boundedness of some retractions on \(\mathrm{St}(m,n)\)

1.1 Second-order boundedness of \(R_\mathsf{polar}\)

1.2 Second-order boundedness of \(R_\mathsf{QR}\)

1.3 Second-order boundedness of \(R_\mathsf{cayley}\)

Proof of Proposition 10

Proof of Corollary 2

Proof of Proposition 11

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now

Quadratic optimization with orthogonality constraint: explicit Łojasiewicz exponent and linear convergence of retraction-based line-search and stochastic variance-reduced gradient methods

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Analysis of the Barzilai-Borwein Step-Sizes for Problems in Hilbert Spaces

A Globally Convergent Algorithm for Nonconvex Optimization Based on Block Coordinate Update

Generalized Newton Method with Positive Definite Regularization for Nonsmooth Optimization Problems with Nonisolated Solutions

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix

Proof of Proposition 4

Proof of Proposition 5

Lemma 1

Proof

Proof of Proposition 6

Proof of Proposition 7

Lemma 2

Proof of Lemma 2

Second-order boundedness of some retractions on \(\mathrm{St}(m,n)\)

1.1 Second-order boundedness of \(R_\mathsf{polar}\)

1.2 Second-order boundedness of \(R_\mathsf{QR}\)

1.3 Second-order boundedness of \(R_\mathsf{cayley}\)

Proof of Proposition 10

Proof of Corollary 2

Proof of Proposition 11

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now