1 Introduction

In this paper, we discuss some geometric properties of matrix exponentials \(e^A=\sum \nolimits _{k=0}^{\infty }\frac{1}{k!}A^k\) and their approximations related to the following optimization problem

$$\begin{aligned} \min f(X)~~\mathrm{s.t.}~~X\in \mathrm{St}(p,n), \end{aligned}$$
(1)

where the feasible set \(\mathrm{St}(p,n)=\left\{ \left. X\in \mathbb {R}^{n\times p}\right| X^\top X=I\right\} \) is the well-known Stiefel manifold. When \(p=n\), \(\mathrm{St}(p,n)\) is just the orthogonal group \(O_n\). When \(p=1\), \(\mathrm{St}(p,n)\) reduces to the unit sphere \(S^{n-1}\). Problem (1) pervades many areas in science and engineering. For various concrete examples of applications of this problem, the reader is referred to [6, 14, 20] and references therein.

Prevalent optimization algorithms on Riemannian manifolds are constructed on the basis of retraction [1, 2], a tool stemming from the Riemannian exponential [5]. Search along a geodesic is basically computationally expensive, but retraction provides a practical alternative on the premise of guaranteeing convergence. Actually, retraction can be viewed as a first-order approximation to the Riemannian exponential. Numerical linear algebra techniques furnish us with abundant ways of constructing retractions. For the Stiefel manifold, the most commonly used retractions are the Q-factor of the QR decomposition and the orthogonal part of the polar decomposition. A later found useful retraction is the Cayley transformation proposed by [16] and developed by [20, 22].

Inspired by the idea that the Cayley transformation can be viewed as an approximate matrix exponential, we probe into the relation between matrix exponentials and geodesics or retractions and proceed to create new retractions by approximation methods for matrix exponentials. There are two motivations behind the new retractions. One is to consummate the theory of optimization on the Stiefel manifold using different tangent representation and exponential representation. The other is to make an attempt to tackle high-rank problems. By ‘high-rank’ we mean that p is very large and even approximate to n, i.e., \(p\lessapprox n\). To our knowledge, cheap cost retractions exist only for low-rank problems in the existing literature [1, 6, 7, 14, 20, 22].

Outline.  The rest of this paper is organized as follows. Basic knowledge of geometry and optimization on the Stiefel manifold is introduced in Sect. 2. Properties of matrix exponentials and their approximations are discussed in Sects. 3 and 4, respectively. Conclusions are drawn in Sect. 5.

2 Basic geometry and optimization on the Stiefel manifold

It is easy to see that the tangent space \(T_{X}\mathrm{St}(p,n)\) of the Stiefel manifold \(\mathrm{St}(p,n)\) can be expressed as

$$\begin{aligned} T_{X}\mathrm{St}(p,n)=\left\{ \left. Z\in \mathbb {R}^{n\times p}\right| X^\top Z+Z^\top X=0\right\} . \end{aligned}$$

Indeed, this tangent space can also be represented as

$$\begin{aligned} T_{X}\mathrm{St}(p,n)=\left\{ \left. Z\in \mathbb {R}^{n\times p}\right| Z=AX,~A^\top =-A\right\} . \end{aligned}$$

On one hand, any Z of the form \(Z=AX\) with an n-by-n skew-symmetric matrix A satisfies \(X^\top Z+Z^\top X=0\). On the other hand, for any \(Z\in T_{X}\mathrm{St}(p,n)\), we can write \(Z=AX\) by defining

$$\begin{aligned} A:=P_XZX^\top -XZ^\top P_X, \end{aligned}$$
(2)

where \(P_X:=I-\frac{1}{2}XX^\top \). There are two commonly known metrics for \(T_{X}\mathrm{St}(p,n)\), the Euclidean metric \(\langle Z_1,Z_2\rangle _e: =\mathrm{tr} (Z_1^\top Z_2)\) and the canonical metric \(\langle Z_1,Z_2\rangle _c: =\mathrm{tr} (Z_1^\top P_X Z_2)\), \(\forall ~Z_1,Z_2\in T_{X}\mathrm{St}(p,n)\).

According to equations (2.7) and (2.41) in [6], geodesics on the Stiefel manifold \(\text {St}(p,n)\) are characterized as

$$\begin{aligned} Y''(t)+Y(t)Y'(t)^\top Y'(t)=0 \end{aligned}$$
(3)

for the Euclidean metric, and

$$\begin{aligned} Y''(t)+Y'(t)Y'(t)^\top Y(t)+Y(t)\left( (Y(t)^\top Y'(t))^2+Y'(t)^\top Y'(t)\right) =0 \end{aligned}$$
(4)

for the canonical metric.

Optimization schemes on Riemannian manifolds have the following form

$$\begin{aligned} x_{k+1}= R_{x_k}(t_k \xi _k), \end{aligned}$$
(5)

where the key constituent is the retraction R which (smoothly) maps tangent vectors \(\xi \in T_{x}{\mathcal {M}}\) to points on the manifold \({\mathcal {M}}\). A formal definition of retraction is given as follows.

Definition 1

[1]  A retraction on a manifold \({\mathcal {M}}\) is a smooth mapping R from the tangent bundle \(T{\mathcal {M}}\) onto \({\mathcal {M}}\) with the following properties. Let \(R_x\) denote the restriction of R to \(\{x\}\times T_x {\mathcal {M}}\).

1.:

\(R_x(0)=x\), where 0 denotes the zero element of \(T_x{\mathcal {M}}\).

2.:

With the canonical identification \(T_{0}T_x{\mathcal {M}} \simeq T_x{\mathcal {M}}\), \(R_x\) satisfies

$$\begin{aligned} \text {D}R_x(0)=\mathrm{id}_{T_x{\mathcal {M}}}, \end{aligned}$$

where \(\text {D}R_x(0)\) denotes the differential of \(R_x\) at 0 and \(\mathrm{id}_{T_x{\mathcal {M}}}\) the identity mapping on \(T_x {\mathcal {M}}\).

Definition 1 tells that the curve \(\gamma _{\xi }: t\mapsto R_x(t\xi )\) satisfies \(\gamma _{\xi }(0)=x\) and \(\gamma '_{\xi }(0)=\xi \), and thus a first-order approximation to the geodesic starting at x with the initial velocity \(\xi \).

Now we introduce three commonly used retractions on the Stiefel manifold. The QR decomposition retraction is given by \(R_X(Z)=\mathrm{qf}(X+Z)\), where qf denotes the Q-factor of the QR decomposition with its R-factor having strictly positive diagonal elements. The polar decomposition retraction is given by \(R_X(Z)=(X+Z)(I+Z^\top Z)^{-1/2}\). The Cayley transformation retraction is given by

$$\begin{aligned} R_X(Z)=\left( I-\frac{1}{2}A\right) ^{-1}\left( I+\frac{1}{2}A\right) X, \end{aligned}$$
(6)

where A is defined by (2). It is shown in [22] that retraction (6) can derive a nonexpansive and an isometric vector transports, which are important tools in Riemannian conjugate gradient methods [17,18,19, 22] and Riemannian quasi-Newton methods [11,12,13].

3 Properties of matrix exponentials

Consider the matrix exponential mapping

$$\begin{aligned} e_A: T\mathrm{St}(p,n) \rightarrow \mathrm{St}(p,n): (X,Z)\mapsto e^AX, \end{aligned}$$
(7)

where A is defined by (2), and its corresponding curve \(Y(t)=e^{tA}X\). Do not confuse (7) with the Riemannian exponential because they are not always the same. The following theorem is the main result of this section.

Theorem 1

If A is an n-by-n skew-symmetric matrix, then the following statements are true.

(i) :

Under the Euclidean metric, \(Y(t)=e^{tA}X\) is a geodesic on \(O_n\) and a curve with constant length of tangent vectors on \(\mathrm{St}(p,n)\). (Note that a geodesic always has constant length of tangent vectors, but not vice versa.)

(ii) :

Under the canonical metric, \(Y(t)=e^{tA}X\) is a geodesic on \(O_n\) and \(\mathrm{St}(p,n)\) with \(p=n-1\).

In particular, if A is the matrix defined by (2), then the following statements are true.

(iii) :

The matrix exponential mapping (7) is always a retraction on \(\mathrm{St}(p,n)\).

(iv) :

Under the Euclidean metric, \(Y(t)=e^{tA}X\) is a geodesic on \(\mathrm{St}(p,n)\) if the tangent vector Z satisfies \(Z^\top X=0\).

(v) :

Under the Euclidean metric, the matrix exponential mapping (7) is the same as the Riemannian exponential on \(O_n\) and \(S^{n-1}\).

(vi) :

Under the canonical metric, the matrix exponential mapping (7) is the same as the Riemannian exponential on \(\mathrm{St}(p,n)\).

Proof

  1. (i)

    The skew-symmetry of A yields \((e^{tA}X)^\top e^{tA}X= X^\top e^{-tA} e^{tA} X=X^\top X=I\). Substituting

    $$\begin{aligned} Y(t)=e^{tA}X,~~Y'(t)=Ae^{tA}X,~~Y''(t)=A^2e^{tA}X, \end{aligned}$$
    (8)

    into the left hand side of (3) and using the skew-symmetry of A, we obtain

    $$\begin{aligned} Y''(t)+Y(t)Y'(t)^\top Y'(t)=e^{tA}(I-XX^\top )A^2X. \end{aligned}$$
    (9)

    If \(p=n\), then X is orthogonal and \(Y(t)=e^{tA}X\) solves the geodesic equation (3). Using the second equation in (8) and the skew-symmetry of A, we have

    $$\begin{aligned} \langle Y'(t),Y'(t)\rangle _e=\langle Ae^{tA}X,Ae^{tA}X\rangle _e\equiv \mathrm{tr}(X^\top A^\top A X), \end{aligned}$$

    which means the curve \(Y(t)=e^{tA}X\) has constant length of tangent vectors.

  2. (ii)

    Substituting (8) into the left hand side of (4) and using the skew-symmetry of A, we have

    $$\begin{aligned}&Y''(t)+Y'(t)Y'(t)^\top Y(t)+Y(t)\left( (Y(t)^\top Y'(t))^2+Y'(t)^\top Y'(t)\right) \nonumber \\&\quad =e^{tA}(I-XX^\top )A(I-XX^\top )AX. \end{aligned}$$
    (10)

    When \(p=n\), \(Y(t)=e^{tA}X\) solves the geodesic equation (4). If \(p=n-1\), we can expand X to an orthogonal matrix [Xq]. Then \((I-XX^\top )A(I-XX^\top )=qq^\top A qq^\top =0\) and \(Y(t)=e^{tA}X\) is still a solution to (4).

  3. (iii)

    With A defined by (2) where X is the foot of the tangent vector Z, \(e_A\) in (7) is obviously a smooth mapping. Using \(Y(0)=e^{0\cdot A}X=X\) and \(Y'(0)=Ae^{0\cdot A}X=AX=Z\), we see that \(e_A\) is a retraction on \(\mathrm{St}(p,n)\) from the explanation below Definition 1.

  4. (iv)

    If the tangent vector Z satisfies \(Z^\top X=0\), it follows from (2) that \(A=ZX^\top -XZ^\top \). Then

    $$\begin{aligned} (I-XX^\top )A^2X=(I-XX^\top )AZ=-(I-XX^\top )XZ^\top Z=0, \end{aligned}$$

    which means by (9) that \(Y(t)=e^{tA}X\) is a solution to (3).

  5. (v)

    This conclusion follows immediately from statements (i), (iii), (iv) and the fact that the tangent space of the unit sphere reduces to \(T_x S^{n-1}=\{\left. z\in \mathbb {R}^n\right| x^\top z=0\}\).

  6. (vi)

    This conclusion has already been proved in [16], but to make this paper self-contained, we show this below by using the geodesic equation (4) directly. Applying (2), we have

    $$\begin{aligned} (I-XX^\top )A(I-XX^\top )=(I-XX^\top )(ZX^\top -XZ^\top )(I-XX^\top )=0, \end{aligned}$$

    which means by (10) that \(Y(t)=e^{tA}X\) is a solution to (4). Then the result follows immediately. \(\square \)

The following example demonstrates that the mapping (7) with A defined by (2) is indeed not the same as the Riemannian exponential under the Euclidean metric.

Example 1

Let

$$\begin{aligned} X=\left[ {\begin{array}{*{20}r} 1 &{}\quad 0 \\ 0 &{}\quad 1 \\ 0&{}\quad 0\\ \end{array}} \right] \in \mathrm{St}(2,3),~~ Z=\left[ {\begin{array}{*{20}r} 0 &{}\quad -1 \\ 1 &{}\quad 0 \\ 1&{}\quad 1\\ \end{array}} \right] \in T_X\mathrm{St}(2,3). \end{aligned}$$

Obviously, \(Z^\top X\ne 0\). According to (2), we have

$$\begin{aligned} A=\left[ {\begin{array}{*{20}r} 0 &{}\quad -1&{}\quad -1 \\ 1 &{}\quad 0 &{}\quad -1 \\ 1&{}\quad 1&{}\quad 0 \\ \end{array}} \right] ~~\Longrightarrow ~~ (I-XX^\top )A^2X= \left[ {\begin{array}{*{20}r} 0 &{}\quad 0 \\ 0&{}\quad 0 \\ 1&{}\quad -1\\ \end{array}} \right] . \end{aligned}$$

Thus, according to (9) and the invertibility of \(e^{tA}\), \(Y(t)=e^{tA}X\) is incompatible with (3) for all t.

The following example demonstrates that the curve \(Y(t)=e^{tA}X\) is not a geodesic under the canonical metric for some skew symmetric matrix A.

Example 2

Let

$$\begin{aligned} X=\left[ {\begin{array}{*{20}r} 1 &{}\quad 0 \\ 0 &{}\quad 1 \\ 0&{}\quad 0\\ 0&{}\quad 0\\ \end{array}} \right] \in \mathrm{St}(2,4),~~ A=\left[ {\begin{array}{*{20}r} 0 &{}\quad -1&{}\quad -1 &{}\quad -1\\ 1 &{}\quad 0 &{}\quad -1 &{}\quad -1\\ 1&{}\quad 1&{}\quad 0&{}\quad -1 \\ 1&{}\quad 1&{}\quad 1&{}\quad 0\\ \end{array}} \right] . \end{aligned}$$

Then

$$\begin{aligned} (I-XX^\top )A(I-XX^\top )AX= \left[ {\begin{array}{*{20}r} 0 &{}0 \\ 0&{}0 \\ -1&{}-1\\ 1&{}1\\ \end{array}} \right] . \end{aligned}$$

Thus, according to (10) and the invertibility of \(e^{tA}\), \(Y(t)=e^{tA}X\) is incompatible with (4) for all t.

In general, computing matrix exponentials is not cheap. But for unit spheres, the curve \(y(t)=e^{tA}x\) with A defined by (2) is always a geodesic by Theorem 1 and thus has the following much simpler expression [1]

$$\begin{aligned} y(t)=x\cos (t||z||_2)+z\frac{\sin (t||z||_2) }{||z||_2}. \end{aligned}$$
(11)

Now we generalize this formula for any matrix A of the form \(A=ab^\top -ba^\top \) as follows.

Proposition 1

Let \(A\in \mathbb {R}^{n\times n} \) be a nonzero matrix of the form \(A=ab^\top -ba^\top \) and \(\theta =\sqrt{||a||_2^2||b||_2^2-(a^\top b)^2}\). Then

$$\begin{aligned} e^{tA}=I+\frac{\sin t\theta }{\theta }A+\frac{1-\cos t\theta }{\theta ^2}A^2, \end{aligned}$$
(12)

and

$$\begin{aligned} e^{tA}x=x+\alpha (t) a+\beta (t) b, \end{aligned}$$
(13)

where

$$\begin{aligned} \alpha (t)=\frac{\sin t\theta }{\theta }(b^\top x)+\frac{1-\cos t\theta }{\theta ^2}\left( (a^\top b)(b^\top x)-||b||_2^2(a^\top x)\right) , \end{aligned}$$
(14)

and

$$\begin{aligned} \beta (t)=-\frac{\sin t\theta }{\theta }(a^\top x)+\frac{1-\cos t\theta }{\theta ^2}\left( (a^\top b)(a^\top x)-||a||_2^2(b^\top x)\right) . \end{aligned}$$
(15)

Proof

Because A is nonzero, a and b are linearly independent. So we can find \(n-2\) vectors \(q_1,q_2,\ldots ,q_{n-2}\) such that the matrix \(Q=[a,b,q_1,\ldots ,q_{n-2}]\) is nonsingular. Then

$$\begin{aligned} \det (\lambda I-A)&=\det \left( \lambda I+[a,b][-b,a]^\top \right) \\&=\det \left( Q^{-1}\left( \lambda I+[a,b][-b,a]^\top \right) Q\right) \\&=\det \left( \lambda I+[e_1,e_2]\left[ -Q^\top b,Q^\top a\right] ^\top \right) \\&=\lambda ^{n-2}\left( (\lambda -a^\top b)(\lambda +a^\top b)+(a^\top a)(b^\top b)\right) \\&=\lambda ^{n-2}\left( \lambda ^2-(a^\top b)^2+||a||_2^2||b||_2^2\right) . \end{aligned}$$

Hence, A has eigenvalues \(\lambda _1=\theta \mathbf{i},~\lambda _2=-\theta \mathbf{i},~\lambda _3=\cdots =\lambda _n=0\), where \(\theta =\sqrt{||a||_2^2||b||_2^2-(a^\top b)^2}\).

Since A is skew-symmetric, A is also diagonalizable, and thus its minimal polynomial has no multiple roots. It then follows from Definition 1.4 in [9] that \(e^{tA}\) is equal to h(tA), where h is the unique polynomial that satisfies the interpolation conditions \(h(t\lambda _j)=e^{t\lambda _j},~~j=1,2,3\). Hence,

$$\begin{aligned} h(\lambda )&=\, e^{t\lambda _1}\frac{(\lambda -t\lambda _2)(\lambda -t\lambda _3)}{(t\lambda _1-t\lambda _2)(t\lambda _1-t\lambda _3)} +e^{t\lambda _2}\frac{(\lambda -t\lambda _1)(\lambda -t\lambda _3)}{(t\lambda _2-t\lambda _1)(t\lambda _2-t\lambda _3)}\\&\quad \,+e^{t\lambda _3}\frac{(\lambda -t\lambda _1)(\lambda -t\lambda _2)}{(t\lambda _3-t\lambda _1)(t\lambda _3-t\lambda _2)}\\&=\,-e^{t\theta \mathbf i}\frac{\lambda ^2+t\theta \mathbf{i}\lambda }{2t^2\theta ^2} -e^{-t\theta \mathbf i}\frac{\lambda ^2-t\theta \mathbf{i}\lambda }{2t^2\theta ^2 } +\frac{\lambda ^2+t^2\theta ^2}{t^2\theta ^2 }\\&=\frac{1-\cos t\theta }{t^2\theta ^2}\lambda ^2+\frac{\sin t\theta }{t\theta }\lambda +1. \end{aligned}$$

Then (12) follows immediately. Substituting \(A=ab^\top -ba^\top \) into (12), we also obtain (13)–(15). \(\square \)

Remark 1

It is interesting that the right hand side of (12) is a quadratic polynomial of A that resembles the second-order Taylor expansion of \(e^{tA}\) around \(t=0\).

Using the identity \(z=(I-xx^\top )z=z-xz^\top x\) for any tangent vector \(z\in T_xS^{n-1}\), it is easy to see formulae (13)–(15) coincide with (11). However, Proposition 1 tells us more than a geodesic expression, because formulae (13)–(15) may generate many efficient schemes if we choose appropriate vectors a and b such that \(Ax=(ab^\top -ba^\top )x=z\) for a tangent vector z. For example, we can choose \(a=z+\tau x\) and \(b=x\) where \(\tau \) is a real number.

4 Properties of approximations to matrix exponentials

Motivated by the results in the previous section, we now investigate the relation between approximations to matrix exponentials and retractions on the Stiefel manifold. Padé approximation is a powerful tool for computing functions of matrices, which is implemented in MATLAB’s function expm for efficiently computing matrix exponentials [8, 9, 15].

Diagonal Padé approximants \(r_{m}(x)=p_{m}(x)/q_{m}(x)\) to exponentials are known explicitly as [3]

$$\begin{aligned} p_{m}(x)=\sum _{i=0}^m\frac{(2m-i)!m!}{(2m)!(m-i)!}\frac{x^i}{i!}, ~~q_{m}(x)=\sum _{i=0}^m\frac{(2m-i)!m!}{(2m)!(m-i)!}\frac{(-x)^i}{i!}. \end{aligned}$$
(16)

Replacing x with A, one can obtain the Padé approximants \(r_{m}(A)=q_{m}(A)^{-1}p_{m}(A)\) to the matrix exponential \(e^A\).

Lemma 1

For all skew-symmetric matrices A, \(p_m(A)\) and \(q_m(A)\) are nonsingular, and thus \(r_m(A)\) is well defined.

Proof

Let A be a skew-symmetric matrix. Then its eigenvalues are either zeros or pure imaginary numbers. Thus, owing to \(p_m(0)=1\) and \(q_m(0)=1\), \(p_m(A)\) or \(q_m(A)\) is singular if and only if there exists a conjugate pair of pure imaginary eigenvalues \(\pm \mu \mathbf{i}\) of A satisfying \(p_m(\pm \mu \mathbf{i})=0\) or \(q_m(\pm \mu \mathbf{i})=0\). It follows from (16) that diagonal Padé approximants are symmetric, i.e, \(p_m(-x)=q_m(x)\). Then \(q_m(\pm \mu \mathbf{i})=0\) if and only if \(p_m(\pm \mu \mathbf{i})=0\). Therefore the singularity of either \(p_m(A)\) or \(q_m(A)\) will contradict the fact that \(p_m\) and \(q_m\) are coprime [3]. \(\square \)

Lemma 2

Let \(A\in \mathbb {R}^{n\times n}\) be any skew-symmetric matrix. Then the curve \(Y(t)=r_m(tA)X\) satisfies \(Y(t)^\top Y(t)=X^\top X\), \(Y(0)=X\), and \(Y'(0)=AX\).

Proof

Using the skew-symmetry of A and the symmetry of diagonal Padé approximants, we have

$$\begin{aligned} Y(t)^\top Y(t)&=X^\top p_m(tA)^\top q_m(tA)^{-\top } q_m(tA)^{-1} p_m(tA)X\\&=X^\top p_m(-tA)q_m(-tA)^{-1} q_m(tA)^{-1}p_m(tA)X\\&=X^\top q_m(tA)p_m(tA)^{-1} q_m(tA)^{-1} p_m(tA)X\\&=X^\top q_m(tA)q_m(tA)^{-1} p_m(tA)^{-1} p_m(tA)X\\&=X^\top X. \end{aligned}$$

Since \(r_m(0\cdot A)=I\), we have \(Y(0)=X\). Differentiating both sides of \(q_m(tA)Y(t)=p_m(tA)X\) with respect to t, we have

$$\begin{aligned} \frac{\mathrm{d}}{{\mathrm{d}t}}q_m(tA)\cdot Y(t)+q_m(tA)\cdot Y'(t)= \frac{\mathrm{d}}{{\mathrm{d}t}}p_m(tA)X. \end{aligned}$$

Substituting \(t=0\) into this equality and using (16), we have \(-\frac{A}{2}X+Y'(0)=\frac{A}{2}X\), which yields \(Y'(0)=AX\) immediately. \(\square \)

Lemmas 1 and 2 imply the following result.

Theorem 2

Let A be the skew-symmetric matrix defined by (2) for any tangent vector \(Z\in T_{X}\mathrm{St}(p,n)\). Then the curve \(Y(t)=r_m(tA)X\) defines a retraction \(R: T\mathrm{St}(p,n) \rightarrow \mathrm{St}(p,n)\) which assigns to each pair (XZ) where \(Z\in T_X \mathrm{St}(p,n)\) a point \(r_m(A)X \in \mathrm{St}(p,n)\).

Theorem 2 derives a Padé-form scheme

$$\begin{aligned} Y(t)=r_m(tA)X, \end{aligned}$$
(17)

where A is defined by (2). It is readily to see that (17) covers the Cayley transformation (6) as the special case \(m=1\). In fact, there is a well-known error expression for \(r_m(tA)\) as follows (see equation (10.25) in [9] for example)

$$\begin{aligned} e^{tA}-r_m(tA)=\frac{(-1)^m}{2m!}t^{2m+1}A^{2m+1}q_m(tA)^{-1}\int _0^1 e^{stA}(1-s)^ms^m ds, \end{aligned}$$

which means that \(r_m(tA)\) is a 2mth-order approximation to \(e^{tA}\) around zero. Therefore the mapping \(R: (X,Z)\mapsto r_m(A)X\) is a high order retraction [2, 7] under the canonical metric generally and under the Euclidean metric for some special cases according to Theorem 1. The projected polynomial method proposed by Gawlik and Leok very recently [7] also has connections with Padé approximation. But our method and theirs differ in tangent representation, exponential representation and therefore the final approximants to the exponential.

Corollary 1

Let \({\overline{p}}_m(x)\) be a polynomial of degree m having the form \({\overline{p}}_m(x)=1+\frac{1}{2}x+O(x^2)\) and \({\overline{r}}_m(x)=\frac{{\overline{p}}_m(x)}{{\overline{p}}_m(-x)}\). Suppose that \({\overline{p}}_m(x)\) has no pure imaginary roots. Then the mapping \(\overline{R}: T\mathrm{St}(p,n) \rightarrow \mathrm{St}(p,n)\) which assigns to each pair (XZ) where \(Z\in T_X \mathrm{St}(p,n)\) a point \({\overline{r}}_m(A)X \in \mathrm{St}(p,n)\) with A defined by (2) is a retraction.

Remark 2

The quasi-geodesic flow \(Y(t)=\left( I-\frac{t}{2m}A\right) ^{-m}\left( I+\frac{t}{2m}A\right) ^mX\) proposed in [16] also belongs to the category in Corollary 1.

Now, we consider efficient ways of computing (17) in large-scale problems. For low-rank problems, i.e., \(p\ll n\), the Cayley transformation (6) can be rewrite as

$$\begin{aligned} Y(t)=X+t U \left( I-\frac{t}{2}V^\top U\right) ^{-1} V^\top X, \end{aligned}$$

where \(U=[X,P_{X}Z]\), \(V=[P_{X}Z,-X]\), according to [20]. This reduces the computational complexity to \(O(np^2)\). But in general, (17) has not such an easy-to-compute expression.

In what follows, we introduce a heuristic approach to tackle high-rank problems. This approach is based on Krylov subspace methods. The lth Krylov subspace of a nonzero matrix \(A\in \mathbb {R}^{n\times n}\) and a nonzero vector \(b\in \mathbb {R}^n\) is defined by

$$\begin{aligned} K_l(A,b)=\mathrm{span}\{b,Ab,\ldots ,A^lb\}. \end{aligned}$$

For any skew-symmetric matrix \(A\in \mathbb {R}^{n\times n}\), the Lanczos process generates the equation

$$\begin{aligned} AQ_l=Q_lT_l+\beta _{l+1}q_{l+1}e_l^\top , \end{aligned}$$
(18)

where \(Q_l=[q_1,\ldots ,q_l]\) forms an orthonormal basis of \(K_l(A,q_1)\), \(q_{l+1}\) is a unit vector orthogonal to \(Q_l\), \(T_l\) is an \(l\times l\) irreducible skew-symmetric tridiagonal matrix and \(e_l\) is the lth column of the identity matrix. It follows from (18) that \(Q_l^\top AQ_l=T_l\).

Our subspace variation of the update scheme (17) is

$$\begin{aligned} Y(t)=r_m(tQ_{l}T_{l}Q_{l}^\top )X. \end{aligned}$$
(19)

When \(m=1\), (19) reduces to

$$\begin{aligned} Y_k(t)=\left( I-\frac{t}{2}Q_{l}T_{l}Q_{l}^\top \right) ^{-1}\left( I+\frac{t}{2}Q_{l}T_{l}Q_{l}^\top \right) X. \end{aligned}$$
(20)

The integer l is preferred to be chosen by users rather than fixed by certain rule. For example, l can be adaptively increased with an upper bound from iteration to iteration, similarly to some subspace optimization methods such as [21]. We do not specify a concrete strategy to choose l in this paper since this needs a cogent experimental verification in the future.

Defining

$$\begin{aligned} {{\widetilde{Z}}}:=Q_{l}T_{l}Q_{l}^\top X, \end{aligned}$$
(21)

which is obviously a tangent vector at X, one can show as in Lemma 2 that (19) satisfies:

$$\begin{aligned} Y(t)^\top Y(t)=I,~~Y(0)=X,~~\text {and}~~Y'(0)={\widetilde{Z}}. \end{aligned}$$

This means (19) is a curve on \(\mathrm{St}(p,n)\) that passes through X with the velocity \({\widetilde{Z}}\). Of course, the above first-order geometric property indicates (19) is not a retraction any more, which indeed affects global convergence of Riemannian optimization algorithms in theory. But in practice, a distinct advantage of (19) over (17) is its lower computational complexity owing to the low rank of \({\widetilde{Z}}\) in (21).

Proposition 2

The curve (19) is equivalent to

$$\begin{aligned} Y(t)=X-Q_{l}\left( I-r_m(tT_{l})\right) Q_{l}^\top X. \end{aligned}$$
(22)

In particular, (20) is equivalent to

$$\begin{aligned} Y(t)=X+t Q_{l}T_{l}\left( I-\frac{t}{2}T_{l}\right) ^{-1} Q_{l}^\top X. \end{aligned}$$
(23)

Proof

Let \({\widetilde{p}}_m(x)=p_m(x)-1\) and \({\widetilde{q}}_m(x)=q_m(x)-1\), where \(p_m\) and \(q_m\) are both defined by (16). Then

$$\begin{aligned} r_m(tQ_{l}T_{l}Q_{l}^\top )=\left( I+Q_{l}{\widetilde{q}}_m(tT_{l})Q_{l}^\top \right) ^{-1} \left( I+Q_{l}{\widetilde{p}}_m(tT_{l})Q_{l}^\top \right) . \end{aligned}$$

Defining \(U=Q_{l}{\widetilde{q}}_m(tT_{l})\), \(V=Q_{l}\), and \(W=Q_{l}{\widetilde{p}}_m(tT_{l})\) and applying the Sherman–Morrison–Woodbury formula, we obtain

$$\begin{aligned} Y(t)&=\left( I-U\left( I+V^\top U\right) ^{-1}V^\top \right) \left( I+WV^\top \right) X\\&=X-U\left( I+V^\top U\right) ^{-1}\left( I+V^\top W\right) V^\top X_k+WV^\top X\\&=X-Q_{l}{\widetilde{q}}_m(tT_{l})\left( I+{\widetilde{q}}_m(tT_{l})\right) ^{-1} \left( I+{\widetilde{p}}_m(tT_{l})\right) Q_{l}^\top X+ Q_{l}{\widetilde{p}}_m(tT_{l})Q_{l}^\top X\\&=X-Q_{l}\left( I-\left( I+{\widetilde{q}}_m(tT_{l})\right) ^{-1}\right) \left( I+{\widetilde{p}}_m(tT_{l})\right) Q_{l}^\top X+ Q_{l}{\widetilde{p}}_m(tT_{l})Q_{l}^\top X\\&=X-Q_{l}\left( I-r_m(tT_{l})\right) Q_{l}^\top X. \end{aligned}$$

If \(m=1\), then \(r_1(tT_{l,k})=(I-\frac{t}{2}T_{l})^{-1}(I+\frac{t}{2}T_{l})\) and thus

$$\begin{aligned} Y(t)&=X-Q_{l}\left( I-\left( I-\frac{t}{2}T_{l}\right) ^{-1}\left( I+\frac{t}{2}T_{l}\right) \right) Q_{l}^\top X\\&=X+t Q_{l}T_{l}\left( I-\frac{t}{2}T_{l}\right) ^{-1} Q_{l}^\top X. \end{aligned}$$

So the proof is complete. \(\square \)

Remark 3

Proposition 2 implies the displacement \(Y(t)-X\) lies in the subspace \(\{D\in \mathbb {R}^{n\times p}~|~D=Q_{l}U,~U\in \mathbb {R}^{l\times p}\}\).

Now we analyze the complexity of (22) and (23). The Lanczos process has a cost at \(2n^2l\) flops. Computing \(M_1=Q_{l}^\top X\) has a cost at \(2n^2l\) flops. Computing \(M_2=\left( I-\frac{t}{2}T_{l}\right) ^{-1}M_1\) has a cost at 4nl flops. Final assembly of Y(t) needs \(2n^2l\) flops. The overall computational cost of (23) is \(6n^2l\) flops. The work of updating Y(t) in (23) for a different t has a cost at \(2n^2l\) flops. A similar argument shows that the overall cost and the cost of update for a different t of (22) are the same as those of (23). The most commonly used qf retraction through the QR decomposition requires \(2np^2\) flops. Obviously, if \(p\lessapprox n\) and \(l\ll n\), then \(3nl\ll p^2\) and therefore (22) and (23) are much cheaper than the qf retraction.

Although (19) has a low-complexity form (22), we still have to check whether (19) keeps the descent property. Consider gradient descent methods on the orthogonal group \(O_n\), which means the original tangent vector Z is the negative Riemannian gradient \(-\nabla f(X)\). Of course, the Riemannian gradient differs according to the metric. Under the Euclidean metric, \(\nabla _e f(X)=(I-XX^T)G+X\mathrm{skew}(X^\top G)\), and under the canonical metric, \(\nabla _c f(X)=G-XG^\top X\), where G is the ordinary gradient of f, and \(\mathrm{skew}(M):=\frac{1}{2}(M-M^\top )\). Then we have the following result.

Proposition 3

Consider the case where the manifold is \(O_n\) and Z is chosen as \(-\nabla f(X)\). If X is a noncritical point and \(q_1\) is not an eigenvector of A corresponding to the zero eigenvalue, then (19) is a descent curve.

Proof

First note that the hypothesis implies the matrix A in (2) is nonzero. Since \(Aq_1\ne 0\) and A has no nonzero real eigenvalues, \(q_1\) and \(Aq_1\) are linearly independent in \(\mathbb {R}\) and therefore \(T_l\) is nonzero. Then differentiating f(Y(t)) with respect to t at \(t=0\), we obtain

which means Y(t) is indeed a descent curve. \(\square \)

Next we give an un-rigorous discussion on the descent property of (19) under the Euclidean metric for the case that p is less than but very close to n, i.e., \(p\lessapprox n\). Let \(B=A^\top Q_lT_lQ_l^T\). According to the proof of Proposition 3, we have

$$\begin{aligned} \left. \frac{\mathrm{d}}{\mathrm{d}t}f(Y(t))\right| _{t=0}=\langle \nabla _e f(X),{\widetilde{Z}}\rangle _{e}=-\langle AX, Q_{l}T_{l}Q_{l}^\top X\rangle _e=-\mathrm{tr} (X^\top B X). \end{aligned}$$
(24)

Suppose \(\lambda _1\le \cdots \le \lambda _n\) are the eigenvalues of \(\mathrm{sym}(B)\), where \(\mathrm{sym}(B):=\frac{1}{2}(B+B^\top )\). Then \(\mathrm{tr} (X^\top B X)= \mathrm{tr}(X^\top \mathrm{sym}(B)X)\ge \sum \nolimits _{i=1}^p\lambda _i\) and \(\mathrm{tr}(B)=\mathrm{tr}(\mathrm{sym}(B))=\sum \nolimits _{i=1}^n\lambda _i\). Since B shares the same nonzero eigenvalues with \(Q_l^\top A^\top Q_lT_l=T_l^\top T_l\), the eigenvalues of B are nonnegative real numbers and \(\mathrm{tr}(B)>0\). It follows from Theorem 4.3.50 in [10] that the sequence of eigenvalues \((\lambda _1,\ldots ,\lambda _n)\) of \(\mathrm{sym}(B)\) majorizes the sequence of eigenvalues of B. So, if p is sufficiently close to n and the \(n-p\) largest eigenvalues \(\lambda _{p+1},\ldots ,\lambda _n\) of \(\mathrm{sym}(B)\) are not too large, we probably have \(\mathrm{tr} (X^\top B X)>0\), which implies \(\left. \frac{\mathrm{d}}{\mathrm{d}t}f(Y(t))\right| _{t=0}<0\) by (24).

Finally, we give a simple example to show the potential benefit of using (22). Consider the minimization problem

$$\begin{aligned} \min \limits _{X\in \mathbb {R}^{1000\times 995}} \frac{1}{2}\mathrm{tr} (X^\top DX)~~\mathrm{s.t.}~~X^\top X=I, \end{aligned}$$
(25)

where \(D=\mathrm{diag}(1,2,\ldots ,1000)\). It is easy to see that the optimal value is 247,755 and the optimal solutions are those matrices X of the form \(X=[Q,0]^\top \), where Q is any \(995\times 995\) orthogonal matrix. In this problem, the objective function is \(f(X)=\frac{1}{2}\mathrm{tr}(X^TDX)\). The corresponding Riemannian gradient is \(\nabla _{e/c} f(X)=(I-XX^\top )DX\) and the matrix \(A_{e/c}\) defined in (2) with \(Z=-\nabla _{e/c} f(X)\) is \(A_{e/c}=XX^\top D-DXX^\top \). The algorithms we adopted for solving (25) were two basic gradient descent methods that use the QR decomposition for retraction and (22) respectively. They are formally described in the following two frames. Notice that in line 7 of the algorithms the initial step size for \(k\ge 1\) is chosen to be the BB step size [4]

$$\begin{aligned} \tau _{k+1}^\mathrm{BB} = \frac{||S_k||_F^2}{\left| \mathrm{tr}(S_k^\top W_k)\right| }, \end{aligned}$$
(26)

where \(S_k=X_{k+1}-X_k\) and \(W_k=\nabla f(X_{k+1})-\nabla f(X_k)\). In the beginning, we just fixed a small initial step size. But one of the referees suggested us use the BB step size because a conservative constant initial step size penalizes the qf retraction more than the one based on Krylov subspace. Numerical results showed the BB step size (26) accelerated both algorithms remarkably in comparison with constant step sizes.

Our experiments were conducted in MATLAB R2015a on a laptop with Intel Core i5-6300U CPU @ 2.40 GHz 2.40 GHz and 8.00 GB of RAM. The starting matrix \(X_0\) was generated randomly by the command

$$\begin{aligned} \mathtt{X0 = orth(randn(1000,995))}. \end{aligned}$$

Parameters were chosen as \(\epsilon =10^{-4}\), \(\rho =10^{-8}\), \(\delta =0.5\), \(\tau =1\). For Algorithm 2, the integers m and l were set as \(m=2\) and \(l=10\) and \(q_1=Q_l(:,1)\) was set as \(q_1=\frac{1}{\sqrt{1000}}(1,\ldots ,1)^\top \). The Euclidean metric was used for both algorithms. Numerical results are shown in Table 1, where optimal value, number of iterations, number of function evaluations, number of gradient evaluations and running time are reported. We can see that Algorithm 2 ran faster than Algorithm 1 although Algorithm 2 needed 13 more iterations. Problem (25) gathers the features that the variable matrix has a very high rank and the objective function and its gradient are easy to compute. These features make the time of computing retractions occupy a big part of the whole running time of the algorithm. The results in Table 1 give us such an insight that our Krylov subspace technique is potentially useful for problems like (25).

Table 1 Numerical results of Algorithms 1 and 2 on problem (25)
figure a
figure b

5 Conclusions

In this paper, we investigate the geometric properties of matrix exponentials and their approximations related to optimization on the Stiefel manifold. We figure out the relation between matrix exponentials and geodesics or retractions under various conditions. Diagonal Padé approximation to matrix exponentials is found to be very useful to construct a new class of retractions, which covers the Cayley transformation as a special case. We also propose a Krylov subspace technique to reduce the cost of the new retractions for the Stiefel manifold with a very high rank and prove its descent property when implemented in gradient methods on the orthogonal group. A simple example shows the potential benefit of this Krylov subspace technique.