Abstract
In this paper we probe into the geometric properties of matrix exponentials and their approximations related to optimization on the Stiefel manifold. The relation between matrix exponentials and geodesics or retractions on the Stiefel manifold is discussed. Diagonal Padé approximation to matrix exponentials is used to construct new retractions. A Krylov subspace implementation of the new retractions is also considered for the orthogonal group and the Stiefel manifold with a very high rank.
Similar content being viewed by others
1 Introduction
In this paper, we discuss some geometric properties of matrix exponentials \(e^A=\sum \nolimits _{k=0}^{\infty }\frac{1}{k!}A^k\) and their approximations related to the following optimization problem
where the feasible set \(\mathrm{St}(p,n)=\left\{ \left. X\in \mathbb {R}^{n\times p}\right| X^\top X=I\right\} \) is the well-known Stiefel manifold. When \(p=n\), \(\mathrm{St}(p,n)\) is just the orthogonal group \(O_n\). When \(p=1\), \(\mathrm{St}(p,n)\) reduces to the unit sphere \(S^{n-1}\). Problem (1) pervades many areas in science and engineering. For various concrete examples of applications of this problem, the reader is referred to [6, 14, 20] and references therein.
Prevalent optimization algorithms on Riemannian manifolds are constructed on the basis of retraction [1, 2], a tool stemming from the Riemannian exponential [5]. Search along a geodesic is basically computationally expensive, but retraction provides a practical alternative on the premise of guaranteeing convergence. Actually, retraction can be viewed as a first-order approximation to the Riemannian exponential. Numerical linear algebra techniques furnish us with abundant ways of constructing retractions. For the Stiefel manifold, the most commonly used retractions are the Q-factor of the QR decomposition and the orthogonal part of the polar decomposition. A later found useful retraction is the Cayley transformation proposed by [16] and developed by [20, 22].
Inspired by the idea that the Cayley transformation can be viewed as an approximate matrix exponential, we probe into the relation between matrix exponentials and geodesics or retractions and proceed to create new retractions by approximation methods for matrix exponentials. There are two motivations behind the new retractions. One is to consummate the theory of optimization on the Stiefel manifold using different tangent representation and exponential representation. The other is to make an attempt to tackle high-rank problems. By ‘high-rank’ we mean that p is very large and even approximate to n, i.e., \(p\lessapprox n\). To our knowledge, cheap cost retractions exist only for low-rank problems in the existing literature [1, 6, 7, 14, 20, 22].
Outline. The rest of this paper is organized as follows. Basic knowledge of geometry and optimization on the Stiefel manifold is introduced in Sect. 2. Properties of matrix exponentials and their approximations are discussed in Sects. 3 and 4, respectively. Conclusions are drawn in Sect. 5.
2 Basic geometry and optimization on the Stiefel manifold
It is easy to see that the tangent space \(T_{X}\mathrm{St}(p,n)\) of the Stiefel manifold \(\mathrm{St}(p,n)\) can be expressed as
Indeed, this tangent space can also be represented as
On one hand, any Z of the form \(Z=AX\) with an n-by-n skew-symmetric matrix A satisfies \(X^\top Z+Z^\top X=0\). On the other hand, for any \(Z\in T_{X}\mathrm{St}(p,n)\), we can write \(Z=AX\) by defining
where \(P_X:=I-\frac{1}{2}XX^\top \). There are two commonly known metrics for \(T_{X}\mathrm{St}(p,n)\), the Euclidean metric \(\langle Z_1,Z_2\rangle _e: =\mathrm{tr} (Z_1^\top Z_2)\) and the canonical metric \(\langle Z_1,Z_2\rangle _c: =\mathrm{tr} (Z_1^\top P_X Z_2)\), \(\forall ~Z_1,Z_2\in T_{X}\mathrm{St}(p,n)\).
According to equations (2.7) and (2.41) in [6], geodesics on the Stiefel manifold \(\text {St}(p,n)\) are characterized as
for the Euclidean metric, and
for the canonical metric.
Optimization schemes on Riemannian manifolds have the following form
where the key constituent is the retraction R which (smoothly) maps tangent vectors \(\xi \in T_{x}{\mathcal {M}}\) to points on the manifold \({\mathcal {M}}\). A formal definition of retraction is given as follows.
Definition 1
[1] A retraction on a manifold \({\mathcal {M}}\) is a smooth mapping R from the tangent bundle \(T{\mathcal {M}}\) onto \({\mathcal {M}}\) with the following properties. Let \(R_x\) denote the restriction of R to \(\{x\}\times T_x {\mathcal {M}}\).
- 1.:
-
\(R_x(0)=x\), where 0 denotes the zero element of \(T_x{\mathcal {M}}\).
- 2.:
-
With the canonical identification \(T_{0}T_x{\mathcal {M}} \simeq T_x{\mathcal {M}}\), \(R_x\) satisfies
$$\begin{aligned} \text {D}R_x(0)=\mathrm{id}_{T_x{\mathcal {M}}}, \end{aligned}$$where \(\text {D}R_x(0)\) denotes the differential of \(R_x\) at 0 and \(\mathrm{id}_{T_x{\mathcal {M}}}\) the identity mapping on \(T_x {\mathcal {M}}\).
Definition 1 tells that the curve \(\gamma _{\xi }: t\mapsto R_x(t\xi )\) satisfies \(\gamma _{\xi }(0)=x\) and \(\gamma '_{\xi }(0)=\xi \), and thus a first-order approximation to the geodesic starting at x with the initial velocity \(\xi \).
Now we introduce three commonly used retractions on the Stiefel manifold. The QR decomposition retraction is given by \(R_X(Z)=\mathrm{qf}(X+Z)\), where qf denotes the Q-factor of the QR decomposition with its R-factor having strictly positive diagonal elements. The polar decomposition retraction is given by \(R_X(Z)=(X+Z)(I+Z^\top Z)^{-1/2}\). The Cayley transformation retraction is given by
where A is defined by (2). It is shown in [22] that retraction (6) can derive a nonexpansive and an isometric vector transports, which are important tools in Riemannian conjugate gradient methods [17,18,19, 22] and Riemannian quasi-Newton methods [11,12,13].
3 Properties of matrix exponentials
Consider the matrix exponential mapping
where A is defined by (2), and its corresponding curve \(Y(t)=e^{tA}X\). Do not confuse (7) with the Riemannian exponential because they are not always the same. The following theorem is the main result of this section.
Theorem 1
If A is an n-by-n skew-symmetric matrix, then the following statements are true.
- (i) :
-
Under the Euclidean metric, \(Y(t)=e^{tA}X\) is a geodesic on \(O_n\) and a curve with constant length of tangent vectors on \(\mathrm{St}(p,n)\). (Note that a geodesic always has constant length of tangent vectors, but not vice versa.)
- (ii) :
-
Under the canonical metric, \(Y(t)=e^{tA}X\) is a geodesic on \(O_n\) and \(\mathrm{St}(p,n)\) with \(p=n-1\).
In particular, if A is the matrix defined by (2), then the following statements are true.
- (iii) :
-
The matrix exponential mapping (7) is always a retraction on \(\mathrm{St}(p,n)\).
- (iv) :
-
Under the Euclidean metric, \(Y(t)=e^{tA}X\) is a geodesic on \(\mathrm{St}(p,n)\) if the tangent vector Z satisfies \(Z^\top X=0\).
- (v) :
-
Under the Euclidean metric, the matrix exponential mapping (7) is the same as the Riemannian exponential on \(O_n\) and \(S^{n-1}\).
- (vi) :
-
Under the canonical metric, the matrix exponential mapping (7) is the same as the Riemannian exponential on \(\mathrm{St}(p,n)\).
Proof
-
(i)
The skew-symmetry of A yields \((e^{tA}X)^\top e^{tA}X= X^\top e^{-tA} e^{tA} X=X^\top X=I\). Substituting
$$\begin{aligned} Y(t)=e^{tA}X,~~Y'(t)=Ae^{tA}X,~~Y''(t)=A^2e^{tA}X, \end{aligned}$$(8)into the left hand side of (3) and using the skew-symmetry of A, we obtain
$$\begin{aligned} Y''(t)+Y(t)Y'(t)^\top Y'(t)=e^{tA}(I-XX^\top )A^2X. \end{aligned}$$(9)If \(p=n\), then X is orthogonal and \(Y(t)=e^{tA}X\) solves the geodesic equation (3). Using the second equation in (8) and the skew-symmetry of A, we have
$$\begin{aligned} \langle Y'(t),Y'(t)\rangle _e=\langle Ae^{tA}X,Ae^{tA}X\rangle _e\equiv \mathrm{tr}(X^\top A^\top A X), \end{aligned}$$which means the curve \(Y(t)=e^{tA}X\) has constant length of tangent vectors.
-
(ii)
Substituting (8) into the left hand side of (4) and using the skew-symmetry of A, we have
$$\begin{aligned}&Y''(t)+Y'(t)Y'(t)^\top Y(t)+Y(t)\left( (Y(t)^\top Y'(t))^2+Y'(t)^\top Y'(t)\right) \nonumber \\&\quad =e^{tA}(I-XX^\top )A(I-XX^\top )AX. \end{aligned}$$(10)When \(p=n\), \(Y(t)=e^{tA}X\) solves the geodesic equation (4). If \(p=n-1\), we can expand X to an orthogonal matrix [X, q]. Then \((I-XX^\top )A(I-XX^\top )=qq^\top A qq^\top =0\) and \(Y(t)=e^{tA}X\) is still a solution to (4).
-
(iii)
With A defined by (2) where X is the foot of the tangent vector Z, \(e_A\) in (7) is obviously a smooth mapping. Using \(Y(0)=e^{0\cdot A}X=X\) and \(Y'(0)=Ae^{0\cdot A}X=AX=Z\), we see that \(e_A\) is a retraction on \(\mathrm{St}(p,n)\) from the explanation below Definition 1.
-
(iv)
If the tangent vector Z satisfies \(Z^\top X=0\), it follows from (2) that \(A=ZX^\top -XZ^\top \). Then
$$\begin{aligned} (I-XX^\top )A^2X=(I-XX^\top )AZ=-(I-XX^\top )XZ^\top Z=0, \end{aligned}$$which means by (9) that \(Y(t)=e^{tA}X\) is a solution to (3).
-
(v)
This conclusion follows immediately from statements (i), (iii), (iv) and the fact that the tangent space of the unit sphere reduces to \(T_x S^{n-1}=\{\left. z\in \mathbb {R}^n\right| x^\top z=0\}\).
-
(vi)
This conclusion has already been proved in [16], but to make this paper self-contained, we show this below by using the geodesic equation (4) directly. Applying (2), we have
$$\begin{aligned} (I-XX^\top )A(I-XX^\top )=(I-XX^\top )(ZX^\top -XZ^\top )(I-XX^\top )=0, \end{aligned}$$which means by (10) that \(Y(t)=e^{tA}X\) is a solution to (4). Then the result follows immediately. \(\square \)
The following example demonstrates that the mapping (7) with A defined by (2) is indeed not the same as the Riemannian exponential under the Euclidean metric.
Example 1
Let
Obviously, \(Z^\top X\ne 0\). According to (2), we have
Thus, according to (9) and the invertibility of \(e^{tA}\), \(Y(t)=e^{tA}X\) is incompatible with (3) for all t.
The following example demonstrates that the curve \(Y(t)=e^{tA}X\) is not a geodesic under the canonical metric for some skew symmetric matrix A.
Example 2
Let
Then
Thus, according to (10) and the invertibility of \(e^{tA}\), \(Y(t)=e^{tA}X\) is incompatible with (4) for all t.
In general, computing matrix exponentials is not cheap. But for unit spheres, the curve \(y(t)=e^{tA}x\) with A defined by (2) is always a geodesic by Theorem 1 and thus has the following much simpler expression [1]
Now we generalize this formula for any matrix A of the form \(A=ab^\top -ba^\top \) as follows.
Proposition 1
Let \(A\in \mathbb {R}^{n\times n} \) be a nonzero matrix of the form \(A=ab^\top -ba^\top \) and \(\theta =\sqrt{||a||_2^2||b||_2^2-(a^\top b)^2}\). Then
and
where
and
Proof
Because A is nonzero, a and b are linearly independent. So we can find \(n-2\) vectors \(q_1,q_2,\ldots ,q_{n-2}\) such that the matrix \(Q=[a,b,q_1,\ldots ,q_{n-2}]\) is nonsingular. Then
Hence, A has eigenvalues \(\lambda _1=\theta \mathbf{i},~\lambda _2=-\theta \mathbf{i},~\lambda _3=\cdots =\lambda _n=0\), where \(\theta =\sqrt{||a||_2^2||b||_2^2-(a^\top b)^2}\).
Since A is skew-symmetric, A is also diagonalizable, and thus its minimal polynomial has no multiple roots. It then follows from Definition 1.4 in [9] that \(e^{tA}\) is equal to h(tA), where h is the unique polynomial that satisfies the interpolation conditions \(h(t\lambda _j)=e^{t\lambda _j},~~j=1,2,3\). Hence,
Then (12) follows immediately. Substituting \(A=ab^\top -ba^\top \) into (12), we also obtain (13)–(15). \(\square \)
Remark 1
It is interesting that the right hand side of (12) is a quadratic polynomial of A that resembles the second-order Taylor expansion of \(e^{tA}\) around \(t=0\).
Using the identity \(z=(I-xx^\top )z=z-xz^\top x\) for any tangent vector \(z\in T_xS^{n-1}\), it is easy to see formulae (13)–(15) coincide with (11). However, Proposition 1 tells us more than a geodesic expression, because formulae (13)–(15) may generate many efficient schemes if we choose appropriate vectors a and b such that \(Ax=(ab^\top -ba^\top )x=z\) for a tangent vector z. For example, we can choose \(a=z+\tau x\) and \(b=x\) where \(\tau \) is a real number.
4 Properties of approximations to matrix exponentials
Motivated by the results in the previous section, we now investigate the relation between approximations to matrix exponentials and retractions on the Stiefel manifold. Padé approximation is a powerful tool for computing functions of matrices, which is implemented in MATLAB’s function expm for efficiently computing matrix exponentials [8, 9, 15].
Diagonal Padé approximants \(r_{m}(x)=p_{m}(x)/q_{m}(x)\) to exponentials are known explicitly as [3]
Replacing x with A, one can obtain the Padé approximants \(r_{m}(A)=q_{m}(A)^{-1}p_{m}(A)\) to the matrix exponential \(e^A\).
Lemma 1
For all skew-symmetric matrices A, \(p_m(A)\) and \(q_m(A)\) are nonsingular, and thus \(r_m(A)\) is well defined.
Proof
Let A be a skew-symmetric matrix. Then its eigenvalues are either zeros or pure imaginary numbers. Thus, owing to \(p_m(0)=1\) and \(q_m(0)=1\), \(p_m(A)\) or \(q_m(A)\) is singular if and only if there exists a conjugate pair of pure imaginary eigenvalues \(\pm \mu \mathbf{i}\) of A satisfying \(p_m(\pm \mu \mathbf{i})=0\) or \(q_m(\pm \mu \mathbf{i})=0\). It follows from (16) that diagonal Padé approximants are symmetric, i.e, \(p_m(-x)=q_m(x)\). Then \(q_m(\pm \mu \mathbf{i})=0\) if and only if \(p_m(\pm \mu \mathbf{i})=0\). Therefore the singularity of either \(p_m(A)\) or \(q_m(A)\) will contradict the fact that \(p_m\) and \(q_m\) are coprime [3]. \(\square \)
Lemma 2
Let \(A\in \mathbb {R}^{n\times n}\) be any skew-symmetric matrix. Then the curve \(Y(t)=r_m(tA)X\) satisfies \(Y(t)^\top Y(t)=X^\top X\), \(Y(0)=X\), and \(Y'(0)=AX\).
Proof
Using the skew-symmetry of A and the symmetry of diagonal Padé approximants, we have
Since \(r_m(0\cdot A)=I\), we have \(Y(0)=X\). Differentiating both sides of \(q_m(tA)Y(t)=p_m(tA)X\) with respect to t, we have
Substituting \(t=0\) into this equality and using (16), we have \(-\frac{A}{2}X+Y'(0)=\frac{A}{2}X\), which yields \(Y'(0)=AX\) immediately. \(\square \)
Lemmas 1 and 2 imply the following result.
Theorem 2
Let A be the skew-symmetric matrix defined by (2) for any tangent vector \(Z\in T_{X}\mathrm{St}(p,n)\). Then the curve \(Y(t)=r_m(tA)X\) defines a retraction \(R: T\mathrm{St}(p,n) \rightarrow \mathrm{St}(p,n)\) which assigns to each pair (X, Z) where \(Z\in T_X \mathrm{St}(p,n)\) a point \(r_m(A)X \in \mathrm{St}(p,n)\).
Theorem 2 derives a Padé-form scheme
where A is defined by (2). It is readily to see that (17) covers the Cayley transformation (6) as the special case \(m=1\). In fact, there is a well-known error expression for \(r_m(tA)\) as follows (see equation (10.25) in [9] for example)
which means that \(r_m(tA)\) is a 2mth-order approximation to \(e^{tA}\) around zero. Therefore the mapping \(R: (X,Z)\mapsto r_m(A)X\) is a high order retraction [2, 7] under the canonical metric generally and under the Euclidean metric for some special cases according to Theorem 1. The projected polynomial method proposed by Gawlik and Leok very recently [7] also has connections with Padé approximation. But our method and theirs differ in tangent representation, exponential representation and therefore the final approximants to the exponential.
Corollary 1
Let \({\overline{p}}_m(x)\) be a polynomial of degree m having the form \({\overline{p}}_m(x)=1+\frac{1}{2}x+O(x^2)\) and \({\overline{r}}_m(x)=\frac{{\overline{p}}_m(x)}{{\overline{p}}_m(-x)}\). Suppose that \({\overline{p}}_m(x)\) has no pure imaginary roots. Then the mapping \(\overline{R}: T\mathrm{St}(p,n) \rightarrow \mathrm{St}(p,n)\) which assigns to each pair (X, Z) where \(Z\in T_X \mathrm{St}(p,n)\) a point \({\overline{r}}_m(A)X \in \mathrm{St}(p,n)\) with A defined by (2) is a retraction.
Remark 2
The quasi-geodesic flow \(Y(t)=\left( I-\frac{t}{2m}A\right) ^{-m}\left( I+\frac{t}{2m}A\right) ^mX\) proposed in [16] also belongs to the category in Corollary 1.
Now, we consider efficient ways of computing (17) in large-scale problems. For low-rank problems, i.e., \(p\ll n\), the Cayley transformation (6) can be rewrite as
where \(U=[X,P_{X}Z]\), \(V=[P_{X}Z,-X]\), according to [20]. This reduces the computational complexity to \(O(np^2)\). But in general, (17) has not such an easy-to-compute expression.
In what follows, we introduce a heuristic approach to tackle high-rank problems. This approach is based on Krylov subspace methods. The lth Krylov subspace of a nonzero matrix \(A\in \mathbb {R}^{n\times n}\) and a nonzero vector \(b\in \mathbb {R}^n\) is defined by
For any skew-symmetric matrix \(A\in \mathbb {R}^{n\times n}\), the Lanczos process generates the equation
where \(Q_l=[q_1,\ldots ,q_l]\) forms an orthonormal basis of \(K_l(A,q_1)\), \(q_{l+1}\) is a unit vector orthogonal to \(Q_l\), \(T_l\) is an \(l\times l\) irreducible skew-symmetric tridiagonal matrix and \(e_l\) is the lth column of the identity matrix. It follows from (18) that \(Q_l^\top AQ_l=T_l\).
Our subspace variation of the update scheme (17) is
When \(m=1\), (19) reduces to
The integer l is preferred to be chosen by users rather than fixed by certain rule. For example, l can be adaptively increased with an upper bound from iteration to iteration, similarly to some subspace optimization methods such as [21]. We do not specify a concrete strategy to choose l in this paper since this needs a cogent experimental verification in the future.
Defining
which is obviously a tangent vector at X, one can show as in Lemma 2 that (19) satisfies:
This means (19) is a curve on \(\mathrm{St}(p,n)\) that passes through X with the velocity \({\widetilde{Z}}\). Of course, the above first-order geometric property indicates (19) is not a retraction any more, which indeed affects global convergence of Riemannian optimization algorithms in theory. But in practice, a distinct advantage of (19) over (17) is its lower computational complexity owing to the low rank of \({\widetilde{Z}}\) in (21).
Proposition 2
The curve (19) is equivalent to
In particular, (20) is equivalent to
Proof
Let \({\widetilde{p}}_m(x)=p_m(x)-1\) and \({\widetilde{q}}_m(x)=q_m(x)-1\), where \(p_m\) and \(q_m\) are both defined by (16). Then
Defining \(U=Q_{l}{\widetilde{q}}_m(tT_{l})\), \(V=Q_{l}\), and \(W=Q_{l}{\widetilde{p}}_m(tT_{l})\) and applying the Sherman–Morrison–Woodbury formula, we obtain
If \(m=1\), then \(r_1(tT_{l,k})=(I-\frac{t}{2}T_{l})^{-1}(I+\frac{t}{2}T_{l})\) and thus
So the proof is complete. \(\square \)
Remark 3
Proposition 2 implies the displacement \(Y(t)-X\) lies in the subspace \(\{D\in \mathbb {R}^{n\times p}~|~D=Q_{l}U,~U\in \mathbb {R}^{l\times p}\}\).
Now we analyze the complexity of (22) and (23). The Lanczos process has a cost at \(2n^2l\) flops. Computing \(M_1=Q_{l}^\top X\) has a cost at \(2n^2l\) flops. Computing \(M_2=\left( I-\frac{t}{2}T_{l}\right) ^{-1}M_1\) has a cost at 4nl flops. Final assembly of Y(t) needs \(2n^2l\) flops. The overall computational cost of (23) is \(6n^2l\) flops. The work of updating Y(t) in (23) for a different t has a cost at \(2n^2l\) flops. A similar argument shows that the overall cost and the cost of update for a different t of (22) are the same as those of (23). The most commonly used qf retraction through the QR decomposition requires \(2np^2\) flops. Obviously, if \(p\lessapprox n\) and \(l\ll n\), then \(3nl\ll p^2\) and therefore (22) and (23) are much cheaper than the qf retraction.
Although (19) has a low-complexity form (22), we still have to check whether (19) keeps the descent property. Consider gradient descent methods on the orthogonal group \(O_n\), which means the original tangent vector Z is the negative Riemannian gradient \(-\nabla f(X)\). Of course, the Riemannian gradient differs according to the metric. Under the Euclidean metric, \(\nabla _e f(X)=(I-XX^T)G+X\mathrm{skew}(X^\top G)\), and under the canonical metric, \(\nabla _c f(X)=G-XG^\top X\), where G is the ordinary gradient of f, and \(\mathrm{skew}(M):=\frac{1}{2}(M-M^\top )\). Then we have the following result.
Proposition 3
Consider the case where the manifold is \(O_n\) and Z is chosen as \(-\nabla f(X)\). If X is a noncritical point and \(q_1\) is not an eigenvector of A corresponding to the zero eigenvalue, then (19) is a descent curve.
Proof
First note that the hypothesis implies the matrix A in (2) is nonzero. Since \(Aq_1\ne 0\) and A has no nonzero real eigenvalues, \(q_1\) and \(Aq_1\) are linearly independent in \(\mathbb {R}\) and therefore \(T_l\) is nonzero. Then differentiating f(Y(t)) with respect to t at \(t=0\), we obtain
which means Y(t) is indeed a descent curve. \(\square \)
Next we give an un-rigorous discussion on the descent property of (19) under the Euclidean metric for the case that p is less than but very close to n, i.e., \(p\lessapprox n\). Let \(B=A^\top Q_lT_lQ_l^T\). According to the proof of Proposition 3, we have
Suppose \(\lambda _1\le \cdots \le \lambda _n\) are the eigenvalues of \(\mathrm{sym}(B)\), where \(\mathrm{sym}(B):=\frac{1}{2}(B+B^\top )\). Then \(\mathrm{tr} (X^\top B X)= \mathrm{tr}(X^\top \mathrm{sym}(B)X)\ge \sum \nolimits _{i=1}^p\lambda _i\) and \(\mathrm{tr}(B)=\mathrm{tr}(\mathrm{sym}(B))=\sum \nolimits _{i=1}^n\lambda _i\). Since B shares the same nonzero eigenvalues with \(Q_l^\top A^\top Q_lT_l=T_l^\top T_l\), the eigenvalues of B are nonnegative real numbers and \(\mathrm{tr}(B)>0\). It follows from Theorem 4.3.50 in [10] that the sequence of eigenvalues \((\lambda _1,\ldots ,\lambda _n)\) of \(\mathrm{sym}(B)\) majorizes the sequence of eigenvalues of B. So, if p is sufficiently close to n and the \(n-p\) largest eigenvalues \(\lambda _{p+1},\ldots ,\lambda _n\) of \(\mathrm{sym}(B)\) are not too large, we probably have \(\mathrm{tr} (X^\top B X)>0\), which implies \(\left. \frac{\mathrm{d}}{\mathrm{d}t}f(Y(t))\right| _{t=0}<0\) by (24).
Finally, we give a simple example to show the potential benefit of using (22). Consider the minimization problem
where \(D=\mathrm{diag}(1,2,\ldots ,1000)\). It is easy to see that the optimal value is 247,755 and the optimal solutions are those matrices X of the form \(X=[Q,0]^\top \), where Q is any \(995\times 995\) orthogonal matrix. In this problem, the objective function is \(f(X)=\frac{1}{2}\mathrm{tr}(X^TDX)\). The corresponding Riemannian gradient is \(\nabla _{e/c} f(X)=(I-XX^\top )DX\) and the matrix \(A_{e/c}\) defined in (2) with \(Z=-\nabla _{e/c} f(X)\) is \(A_{e/c}=XX^\top D-DXX^\top \). The algorithms we adopted for solving (25) were two basic gradient descent methods that use the QR decomposition for retraction and (22) respectively. They are formally described in the following two frames. Notice that in line 7 of the algorithms the initial step size for \(k\ge 1\) is chosen to be the BB step size [4]
where \(S_k=X_{k+1}-X_k\) and \(W_k=\nabla f(X_{k+1})-\nabla f(X_k)\). In the beginning, we just fixed a small initial step size. But one of the referees suggested us use the BB step size because a conservative constant initial step size penalizes the qf retraction more than the one based on Krylov subspace. Numerical results showed the BB step size (26) accelerated both algorithms remarkably in comparison with constant step sizes.
Our experiments were conducted in MATLAB R2015a on a laptop with Intel Core i5-6300U CPU @ 2.40 GHz 2.40 GHz and 8.00 GB of RAM. The starting matrix \(X_0\) was generated randomly by the command
Parameters were chosen as \(\epsilon =10^{-4}\), \(\rho =10^{-8}\), \(\delta =0.5\), \(\tau =1\). For Algorithm 2, the integers m and l were set as \(m=2\) and \(l=10\) and \(q_1=Q_l(:,1)\) was set as \(q_1=\frac{1}{\sqrt{1000}}(1,\ldots ,1)^\top \). The Euclidean metric was used for both algorithms. Numerical results are shown in Table 1, where optimal value, number of iterations, number of function evaluations, number of gradient evaluations and running time are reported. We can see that Algorithm 2 ran faster than Algorithm 1 although Algorithm 2 needed 13 more iterations. Problem (25) gathers the features that the variable matrix has a very high rank and the objective function and its gradient are easy to compute. These features make the time of computing retractions occupy a big part of the whole running time of the algorithm. The results in Table 1 give us such an insight that our Krylov subspace technique is potentially useful for problems like (25).
5 Conclusions
In this paper, we investigate the geometric properties of matrix exponentials and their approximations related to optimization on the Stiefel manifold. We figure out the relation between matrix exponentials and geodesics or retractions under various conditions. Diagonal Padé approximation to matrix exponentials is found to be very useful to construct a new class of retractions, which covers the Cayley transformation as a special case. We also propose a Krylov subspace technique to reduce the cost of the new retractions for the Stiefel manifold with a very high rank and prove its descent property when implemented in gradient methods on the orthogonal group. A simple example shows the potential benefit of this Krylov subspace technique.
References
Absil, P.-A., Mahony, R., Sepulchre, R.: Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton, NJ (2008)
Absil, P.-A., Malick, J.: Projection-like retractions on matrix manifolds. SIAM J. Optim. 22, 135–158 (2012)
Baker Jr., G.A.: Essentials of Padé Approximants. Academic Press, London (1975)
Barzilai, J., Borwein, J.M.: Two-point step size gradient methods. IMA J. Numer. Anal. 8, 141–148 (1988)
do Carmo, M.P.: Riemannian Geometry. Birkhäuser Boston Inc., Boston, MA (1992). (Translated from the second Portuguese edition by Francis Flaherty. Mathematics: Theory & Applications)
Edelman, A., Arias, T.A., Smith, S.T.: The geometry of algorithms with orthogonality constraints. SIAM J. Matrix Anal. Appl. 20, 303–353 (1999)
Gawlik, E.S., Leok, M.: High-order retractions on matrix manifolds using projected polynomials. SIAM J. Matrix Anal. Appl. 39, 801–828 (2018)
Higham, N.J.: The scaling and squaring method for the matrix exponential revisited. SIAM J. Matrix Anal. Appl. 26, 1179–1193 (2005)
Higham, N.J.: Functions of Matrices: Theory and Computation. SIAM, Philadelphia, PA (2008)
Horn, R.A., Johnson, C.R.: Matrix Analysis, 2nd edn. Cambridge University Press, New York, NY (2013)
Huang, W., Gallivan, K.A., Absil, P.-A.: A Broyden class of quasi-Newton methods for Riemannian optimization. SIAM J. Optim. 25, 1660–1685 (2015)
Huang, W., Absil, P.-A., Gallivan, K.A.: Intrinsic representation of tangent vectors and vector transports on matrix manifolds. Numer. Math. 136, 523–543 (2017)
Huang, W., Absil, P.-A., Gallivan, K.A.: A Riemannian BFGS method without differentiated retraction for nonconvex optimization problems. SIAM J. Optim. 28, 470–495 (2018)
Jiang, B., Dai, Y.: A framework of constraint preserving update schemes for optimization on Stiefel manifold. Math. Program. 153, 535–575 (2015)
Moler, C.B., Van Loan, C.F.: Nineteen dubious ways to compute the exponential of a matrix, twenty-five years later. SIAM Rev. 45, 3–49 (2003)
Nishimori, Y., Akaho, S.: Learning algorithms utilizing quasi-geodesic flows on the Stiefel manifold. Neurocomputing 67, 106–135 (2005)
Ring, W., Wirth, B.: Optimization methods on Riemannian manifolds and their application to shape space. SIAM J. Optim. 22, 596–627 (2012)
Sato, H.: A Dai–Yuan-type Riemannian conjugate gradient method with the weak Wolfe conditions. Comput. Optim. Appl. 64, 101–118 (2016)
Sato, H., Iwai, T.: A new, globally convergent Riemannian conjugate gradient method. Optimization 64, 1011–1031 (2015)
Wen, Z., Yin, W.: A feasible method for optimization with orthogonality constraints. Math. Program. 142, 397–434 (2013)
Yuan, Y.: Subspace methods for large scale nonlinear equations and nonlinear least squares. Optim. Eng. 10, 207–218 (2009)
Zhu, X.: A Riemannian conjugate gradient method for optimization on the Stiefel manifold. Comput. Optim. Appl. 67, 73–110 (2017)
Acknowledgements
The authors are very grateful to the two anonymous referees for their valuable suggestions and comments on this paper. The first author is supported by National Natural Science Foundation of China (Nos. 11601317 and 11526135). The second author is supported by National Natural Science Foundation of China (No. 71701153) and the International Postdoctoral Exchange Fellowship Program (No. 20160087).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhu, X., Duan, C. On matrix exponentials and their approximations related to optimization on the Stiefel manifold. Optim Lett 13, 1069–1083 (2019). https://doi.org/10.1007/s11590-018-1341-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11590-018-1341-z