Keywords

1 Introduction

In this paper we consider the following optimization problem with orthogonality constraints:

$$\begin{aligned} \min _{X\in \mathbb {R}^{n\times p}}\mathcal {F}(X) \quad \text {s.t.} \quad X^{\top }X = I_p, \end{aligned}$$
(1)

where \({\mathcal {F}}: \mathbb {R}^{n\times p}\rightarrow \mathbb {R}\) is a differentiable function and \(I_p\in \mathbb {R}^{p\times p}\) represents the identity matrix. The feasible set \(Stf (n,p) :=\{ X\in \mathbb {R}^{n\times p} | X^{\top }X=I \} \) is known as the ā€œStiefel Manifoldā€. This manifold is simplified to the unit sphere when \(p=1\) and in the case \(p=n\) is called ā€œOrthogonal groupā€. The Stiefel manifold can be seen as an embedded sub-manifold of \(\mathbb {R}^{n\times p}\) with dimension equals to \(np- \frac{1}{2}p(p+1)\), see [1].

Problem (1) admits many applications such as, linear eigenvalue problem [14], sparse principal component analysis [4], Kohn-Sham total energy minimization [16], orthogonal procrustes problem [5], weighted orthogonal procrustes problem [6], nearest low-rank correlation matrix problem [7, 12], joint diagonalizationĀ (blind source separation) [8], among others. In addition, some problems such as PCA, LDA, multidimensional scaling, orthogonal neighborhood preserving projection can be formulated as problem (1) [9].

On the other hand, the Stiefel manifold is a compact set, which ensures that (1) has a global optimum at least. However, this manifold is not a convex set, which transforms (1) in a hard optimization problem. For example, the quadratic assignment problem (QAP) and the leakage interference minimization are NP-hardĀ [10].

In this paper we propose a new method based on projections onto the Stiefel manifold. In particular, we study two algorithms to solve problem (1). At each iteration of the algorithms, we project the corresponding update onto the Stiefel manifold using the singular value decompositionĀ (SVD) which guarantees to obtain a feasible sequence. Although, the SVD decomposition is computationally expensive, this is less expensive than building a geodesic. In the literature, we can find other feasible methods that solve problem (1), for example, the ones based on retractions methods use projections that involve QR factorization, polar decomposition, Gram-Schmidt process or SVD decomposition [1].

This paper is organized as follows. In Subsect.Ā 2.1 we present some standard notation and in Subsect.Ā 2.2 we give the optimality conditions of the problem (1), Subsect.Ā 2.3 describes the proposed update scheme, where we present a linear search monotone algorithm and a globally convergent non-monotone algorithm for solving problem (1), Subsect.Ā 2.4 shows different strategies to choose the step size according to Armijo-Wolfe like condition, and a non-monotone search using the Barzilai Borwein step size. Some theoretical results are presented in Sect.Ā 3. SectionĀ 4 is dedicated to numerical experiments in order to demonstrate the efficiency and robustness of the proposed algorithms.

2 Algorithms

In the first two subsections, we introduce some standard notation and the optimality conditions of problem (1) respectively. Next subsections are devoted to introduce our proposed method.

2.1 Notation

We say that a matrix \(W\in \mathbb {R}^{n\times n}\) is skew-symmetric if \(W = -W^{\top }\). The trace of X is defined as the sum its diagonal elements, and we will denote by Tr[X]. The Euclidean inner product of two matrices \(A,B\in \mathbb {R}^{m\times n}\) is defined as \(\langle A, B\rangle : = \sum _{i,j}A_{i,j}B_{i,j} = Tr[A^{\top }B]\). The Frobenius norm is defined using the previous inner product, i.e., \(||A||_F = \sqrt{\langle A,A\rangle }\). Let \(\mathcal {F}:\mathbb {R}^{n\times p} \rightarrow \mathbb {R}\) be a differentiable function, then the derivative of \(\mathcal {F}\) with respect to X is denoted as \( G := \mathcal {DF}(X):=( \frac{\partial \mathcal {F}(X)}{\partial X_{ij}} )\) and the derivative of the function \(\mathcal {F}\) in X in the direction Z is defined as:

$$\begin{aligned} \mathcal {DF}(X)[Z] := \frac{\partial \mathcal {F}(X+t Z)}{\partial \tau } \Big |_{t = 0} = \lim _{t\rightarrow 0}\frac{\mathcal {F}(X+tZ) - \mathcal {F}(X)}{t} = \langle \mathcal {DF}(X),Z \rangle . \end{aligned}$$
(2)

2.2 Optimality Conditions

The Lagrangian function associated to the optimization problem (1) is given by:

$$\begin{aligned} \mathcal {L}(X,\varLambda )= \mathcal {F}(X) - \frac{1}{2} Tr[\,\varLambda (X^{\top }X-I_p)\,], \end{aligned}$$
(3)

where \(I_p\) is the identity matrix and \(\varLambda \) is the Lagrange multipliers matrix, which is symmetric due to the matrix \(X^{\top }X\) is also symmetric. The Lagrangian function leads to the first order optimality conditions for problem (1):

$$\begin{aligned} G - X\varLambda= & {} 0 \end{aligned}$$
(4a)
$$\begin{aligned} X^{\top }X -I_p= & {} 0. \end{aligned}$$
(4b)

Lemma 1

(cf. Wen and Yin [15]). Suppose that X is a local minimizer of problem (1). Then X satisfies the first order optimality conditions (4a) and (4b) with the associated Lagrangian multiplier \(\varLambda = G^{\top }X\). Defining \(\nabla \mathcal {F}(X) :=G-XG^{\top }X\) and \(A:=GX^{\top }-XG^{\top }\). Then \(\nabla \mathcal {F}(X) = AX\). Moreover, \(\nabla \mathcal {F} = 0\) if and only if \(A=0\).

Proof

See [15].

The LemmaĀ 1 establishes an equivalence to the (4a) and (4b) conditions, i.e., ifĀ \(X\in Stf (n,p) \) satisfies that \(\nabla \mathcal {F}(X) = 0\) then X also satisfies (4a) and (4b), so we can use this result as a stopping criterion for our algorithms.

2.3 Update Schemes

In this subsection we present a linear combination based algorithm. As the new iterated of our proposals does not necessarily belong to the Stiefel Manifold, we use a projection operator, in order to force the feasibility of the new iterated. Specifically, we use the classical projection operator which is defined as \(\pi (X) := \text {arg} \min _{Q\in Stf (n,p)} ||X - Q||_F^{2}\), it is known that the solution of this problem is given by \(\pi (X) = UI_{n,p}V^{\top }\) where \(X = U\varSigma V^{\top }\) is the SVD decomposition of X, for details of the demonstration of this result see [11].

In our updating formula, we use the previous result for obtaining a new point that satisfies the constraints of the problem (1). For example, if \( Y_k(\tau )\) is obtained from our proposal, i.e., the linear combination scheme, then the new test point is:

$$\begin{aligned} X_{k + 1} := Z_k(\tau ) := \pi (Y_k(\tau )). \end{aligned}$$
(5)

In the next subsections we explain in more detail our updating formula \(Y_k(\tau )\).

A Scheme Based on a Linear Combination. Our proposal uses the following update formula:

$$\begin{aligned} Y^{CL}_{k}(\tau ) := X_k - \tau \,(\, \lambda B_kL + \mu C_kR \,), \end{aligned}$$
(6)

where \(G_k = \mathcal {DF}(X_k)\), \(B_k = G_kL^{\top } - LG_k^{\top }\), \(C_k = G_kR^{\top } - RG_k^{\top }\), \(L,R\in \mathbb {R}^{n\times p}\), \(\tau \) is the step size and (\(\lambda \), \(\mu \)) are any two scalars satisfying:

$$\begin{aligned} \lambda ||B_k||_F^2 + \mu ||C_k||_F^2 \,>\,0. \end{aligned}$$

The following lemma shows that the curve \(Y^{CL}_{k}(\cdot )\) defined by Eq. (6) is a descent curve at \(\tau =0\).

Lemma 2

Let \(Y^{CL}_{k}(\tau )\) be defined by Eq. (6), then \(Y^{CL}_{k}(\tau )\) is a descent curve at \(\tau = 0\), i.e.,

$$\begin{aligned} \mathcal {D F}(X_k) [\dot{Y}^{CL}_k(0)] = -\frac{\lambda }{2}||B_k||_F^2 - \frac{\mu }{2}||C_k||_F^2 < 0. \end{aligned}$$
(7)

Proof

The proof is straightforward, and it can be obtained by using trace properties and using Eq.Ā (2).

Remark 1

Note that in the updating formula (6), we can select any matrix L or R, in particular one can use matrices L, R with random entries. The parameters (\(\lambda ,\mu \)) can appropriately selected, for example, we can choose both positive. This ensures that the method will descent and may eventually converge to a local minimum. In our implementation, we select \(L = X_k\), \(R = X_{k-1}\) and \((\lambda ,\mu ) = (2/3,1/3)\).

2.4 Strategies to Select the Step Size

From now on, \(Y_k(\tau )\) represents our proposal, i.e., the based on the linear combination method.

A Descent Condition. In our method, we will choose the biggest step size \(\tau \) that satisfies the following condition:

$$\begin{aligned} \mathcal {F}(Z_k(\tau ))\le & {} \mathcal {F}(X_k) + \sigma \tau Tr[G_k^{\top }\dot{Y}_k(0)], \end{aligned}$$
(8)

with \(0<\sigma <1.\)

Note that Eq.Ā (8) is not exactly the classic ā€œArmijo conditionā€, since we use \(\dot{Y}_k(0)\) instead of \(\dot{Z}_k(0)\). However, if we only use the condition (8) for computing the step size, it ensures the descent of the objective function as long as the directional derivative \(Tr[\mathcal {DF}(X_k)^{\top }\dot{Y}_k(0)]\) is negative. In this work, we also study the behavior of our algorithms calculating the step size as satisfying (8).

Nonmonotone Search with Barzilai Borwein Step Size. It is known that the Barzilai-Borwein (BB) step size, see [2], can sometimes improve the performance of linear search algorithms such as the steepest descent method without adding too much computational cost. This technique considers the classic steepest descent method and proposes to use any of the following step sizes:

$$\begin{aligned} \alpha _k^{BB1} = \frac{||S_k||_F^2}{Tr[S_k^{\top }R_k]}\quad \text {and} \quad \alpha _k^{BB2} = \frac{Tr[S_k^{\top }R_k]}{||R_k||_F^2}. \end{aligned}$$
(9)

where \(S_k = X_{k+1}-X_k\), \(R_k = \mathcal {DF}(X_{k+1}) - \mathcal {DF}(X_k)\) and the matrix \(B(\alpha ) = (\alpha I)^{-1}\), is considered an approximation of the Hessian of the objective function. For more details see [2, 13].

Since the quantities \(\alpha _k^{BB1}\), \(\alpha _k^{BB2}\) could be negatives, the absolute value of these step sizes is usually considered. On the other hand, the BB-steps do not necessarily guarantee the descent of the objective function at each iteration, this may imply that the method does not converge. In order to solve this problematic, we use a technique that guarantees global convergence, see Refs.Ā [3, 13] for details. In particular, we use a non-monotone line search algorithm, see [17], combined with the BB-step in order to select the step size, see AlgorithmĀ 1.

figure a

Note that when \(\eta =0\), AlgorithmĀ 1 is reduced to a monotonous algorithm which generates points satisfying the descent condition (8).

3 Theoretical Results

In this section we prove some convergence results of our AlgorithmĀ 1 when itā€™s use with \(\eta = 0\).

Lemma 3

Let \(\{X_k\}\) be an infinite sequence generated by AlgorithmĀ 1. Then \(\{\mathcal {F}(X_k)\}\) is a convergent sequence. Moreover any accumulation point \(X_{*}\) of \(\{X_k\}\) is feasible, i.e., \(X_{*}^{\top }X_{*} = I\).

Proof

By construction of the AlgorithmĀ 1 we have,

$$\begin{aligned} \mathcal {F}(X_{k+1})\le & {} \mathcal {F}(X_k) + \sigma \tau _kTr[G_k^{\top }\dot{Y}_k(0)], \qquad \forall k \end{aligned}$$
(10)

or equivalently,

$$\begin{aligned} \mathcal {F}(X_k) - \mathcal {F}(X_{k+1})\ge & {} - \sigma \tau _kTr[G_k^{\top }\dot{Y}_k(0)], \qquad \forall k \\> & {} 0 \qquad (\text {due } Y_k(\tau ) \text { is a descent curve at } \tau = 0 \,), \end{aligned}$$

so, \(\{\mathcal {F}(X_k)\}\) is a monotonically decreasing sequence. Now, since Stiefel manifold is a compact set and \(\mathcal {F}\) is a continuous function, we obtain that \(\mathcal {F}\) has maximum and minimum on Stf(n,p). Therefore, \(\{\mathcal {F}(X_k)\}\) is bounded, and then \(\{\mathcal {F}(X_k)\}\) is a convergent sequence.

On the other hand, let \(\{X_k\}_{k\in \mathcal {K}}\) be a convergent subsequence of \(\{X_k\}\) and suppose that this subsequence converges to \(X_{*}\), that is \(\lim _{k\in \mathcal {K}}X_k = X_{*}\), since \(X_k\) is a feasible point for all \(k\in \mathcal {K}\) and Stf(n,p) is a compact set, then we have \(X_{*}\in Stf(n,p) \), i.e.,

$$\begin{aligned} X_{*}^{\top }X_{*} = I, \end{aligned}$$

therefore every accumulation point is feasible.

Theorem 1

Let \(\{X_k\}\) be an infinite sequence generated by AlgorithmĀ 1. Then any accumulation point \(X_{*}\) of \(\{X_k\}\) satisfies the the first order optimality conditions.

The proof of TheoremĀ 1 is obtained by following the ideas of the demonstration of Theorem 4.3.1 in [1] except for slight adaptations.

4 Numerical Experiments

In this section we analyze the performance of our method by solving several simulated experiments with the format of the problem (1), for different objective functions and different sizes of problems. We also make comparisons between some state of the art methods and our proposal, in order to measure the performance and efficiency of our algorithms.

4.1 Implementation Details

All our experiments were performed using Matlab R2013a on an Intel processor i3-380M, 2.53 GHz CPU with 500Ā Gb HD and 8Ā Gb of Ram. For the different parameters of our two algorithms, we use the following values: initial step size \(\tau =\) 1eāˆ’2, \(\sigma = \) 1eāˆ’4, \(\eta = 0\text {.}85\), \(\delta = 0\text {.}1\). Moreover, as the convergence of the first-order methods (methods using the first derivative of the objective function) can be very slow we will use several stop criteria:

$$\begin{aligned} ||\nabla \mathcal {F}(X_k)||_F<\epsilon , \quad \text {and} \quad (tol_k^{x}< xtol\,\wedge \, tol_k^{f} < ftol), \end{aligned}$$
(11)

and a maximum of K iterations, where

$$\begin{aligned} tol_k^{x} := \frac{||X_{k+1} - X_k||_F}{\sqrt{n}}, \quad \text {and} \quad tol_k^{f} := \frac{\mathcal {F}(X_k) - \mathcal {F}(X_{k+1})}{|\mathcal {F}(X_k)| + 1}. \end{aligned}$$

In the experiments, we used the following default values: \(xtol = \) 1eāˆ’6, \(ftol = \) 1eāˆ’12, \(T = 5\) and \(\epsilon = \) 1eāˆ’4.

In all experiments presented in the following subsections we use the following notation:

  • Nfe: The number of evaluations of the objective function.

  • Nitr: The number of iterations performed by the algorithm to convergence.

  • Time: The time (in seconds) used by the algorithm to converge.

  • NrmG: The gradient norm of the Lagrangian function with respect to primal variables evaluated at the estimated ā€œoptimalā€.

  • Fval: Evaluation of the objective function at the estimated ā€œoptimalā€.

  • Feasi: Corresponds to the following error \(|| \hat{X}^{\top }\hat{X} - I_p ||_F\), where \(\hat{X}\) denotes the ā€œoptimalā€ estimated by the algorithm.

In addition, we denote by the Steepest Descent Steep-Dest, the Trust-Region method Trust-Reg and the Conjugate Gradient method Conj-Grad from ā€œmanoptā€ toolboxFootnote 1, and PGST the algorithm presented in [6]. On the other hand, Linear-Co denote our AlgorithmĀ 1.

4.2 Weighted Orthogonal Procrustes Problem (WOPP)

Let \(X\in \mathbb {R}^{m\times n}\), \(A\in \mathbb {R}^{p\times m}\), \(B\in \mathbb {R}^{p\times q}\) and \(C\in \mathbb {R}^{n\times q}\). The Weighted Orthogonal Procrustes Problem (WOPP) consists in solving the following constrained optimization problem:

$$\begin{aligned} \min _{X\in \mathbb {R}^{m\times n}}&\frac{1}{2}|| AXC - B ||_F^{2} \\ \text {s. t.}&X^{\top }X = I_n. \nonumber \end{aligned}$$
(12)

When C is the identity matrix with appropriate dimensions, this problem is known as Unbalanced Orthogonal Procrustes Problem (UOPP), for more details see [1].

Experiments with WOPP Problems. The problems in this subsection were taken from [18]. In particular, we considered \(n = q\), \(p = m\), \(A = PSR^{\top }\) and \(C = Q \varLambda Q^{\top }\), where P,Ā Q and R are orthogonal matrices generated randomly with \(Q \in \mathbb {R}^{n\times n}\), \(R,P\in \mathbb {R}^{m \times m}\), \(\varLambda \in \mathbb {R}^{n\times n}\) is a diagonal matrix with entries generated from a uniform distribution in the range \([\frac{1}{2},2]\) and S is a diagonal matrix defined for each type of problem, see below for details. As a starting point \(X_0\in \mathbb {R}^{m\times n}\), we generated random matrices on the Stiefel manifold. When not specified, the entries of the matrix were generated using a standard Gaussian distribution.

For comparison purposes, we created problems with a known solution \(Q_{*}\in \mathbb {R}^{m\times n}\) randomly selected on the Stiefel manifold. Then, we built the matrix B as \(B = AQ_{*}C \). Finally, for the different tested problems the diagonal matrix S is described below.

Problem 1: The diagonal elements of S were generated by a normal distribution in the interval [10,12].

Problem 2: The diagonal of S is given by \(S_{ii} = i + 2r_i\), where \(r_i\) was a random number uniformly distributed in the interval [0,Ā 1].

For each experiment, a total of 300 WOOPā€™s problems were built with the matrix S generated according to problems Problem 1 and Problem 2 respectively. The maximum number of iterations, for all methods, was \(K=8000\).

The results of the previous experiments are presented in TablesĀ 1 and 2. We denote by Error to the standard error with respect to the global solution \(Q_{*}\), i.e., \(|| \hat{X} - Q_{*}||_F\) where \( \hat{X}\) is the optimum estimated by the algorithms. Furthermore, min, mean, max denote the minimum, maximum and average obtained by each algorithm in the 300 runs.

According to TableĀ 1 for well-conditioned problems, i.e., Problem 1, all the algorithms present similar results. Note that PGST obtained a lower number of iterations. In general, all the methods presented a similar performance for this type of problems. On the other hand, for ill-conditioned problems, i.e., Problem 2, we observe that all the method arrived to the solution \(Q_{*}\), according to NrmG, Fval and Error measures. Moreover, our Linear-Co procedure obtained similar results compared with the PGST algorithm when \(n<m\), and when \(m=n\) Linear-Co method achieved better results that the PGST, see TableĀ 2.

Table 1. Performance of the methods for well conditioned WOPP problems (Problem 1)
Table 2. Performance of the methods for ill-conditioned WOPP problems (Problem 2)

5 Conclusions

In this paper we proposed a feasible method for solving optimization problems with orthogonality constraints. This method is very general and was based on a linear combination of descent directions and using the same manifold framework. We are currently exploring several variants of this procedure. In order to preserve feasibility, our proposal requires to project onto the Stiefel manifold. In particular, we used the SVD decomposition in each iteration. In this work, we also presented some convergence results. Finally, in numerical experiments, the proposed algorithms obtained a competitive performance compared with some state of the art algorithms.