Abstract
Nonconvex and nonsmooth optimization problems with linear equation and generalized orthogonality constraints have wide applications. These problems are difficult to solve due to nonsmooth objective function and nonconvex constraints. In this paper, by introducing an extended proximal alternating linearized minimization (EPALM) method, we propose a framework based on the augmented Lagrangian scheme (EPALMAL). We also show that the EPALMAL method has global convergence in the sense that every bounded sequence generated by the EPALMAL method has at least one convergent subsequence that converges to the Karush–Kuhn–Tucker point of the original problem. Experiments on a variety of applications, including compressed modes and multivariate data analysis, have demonstrated that the proposed method is noticeably efficient and achieves comparable performance with existing methods.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10915-017-0359-1/MediaObjects/10915_2017_359_Fig1_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10915-017-0359-1/MediaObjects/10915_2017_359_Fig2_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10915-017-0359-1/MediaObjects/10915_2017_359_Fig3_HTML.gif)
Similar content being viewed by others
Notes
We did not compare with the ADMM method although the convergence for compact constraints is given (e.g. [48]) since the only difference between the SOC method and the ADMM method is that the augmented penalty parameters in the latter method are generally the same for different constraints.
SULDA_admm and SULDA\(\_{\ell _1}\) can be found at http://web.bii.a-star.edu.sg/~zhangxw/Publications.html, SDA and GLOSS can be found at http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=5671 and http://www.hds.utc.fr/~grandval/dokuwiki/doku.php?id=en:code, respectively.
Brain, Lymphoma and Srbct are obtained from http://stat.ethz.ch/~dettling/bagboost.html.
ORL\(_{64\times 64}\) is from http://www.cl.cam.ac.uk/Reasearch/DTG/attarchive:pub/data/attfaces/tar.Z, and Palmprint is from http://www4.comp.polyu.edu.hk/~biometrics/.
All data sets were downloaded from http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download.
The MATLAB implementation of SCCA with acceleration is available at http://web.bii.a-star.edu.sg/~zhangxw/Publications.html.
References
Abrudan, T., Eriksson, J., Koivunen, V.: Steepest descent algorithms for optimization under unitary matrix constraint. IEEE Trans. Signal Process. 56(3), 1134–1147 (2008)
Abrudan, T., Eriksson, J., Koivunen, V.: Conjugate gradient algorithm for optimization under unitary matrix constraint. Signal Process. 89(9), 1704–1714 (2009)
Absil, P.A., Mahony, R., Sepulchre, R.: Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton (2009)
Andreani, R., Birgin, E.G., Martínez, J.M., Schuverdt, M.L.: On augmented lagrangian methods with general lower-level constraints. SIAM J. Optim. 18(4), 1286–1309 (2007)
Attouch, H., Bolte, J., Redont, P., Soubeyran, A.: Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the kurdyka-lojasiewicz inequality. Math. Oper. Res. 35(2), 438–457 (2010)
Attouch, H., Bolte, J., Svaiter, B.F.: Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized gauss-seidel methods. Math. Program. 137(1–2), 91–129 (2013)
Bertsekas, D.P.: Constrained Optimization and Lagrange Multiplier Methods. Academic Press, London (1982)
Bertsekas, D.P.: Nonlinear Programming. Athena Scientific, Belmont (1999)
Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. 146(1–2), 459–494 (2014)
Bradley, A.P.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 30(7), 1145–1159 (1997)
Chen, W.: Wavelet frames on the sphere, high angular resolution diffusion imagining and \(l_1\)-regularized optimization on stiefel manifolds. Ph.D. thesis, The National University of Singapore (2015)
Chen, W., Ji, H., You, Y.: An augmented lagrangian method for \(\ell _1\)-regularized optimization problems with orthogonality constraints. SIAM J. Sci. Comput. 38(4), B570–B592 (2016)
Chu, D., Liao, L.Z., Ng, M.K., Zhang, X.: Sparse canonical correlation analysis: new formulation and algorithm. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 3050–3065 (2013)
Chu, M.T., Trendafilov, N.T.: The orthogonally constrained regression revisited. J. Comput. Graph. Stat. 10(4), 746–771 (2001)
Clarke, F.H., Ledyaev, Y.S., Stern, R.J., Wolenski, P.R.: Nonsmooth Analysis and Control Theory, vol. 178. Springer, Berlin (2008)
Clemmensen, L., Hastie, T., Witten, D., Ersbøll, B.: Sparse discriminant analysis. Technometrics 53(4), 406–413 (2011)
Duda, R., Hart, P., Stork, D.: Pattern Classification. Wiley, New York (2000)
Edelman, A., Arias, T.A., Smith, S.T.: The geometry of algorithms with orthogonality constraints. SIAM J Matrix Anal. Appl. 20(2), 303–353 (1998)
Fawcett, T.: An introduction to ROC analysis. Pattern Recognit. Lett. 27(8), 861–874 (2006)
Francisco, J.B., Martínez, J.M., Martínez, L., Pisnitchenko, F.: Inexact restoration method for minimization problems arising in electronic structure calculations. Comput. Optim. Appl. 50(3), 555–590 (2011)
Grubišić, I., Pietersz, R.: Efficient rank reduction of correlation matrices. Linear Algebra Appl. 422(2), 629–653 (2007)
Hardoon, D.R., Shawe-Taylor, J.: Sparse canonical correlation analysis. Mach. Learn. 83(3), 331–353 (2011)
Hestenes, M.R.: Multiplier and gradient methods. J. Optim. Theory Appl. 4(5), 303–320 (1969)
Hotelling, H.: Relations between two sets of variates. Biometrika 28(3/4), 321–377 (1936)
Howland, P., Jeon, M., Park, H.: Structure preserving dimension reduction for clustered text data based on the generalized singular value decomposition. SIAM J. Matrix Anal. Appl. 25, 165–179 (2003)
Jiang, B., Dai, Y.H.: A framework of constraint preserving update schemes for optimization on stiefel manifold. Math. Program. 153(2), 535–575 (2015)
Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: MT Summit, vol. 5, pp. 79–86. Citeseer (2005)
Kokiopoulou, E., Chen, J., Saad, Y.: Trace optimization and eigenproblems in dimension reduction methods. Numer. Linear Algebra Appl. 18(3), 565–602 (2011)
Kovnatsky, A., Glashoff, K., Bronstein, M.M.: Madmm: a generic algorithm for non-smooth optimization on manifolds. arXiv preprint arXiv:1505.07676 (2015)
Kurdyka, K.: On gradients of functions definable in o-minimal structures. Annales de l’institut Fourier 48(3), 769–783 (1998)
Lai, R., Osher, S.: A splitting method for orthogonality constrained problems. J. Sci. Comput. 58(2), 431–449 (2014)
Li, G., Pong, T.K.: Global convergence of splitting methods for nonconvex composite optimization. SIAM J. Optim. 25(4), 2434–2460 (2015)
Lojasiewicz, S.: Une propriété topologique des sous-ensembles analytiques réels. Les équations aux dérivées partielles pp. 87–89 (1963)
Lu, Z., Zhang, Y.: An augmented lagrangian approach for sparse principal component analysis. Math. Program. 135(1–2), 149–193 (2012)
Merchante, L., Grandvalet, Y., Govaert, G.: An efficient approach to sparse linear discriminant analysis. In: Preceedings of the 29th International Conference on Machine Learning (2012)
Mordukhovich, B.S.: Variational Analysis and Generalized Differentiation I: Basic Theory, vol. 330. Springer, Berlin (2006)
Mordukhovich, B.S., Shao, Y.: On nonconvex subdifferential calculus in banach spaces. J. Convex Anal. 2(1/2), 211–227 (1995)
Moreau, J.J.: Proximité et dualité dans un espace hilbertien. Bulletin de la Société Mathématique de France 93, 273–299 (1965)
Nishimori, Y., Akaho, S.: Learning algorithms utilizing quasi-geodesic flows on the stiefel manifold. Neurocomputing 67, 106–135 (2005)
Ozoliņš, V., Lai, R., Caflisch, R., Osher, S.: Compressed modes for variational problems in mathematics and physics. Proc. Natl. Acad. Sci. 110(46), 18368–18373 (2013)
Powell, M.J.: A method for non-linear constraints in minimization problems. UKAEA (1967)
Rockafellar, R.T.: Augmented lagrange multiplier functions and duality in nonconvex programming. SIAM J. Control 12(2), 268–285 (1974)
Rockafellar, R.T., Wets, R.J.B.: Variational Analysis, vol. 317. Springer, Berlin (2009)
Savas, B., Lim, L.H.: Quasi-newton methods on grassmannians and multilinear approximations of tensors. SIAM J. Sci. Comput. 32(6), 3352–3393 (2010)
Sriperumbudur, B.K., Torres, D.A., Lanckriet, G.R.: A majorization-minimization approach to the sparse generalized eigenvalue problem. Mach. Learn. 85(1–2), 3–39 (2011)
Vinokourov, A., Cristianini, N., Shawe-Taylor, J.S.: Inferring a semantic representation of text via cross-language correlation analysis. In: Advances in Neural Information Processing Systems, pp. 1473–1480 (2002)
Voorhees, E.M.: The sixth text retrieval conference (trec-6). Inf. Process. Manag. 36(1), 1–2 (2000)
Wang, Y., Yin, W., Zeng, J.: Global convergence of admm in nonconvex nonsmooth optimization. arXiv preprint arXiv:1511.06324 (2015)
Wen, Z., Yang, C., Liu, X., Zhang, Y.: Trace-penalty minimization for large-scale eigenspace computation. J. Sci. Comput. 66, 1175–1203 (2016)
Wen, Z., Yin, W.: A feasible method for optimization with orthogonality constraints. Math. Program. 142(1–2), 397–434 (2013)
Witten, D.M., Tibshirani, R., Hastie, T.: A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10(3), 515–534 (2009)
Yang, C., Meza, J.C., Wang, L.W.: A trust region direct constrained minimization algorithm for the Kohn–Sham equation. SIAM J. Sci. Comput. 29(5), 1854–1875 (2007)
Yang, K., Cai, Z., Li, J., Lin, G.: A stable gene selection in microarray data analysis. BMC Bioinform. 7(1), 228 (2006)
Ye, J.: Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems. J. Mach. Learn. Res. 6(4), 483–502 (2005)
Zhang, L., Li, R.: Maximization of the sum of the trace ratio on the stiefel manifold, i: theory. Sci. China Math. 57(12), 2495–2508 (2014)
Zhang, L., Li, R.: Maximization of the sum of the trace ratio on the stiefel manifold, ii: computation. Sci. China Math. 58(7), 1549–1566 (2015)
Zhang, X.: Sparse dimensionality reduction methods: algorithms and applications. Ph.D. thesis, The National University of Singapore (2013)
Zhang, X., Chu, D.: Sparse uncorrelated linear discriminant analysis. In: Proceedings of the 30th International Conference on Machine Learning (ICML-13), pp. 45–52 (2013)
Zhang, X., Chu, D., Tan, R.C.: Sparse uncorrelated linear discriminant analysis for undersampled problems. IEEE Trans. Neural Netw. Learn. Syst. 27(7), 1469–1485 (2015)
Zhao, Y., Karypis, G.: Empirical and theoretical comparisons of selected criterion functions for document clustering. Mach. Learn. 55(3), 311–331 (2004)
Acknowledgements
The authors would like to thank the two anonymous referees for their valuable comments and suggestions. The work of L.-Z. Liao was supported in part by grants from Hong Kong Baptist University (FRG) and General Research Fund (GRF) of Hong Kong.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix 1: Proof of Lemma 2
To prove Lemma 2, we need a few preliminary results regarding the limiting subdifferential of indicator functions. For any closed set \({\mathscr {X}}\), it is well-known [37] that the limiting subdifferential of indicator function \(\delta _{{\mathscr {X}}}\) at x is given by the normal cone to \({\mathscr {X}}\) with respect to x denoted by \(N_{{\mathscr {X}}}(x)\), that is,
Next, we consider the normal cone of two specific closed sets. Let \(\ell (X) = {\mathscr {A}}X : {\mathbb {R}}^{m\times n}\rightarrow {\mathbb {R}}^{\gamma \times n}\) be a linear mapping with \({\mathscr {A}} \in {\mathbb {R}}^{\gamma \times m}\), and define the closed set \({\mathscr {X}}:= \{X \in {\mathbb {R}}^{m\times n} \ | \ {\mathscr {A}}X = 0\}\). It is easy to show that
Let \({\mathscr {M}} = \{X\in {\mathbb {R}}^{n\times q} \ |\ X^TMX = I_q, M\in S_+^n\}\) be the set of matrices satisfying the generalized orthogonality constraints, it follows that for any curve \(Y(t)\in {\mathscr {M}}\) with \(Y(0) = X, (Y'(t))^TMY(t) + Y(t)^TMY'(t) = 0\). Let \(t = 0\), notice that \(Y'(0)\in T_{{\mathscr {M}}}(X)\), we have
Let \(MX = {\bar{U}}{\bar{{\varSigma }}} {\bar{V}}^T\) be the reduced SVD of MX, and \((MX)_\perp \) be a column orthogonal matrix such that \([{\bar{U}} \ (MX)_\perp ]\) is an orthogonal matrix, then \(\eta \) can be written as
where \(\eta _1^T{\bar{{\varSigma }}}{\bar{V}}^T + {\bar{V}}{\bar{{\varSigma }}}\eta _1 = 0\), which means \({\bar{V}}{\bar{{\varSigma }}}\eta _1\) is skew-symmetric. Moreover,
where \((X^TM)^\dagger \) is the pseudo-inverse of \(X^TM\). Therefore, the tangent space of \({\mathscr {M}}\) is given by
In addition, any \(\zeta \in N_{{{\mathscr {M}}}}(X)\) can be written as
Since \(\left\langle \zeta , \eta \right\rangle = 0\) for any \(\eta \in T_{{\mathscr {M}}}(X)\), we have
The first equality implies that
and thus \({\bar{V}}{\bar{{\varSigma }}}^{-1}\zeta _1\) must be symmetric. Hence, the normal cone of \({\mathscr {M}}\) at X is given by
Now, we are ready to prove Lemma 2.
Proof
Equalities
hold since \((X^*, Y^*, G^*)\) is feasible as a local minimizer. For the convenience of analysis, we denote \(W := \begin{pmatrix} X\\ Y\\ G \end{pmatrix}\in {\mathbb {R}}^{(2n+m)\times q}\) and define \(g_1: {\mathbb {R}}^{(2n+m)\times q} \rightarrow {\mathbb {R}} ^{(l+ n)\times q}\) as
Let \({\varOmega } = \{W\in {\mathbb {R}}^{(2n+m)\times q}\ | \ g_1(W) = 0\}\), then problem (9) is equivalent to
Since \((X^*, Y^*, G^*)\) is a local minimizer, by the generalized Fermat’s rule and subdifferentiability property [15, 43], we have
As described at the beginning of this “Appendix 1”, subdifferential of indicator functions \(\delta _{{\mathscr {M}}}\) and \(\delta _{{\varOmega }}\) are given by the normal cones (33) and (32), respectively. In particular,
and
Therefore, there exist \(v^*\in \partial f(X^*), \ w^*\in \partial g(Y^*), \ {\varLambda }_1^*\in {\mathbb {R}}^{l\times q}, \ {\varLambda }_2^*\in {\mathbb {R}}^{n\times q}, \ {\varLambda }_3^*\in S^q\) such that
which proves equality (12). Moreover, it yields that \({\varLambda }_2^* = 2MG^*{\varLambda }_3^*\). Substitute this into (12) and eliminate \(G^*\), we get equality (13). This completes the proof. \(\square \)
Appendix 2: Proof of Theorem 1
Proof
For any limit point \((X^*, Y^*, G^*)\) of the bounded sequence \(\{(X^k, Y^k, G^k)\}_{k\in {\mathbb {N}}}\), there exists an index set \({\mathscr {K}}\subset {\mathbb {N}}\) such that \(\{(X^k, Y^k, G^k)\}_{k\in {\mathscr {K}}}\) converges to \((X^*, Y^*, G^*)\). To prove that \((X^*, Y^*, G^*)\) is a KKT point, we first show that it is a feasible point. The equality \((G^*)^TMG^* = I_q\) is trivial to check since \((G^k)^TMG^k = I_q\) holds for any \(k\in {\mathbb {N}}\). If \(\{\rho _k\}\) is bounded, then by the updating rule of \(\rho _k\) in Algorithm 2, there exists an \(k_0\in {\mathbb {N}}\) such that
By the definition of \(R_j^k, j = 1, 2\), it holds that
for any \(k\ge k_0\). Thus
If \(\{\rho _k\}\) is unbounded, by the generalized Fermat rule, finding a solution satisfying the constraint (11) is equivalent of calculating a point \((X^k, Y^k, G^k)\) such that
for some \(v^k\in \partial f(X^k), w^k\in \partial g(Y^k), {\varLambda }_3^k \in {\mathscr {S}}^q\) , and \(\epsilon _{k}\downarrow 0\) as \(k\rightarrow \infty \). Notice that \(\{{\bar{{\varLambda }}}_1^k\}\) and \(\{{\bar{{\varLambda }}}_2^k\}\) are bounded, \(\{v^k\}_{k\in {\mathscr {K}}}, \{w^k\}_{k\in {\mathscr {K}}}, \{\partial _X h(X^k, Y^k)\}\) and \(\{\partial _Y h(X^k, Y^k)\}\) are bounded under Assumption 1 (iii)–(iv). Let \(k\in {\mathscr {K}}\) go to infinity, Eq. (34) implies that
Recall that B has full row rank, we get \(AX^* + BY^* - C = 0\) and \(X^* - G^* = 0\). Therefore, in both cases, we show that \((X^*, Y^*, G^*)\) is a feasible point.
Next, we show that there exist \({\varLambda }_1^*\in {\mathbb {R}}^{k\times q}, {\varLambda }_2^*\in {\mathbb {R}}^{n\times q}\) and \({\varLambda }_3^*\in {\mathscr {S}}^q\) such that \((X^*, Y^*, G^*; {\varLambda }_1^*, {\varLambda }_2^*, {\varLambda }_3^*)\) satisfies (12). If \(\{X^k, Y^k, G^k\}_{k\in {\mathbb {N}}}\) is bounded, there exists an index set \({\mathscr {K}}\subseteq {\mathbb {N}}\), such that \(\lim _{k\in {\mathscr {K}}}(X^k, Y^k, G^k) = (X^*, Y^*, G^*)\). Since \(\{v^k\}_{k\in {\mathscr {K}}}\) is bounded, there exists a subsequence \({\mathscr {K}}_2\subseteq {\mathscr {K}}\) such that \(\lim _{k\in {\mathscr {K}}_2}v^k = v^*\). Moreover, by the closedness property of the limiting subdifferential, we get
Similarly, there exists a subsequence \({\mathscr {K}}_3\subseteq {\mathscr {K}}_2\) such that \(\lim _{k\in {\mathscr {K}}_3}w^k = w^*\), and
Combining with the updating formula of \({\bar{{\varLambda }}}_1^k\) and \({\bar{{\varLambda }}}_2^k\) in Step 2 of Algorithm 2, (34) implies that there exists a \(\xi ^k\) with \(\Vert \xi ^k\Vert _\infty \le \frac{\epsilon _{k-1}}{\rho _{k-1}}\) such that
Define
we can rewrite (35) as
Since the columns of \({\varXi }^k\) are linearly independent, \(({\varXi }^k)^T({\varXi }^k)\) is nonsingular. Thus
Taking limit on (36) as \(k\in {\mathscr {K}}_3\) goes to infinity, and noticing that \(\Vert \xi ^k\Vert _{\infty } \le \frac{\epsilon _{k-1}}{\rho _{k-1}}\) with \(\epsilon _k\downarrow 0\) as \(k\rightarrow \infty \), we have
where
has full column rank. From the definition of \({\varUpsilon }^k\), taking limit \(k\in {\mathscr {K}}_3\) goes to infinity on both sides of (35) yields that
where \({\varLambda }_3^*\in {\mathscr {S}}^q\) since \({\varLambda }_3^k\in {\mathscr {S}}^q\) for any \(k\in {\mathbb {N}}\). According to Lemma 2, \((X^*, Y^*, G^*)\) is a KKT point of problem (9). Moreover, \((X^*, Y^*)\) is a KKT point of problem (1). \(\square \)
Appendix 3: Proof of Proposition 1
Proof
In this proof, we assume k is fixed and for the simplicity of notations, we let \(W := \begin{pmatrix} X\\ Y\\ G \end{pmatrix}\) and \(L_k(W): = L_{\rho _{k-1}}(X, Y, G; {\bar{{\varLambda }}}^{k-1})\).
1) By the first-order optimality conditions of three subproblems in (15), there exist \(v^{k,j}\in \partial f_1^k(X^{k, j}), w^{k,j}\in \partial f_2^k(Y^{k, j}), \nu ^{k,j}\in \partial f_3^k(G^{k,j})\) such that
Combined with the above relationships, by simple calculations, \(A^{k,j}\) defined in (19) has the following properties,
Since \(H_k(W)\) is continuously differentiable, by subdifferentiability property [5, 43],
which implies
To show \(\Vert A^{k,j}\Vert _\infty \rightarrow 0\), it suffices to show that \(L_k(W)\) satisfies the conditions in Assumption 2 and apply Lemma 1 c).
We first show that \(\{W^{k,j}\}_{j\in {\mathbb {N}}}\) generated by Algorithm 3 is bounded. This can be accomplished via proof by contradiction. Notice that \({\tilde{L}}_k(W) = \rho _{k-1}L_k(W)\) is a coercive function under the assumption that \(\phi (X, Y) + \frac{\rho _0}{2}\Vert AX + BY - C\Vert _F^2\) is coercive function and the fact that \(\{\rho _k\}_{k\in {\mathbb {N}}}\) is non-decreasing, \(\delta _{{\mathscr {M}}}(G)\) is a coercive function and \(\frac{\rho _{k-1}}{2}\Vert X - G + {\bar{{\varLambda }}}_2^{k-1}\Vert _F^2 - \frac{1}{2\rho _{k-1}}\Vert {\bar{{\varLambda }}}^{k-1}\Vert _F^2\) is bounded from below. Suppose \(\lim _{j\rightarrow \infty }\Vert W^{k,j}\Vert _\infty = +\infty \), then there must hold
On the other hand, we know from Lemma 1 a) that \(\{L_{k}(W^{k,j})\}_{j\in {\mathbb {N}}}\) is a decreasing sequence, thus \(\{{\tilde{L}}_{k}(W^{k,j})\}_{j\in {\mathbb {N}}}\) is non-increasing, which implies
Hence, a contradiction, and \(\{W^{k,j}\}_{j\in {\mathbb {N}}}\) is bounded.
Now, we verify that \(L_k(W)\) satisfies the conditions in Assumption 2. By the definitions of \(L_k(W), H_k(W), f_i^k, i = 1, 2, 3\) and Assumption 1 (i), it is easy to see that for any given \({\bar{{\varLambda }}}^{k-1}\) and \(\rho _{k-1}\), the following results hold:
-
(a)
\(f_i^k, i = 1, 2, 3\), are proper and lower semicontinuous functions satisfying \(\inf f_i^k > -\infty , H_k\) is a \(C^1\) function, and
$$\begin{aligned} H_k(W)&= \frac{1}{\rho _{k-1}} h(X,Y) + \frac{1}{2}\Vert AX + BY - C + {\bar{{\varLambda }}}_1^{k-1} / \rho _{k-1}\Vert _F^2 \nonumber \\&\quad + \frac{1}{2}\Vert X - G + {\bar{{\varLambda }}}_2^{k-1}/\rho _{k-1}\Vert _F^2 - \frac{1}{2\rho _{k-1}^2}\Vert {\bar{{\varLambda }}}^{k-1}\Vert _F^2. \end{aligned}$$(37)Thus \(\inf _{W} H_k(W) \ge \frac{1}{\rho _{k-1}}\min _{X,Y}h(X,Y) - \frac{1}{2(\rho _{k-1})^2}\Vert {\bar{{\varLambda }}}^{k-1}\Vert _F^2 > -\infty \).
-
(b)
Since \(H_k\) is a quadratic function with respect to \(G, \nabla _G H_k\) is obviously Lipschitz continuous. Regarding the Lipschitz continuity of partial derivatives \(\nabla _X h(X, Y)\) and \(\nabla _Y h(X, Y)\), we have
$$\begin{aligned} \Vert \nabla _{X} H_k(X, Y^{k,j-1}, G^{k,j-1}) - \nabla _{X} H_k({\tilde{X}}, Y^{k,j-1}, G^{k,j-1})\Vert \le L_1^{k,j-1} \Vert X - {\tilde{X}}\Vert , \end{aligned}$$where \(L_1^{k,j-1} = \frac{L_1(Y^{k,j-1})}{\rho _{k-1}} + \Vert A^TA\Vert + 1\), and
$$\begin{aligned} \Vert \nabla _{Y} H_k(X^{k,j}, Y, G^{k,j-1}) - \nabla _{Y} H_k(X^{k,j}, {\tilde{Y}}, G^{k,j-1})\Vert \le L_2^{k,j-1}\Vert Y - {\tilde{Y}}\Vert , \end{aligned}$$where \(L_2^{k,j-1} = \frac{L_2(X^{k,j})}{\rho _{k-1}} + \Vert B^TB\Vert + 1\). In addition, let \({\bar{L}}_1 = \sup _{Y}L_1(Y)\) and \({\bar{L}}_2 = \sup _XL_2(X)\), then the boundedness of \(\{W^{k,j}\}_{j\in {\mathbb {N}}}\) and Assumption 1 (iii) imply that \({\bar{L}}_1 < \infty \) and \({\bar{L}}_2 < \infty \), and we have
$$\begin{aligned}&\Vert A^TA\Vert + 1 \le L_1^{k, j-1} \le {\bar{L}}_1/\rho _0 + \Vert A^TA\Vert + 1,\quad \\&\quad \Vert B^TB\Vert + 1 \le L_2^{k, j-1} \le {\bar{L}}_2/\rho _0 + \Vert B^TB\Vert + 1, \end{aligned}$$which imply
$$\begin{aligned} \inf _j\{L_1^{k, j-1}\} \ge \Vert A^TA\Vert + 1> -\infty , \quad \inf _j\{L_2^{k, j-1}\} \ge \Vert B^TB\Vert + 1 > -\infty , \end{aligned}$$and
$$\begin{aligned}&\sup _j\left\{ L_1^{k, j-1}\right\} \le {\bar{L}}_1/\rho _0 + \Vert A^TA\Vert + 1< +\infty , \quad \sup _j\left\{ L_2^{k, j-1}\right\} \\&\quad \le {\bar{L}}_2/\rho _0 + \Vert B^TB\Vert + 1 < +\infty . \end{aligned}$$Moreover, Assumption 2 (iii) holds by the definition of \(B_i^{k, j-1}, i = 1, 2\) and \(B_3^k\). Thus, Assumption 2 (i)–(iii) hold.
-
(c)
Assumption 2 (iv) holds since h(X, Y) satisfies Assumption 1 (iii).
2) From the proof of 1), we know that \(\{W^{k,j}\}_{j\in {\mathbb {N}}}\) is bounded. Then the proof of 2) remains to show that \(L_k(W)\) is a K–L function and apply Lemma 1 d). Notice that
i.e., \(L_k(W)\) satisfies the K–L properties as a finite sum of functions satisfying the K–L properties, then the result holds directly. \(\square \)
Appendix 4: Outline of Algorithms for CMs, Sparse ULDA and Sparse CCA
In this section, we describe the detailed derivations of algorithms for CMs, sparse ULDA and sparse CCA in Sect. 5.
1.1 Compressed Modes
The scaled augmented Lagrangian function associated with (21) is
where \({\mathscr {X}} = St(n,N)\) and
Applying Algorithm 3 with \(B_1^{k,j-1} = \gamma _1^k I\) and \(B_2^{k,j-1} = \gamma _2 I\), we get the following updating of \(({\varPsi }^{k, j}, X^{k, j})\) for any fixed \(k\in {\mathbb {N}}\)
where \(\mathbf{shrink }(x, \eta ) = \text{ sign }(x)\odot \max \{|x| - \eta , 0\}\) is the soft-shrinkage operator and \(\odot \) denotes component-wise product, and \({\tilde{U}}{\tilde{{\varSigma }}}{\tilde{V}}^T = {\varPsi }^{k,j} + \frac{{\bar{{\varLambda }}}^{k-1}}{\rho _{k-1}} + (\gamma _2 - 1) X^{k,j-1}\) is the reduced SVD. Outline of the algorithm is given in Algorithm 4.
![figure d](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10915-017-0359-1/MediaObjects/10915_2017_359_Figd_HTML.gif)
1.2 Sparse ULDA
![figure e](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10915-017-0359-1/MediaObjects/10915_2017_359_Fige_HTML.gif)
By introducing auxiliary variable \(Z = X\), the scaled augmented Lagrangian function associated with (25) is
where \({\mathscr {O}}\) denotes the set of orthogonal matrices and
Applying Algorithm 3 with \(B_1^{k,j-1} = \gamma _1 I, B_2^{k,j-1} = \gamma _2 I\) and \(B_3^{k} = \gamma _3 I\), we get the following updating of \((G^{k, j}, X^{k, j}, Z^{k, j})\) for any fixed \(k\in {\mathbb {N}}\)
where \({\tilde{U}}{\tilde{{\varSigma }}}{\tilde{V}}^T = X^{k,j} + \gamma _3 Z^{k,j} + {\bar{{\varLambda }}}_2^{k-1} /\rho _{k-1}\) is the reduced SVD.
Applying Algorithm 2 to problems (25) and (26) are the same except that in the latter case we compute \(G^{k,j}\) as follows: Let \({\varDelta }_G^{k,j-1}:=G^{k, j-1} - \frac{1}{\gamma _2}{\mathscr {B}}^T({\mathscr {A}}X^{k, j} + {\mathscr {B}}G^{k, j-1} + \frac{{\bar{{\varLambda }}}_1^{k-1}}{\rho _{k-1}})\), then
where subscript i, : means the i-th row.
Summarizing the above procedure leads to sparse ULDA algorithm SULDAAL and group sparse ULDA algorithm SULDAAL outlined in Algorithm 5.
1.3 Sparse CCA
![figure f](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10915-017-0359-1/MediaObjects/10915_2017_359_Figf_HTML.gif)
By introducing auxiliary variables \(P = W_x, Q = W_y\), and denoting \({\mathscr {X}} = \{P\ |\ P^TX^TXP = I\}, {\mathscr {Y}} = \{Q\ |\ Q^TY^TYQ = I\}\), we can reformulate (28) as
The scaled augmented Lagrangian function associated with (28) is
where
Applying Algorithm 3 with \(B_1^{k,j-1} = \gamma _1 I, B_2^{k,j-1} = \gamma _2 I, B_3^{k,j} = \alpha ^kX^TX - I_{d_1} + \alpha ^k(I_{d_1} - U_1U_1^T)\) and \(B_4^{k,j} = \beta ^kY^TY - I_{d_2} + \beta ^k(I_{d_2} - V_1V_1^T)\), where \(U_1\) and \(V_1\) are obtained from the reduced SVD of X and Y, respectively, as in Eq. (29), we get the following updating of \((W_x^{k, j}, W_y^{k,j})\) for any fixed \(k\in {\mathbb {N}}\)
The resulting algorithm SCCAALN is outlined in Algorithm 6.
![figure g](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10915-017-0359-1/MediaObjects/10915_2017_359_Figg_HTML.gif)
For problem (31), the associated scaled augmented Lagrangian function is
where
Applying Algorithm 3 with \(B_1^{k,j-1} = \gamma _1 I, B_2^{k,j-1} = \gamma _2 I\) and \(B_3^{k,j} = \gamma _3 I\), we get the following updating of \((W_x^{k, j}, W_y^{k,j}, W^{k, j})\) for any fixed \(k\in {\mathbb {N}}\)
where \({\varDelta }_x^{k,j} = \frac{{\bar{{\varLambda }}}_1^{k-1}}{\rho _{k-1}} + U_1^TW_x^{k, j}, {\varDelta }_y^{k,j} = \frac{{\bar{{\varLambda }}}_2^{k-1}}{\rho _{k-1}} + V_1^TW_y^{k, j}\) and \({\tilde{U}}{\tilde{{\varSigma }}}{\tilde{V}}^T = W^{k,j-1} + \frac{1}{\gamma _3}(P_1^T{\varSigma }_1^{-1}{\varDelta }_x^{k,j} + P_2^T{\varSigma }_2^{-1}{\varDelta }_y^{k,j})\) is the reduced SVD. The resulting algorithm WSCCAAL is outlined in Algorithm 7.
Rights and permissions
About this article
Cite this article
Zhu, H., Zhang, X., Chu, D. et al. Nonconvex and Nonsmooth Optimization with Generalized Orthogonality Constraints: An Approximate Augmented Lagrangian Method. J Sci Comput 72, 331–372 (2017). https://doi.org/10.1007/s10915-017-0359-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10915-017-0359-1