DC formulations and algorithms for sparse optimization problems

Gotoh, Jun-ya; Takeda, Akiko; Tono, Katsuya

doi:10.1007/s10107-017-1181-0

DC formulations and algorithms for sparse optimization problems

Full Length Paper
Series B
Published: 26 July 2017

Volume 169, pages 141–176, (2018)
Cite this article

Mathematical Programming Submit manuscript

Jun-ya Gotoh¹,
Akiko Takeda^2,3 &
Katsuya Tono⁴

3932 Accesses
93 Citations
Explore all metrics

Abstract

We propose a DC (Difference of two Convex functions) formulation approach for sparse optimization problems having a cardinality or rank constraint. With the largest-k norm, an exact DC representation of the cardinality constraint is provided. We then transform the cardinality-constrained problem into a penalty function form and derive exact penalty parameter values for some optimization problems, especially for quadratic minimization problems which often appear in practice. A DC Algorithm (DCA) is presented, where the dual step at each iteration can be efficiently carried out due to the accessible subgradient of the largest-k norm. Furthermore, we can solve each DCA subproblem in linear time via a soft thresholding operation if there are no additional constraints. The framework is extended to the rank-constrained problem as well as the cardinality- and the rank-minimization problems. Numerical experiments demonstrate the efficiency of the proposed DCA in comparison with existing methods which have other penalty terms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A non-convex algorithm framework based on DC programming and DCA for matrix completion

Article 29 May 2014

Solving nonnegative sparsity-constrained optimization via DC quadratic-piecewise-linear approximations

Article 05 May 2021

Lp quasi-norm minimization: algorithm and applications

Article Open access 07 February 2024

Notes

When some state-of-the-art solver is employed, (1) should be represented in the so-called Specially Ordered Sets of Type 1 (SOS-1) and it does not have to bother about the big-M constants. Note that the approach proposed in [45] is based on DCA, and the magnitude of M can affect the performance of the DCA.
In general, for $k\in [1,n]$, it is valid that $|||\varvec{w}|||_{k}=\min _c\left\{ kc+\sum _{i=1}^n[|w_{i}|-c ]^+\right\} $, which can be further rewritten as a linear program and solved in time of the order O(n). For further properties of the norm, see [5, 20, 42, 56].
With this fact, for fixed k, the subgradient of the largest-k norm of $\varvec{w}\in \mathbb {R}^{n}$ can be obtained in time of the order O(n) by using a selection algorithm. Indeed, we only need to know the kth largest element $|w_{(k)}|$ and the list that contains all the elements whose orders are smaller than k. We simply assign the value $+1$ or $-1$ to all the elements in this list. Finding the kth value from an array of size n only takes O(n) time [8], and we can make the list in O(n) time by searching all the elements.
In general, the proximal mapping of a function $h: \mathbb {R}^n \rightarrow \mathbb {R}$ at $\varvec{u}\in \mathbb {R}^{n}$ is defined as
$$\begin{aligned} \mathrm{prox}_h({\varvec{u}}) := \mathop {{\text {argmin}}}\limits _{{\varvec{w}}} ~\frac{1}{2} \Vert \varvec{w}-{\varvec{u}}\Vert _2^2 +h(\varvec{w}). \end{aligned}$$
For $h(\varvec{w})=\beta \Vert {\varvec{w}}\Vert _1$, each element of $\mathrm{prox}_h({\varvec{u}})$ is explicitly given by
$$\begin{aligned} \mathrm{prox}_h({\varvec{u}})_i = \left\{ \begin{array}{ll} u_i - \beta , &{} \quad (u_i \ge \beta ),\\ 0, &{} \quad (-\beta \le u_i \le \beta ),\\ u_i + \beta , &{} \quad (u_i \le -\beta ). \end{array} \right. \end{aligned}$$
Following [1, 41], the Ky Fan k norm of a matrix $\varvec{W}\in \mathbb {R}^{m{\times }n}$ can be computed by solving the following semidefinite programming (SDP) problem:
$$\begin{aligned} |||\varvec{W}|||_{k}= \min _{\varvec{W},\varvec{Z},c} \left\{ kc + \mathrm {Tr}(\varvec{Z}) : \varvec{Z} \succeq \begin{pmatrix} \varvec{O} &{} \varvec{W}^\top \\ \varvec{W} &{} \varvec{O} \end{pmatrix} -c \varvec{I},~\varvec{Z} \succeq \varvec{O} \right\} , \end{aligned}$$
where $\varvec{Z}\succeq \varvec{Y}$ denotes that $\varvec{Z}-\varvec{Y}$ is positive semidefinite.
Problem (18) can be recast as the following problem:
$$\begin{aligned} \begin{array}{cl} \underset{\varvec{W},\varvec{Z}_1,\varvec{Z}_2}{\text{ minimize }} &{} \displaystyle g(\varvec{W})+\frac{\rho }{2} (\mathrm {Tr}(\varvec{Z}_1)+\mathrm {Tr}(\varvec{Z}_2)) - \varvec{S}_{\varvec{W}}^{t-1} \bullet \varvec{W} \\ \text{ subject } \text{ to } &{} \varvec{W}\in {S},~ \begin{pmatrix} \varvec{Z}_1 &{} \varvec{W} \\ \varvec{W}^\top &{} \varvec{Z}_2 \end{pmatrix} \succeq \varvec{O}. \end{array} \end{aligned}$$
If g is a linear function and S is given by a system of linear functions on $\mathbb {R}^{m\times n}$, the above problem can be solved by a standard (linear) SDP solver.
All the numerical experiments in this section were performed on an Intel Core i7 2.9 GHz personal computer with 8GB of physical memory using Matlab (R2013a) with IBM ILOG CPLEX 12.
In a similar manner to Corollary 4, we can provide a value of $\rho $, above which (27) becomes an exact penalty formulation of the cardinality constraint. For a positive definite matrix $\varvec{Q}$, an exact penalty is given as
$$\begin{aligned} \rho > \max _{i}\left\{ \frac{8\Vert \varvec{q}\Vert _{2}}{\lambda _{\min }(\varvec{Q})}\left( |q_{i}|+\frac{2\Vert \varvec{Q}\varvec{e}_{i}\Vert _{2}\Vert \varvec{q}\Vert _{2}+\Vert \varvec{q}\Vert _{2}Q_{ii}}{\lambda _{\min }(\varvec{Q})}\right) \right\} . \end{aligned}$$
Note that the above $\rho $ for (27) is $8\Vert \varvec{q}\Vert _{2}/\lambda _{\min }(\varvec{Q})$ times as large as that for (6). See Appendix A.6 for the derivation of $\rho $. As for $M_j$, we have a naive bound $M_j\ge 2\Vert \varvec{q}\Vert _{2}/\lambda _{\min }(\varvec{Q})$ from the proof of Corollary 4, but to derive a penalty parameter value, we set $M_j=4\Vert \varvec{q}\Vert _{2}/\lambda _{\min }(\varvec{Q})$.
DCA for (27) repeats the following procedure:
$$\begin{aligned} (\varvec{s}_{\varvec{w}}^{t-1},\varvec{v}_{u}^{t-1})&= (-\varvec{q},2\rho \varvec{u}),\\ (\varvec{w}^{t},\varvec{u}^{t})&\in \mathop {{\text {argmin}}}\limits _{\varvec{w},\varvec{u}}\left\{ \frac{1}{2}\varvec{w}^{\top }\varvec{Q}\varvec{w}-\varvec{w}^{\top }\varvec{s}_{\varvec{w}}^{t-1}-\varvec{u}^{\top }\varvec{v}_{u}^{t-1}: \begin{array}{l} -M_ju_j\le w_j\le M_ju_j,j=1,\ldots ,n,\\ \varvec{1}^{\top }\varvec{u}=K,\,\varvec{0}\le \varvec{u} \le \varvec{1} \end{array} \right\} . \end{aligned}$$
In the original program code of [53], the coefficient was not 0.9, but 0.7. We slowed down the speed of decrease of $\mu $ because the proximal gradient without acceleration techniques needs much more iterations than the original code.
For the convex objective function $f(\varvec{w})$, the rate of convergence to the optimal solution $\varvec{w}^\star $ changes from $f(\varvec{w}^t)-f(\varvec{w}^\star ) =O(1/t)$ to $O(1/t^2)$.

References

Alizadeh, F.: Interior point methods in semidefinite programming with applications to combinatorial optimization. SIAM J. Optim. 5(1), 13–51 (1995)
Article MathSciNet MATH Google Scholar
Arthanari, T.S., Dodge, Y.: Mathematical Programming in Statistics, vol. 341. Wiley, New York (1981)
MATH Google Scholar
Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, Berlin (2011)
Book MATH Google Scholar
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
Article MathSciNet MATH Google Scholar
Bertsimas, D., Pachamanova, D., Sim, M.: Robust linear optimization under general norms. Oper. Res. Lett. 32(6), 510–516 (2004)
Article MathSciNet MATH Google Scholar
Bertsimas, D., King, A., Mazumder, R.: Best subset selection via a modern optimization lens. Ann. Stat. 44(2), 813–852 (2016)
Article MathSciNet MATH Google Scholar
Blake, C.L., Merz, C.J.: UCI repository of machine learning databases, (1998). http://www.ics.uci.edu/~mlearn/MLRepository.html
Blum, M., Floyd, R.W., Pratt, V., Rivest, R.L., Tarjan, R.E.: Time bounds for selection. J. Comput. Syst. Sci. 7, 448–461 (1973)
Article MathSciNet MATH Google Scholar
Bradley, P.S., Mangasarian, O.L.: Feature selection via concave minimization and support vector machines. In: International Conference on Machine Learning, Volume 98, pp. 82–90. (1998)
Brodie, J., Daubechies, I., De Mol, C., Giannone, D., Loris, I.: Sparse and stable markowitz portfolios. Proc. Natl. Acad. Sci. 106(30), 12267–12272 (2009)
Article MATH Google Scholar
Bruckstein, A.M., Donoho, D.L., Elad, M.: From sparse solutions of systems of equations to sparse modeling of signals and images. SIAM Rev. 51(1), 34–81 (2009)
Article MathSciNet MATH Google Scholar
Cai, J., Candes, E.J., Shen, Z.: A singular value thresholding algorithm for matrix completion. SIAM J. Optim. 20(4), 1956–1982 (2010)
Article MathSciNet MATH Google Scholar
Cai, X., Nie, F., Huang, H.: Exact top-$k$ feature selection via $\ell _{2,0}$-norm constraint. In: Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence (2013)
Candès, E.J., Recht, B.: Exact matrix completion via convex optimization. Found. Comput. Math. 9(6), 717–772 (2009)
Article MathSciNet MATH Google Scholar
Candès, E.J., Tao, T.: Decoding by linear programming. IEEE Trans. Inf. Theory 51(12), 4203–4215 (2005)
Article MathSciNet MATH Google Scholar
Donoho, D.L.: De-noising by soft thresholding. IEEE Trans. Inf. Theory 41(3), 613–627 (1995)
Article MathSciNet MATH Google Scholar
Donoho, D.L.: Compressed sensing. IEEE Trans. Inf. Theory 52(4), 1289–1306 (2006)
Article MathSciNet MATH Google Scholar
Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1360 (2001)
Article MathSciNet MATH Google Scholar
Gong, P., Zhang, C., Lu, Z., Huang, J., Ye, J.: A general iterative shrinkage and thresholding algorithm for non-convex regularized optimization problems. In: Proceedings of International Conference on Machine Learning Volume 28, pp. 37–45. (2013)
Gotoh, J., Uryasev, S.: Two pairs of families of polyhedral norms versus $\ell _p$-norms: proximity and applications in optimization. Math. Program. 156(1), 391–431 (2016)
Article MathSciNet MATH Google Scholar
Gulpinar, N., Le Thi, H.A., Moeini, M.: Robust investment strategies with discrete asset choice constraints using DC programming. Optimization 59(1), 45–62 (2010)
Article MathSciNet MATH Google Scholar
Hempel, A.B., Goulart, P.J.: A novel method for modelling cardinality and rank constraints. In: IEEE Conference on Decision and Control, pp. 4322–4327. Los Angeles, USA, December (2014). http://control.ee.ethz.ch/index.cgi?page=publications;action=details;id=4712
Horst, R., Tuy, H.: Global Optimization: Deterministic Approaches, 3rd edn. Springer-Verlag, Berlin (1996)
Book MATH Google Scholar
Hu, Y., Zhang, D., Ye, J., Li, X., He, X.: Fast and accurate matrix completion via truncated nuclear norm regularization. IEEE Trans. Pattern Anal. Mach. Intell. 35(9), 2117–2130 (2013)
Article Google Scholar
Le Thi, H.A., Pham Dinh, T.: The DC (difference of convex functions) programming and DCA revisited with DC models of real world nonconvex optimization problems. Ann. Oper. Res. 133(1–4), 23–46 (2005)
MathSciNet MATH Google Scholar
Le Thi, H.A., Pham Dinh, T.: DC approximation approaches for sparse optimization. Eur. J. Oper. Res. 244(1), 26–46 (2015)
Article MathSciNet MATH Google Scholar
Le Thi, H.A., Pham Dinh, T., Muu, L.D.: Exact penalty in D.C. programming. Vietnam J. Math. 27(2), 169–178 (1999)
MathSciNet MATH Google Scholar
Le Thi, H.A., Le, H.M., Nguyen, V.V., Pham Dinh, T.: A DC programming approach for feature selection in support vector machines learning. Adv. Data Anal. Classif. 2(3), 259–278 (2008)
Article MathSciNet MATH Google Scholar
Le Thi, H.A., Pham Dinh, T., Yen, N.D.J.: Properties of two DC algorithms in quadratic programming. J. Glob. Optim. 49(3), 481–495 (2011)
Article MathSciNet MATH Google Scholar
Le Thi, H.A., Pham Dinh, T., Ngai, H.V.J.: Exact penalty and error bounds in DC programming. J. Glob. Optim. 52(3), 509–535 (2012)
Article MathSciNet MATH Google Scholar
Lu, C., Tang, J., Yan, S., Lin, Z.: Generalized nonconvex nonsmooth low-rank minimization. In: Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pp. 4130–4137. IEEE, (2014)
Ma, S., Goldfarb, D., Chen, L.: Fixed point and bregman iterative methods for matrix rank minimization. Math. Program. 128(1), 321–353 (2011)
Article MathSciNet MATH Google Scholar
Miyashiro, R., Takano, Y.: Mixed integer second-order cone programming formulations for variable selection in linear regression. Eur. J. Oper. Res. 247(3), 721–731 (2015a)
Miyashiro, R., Takano, Y.: Subset selection by Mallow’s $C_p$: a mixed integer programming approach. Expert Syst. Appl. 42(1), 325–331 (2015b)
Article Google Scholar
Moghaddam, B., Weiss, Y., Avidan, S.: Generalized spectral bounds for sparse lda. In: Proceedings of the 23rd International Conference on Machine learning, pp. 641–648. ACM, (2006)
Natarajan, B.K.: Sparse approximate solutions to linear systems. SIAM J. Comput. 24(2), 227–234 (1995)
Article MathSciNet MATH Google Scholar
Nesterov, Y.: Gradient methods for minimizing composite objective function. Math. Program. 140(1), 125–161 (2013)
Article MathSciNet MATH Google Scholar
Nguyen, T.B.T., Le Thi, H.A., Le, H.M., Vo, X.T.: DC approximation approach for $\ell _0$-minimization in compressed sensing. In: Le Thi, H.A., Nguyen, N.T., Do, T.V. (eds.) Advanced Computational Methods for Knowledge Engineering, pp. 37–48. Springer, Berlin (2015)
Nhat, P.D., Nguyen, M.C., Le Thi, H.A.: A DC programming approach for sparse linear discriminant analysis. In: Do, T.V., Le Thi, H.A., Nguyen, N.T. (eds.) Advanced Computational Methods for Knowledge Engineering, pp. 65–74. Springer, (2014)
Nocedal, Jorge, Wright, Stephen J.: Numerical Optimization 2nd. Springer, Berlin (2006)
MATH Google Scholar
Overton, M.L., Womersley, R.S.: Optimality conditions and duality theory for minimizing sums of the largest eigenvalues of symmetric matrices. Math. Program. 62(1–3), 321–357 (1993)
Article MathSciNet MATH Google Scholar
Pavlikov, K., Uryasev, S.: CVaR norm and applications in optimization. Optim. Lett. 8(7), 1999–2020 (2014)
Article MathSciNet MATH Google Scholar
Pham Dinh, T., Le Thi, H.A.: Convex analysis approach to D.C. programming: theory, algorithms and applications. Acta Math. Vietnam. 22(1), 289–355 (1997)
MathSciNet MATH Google Scholar
Pham Dinh, T., Le Thi, H.A.: A D.C. optimization algorithm for solving the trust-region subproblem. SIAM J. Optim. 8(2), 476–505 (1998)
Article MathSciNet MATH Google Scholar
Pham Dinh, T., Le Thi, H.A.: Recent advances in DC programming and DCA. Trans. Comput. Collect. Intell. 8342, 1–37 (2014)
Google Scholar
Recht, B., Fazel, M., Parrilo, P.A.: Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev. 52(3), 471–501 (2010)
Article MathSciNet MATH Google Scholar
Shevade, S.K., Keerthi, S.S.: A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics 19(17), 2246–2253 (2003)
Article Google Scholar
Smola, A.J., Vishwanathan, S.V.N., Hofmann, T.: Kernel methods for missing variables. In: Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, pp. 325–332. (2005)
Sriperumbudur, B.K., Lanckriet, G.R.G.: A proof of convergence of the concave-convex procedure using Zangwill’s theory. Neural Comput. 24(6), 1391–1407 (2012)
Article MathSciNet MATH Google Scholar
Takeda, A., Niranjan, M., Gotoh, J., Kawahara, Y.: Simultaneous pursuit of out-of-sample performance and sparsity in index tracking portfolios. Comput. Manag. Sci. 10(1), 21–49 (2013)
Article MathSciNet MATH Google Scholar
Thiao, M., Pham Dinh, T., Le Thi, H.A.: A DC programming approach for sparse eigenvalue problem. In: Proceedings of the 27th International Conference on Machine Learning, pp. 1063–1070. (2010)
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 58(1), 267–288 (1996)
MathSciNet MATH Google Scholar
Toh, K.-C., Yun, S.: An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems. Pac. J. Optim. 6(3), 615–640 (2010)
MathSciNet MATH Google Scholar
Watson, G.A.: Linear best approximation using a class of polyhedral norms. Numer. Algorithms 2(3), 321–335 (1992)
Article MathSciNet MATH Google Scholar
Watson, G.A.: On matrix approximation problems with Ky Fan $k$ norms. Numer. Algorithms 5(5), 263–272 (1993)
Article MathSciNet MATH Google Scholar
Wu, B., Ding, C., Sun, D.F., Toh, K.-C.: On the Moreau-Yosida regularization of the vector $k$-norm related functions. SIAM J. Optim. 24, 766–794 (2014)
Article MathSciNet MATH Google Scholar
Zhang, C.H.: Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38(2), 894–942 (2010)
Article MathSciNet MATH Google Scholar
Zheng, X., Sun, X., Li, D., Sun, J.: Successive convex approximations to cardinality-constrained convex programs: a piecewise-linear DC approach. Comput. Optim. Appl. 59(1–2), 379–397 (2014)
Article MathSciNet MATH Google Scholar
Zou, H., Hastie, T., Tibshirani, R.: Sparse principal component analysis. J. Comput. Graph. stat. 15(2), 265–286 (2006)
Article MathSciNet Google Scholar

Download references

Acknowledgements

The research of the first author was supported by JSPS KAKENHI Grant Number 15K01204, 22510138, and 26242027. The research of the second author was supported by JST CREST Grant Number JPMJCR15K5, Japan. The authors are very grateful for the reviewers, whose comments enabled us to improve the readability of the paper.

Author information

Authors and Affiliations

Department of Industrial and Systems Engineering, Chuo University, 2-13-27 Kasuga, Bunkyo-ku, Tokyo, 112-8551, Japan
Jun-ya Gotoh
Department of Mathematical Analysis and Statistical Inference, The Institute of Statistical Mathematics, 10-3 Midori-cho, Tachikawa, Tokyo, 190-8562, Japan
Akiko Takeda
RIKEN Center for Advanced Intelligence Project, 1-4-1, Nihonbashi, Chuo-ku, Tokyo, 103-0027, Japan
Akiko Takeda
Data Science Research Laboratories, NEC Corporation, 1753, Shimonumabe, Nakahara-ku, Kawasaki-shi, Kanagawa, 211-8666, Japan
Katsuya Tono

Authors

Jun-ya Gotoh
View author publications
You can also search for this author in PubMed Google Scholar
Akiko Takeda
View author publications
You can also search for this author in PubMed Google Scholar
Katsuya Tono
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Akiko Takeda.

A Proofs for Derivation of Exact Penalty Parameter Values

1.1 A.1 Proof of Theorem 3

To prove the claim of Theorem 3, we use the following lemma:

Lemma 1

If $f:\mathbb R^{n}\rightarrow \mathbb R$ is L-smooth, then for any $\varvec{x},\varvec{y}\in \mathbb R^{n}$,

$$\begin{aligned} f(\varvec{y})\le f(\varvec{x})+\nabla f(\varvec{x})^{\top }(\varvec{y}-\varvec{x})+\frac{L}{2}\Vert \varvec{y}-\varvec{x}\Vert _{2}^{2}. \end{aligned}$$

For simplicity of notation, we use $\varvec{w}^{\star }$ instead of $\varvec{w}_\rho ^{\star }$ for an optimal solution of (6) with some $\rho $. Assume by contradiction that $\Vert \varvec{w}^{\star }\Vert _{0}>K$, which implies $|w_{(K+1)}^{\star }| > 0$. We consider ${\tilde{\varvec{w}}}:=\varvec{w}^{\star }-w_{i}^{\star }\varvec{e}_{i}$, where i is the index of the $(K+1)$-st largest element in absolute value and $\varvec{e}_{i}$ denotes the unit vector in the i-th coordinate direction. Note that ${\tilde{\varvec{w}}}$ is a feasible solution to the given problem. Then from Lemma 1 and $\Vert \varvec{w}^{\star }\Vert _{2}\le C$, we have

$$\begin{aligned}&f(\varvec{w}^{\star })+\rho (\Vert \varvec{w}^{\star }\Vert _{1}-|||\varvec{w}^{\star }|||_{K})-f({\tilde{\varvec{w}}})-\rho (\Vert {\tilde{\varvec{w}}}\Vert _{1}-|||{\tilde{\varvec{w}}}|||_{K})\\&\quad =f(\varvec{w}^{\star })-f(\varvec{w}^{\star }-w_{i}^{\star }\varvec{e}_{i})+\rho | w_{i}^{\star }|\\&\quad \ge \nabla f(\varvec{w}^{\star })^{\top }w_{i}^{\star }\varvec{e}_{i}-\frac{L}{2}w_{i}^{\star 2} +\rho | w_{i}^{\star }|\\&\quad \ge |w_{i}^{\star }|\left( \rho -\Vert \nabla f(\varvec{w}^{\star })\Vert _{2}-\frac{1}{2}LC\right) . \end{aligned}$$

By using the Lipschitz continuity of $\nabla f$ again, we have

$$\begin{aligned} \Vert \nabla f(\varvec{w}^{\star })\Vert _{2}&\le \Vert \nabla f(\varvec{0})\Vert _{2}+\Vert \nabla f(\varvec{w}^{\star })-\nabla f(\varvec{0})\Vert _{2}\le \Vert \nabla f(\varvec{0})\Vert _{2}+LC. \end{aligned}$$

Thus we have

$$\begin{aligned}&f(\varvec{w}^{\star })+\rho (\Vert \varvec{w}^{\star }\Vert _{1}-|||\varvec{w}^{\star }|||_{K})-{f({\tilde{\varvec{w}}})}-\rho (\Vert {\tilde{\varvec{w}}}\Vert _{1}-|||{\tilde{\varvec{w}}}|||_{K})\\&\quad \ge |w_{i}^{\star }|\left( \rho -\Vert \nabla f(\varvec{0})\Vert _{2}-\frac{3}{2}LC\right) >0, \end{aligned}$$

which contradicts the optimality of $\varvec{w}^\star $. $\square $

1.2 A.2 Proof of Corollary 3

Let $\varvec{w}^\star $ be an optimal solution of (6) and assume by contradiction that $\Vert \varvec{w}^{\star }\Vert _{0}>K$. We consider a feasible solution ${\tilde{\varvec{w}}}:=\varvec{w}^{\star }-w_{i}^{\star }(\varvec{e}_{i} - \varvec{e}_j)$, where i and j are the indices of the $(K+1)$-st and K-th largest elements in absolute value and $\varvec{e}_{i}$ and $\varvec{e}_j$ denote the unit vectors in the corresponding coordinate directions. Then from Lemma 1, we have

$$\begin{aligned}&f(\varvec{w}^{\star })+\rho (\Vert \varvec{w}^{\star }\Vert _{1}-|||\varvec{w}^{\star }|||_{K})-f({\tilde{\varvec{x}}})-\rho (\Vert {\tilde{\varvec{w}}}\Vert _{1}-|||{\tilde{\varvec{w}}}|||_{K})\\&\quad =f(\varvec{w}^{\star })-f(\varvec{w}^{\star }-w_{i}^{\star }(\varvec{e}_{i}-\varvec{e}_j))+\rho | w_{i}^{\star }|\\&\quad \ge \nabla f(\varvec{w}^{\star })^{\top }w_{i}^{\star }(\varvec{e}_{i}-\varvec{e}_j)-\frac{L}{2}w_{i}^{\star 2}\Vert \varvec{e}_i-\varvec{e}_j\Vert _2^2 +\rho | w_{i}^{\star }|\\&\quad \ge |w_{i}^{\star }|\left( \rho -\Vert \nabla f(\varvec{w}^{\star })\Vert _{2}\Vert \varvec{e}_i-\varvec{e}_j\Vert _2-L|w_i^\star |\right) \\&\quad \ge |w_{i}^{\star }|\left( \rho -\sqrt{2}\Vert \nabla f(\varvec{w}^{\star })\Vert _{2}-L\right) . \end{aligned}$$

Then in the same way as in the proof of Theorem 3, we have

$$\begin{aligned} f(\varvec{w}^{\star })+\rho (\Vert \varvec{w}^{\star }\Vert _{1}-|||\varvec{w}^{\star }|||_{K})-{f({\tilde{\varvec{w}}})}-\rho (\Vert {\tilde{\varvec{w}}}\Vert _{1}-|||{\tilde{\varvec{w}}}|||_{K})>0, \end{aligned}$$

which contradicts the optimality of $\varvec{w}^\star $. $\square $

1.3 A.3 Proof of Theorem 4

For simplicity of notation, we use $\varvec{w}^{\star }$ instead of $\varvec{w}_\rho ^{\star }$ for an optimal solution of (6) with some $\rho $. Assume that $\Vert \varvec{w}^{\star }\Vert _{0}>K$ and construct a feasible solution $\tilde{\varvec{w}}:=\varvec{w}^{\star }-w^{\star }_{j}\varvec{e}_{j}$, where j is the index of the $(K+1)$-st largest element of $(|w^{\star }_{1}|,\ldots ,|w^{\star }_{n}|)^{\top }$. Then, from the Cauchy–Schwarz inequality, we have

$$\begin{aligned}&\frac{1}{2}(\varvec{w}^{\star })^{\top }\varvec{Q}\varvec{w}^{\star }+\varvec{q}^{\top }\varvec{w}^{\star }+\rho (\Vert \varvec{w}^{\star }\Vert _{1}-|||\varvec{w}^{\star }|||_{K}) \\&\qquad -\frac{1}{2}\varvec{\tilde{w}}^{\top }\varvec{Q}\varvec{\tilde{w}}-\varvec{q}^{\top }\varvec{\tilde{w}}+\rho (\Vert {\tilde{\varvec{w}}}\Vert _{1}-|||{\tilde{\varvec{w}}}|||_{K})\\&\quad = {w}^{\star }_{j}\varvec{e}_{j}^{\top }\varvec{Q}\varvec{w}^{\star }-\frac{1}{2}{w^{\star }_{j}}^{2}Q_{jj}+w^{\star }_{j}q_{j}+\rho |w^{\star }_{j}|\\&\quad \ge -|{w}^{\star }_{j}|\Vert \varvec{Q}\varvec{e}_{j}\Vert _{2}\Vert \varvec{w}^{\star }\Vert _{2}-\frac{1}{2}|w^{\star }_{j}|\Vert \varvec{w}^{\star }\Vert _{2}|Q_{jj}| \\&\qquad -|q_{j}||w^{\star }_{j}|+\rho |w^{\star }_{j}| ~~ \left( \because |w^{\star }_j|\le \Vert \varvec{w}^{\star }\Vert _2\right) \\&\quad \ge |{w}^{\star }_{j}|\left( \rho -C\Vert \varvec{Q}\varvec{e}_{j}\Vert _{2}-C|Q_{jj}|/2-|q_{j}|\right) >0, \end{aligned}$$

which contradicts the optimality of $\varvec{w}^{\star }$. $\square $

1.4 A.4 Proof of Corollary 4

Let $\varvec{w}^{\star }$ be an optimal solution of (6) and assume by contradiction that $\Vert \varvec{w}^{\star }\Vert _{2}> 2 \Vert \varvec{q}\Vert _{2}/\lambda _{\min }(\varvec{Q})$. We have

$$\begin{aligned} f(\varvec{w}^{\star })= & {} \frac{1}{2}(\varvec{w}^{\star })^{\top }\varvec{Q}\varvec{w}^{\star }+\varvec{q}^{\top }\varvec{w}^{\star }+\rho (\Vert \varvec{w}^{\star }\Vert _{1}-|||\varvec{w}^{\star }|||_{K}) \\\ge & {} \frac{1}{2}\lambda _{\min }(\varvec{Q})\Vert \varvec{w}^{\star }\Vert _{2}^{2}-\Vert \varvec{q}\Vert _{2}\Vert \varvec{w}^{\star }\Vert _{2}>0. \end{aligned}$$

On the other hand, since $f(\varvec{0})=0$, the optimal value of (6) must be nonpositive. Hence it suffices to consider the case $\Vert \varvec{w}^{\star }\Vert _{2}\le 2 \Vert \varvec{q}\Vert _{2}/\lambda _{\min }(\varvec{Q})$. Applying $C=2 \Vert \varvec{q}\Vert _{2}/{\lambda }_{\min }(\varvec{Q})$ to Theorem 4, we have the desired result. $\square $

1.5 A.5 Proof of Theorem 5

For simplicity of notation, we use $\varvec{W}^{\star }$ instead of $\varvec{W}_\rho ^{\star }$. Assume by contradiction that $\mathrm {rank}(\varvec{W}^{\star })>K$ and construct a feasible solution $\varvec{\tilde{W}}:=\varvec{W}^{\star }-\varvec{W}^{\star }\varvec{y}_{K+1}\varvec{y}_{K+1}^{\top }$, where $\varvec{y}_{K+1}$ is the $(K+1)$-st leading eigenvector of $\varvec{W}^{\star \top }\varvec{W}^{\star }$ with $\Vert \varvec{y}_{K+1}\Vert _{2}=1$. Then from the quadratic upper bound from the Lipschitz continuity, we have

$$\begin{aligned}&f(\varvec{W}^{\star })+\rho (\Vert \varvec{W}^{\star }\Vert _*-|||\varvec{W}^{\star }|||_{K})-f(\varvec{\tilde{W}})-\rho (\Vert \varvec{\tilde{W}}\Vert _*-|||\varvec{\tilde{W}}|||_{K})\\&\quad =f(\varvec{W}^{\star })-f(\varvec{W}^{\star }-\varvec{W}^{\star }\varvec{y}_{K+1}\varvec{y}_{K+1}^{\top })+\rho \sigma _{K+1}\\&\quad \ge \nabla f(\varvec{W}^{\star })\bullet (\varvec{W}^{\star }\varvec{y}_{K+1}\varvec{y}_{K+1}^{\top })-\frac{L}{2}\Vert \varvec{W}^{\star }\varvec{y}_{K+1}\varvec{y}_{K+1}^{\top }\Vert _{\mathrm {F}}^{2}+\rho \sigma _{K+1}\\&\quad \ge \rho \sigma _{K+1}-\Vert \nabla f(\varvec{W}^{\star })\Vert _{\mathrm {F}}\Vert \varvec{W}^{\star }\varvec{y}_{K+1}\varvec{y}_{K+1}^{\top }\Vert _{\mathrm {F}}-\frac{L}{2}\sigma _{K+1}^{2}\\&\quad \ge \rho \sigma _{K+1}-\sigma _{K+1}(\Vert \nabla f(\varvec{O})\Vert _{\mathrm {F}}+LC)-\frac{LC}{2}\sigma _{K+1}\\&\quad \ge \sigma _{K+1}\left( \rho -\Vert \nabla f(\varvec{O})\Vert _{\mathrm {F}}-\frac{3}{2}LC\right) >0, \end{aligned}$$

which contradicts the optimality of $\varvec{W}^\star $. $\square $

1.6 A.6 Exact penalty parameter for the MIP-based formulation

Theorem 6

Let $(\varvec{w}^\star ,\varvec{u}^\star )$ be an arbitrary optimal solution of (27). Then $\varvec{w}^\star $ is optimal to the corresponding cardinality-constrained problem if

$$\begin{aligned} \rho > \max _{i}\left\{ \frac{8\Vert \varvec{q}\Vert _{2}}{\lambda _{\min }(\varvec{Q})}\left( |q_{i}|+\frac{2\Vert \varvec{Q}\varvec{e}_{i}\Vert _{2}\Vert \varvec{q}\Vert _{2}+\Vert \varvec{q}\Vert _{2}Q_{ii}}{\lambda _{\min }(\varvec{Q})}\right) \right\} . \end{aligned}$$

Proof

In the same manner as in the proofs of Corollary 4, $\varvec{w}^{\star }$ is bounded by $\Vert \varvec{w}^{\star }\Vert _{2}\le 2\Vert \varvec{q}\Vert _{2}/\lambda _{\min }(\varvec{Q})$. Assume by contradiction that $\varvec{1}^{\top }\varvec{u}^{\star }-(\varvec{u}^{\star })^{\top }\varvec{u}^{\star }>0$.

If there exists j such that $0<u^{\star }_{j}\le 1/2$, by constructing a feasible solution $(\tilde{\varvec{w}},\tilde{\varvec{u}})=(\varvec{w}^{\star }-w^{\star }_{j}\varvec{e}_{j},\varvec{u}^{\star }-u^{\star }_{j}\varvec{e}_{j})$, we have
$$\begin{aligned}&\frac{1}{2}(\varvec{w}^{\star })^{\top }\varvec{Q}\varvec{w}^{\star }+\varvec{q}^{\top }\varvec{w}^{\star }-\rho (\varvec{u}^{\star })^{\top }\varvec{u}^\star -\frac{1}{2}\varvec{\tilde{w}}^{\top }\varvec{Q}\varvec{\tilde{w}}-\varvec{q}^{\top }\varvec{\tilde{w}}+\rho \varvec{\tilde{u}}^{\top }\varvec{\tilde{u}}\\&\quad =w^{\star }_{j}\varvec{e}_{j}^{\top }\varvec{Q}\varvec{w}^{\star }-\frac{1}{2}{w^{\star }_{j}}^{2}Q_{jj}+w^{\star }_{j}q_{j}+\rho (u^{\star }_{j}-(u^{\star }_{j})^{2})\\&\quad \ge w^{\star }_{j}\varvec{e}_{j}^{\top }\varvec{Q}\varvec{w}^{\star }-\frac{1}{2}{w^{\star }_{j}}^{2}Q_{jj}+w^{\star }_{j}q_{j}+\frac{1}{2}\rho u^{\star }_{j}\quad (\because x-x^{2}> x/2 \ \text{ for } \ 0<x<1/2)\\&\quad \ge \frac{1}{2}\rho u^{\star }_{j}-|w^{\star }_{j}|\Vert \varvec{Q}\varvec{e}_{j}\Vert _{2}\Vert \varvec{w}^{\star } \Vert _{2}-\frac{1}{2}|w^{\star }_{j}|\Vert \varvec{w}^{\star }\Vert _{2}Q_{jj}-|w_{j}||q_{j}|\\&\quad \ge \frac{\rho }{2M}|w^{\star }_{j}|-\frac{2|w^{\star }_{j}|\Vert \varvec{Q}\varvec{e}_{j}\Vert _{2}\Vert \varvec{q}\Vert _{2}}{\lambda _{\min }(\varvec{Q})}-\frac{|w^{\star }_{j}|\Vert \varvec{q}\Vert _{2}Q_{jj}}{\lambda _{\min }(\varvec{Q})}-|w_{j}||q_{j}|\\&\quad \ge \frac{|w^{\star }_{j}|}{2M}\left[ \rho -\frac{8\Vert \varvec{q}\Vert _{2}}{\lambda _{\min }(\varvec{Q})}\left( |q_{j}|+\frac{2\Vert \varvec{Q}\varvec{e}_{j}\Vert _{2}\Vert \varvec{q}\Vert _{2}+\Vert \varvec{q}\Vert _{2}Q_{jj}}{\lambda _{\min }(\varvec{Q})}\right) \right] >0. \end{aligned}$$
Otherwise, if $u^{\star }_{j}\notin \{0,1\}$, then $1/2< u^{\star }_{j}<1$.
- Case: $\varvec{1}^{\top }\varvec{u}^{\star }<K$. We can take $\varepsilon >0$ such that $\varvec{1}^{\top }\varvec{u}^{\star }+\varepsilon \le K$ and $u^{\star }_{j}+\varepsilon \le 1$. Then by putting $(\tilde{\varvec{w}},\tilde{\varvec{u}})=(\varvec{w}^{\star },\varvec{u}^{\star }+\varepsilon \varvec{e}_{j})$, we have
  $$\begin{aligned}&f(\varvec{w}^{\star },\varvec{u}^{\star })-f(\tilde{\varvec{w}},\tilde{\varvec{u}}) \\&\quad =\rho [\varvec{1}^{\top }\varvec{u}^{\star }-(\varvec{u}^{\star })^{\top }\varvec{u}^{\star }-\varvec{1}^{\top }(\varvec{u}^{\star }+\varepsilon \varvec{e}_{j})+(\varvec{u}^{\star }+\varepsilon \varvec{e}_{j})^{\top }(\varvec{u}^{\star }+\varepsilon \varvec{e}_{j})]\\&\quad =\rho \left( -\varepsilon +2\varepsilon u^{\star }_j+\varepsilon ^{2}\right) >0. \end{aligned}$$
- Case: $\varvec{1}^{\top }\varvec{u}^{\star }=K$. There exists $i\ (\ne j)$ such that $1/2< u^{\star }_{i}<1$. Assume $u^{\star }_{i}\ge u^{\star }_{j}$ without loss of generality. If we choose $\varepsilon >0$ such that $u^{\star }_{i}+\varepsilon \le 1$ and $u^{\star }_{j}-\varepsilon \ge 1/2$, a solution $(\tilde{\varvec{w}},\tilde{\varvec{u}})=(\varvec{w}^{\star },\varvec{u}^{\star }+\varepsilon \varvec{e}_{i}-\varepsilon \varvec{e}_{j})$ is feasible, since $M(u^{\star }_{j}-\varepsilon )\ge M/2\ge |w_{j}^\star |$. Then we have
  $$\begin{aligned} f(\varvec{w}^{\star },\varvec{u}^{\star })-f(\tilde{\varvec{w}},\tilde{\varvec{u}})&=\rho (-(\varvec{u}^{\star })^{\top }\varvec{u}^{\star }+(\varvec{u}^{\star }+\varepsilon \varvec{e}_{i}-\varepsilon \varvec{e}_{j})^{\top }(\varvec{u}^{\star }+\varepsilon \varvec{e}_{i}-\varepsilon \varvec{e}_{j}))\\&=\rho (2\varepsilon u^{\star }_{i}-2\varepsilon u^{\star }_{j}+2\varepsilon ^{2})>0. \end{aligned}$$

$\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gotoh, Jy., Takeda, A. & Tono, K. DC formulations and algorithms for sparse optimization problems. Math. Program. 169, 141–176 (2018). https://doi.org/10.1007/s10107-017-1181-0

Download citation

Received: 30 August 2015
Accepted: 09 July 2017
Published: 26 July 2017
Issue Date: May 2018
DOI: https://doi.org/10.1007/s10107-017-1181-0

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DC formulations and algorithms for sparse optimization problems

Abstract

Access this article

Similar content being viewed by others

A non-convex algorithm framework based on DC programming and DCA for matrix completion

Solving nonnegative sparsity-constrained optimization via DC quadratic-piecewise-linear approximations

Lp quasi-norm minimization: algorithm and applications

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

A Proofs for Derivation of Exact Penalty Parameter Values

1.1 A.1 Proof of Theorem 3

Lemma 1

1.2 A.2 Proof of Corollary 3

1.3 A.3 Proof of Theorem 4

1.4 A.4 Proof of Corollary 4

1.5 A.5 Proof of Theorem 5

1.6 A.6 Exact penalty parameter for the MIP-based formulation

Theorem 6

Proof

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

DC formulations and algorithms for sparse optimization problems

Abstract

Access this article

Similar content being viewed by others

A non-convex algorithm framework based on DC programming and DCA for matrix completion

Solving nonnegative sparsity-constrained optimization via DC quadratic-piecewise-linear approximations

Lp quasi-norm minimization: algorithm and applications

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

A Proofs for Derivation of Exact Penalty Parameter Values

A Proofs for Derivation of Exact Penalty Parameter Values

1.1 A.1 Proof of Theorem 3

Lemma 1

1.2 A.2 Proof of Corollary 3

1.3 A.3 Proof of Theorem 4

1.4 A.4 Proof of Corollary 4

1.5 A.5 Proof of Theorem 5

1.6 A.6 Exact penalty parameter for the MIP-based formulation

Theorem 6

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation