Skip to main content
Log in

DC formulations and algorithms for sparse optimization problems

  • Full Length Paper
  • Series B
  • Published:
Mathematical Programming Submit manuscript

Abstract

We propose a DC (Difference of two Convex functions) formulation approach for sparse optimization problems having a cardinality or rank constraint. With the largest-k norm, an exact DC representation of the cardinality constraint is provided. We then transform the cardinality-constrained problem into a penalty function form and derive exact penalty parameter values for some optimization problems, especially for quadratic minimization problems which often appear in practice. A DC Algorithm (DCA) is presented, where the dual step at each iteration can be efficiently carried out due to the accessible subgradient of the largest-k norm. Furthermore, we can solve each DCA subproblem in linear time via a soft thresholding operation if there are no additional constraints. The framework is extended to the rank-constrained problem as well as the cardinality- and the rank-minimization problems. Numerical experiments demonstrate the efficiency of the proposed DCA in comparison with existing methods which have other penalty terms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. When some state-of-the-art solver is employed, (1) should be represented in the so-called Specially Ordered Sets of Type 1 (SOS-1) and it does not have to bother about the big-M constants. Note that the approach proposed in [45] is based on DCA, and the magnitude of M can affect the performance of the DCA.

  2. In general, for \(k\in [1,n]\), it is valid that \(|||\varvec{w}|||_{k}=\min _c\left\{ kc+\sum _{i=1}^n[|w_{i}|-c ]^+\right\} \), which can be further rewritten as a linear program and solved in time of the order O(n). For further properties of the norm, see [5, 20, 42, 56].

  3. With this fact, for fixed k, the subgradient of the largest-k norm of \(\varvec{w}\in \mathbb {R}^{n}\) can be obtained in time of the order O(n) by using a selection algorithm. Indeed, we only need to know the kth largest element \(|w_{(k)}|\) and the list that contains all the elements whose orders are smaller than k. We simply assign the value \(+1\) or \(-1\) to all the elements in this list. Finding the kth value from an array of size n only takes O(n) time [8], and we can make the list in O(n) time by searching all the elements.

  4. In general, the proximal mapping of a function \(h: \mathbb {R}^n \rightarrow \mathbb {R}\) at \(\varvec{u}\in \mathbb {R}^{n}\) is defined as

    $$\begin{aligned} \mathrm{prox}_h({\varvec{u}}) := \mathop {{\text {argmin}}}\limits _{{\varvec{w}}} ~\frac{1}{2} \Vert \varvec{w}-{\varvec{u}}\Vert _2^2 +h(\varvec{w}). \end{aligned}$$

    For \(h(\varvec{w})=\beta \Vert {\varvec{w}}\Vert _1\), each element of \(\mathrm{prox}_h({\varvec{u}})\) is explicitly given by

    $$\begin{aligned} \mathrm{prox}_h({\varvec{u}})_i = \left\{ \begin{array}{ll} u_i - \beta , &{} \quad (u_i \ge \beta ),\\ 0, &{} \quad (-\beta \le u_i \le \beta ),\\ u_i + \beta , &{} \quad (u_i \le -\beta ). \end{array} \right. \end{aligned}$$
  5. Following [1, 41], the Ky Fan k norm of a matrix \(\varvec{W}\in \mathbb {R}^{m{\times }n}\) can be computed by solving the following semidefinite programming (SDP) problem:

    $$\begin{aligned} |||\varvec{W}|||_{k}= \min _{\varvec{W},\varvec{Z},c} \left\{ kc + \mathrm {Tr}(\varvec{Z}) : \varvec{Z} \succeq \begin{pmatrix} \varvec{O} &{} \varvec{W}^\top \\ \varvec{W} &{} \varvec{O} \end{pmatrix} -c \varvec{I},~\varvec{Z} \succeq \varvec{O} \right\} , \end{aligned}$$

    where \(\varvec{Z}\succeq \varvec{Y}\) denotes that \(\varvec{Z}-\varvec{Y}\) is positive semidefinite.

  6. Problem (18) can be recast as the following problem:

    $$\begin{aligned} \begin{array}{cl} \underset{\varvec{W},\varvec{Z}_1,\varvec{Z}_2}{\text{ minimize }} &{} \displaystyle g(\varvec{W})+\frac{\rho }{2} (\mathrm {Tr}(\varvec{Z}_1)+\mathrm {Tr}(\varvec{Z}_2)) - \varvec{S}_{\varvec{W}}^{t-1} \bullet \varvec{W} \\ \text{ subject } \text{ to } &{} \varvec{W}\in {S},~ \begin{pmatrix} \varvec{Z}_1 &{} \varvec{W} \\ \varvec{W}^\top &{} \varvec{Z}_2 \end{pmatrix} \succeq \varvec{O}. \end{array} \end{aligned}$$

    If g is a linear function and S is given by a system of linear functions on \(\mathbb {R}^{m\times n}\), the above problem can be solved by a standard (linear) SDP solver.

  7. All the numerical experiments in this section were performed on an Intel Core i7 2.9 GHz personal computer with 8GB of physical memory using Matlab (R2013a) with IBM ILOG CPLEX 12.

  8. In a similar manner to Corollary 4, we can provide a value of \(\rho \), above which (27) becomes an exact penalty formulation of the cardinality constraint. For a positive definite matrix \(\varvec{Q}\), an exact penalty is given as

    $$\begin{aligned} \rho > \max _{i}\left\{ \frac{8\Vert \varvec{q}\Vert _{2}}{\lambda _{\min }(\varvec{Q})}\left( |q_{i}|+\frac{2\Vert \varvec{Q}\varvec{e}_{i}\Vert _{2}\Vert \varvec{q}\Vert _{2}+\Vert \varvec{q}\Vert _{2}Q_{ii}}{\lambda _{\min }(\varvec{Q})}\right) \right\} . \end{aligned}$$

    Note that the above \(\rho \) for (27) is \(8\Vert \varvec{q}\Vert _{2}/\lambda _{\min }(\varvec{Q})\) times as large as that for (6). See Appendix A.6 for the derivation of \(\rho \). As for \(M_j\), we have a naive bound \(M_j\ge 2\Vert \varvec{q}\Vert _{2}/\lambda _{\min }(\varvec{Q})\) from the proof of Corollary 4, but to derive a penalty parameter value, we set \(M_j=4\Vert \varvec{q}\Vert _{2}/\lambda _{\min }(\varvec{Q})\).

  9. DCA for (27) repeats the following procedure:

    $$\begin{aligned} (\varvec{s}_{\varvec{w}}^{t-1},\varvec{v}_{u}^{t-1})&= (-\varvec{q},2\rho \varvec{u}),\\ (\varvec{w}^{t},\varvec{u}^{t})&\in \mathop {{\text {argmin}}}\limits _{\varvec{w},\varvec{u}}\left\{ \frac{1}{2}\varvec{w}^{\top }\varvec{Q}\varvec{w}-\varvec{w}^{\top }\varvec{s}_{\varvec{w}}^{t-1}-\varvec{u}^{\top }\varvec{v}_{u}^{t-1}: \begin{array}{l} -M_ju_j\le w_j\le M_ju_j,j=1,\ldots ,n,\\ \varvec{1}^{\top }\varvec{u}=K,\,\varvec{0}\le \varvec{u} \le \varvec{1} \end{array} \right\} . \end{aligned}$$
  10. In the original program code of [53], the coefficient was not 0.9, but 0.7. We slowed down the speed of decrease of \(\mu \) because the proximal gradient without acceleration techniques needs much more iterations than the original code.

  11. For the convex objective function \(f(\varvec{w})\), the rate of convergence to the optimal solution \(\varvec{w}^\star \) changes from \(f(\varvec{w}^t)-f(\varvec{w}^\star ) =O(1/t)\) to \(O(1/t^2)\).

References

  1. Alizadeh, F.: Interior point methods in semidefinite programming with applications to combinatorial optimization. SIAM J. Optim. 5(1), 13–51 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  2. Arthanari, T.S., Dodge, Y.: Mathematical Programming in Statistics, vol. 341. Wiley, New York (1981)

    MATH  Google Scholar 

  3. Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, Berlin (2011)

    Book  MATH  Google Scholar 

  4. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  5. Bertsimas, D., Pachamanova, D., Sim, M.: Robust linear optimization under general norms. Oper. Res. Lett. 32(6), 510–516 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  6. Bertsimas, D., King, A., Mazumder, R.: Best subset selection via a modern optimization lens. Ann. Stat. 44(2), 813–852 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  7. Blake, C.L., Merz, C.J.: UCI repository of machine learning databases, (1998). http://www.ics.uci.edu/~mlearn/MLRepository.html

  8. Blum, M., Floyd, R.W., Pratt, V., Rivest, R.L., Tarjan, R.E.: Time bounds for selection. J. Comput. Syst. Sci. 7, 448–461 (1973)

    Article  MathSciNet  MATH  Google Scholar 

  9. Bradley, P.S., Mangasarian, O.L.: Feature selection via concave minimization and support vector machines. In: International Conference on Machine Learning, Volume 98, pp. 82–90. (1998)

  10. Brodie, J., Daubechies, I., De Mol, C., Giannone, D., Loris, I.: Sparse and stable markowitz portfolios. Proc. Natl. Acad. Sci. 106(30), 12267–12272 (2009)

    Article  MATH  Google Scholar 

  11. Bruckstein, A.M., Donoho, D.L., Elad, M.: From sparse solutions of systems of equations to sparse modeling of signals and images. SIAM Rev. 51(1), 34–81 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  12. Cai, J., Candes, E.J., Shen, Z.: A singular value thresholding algorithm for matrix completion. SIAM J. Optim. 20(4), 1956–1982 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  13. Cai, X., Nie, F., Huang, H.: Exact top-\(k\) feature selection via \(\ell _{2,0}\)-norm constraint. In: Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence (2013)

  14. Candès, E.J., Recht, B.: Exact matrix completion via convex optimization. Found. Comput. Math. 9(6), 717–772 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  15. Candès, E.J., Tao, T.: Decoding by linear programming. IEEE Trans. Inf. Theory 51(12), 4203–4215 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  16. Donoho, D.L.: De-noising by soft thresholding. IEEE Trans. Inf. Theory 41(3), 613–627 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  17. Donoho, D.L.: Compressed sensing. IEEE Trans. Inf. Theory 52(4), 1289–1306 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  18. Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1360 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  19. Gong, P., Zhang, C., Lu, Z., Huang, J., Ye, J.: A general iterative shrinkage and thresholding algorithm for non-convex regularized optimization problems. In: Proceedings of International Conference on Machine Learning Volume 28, pp. 37–45. (2013)

  20. Gotoh, J., Uryasev, S.: Two pairs of families of polyhedral norms versus \(\ell _p\)-norms: proximity and applications in optimization. Math. Program. 156(1), 391–431 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  21. Gulpinar, N., Le Thi, H.A., Moeini, M.: Robust investment strategies with discrete asset choice constraints using DC programming. Optimization 59(1), 45–62 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  22. Hempel, A.B., Goulart, P.J.: A novel method for modelling cardinality and rank constraints. In: IEEE Conference on Decision and Control, pp. 4322–4327. Los Angeles, USA, December (2014). http://control.ee.ethz.ch/index.cgi?page=publications;action=details;id=4712

  23. Horst, R., Tuy, H.: Global Optimization: Deterministic Approaches, 3rd edn. Springer-Verlag, Berlin (1996)

    Book  MATH  Google Scholar 

  24. Hu, Y., Zhang, D., Ye, J., Li, X., He, X.: Fast and accurate matrix completion via truncated nuclear norm regularization. IEEE Trans. Pattern Anal. Mach. Intell. 35(9), 2117–2130 (2013)

    Article  Google Scholar 

  25. Le Thi, H.A., Pham Dinh, T.: The DC (difference of convex functions) programming and DCA revisited with DC models of real world nonconvex optimization problems. Ann. Oper. Res. 133(1–4), 23–46 (2005)

    MathSciNet  MATH  Google Scholar 

  26. Le Thi, H.A., Pham Dinh, T.: DC approximation approaches for sparse optimization. Eur. J. Oper. Res. 244(1), 26–46 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  27. Le Thi, H.A., Pham Dinh, T., Muu, L.D.: Exact penalty in D.C. programming. Vietnam J. Math. 27(2), 169–178 (1999)

    MathSciNet  MATH  Google Scholar 

  28. Le Thi, H.A., Le, H.M., Nguyen, V.V., Pham Dinh, T.: A DC programming approach for feature selection in support vector machines learning. Adv. Data Anal. Classif. 2(3), 259–278 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  29. Le Thi, H.A., Pham Dinh, T., Yen, N.D.J.: Properties of two DC algorithms in quadratic programming. J. Glob. Optim. 49(3), 481–495 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  30. Le Thi, H.A., Pham Dinh, T., Ngai, H.V.J.: Exact penalty and error bounds in DC programming. J. Glob. Optim. 52(3), 509–535 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  31. Lu, C., Tang, J., Yan, S., Lin, Z.: Generalized nonconvex nonsmooth low-rank minimization. In: Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pp. 4130–4137. IEEE, (2014)

  32. Ma, S., Goldfarb, D., Chen, L.: Fixed point and bregman iterative methods for matrix rank minimization. Math. Program. 128(1), 321–353 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  33. Miyashiro, R., Takano, Y.: Mixed integer second-order cone programming formulations for variable selection in linear regression. Eur. J. Oper. Res. 247(3), 721–731 (2015a)

  34. Miyashiro, R., Takano, Y.: Subset selection by Mallow’s \(C_p\): a mixed integer programming approach. Expert Syst. Appl. 42(1), 325–331 (2015b)

    Article  Google Scholar 

  35. Moghaddam, B., Weiss, Y., Avidan, S.: Generalized spectral bounds for sparse lda. In: Proceedings of the 23rd International Conference on Machine learning, pp. 641–648. ACM, (2006)

  36. Natarajan, B.K.: Sparse approximate solutions to linear systems. SIAM J. Comput. 24(2), 227–234 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  37. Nesterov, Y.: Gradient methods for minimizing composite objective function. Math. Program. 140(1), 125–161 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  38. Nguyen, T.B.T., Le Thi, H.A., Le, H.M., Vo, X.T.: DC approximation approach for \(\ell _0\)-minimization in compressed sensing. In: Le Thi, H.A., Nguyen, N.T., Do, T.V. (eds.) Advanced Computational Methods for Knowledge Engineering, pp. 37–48. Springer, Berlin (2015)

  39. Nhat, P.D., Nguyen, M.C., Le Thi, H.A.: A DC programming approach for sparse linear discriminant analysis. In: Do, T.V., Le Thi, H.A., Nguyen, N.T. (eds.) Advanced Computational Methods for Knowledge Engineering, pp. 65–74. Springer, (2014)

  40. Nocedal, Jorge, Wright, Stephen J.: Numerical Optimization 2nd. Springer, Berlin (2006)

    MATH  Google Scholar 

  41. Overton, M.L., Womersley, R.S.: Optimality conditions and duality theory for minimizing sums of the largest eigenvalues of symmetric matrices. Math. Program. 62(1–3), 321–357 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  42. Pavlikov, K., Uryasev, S.: CVaR norm and applications in optimization. Optim. Lett. 8(7), 1999–2020 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  43. Pham Dinh, T., Le Thi, H.A.: Convex analysis approach to D.C. programming: theory, algorithms and applications. Acta Math. Vietnam. 22(1), 289–355 (1997)

    MathSciNet  MATH  Google Scholar 

  44. Pham Dinh, T., Le Thi, H.A.: A D.C. optimization algorithm for solving the trust-region subproblem. SIAM J. Optim. 8(2), 476–505 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  45. Pham Dinh, T., Le Thi, H.A.: Recent advances in DC programming and DCA. Trans. Comput. Collect. Intell. 8342, 1–37 (2014)

    Google Scholar 

  46. Recht, B., Fazel, M., Parrilo, P.A.: Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev. 52(3), 471–501 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  47. Shevade, S.K., Keerthi, S.S.: A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics 19(17), 2246–2253 (2003)

    Article  Google Scholar 

  48. Smola, A.J., Vishwanathan, S.V.N., Hofmann, T.: Kernel methods for missing variables. In: Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, pp. 325–332. (2005)

  49. Sriperumbudur, B.K., Lanckriet, G.R.G.: A proof of convergence of the concave-convex procedure using Zangwill’s theory. Neural Comput. 24(6), 1391–1407 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  50. Takeda, A., Niranjan, M., Gotoh, J., Kawahara, Y.: Simultaneous pursuit of out-of-sample performance and sparsity in index tracking portfolios. Comput. Manag. Sci. 10(1), 21–49 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  51. Thiao, M., Pham Dinh, T., Le Thi, H.A.: A DC programming approach for sparse eigenvalue problem. In: Proceedings of the 27th International Conference on Machine Learning, pp. 1063–1070. (2010)

  52. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 58(1), 267–288 (1996)

    MathSciNet  MATH  Google Scholar 

  53. Toh, K.-C., Yun, S.: An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems. Pac. J. Optim. 6(3), 615–640 (2010)

    MathSciNet  MATH  Google Scholar 

  54. Watson, G.A.: Linear best approximation using a class of polyhedral norms. Numer. Algorithms 2(3), 321–335 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  55. Watson, G.A.: On matrix approximation problems with Ky Fan \(k\) norms. Numer. Algorithms 5(5), 263–272 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  56. Wu, B., Ding, C., Sun, D.F., Toh, K.-C.: On the Moreau-Yosida regularization of the vector \(k\)-norm related functions. SIAM J. Optim. 24, 766–794 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  57. Zhang, C.H.: Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38(2), 894–942 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  58. Zheng, X., Sun, X., Li, D., Sun, J.: Successive convex approximations to cardinality-constrained convex programs: a piecewise-linear DC approach. Comput. Optim. Appl. 59(1–2), 379–397 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  59. Zou, H., Hastie, T., Tibshirani, R.: Sparse principal component analysis. J. Comput. Graph. stat. 15(2), 265–286 (2006)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

The research of the first author was supported by JSPS KAKENHI Grant Number 15K01204, 22510138, and 26242027. The research of the second author was supported by JST CREST Grant Number JPMJCR15K5, Japan. The authors are very grateful for the reviewers, whose comments enabled us to improve the readability of the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Akiko Takeda.

A Proofs for Derivation of Exact Penalty Parameter Values

A Proofs for Derivation of Exact Penalty Parameter Values

1.1 A.1  Proof of Theorem 3

To prove the claim of Theorem 3, we use the following lemma:

Lemma 1

If \(f:\mathbb R^{n}\rightarrow \mathbb R\) is L-smooth, then for any \(\varvec{x},\varvec{y}\in \mathbb R^{n}\),

$$\begin{aligned} f(\varvec{y})\le f(\varvec{x})+\nabla f(\varvec{x})^{\top }(\varvec{y}-\varvec{x})+\frac{L}{2}\Vert \varvec{y}-\varvec{x}\Vert _{2}^{2}. \end{aligned}$$

For simplicity of notation, we use \(\varvec{w}^{\star }\) instead of \(\varvec{w}_\rho ^{\star }\) for an optimal solution of (6) with some \(\rho \). Assume by contradiction that \(\Vert \varvec{w}^{\star }\Vert _{0}>K\), which implies \(|w_{(K+1)}^{\star }| > 0\). We consider \({\tilde{\varvec{w}}}:=\varvec{w}^{\star }-w_{i}^{\star }\varvec{e}_{i}\), where i is the index of the \((K+1)\)-st largest element in absolute value and \(\varvec{e}_{i}\) denotes the unit vector in the i-th coordinate direction. Note that \({\tilde{\varvec{w}}}\) is a feasible solution to the given problem. Then from Lemma 1 and \(\Vert \varvec{w}^{\star }\Vert _{2}\le C\), we have

$$\begin{aligned}&f(\varvec{w}^{\star })+\rho (\Vert \varvec{w}^{\star }\Vert _{1}-|||\varvec{w}^{\star }|||_{K})-f({\tilde{\varvec{w}}})-\rho (\Vert {\tilde{\varvec{w}}}\Vert _{1}-|||{\tilde{\varvec{w}}}|||_{K})\\&\quad =f(\varvec{w}^{\star })-f(\varvec{w}^{\star }-w_{i}^{\star }\varvec{e}_{i})+\rho | w_{i}^{\star }|\\&\quad \ge \nabla f(\varvec{w}^{\star })^{\top }w_{i}^{\star }\varvec{e}_{i}-\frac{L}{2}w_{i}^{\star 2} +\rho | w_{i}^{\star }|\\&\quad \ge |w_{i}^{\star }|\left( \rho -\Vert \nabla f(\varvec{w}^{\star })\Vert _{2}-\frac{1}{2}LC\right) . \end{aligned}$$

By using the Lipschitz continuity of \(\nabla f\) again, we have

$$\begin{aligned} \Vert \nabla f(\varvec{w}^{\star })\Vert _{2}&\le \Vert \nabla f(\varvec{0})\Vert _{2}+\Vert \nabla f(\varvec{w}^{\star })-\nabla f(\varvec{0})\Vert _{2}\le \Vert \nabla f(\varvec{0})\Vert _{2}+LC. \end{aligned}$$

Thus we have

$$\begin{aligned}&f(\varvec{w}^{\star })+\rho (\Vert \varvec{w}^{\star }\Vert _{1}-|||\varvec{w}^{\star }|||_{K})-{f({\tilde{\varvec{w}}})}-\rho (\Vert {\tilde{\varvec{w}}}\Vert _{1}-|||{\tilde{\varvec{w}}}|||_{K})\\&\quad \ge |w_{i}^{\star }|\left( \rho -\Vert \nabla f(\varvec{0})\Vert _{2}-\frac{3}{2}LC\right) >0, \end{aligned}$$

which contradicts the optimality of \(\varvec{w}^\star \). \(\square \)

1.2 A.2  Proof of Corollary 3

Let \(\varvec{w}^\star \) be an optimal solution of (6) and assume by contradiction that \(\Vert \varvec{w}^{\star }\Vert _{0}>K\). We consider a feasible solution \({\tilde{\varvec{w}}}:=\varvec{w}^{\star }-w_{i}^{\star }(\varvec{e}_{i} - \varvec{e}_j)\), where i and j are the indices of the \((K+1)\)-st and K-th largest elements in absolute value and \(\varvec{e}_{i}\) and \(\varvec{e}_j\) denote the unit vectors in the corresponding coordinate directions. Then from Lemma 1, we have

$$\begin{aligned}&f(\varvec{w}^{\star })+\rho (\Vert \varvec{w}^{\star }\Vert _{1}-|||\varvec{w}^{\star }|||_{K})-f({\tilde{\varvec{x}}})-\rho (\Vert {\tilde{\varvec{w}}}\Vert _{1}-|||{\tilde{\varvec{w}}}|||_{K})\\&\quad =f(\varvec{w}^{\star })-f(\varvec{w}^{\star }-w_{i}^{\star }(\varvec{e}_{i}-\varvec{e}_j))+\rho | w_{i}^{\star }|\\&\quad \ge \nabla f(\varvec{w}^{\star })^{\top }w_{i}^{\star }(\varvec{e}_{i}-\varvec{e}_j)-\frac{L}{2}w_{i}^{\star 2}\Vert \varvec{e}_i-\varvec{e}_j\Vert _2^2 +\rho | w_{i}^{\star }|\\&\quad \ge |w_{i}^{\star }|\left( \rho -\Vert \nabla f(\varvec{w}^{\star })\Vert _{2}\Vert \varvec{e}_i-\varvec{e}_j\Vert _2-L|w_i^\star |\right) \\&\quad \ge |w_{i}^{\star }|\left( \rho -\sqrt{2}\Vert \nabla f(\varvec{w}^{\star })\Vert _{2}-L\right) . \end{aligned}$$

Then in the same way as in the proof of Theorem 3, we have

$$\begin{aligned} f(\varvec{w}^{\star })+\rho (\Vert \varvec{w}^{\star }\Vert _{1}-|||\varvec{w}^{\star }|||_{K})-{f({\tilde{\varvec{w}}})}-\rho (\Vert {\tilde{\varvec{w}}}\Vert _{1}-|||{\tilde{\varvec{w}}}|||_{K})>0, \end{aligned}$$

which contradicts the optimality of \(\varvec{w}^\star \). \(\square \)

1.3 A.3  Proof of Theorem 4

For simplicity of notation, we use \(\varvec{w}^{\star }\) instead of \(\varvec{w}_\rho ^{\star }\) for an optimal solution of (6) with some \(\rho \). Assume that \(\Vert \varvec{w}^{\star }\Vert _{0}>K\) and construct a feasible solution \(\tilde{\varvec{w}}:=\varvec{w}^{\star }-w^{\star }_{j}\varvec{e}_{j}\), where j is the index of the \((K+1)\)-st largest element of \((|w^{\star }_{1}|,\ldots ,|w^{\star }_{n}|)^{\top }\). Then, from the Cauchy–Schwarz inequality, we have

$$\begin{aligned}&\frac{1}{2}(\varvec{w}^{\star })^{\top }\varvec{Q}\varvec{w}^{\star }+\varvec{q}^{\top }\varvec{w}^{\star }+\rho (\Vert \varvec{w}^{\star }\Vert _{1}-|||\varvec{w}^{\star }|||_{K}) \\&\qquad -\frac{1}{2}\varvec{\tilde{w}}^{\top }\varvec{Q}\varvec{\tilde{w}}-\varvec{q}^{\top }\varvec{\tilde{w}}+\rho (\Vert {\tilde{\varvec{w}}}\Vert _{1}-|||{\tilde{\varvec{w}}}|||_{K})\\&\quad = {w}^{\star }_{j}\varvec{e}_{j}^{\top }\varvec{Q}\varvec{w}^{\star }-\frac{1}{2}{w^{\star }_{j}}^{2}Q_{jj}+w^{\star }_{j}q_{j}+\rho |w^{\star }_{j}|\\&\quad \ge -|{w}^{\star }_{j}|\Vert \varvec{Q}\varvec{e}_{j}\Vert _{2}\Vert \varvec{w}^{\star }\Vert _{2}-\frac{1}{2}|w^{\star }_{j}|\Vert \varvec{w}^{\star }\Vert _{2}|Q_{jj}| \\&\qquad -|q_{j}||w^{\star }_{j}|+\rho |w^{\star }_{j}| ~~ \left( \because |w^{\star }_j|\le \Vert \varvec{w}^{\star }\Vert _2\right) \\&\quad \ge |{w}^{\star }_{j}|\left( \rho -C\Vert \varvec{Q}\varvec{e}_{j}\Vert _{2}-C|Q_{jj}|/2-|q_{j}|\right) >0, \end{aligned}$$

which contradicts the optimality of \(\varvec{w}^{\star }\). \(\square \)

1.4 A.4  Proof of Corollary 4

Let \(\varvec{w}^{\star }\) be an optimal solution of (6) and assume by contradiction that \(\Vert \varvec{w}^{\star }\Vert _{2}> 2 \Vert \varvec{q}\Vert _{2}/\lambda _{\min }(\varvec{Q})\). We have

$$\begin{aligned} f(\varvec{w}^{\star })= & {} \frac{1}{2}(\varvec{w}^{\star })^{\top }\varvec{Q}\varvec{w}^{\star }+\varvec{q}^{\top }\varvec{w}^{\star }+\rho (\Vert \varvec{w}^{\star }\Vert _{1}-|||\varvec{w}^{\star }|||_{K}) \\\ge & {} \frac{1}{2}\lambda _{\min }(\varvec{Q})\Vert \varvec{w}^{\star }\Vert _{2}^{2}-\Vert \varvec{q}\Vert _{2}\Vert \varvec{w}^{\star }\Vert _{2}>0. \end{aligned}$$

On the other hand, since \(f(\varvec{0})=0\), the optimal value of (6) must be nonpositive. Hence it suffices to consider the case \(\Vert \varvec{w}^{\star }\Vert _{2}\le 2 \Vert \varvec{q}\Vert _{2}/\lambda _{\min }(\varvec{Q})\). Applying \(C=2 \Vert \varvec{q}\Vert _{2}/{\lambda }_{\min }(\varvec{Q})\) to Theorem 4, we have the desired result. \(\square \)

1.5 A.5  Proof of Theorem 5

For simplicity of notation, we use \(\varvec{W}^{\star }\) instead of \(\varvec{W}_\rho ^{\star }\). Assume by contradiction that \(\mathrm {rank}(\varvec{W}^{\star })>K\) and construct a feasible solution \(\varvec{\tilde{W}}:=\varvec{W}^{\star }-\varvec{W}^{\star }\varvec{y}_{K+1}\varvec{y}_{K+1}^{\top }\), where \(\varvec{y}_{K+1}\) is the \((K+1)\)-st leading eigenvector of \(\varvec{W}^{\star \top }\varvec{W}^{\star }\) with \(\Vert \varvec{y}_{K+1}\Vert _{2}=1\). Then from the quadratic upper bound from the Lipschitz continuity, we have

$$\begin{aligned}&f(\varvec{W}^{\star })+\rho (\Vert \varvec{W}^{\star }\Vert _*-|||\varvec{W}^{\star }|||_{K})-f(\varvec{\tilde{W}})-\rho (\Vert \varvec{\tilde{W}}\Vert _*-|||\varvec{\tilde{W}}|||_{K})\\&\quad =f(\varvec{W}^{\star })-f(\varvec{W}^{\star }-\varvec{W}^{\star }\varvec{y}_{K+1}\varvec{y}_{K+1}^{\top })+\rho \sigma _{K+1}\\&\quad \ge \nabla f(\varvec{W}^{\star })\bullet (\varvec{W}^{\star }\varvec{y}_{K+1}\varvec{y}_{K+1}^{\top })-\frac{L}{2}\Vert \varvec{W}^{\star }\varvec{y}_{K+1}\varvec{y}_{K+1}^{\top }\Vert _{\mathrm {F}}^{2}+\rho \sigma _{K+1}\\&\quad \ge \rho \sigma _{K+1}-\Vert \nabla f(\varvec{W}^{\star })\Vert _{\mathrm {F}}\Vert \varvec{W}^{\star }\varvec{y}_{K+1}\varvec{y}_{K+1}^{\top }\Vert _{\mathrm {F}}-\frac{L}{2}\sigma _{K+1}^{2}\\&\quad \ge \rho \sigma _{K+1}-\sigma _{K+1}(\Vert \nabla f(\varvec{O})\Vert _{\mathrm {F}}+LC)-\frac{LC}{2}\sigma _{K+1}\\&\quad \ge \sigma _{K+1}\left( \rho -\Vert \nabla f(\varvec{O})\Vert _{\mathrm {F}}-\frac{3}{2}LC\right) >0, \end{aligned}$$

which contradicts the optimality of \(\varvec{W}^\star \). \(\square \)

1.6 A.6  Exact penalty parameter for the MIP-based formulation

Theorem 6

Let \((\varvec{w}^\star ,\varvec{u}^\star )\) be an arbitrary optimal solution of (27). Then \(\varvec{w}^\star \) is optimal to the corresponding cardinality-constrained problem if

$$\begin{aligned} \rho > \max _{i}\left\{ \frac{8\Vert \varvec{q}\Vert _{2}}{\lambda _{\min }(\varvec{Q})}\left( |q_{i}|+\frac{2\Vert \varvec{Q}\varvec{e}_{i}\Vert _{2}\Vert \varvec{q}\Vert _{2}+\Vert \varvec{q}\Vert _{2}Q_{ii}}{\lambda _{\min }(\varvec{Q})}\right) \right\} . \end{aligned}$$

Proof

In the same manner as in the proofs of Corollary 4, \(\varvec{w}^{\star }\) is bounded by \(\Vert \varvec{w}^{\star }\Vert _{2}\le 2\Vert \varvec{q}\Vert _{2}/\lambda _{\min }(\varvec{Q})\). Assume by contradiction that \(\varvec{1}^{\top }\varvec{u}^{\star }-(\varvec{u}^{\star })^{\top }\varvec{u}^{\star }>0\).

  • If there exists j such that \(0<u^{\star }_{j}\le 1/2\), by constructing a feasible solution \((\tilde{\varvec{w}},\tilde{\varvec{u}})=(\varvec{w}^{\star }-w^{\star }_{j}\varvec{e}_{j},\varvec{u}^{\star }-u^{\star }_{j}\varvec{e}_{j})\), we have

    $$\begin{aligned}&\frac{1}{2}(\varvec{w}^{\star })^{\top }\varvec{Q}\varvec{w}^{\star }+\varvec{q}^{\top }\varvec{w}^{\star }-\rho (\varvec{u}^{\star })^{\top }\varvec{u}^\star -\frac{1}{2}\varvec{\tilde{w}}^{\top }\varvec{Q}\varvec{\tilde{w}}-\varvec{q}^{\top }\varvec{\tilde{w}}+\rho \varvec{\tilde{u}}^{\top }\varvec{\tilde{u}}\\&\quad =w^{\star }_{j}\varvec{e}_{j}^{\top }\varvec{Q}\varvec{w}^{\star }-\frac{1}{2}{w^{\star }_{j}}^{2}Q_{jj}+w^{\star }_{j}q_{j}+\rho (u^{\star }_{j}-(u^{\star }_{j})^{2})\\&\quad \ge w^{\star }_{j}\varvec{e}_{j}^{\top }\varvec{Q}\varvec{w}^{\star }-\frac{1}{2}{w^{\star }_{j}}^{2}Q_{jj}+w^{\star }_{j}q_{j}+\frac{1}{2}\rho u^{\star }_{j}\quad (\because x-x^{2}> x/2 \ \text{ for } \ 0<x<1/2)\\&\quad \ge \frac{1}{2}\rho u^{\star }_{j}-|w^{\star }_{j}|\Vert \varvec{Q}\varvec{e}_{j}\Vert _{2}\Vert \varvec{w}^{\star } \Vert _{2}-\frac{1}{2}|w^{\star }_{j}|\Vert \varvec{w}^{\star }\Vert _{2}Q_{jj}-|w_{j}||q_{j}|\\&\quad \ge \frac{\rho }{2M}|w^{\star }_{j}|-\frac{2|w^{\star }_{j}|\Vert \varvec{Q}\varvec{e}_{j}\Vert _{2}\Vert \varvec{q}\Vert _{2}}{\lambda _{\min }(\varvec{Q})}-\frac{|w^{\star }_{j}|\Vert \varvec{q}\Vert _{2}Q_{jj}}{\lambda _{\min }(\varvec{Q})}-|w_{j}||q_{j}|\\&\quad \ge \frac{|w^{\star }_{j}|}{2M}\left[ \rho -\frac{8\Vert \varvec{q}\Vert _{2}}{\lambda _{\min }(\varvec{Q})}\left( |q_{j}|+\frac{2\Vert \varvec{Q}\varvec{e}_{j}\Vert _{2}\Vert \varvec{q}\Vert _{2}+\Vert \varvec{q}\Vert _{2}Q_{jj}}{\lambda _{\min }(\varvec{Q})}\right) \right] >0. \end{aligned}$$
  • Otherwise, if \(u^{\star }_{j}\notin \{0,1\}\), then \(1/2< u^{\star }_{j}<1\).

    • Case: \(\varvec{1}^{\top }\varvec{u}^{\star }<K\). We can take \(\varepsilon >0\) such that \(\varvec{1}^{\top }\varvec{u}^{\star }+\varepsilon \le K\) and \(u^{\star }_{j}+\varepsilon \le 1\). Then by putting \((\tilde{\varvec{w}},\tilde{\varvec{u}})=(\varvec{w}^{\star },\varvec{u}^{\star }+\varepsilon \varvec{e}_{j})\), we have

      $$\begin{aligned}&f(\varvec{w}^{\star },\varvec{u}^{\star })-f(\tilde{\varvec{w}},\tilde{\varvec{u}}) \\&\quad =\rho [\varvec{1}^{\top }\varvec{u}^{\star }-(\varvec{u}^{\star })^{\top }\varvec{u}^{\star }-\varvec{1}^{\top }(\varvec{u}^{\star }+\varepsilon \varvec{e}_{j})+(\varvec{u}^{\star }+\varepsilon \varvec{e}_{j})^{\top }(\varvec{u}^{\star }+\varepsilon \varvec{e}_{j})]\\&\quad =\rho \left( -\varepsilon +2\varepsilon u^{\star }_j+\varepsilon ^{2}\right) >0. \end{aligned}$$
    • Case: \(\varvec{1}^{\top }\varvec{u}^{\star }=K\). There exists \(i\ (\ne j)\) such that \(1/2< u^{\star }_{i}<1\). Assume \(u^{\star }_{i}\ge u^{\star }_{j}\) without loss of generality. If we choose \(\varepsilon >0\) such that \(u^{\star }_{i}+\varepsilon \le 1\) and \(u^{\star }_{j}-\varepsilon \ge 1/2\), a solution \((\tilde{\varvec{w}},\tilde{\varvec{u}})=(\varvec{w}^{\star },\varvec{u}^{\star }+\varepsilon \varvec{e}_{i}-\varepsilon \varvec{e}_{j})\) is feasible, since \(M(u^{\star }_{j}-\varepsilon )\ge M/2\ge |w_{j}^\star |\). Then we have

      $$\begin{aligned} f(\varvec{w}^{\star },\varvec{u}^{\star })-f(\tilde{\varvec{w}},\tilde{\varvec{u}})&=\rho (-(\varvec{u}^{\star })^{\top }\varvec{u}^{\star }+(\varvec{u}^{\star }+\varepsilon \varvec{e}_{i}-\varepsilon \varvec{e}_{j})^{\top }(\varvec{u}^{\star }+\varepsilon \varvec{e}_{i}-\varepsilon \varvec{e}_{j}))\\&=\rho (2\varepsilon u^{\star }_{i}-2\varepsilon u^{\star }_{j}+2\varepsilon ^{2})>0. \end{aligned}$$

\(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gotoh, Jy., Takeda, A. & Tono, K. DC formulations and algorithms for sparse optimization problems. Math. Program. 169, 141–176 (2018). https://doi.org/10.1007/s10107-017-1181-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-017-1181-0

Keywords

Mathematics Subject Classification

Navigation