Skip to main content
Log in

Sparse learning via Boolean relaxations

  • Full Length Paper
  • Series B
  • Published:
Mathematical Programming Submit manuscript

Abstract

We introduce novel relaxations for cardinality-constrained learning problems, including least-squares regression as a special but important case. Our approach is based on reformulating a cardinality-constrained problem exactly as a Boolean program, to which standard convex relaxations such as the Lasserre and Sherali-Adams hierarchies can be applied. We analyze the first-order relaxation in detail, deriving necessary and sufficient conditions for exactness in a unified manner. In the special case of least-squares regression, we show that these conditions are satisfied with high probability for random ensembles satisfying suitable incoherence conditions, similar to results on \(\ell _1\)-relaxations. In contrast to known methods, our relaxations yield lower bounds on the objective, and it can be verified whether or not the relaxation is exact. If it is not, we show that randomization based on the relaxed solution offers a principled way to generate provably good feasible solutions. This property enables us to obtain high quality estimates even if incoherence conditions are not met, as might be expected in real datasets. We numerically illustrate the performance of the relaxation-randomization strategy in both synthetic and real high-dimensional datasets, revealing substantial improvements relative to \(\ell _1\)-based methods and greedy selection heuristics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. Taken from the Princeton University Gene Expression Project; for original source and further details please see the references therein.

  2. Taken from FDA-NCI Clinical Proteomics Program Databank; for original source and further details please see the references therein.

References

  1. Ahlswede, R., Winter, A.: Strong converse for identification via quantum channels. IEEE Trans. Inf. Theory 48(3), 569–579 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  2. Akaike, H.: Information theory and an extension of the maximum likelihood principle. In: Proceedings of the 2nd international symposium on information theory, Tsahkadsor, Armenia, USSR (September 1971)

  3. Bickel, P.J., Ritov, Y., Tsybakov, A.: Simultaneous analysis of Lasso and Dantzig selector. Ann. Stat. 37(4), 1705–1732 (2009)

    Article  MATH  MathSciNet  Google Scholar 

  4. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge, UK (2004)

    Book  MATH  Google Scholar 

  5. Bühlmann, P., van de Geer, S.: Statistics for High-Dimensional Data. Springer Series in Statistics. Springer, Berlin (2011)

    Book  Google Scholar 

  6. Candes, E.J., Tao, T.: Decoding by linear programming. IEEE Trans. Info. Theory 51(12), 4203–4215 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  7. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines (and Other Kernel Based Learning Methods). Cambridge University Press, Cambridge (2000)

    Book  Google Scholar 

  8. Inc. CVX Research. CVX: Matlab software for disciplined convex programming, version 2.0 (August 2012)

  9. d’Aspremont, A., El Ghaoui, L.: Testing the nullspace property using semidefinite programming. Technical report, Princeton (2009)

  10. Davidson, K.R., Szarek, S.J.: Local operator theory, random matrices and Banach spaces. Handbook of Banach Spaces, vol. 1, pp. 317–336. Elsevier, Amsterdam (2001)

    Google Scholar 

  11. Dekel, O., Singer, Y.: Support vector machines on a budget. Adv. Neural Inf. Process. Syst. 19, 345 (2007)

    Google Scholar 

  12. Donoho, D.L.: Compress. sensing. IEEE Trans. Info. Theory 52(4), 1289–1306 (2006)

    Article  MATH  MathSciNet  Google Scholar 

  13. Donoho, D.L., Elad, M., Temlyakov, V.M.: Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Trans. Inf. Theory 52(1), 6–18 (2006)

    Article  MATH  MathSciNet  Google Scholar 

  14. Fuchs, J.J.: Recovery of exact sparse representations in the presence of noise. ICASSP 2, 533–536 (2004)

    Google Scholar 

  15. Grant, M., Boyd, S.: Graph implementations for nonsmooth convex programs. In: Blondel, V., Boyd, S., Kimura, H. (eds.) Recent Advances in Learning and Control, Lecture Notes in Control and Information Sciences, pp. 95–110. Springer, Berlin (2008)

    Google Scholar 

  16. Lasserre, J.B.: An explicit exact SDP relaxation for nonlinear 0–1 programs. In: Aardal K., and Gerads A.M.H., (eds.) Lecture Notes in Computer Science, 2081:293–303 (2001)

  17. Laurent, M.: A comparison of the Sherali-Adams, Lovász-Schrijver and Lasserre relaxations for 0–1 programming. Math. Oper. Res. 28, 470–496 (2003)

    Article  MATH  MathSciNet  Google Scholar 

  18. Ledoux, M.: The Concentration of Measure Phenomenon. Mathematical Surveys and Monographs. American Mathematical Society, Providence, RI (2001)

    MATH  Google Scholar 

  19. Lovász, L., Schrijver, A.: Cones of matrices and set-functions and 0–1 optimization. SIAM J. Optim. 1, 166–190 (1991)

    Article  MATH  MathSciNet  Google Scholar 

  20. Markowitz, H.M.: Portf. Sel. Wiley, New York (1959)

    Google Scholar 

  21. McCullagh, P., Nelder, J.A.: Generalized Linear Models. Monographs on Statistics and Applied Probability 37. Chapman and Hall/CRC, New York (1989)

    Google Scholar 

  22. Meinshausen, N., Bühlmann, P.: High-dimensional graphs and variable selection with the Lasso. Ann. Stat. 34, 1436–1462 (2006)

    Article  MATH  Google Scholar 

  23. Motwani, R., Raghavan, P.: Randomized Algorithms. Cambridge University Press, Cambridge, UK (1995)

    Book  MATH  Google Scholar 

  24. Negahban, S., Ravikumar, P., Wainwright, M.J., Yu, B.: Restricted strong convexity and generalized linear models. Technical report, UC Berkeley, Department of Statistics (August 2011)

  25. Nesterov, Y.: Primal-dual subgradient methods for convex problems. Technical report, Center for Operations Research and Econometrics (CORE), Catholic University of Louvain (UCL) (2005)

  26. Oliveira, R.I.: Sums of random Hermitian matrices and an inequality by Rudelson. Elec. Comm. Prob. 15, 203–212 (2010)

    Article  MATH  Google Scholar 

  27. Pilanci, M., El Ghaoui, L., Chandrasekaran, V.: Recovery of sparse probability measures via convex programming. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 25, pp. 2420–2428. Curran Associates, Inc. (2012)

  28. Schmidt, M., van den Berg, E., Friedlander, M., Murphy, K.: Optimizing costly functions with simple constraints: A limited-memory projected quasi-newton algorithm. AISTATS 2009, 5 (2009)

  29. Sherali, H.D., Adams, W.P.: A hierarchy of relaxations between the continuous and convex hull representations for zero-one programming problems. SIAM J. Discrete Math. 3, 411–430 (1990)

    Article  MATH  MathSciNet  Google Scholar 

  30. Tibshirani, R.: Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Ser. B 58(1), 267–288 (1996)

    MATH  MathSciNet  Google Scholar 

  31. Tropp, J.A.: Just relax: Convex programming methods for subset selection and sparse approximation. ICES Report 04–04, UT-Austin, February (2004)

  32. Wainwright, M.J.: Information-theoretic bounds on sparsity recovery in the high-dimensional and noisy setting. IEEE Trans. Info. Theory 55, 5728–5741 (2009)

    Article  MathSciNet  Google Scholar 

  33. Wainwright, M.J.: Sharp thresholds for high-dimensional and noisy sparsity recovery using \(\ell _1\)-constrained quadratic programming (Lasso). IEEE Trans. Inf. Theory 55, 2183–2202 (2009)

    Article  MathSciNet  Google Scholar 

  34. Wainwright, M.J.: Structured regularizers: statistical and computational issues. Annu. Rev. Stat. Appl. 1, 233–253 (2014)

    Google Scholar 

  35. Wainwright, M.J., Jordan, M.I.: Treewidth-based conditions for exactness of the Sherali-Adams and Lasserre relaxations. Technical report, UC Berkeley, Department of Statistics, No. 671 (September 2004)

  36. Wasserman, Larry: Bayesian model selection and model averaging. J. Math. Psychol. 44(1), 92–107 (2000)

    Article  MATH  MathSciNet  Google Scholar 

  37. Zhang, Y., Wainwright, M.J., Jordan, M.I.: Lower bounds on the performance of polynomial-time algorithms for sparse linear regression. In COLT conference, Barcelona, Spain, (June 2014). Full length version at http://arxiv.org/abs/1402.1918

  38. Zhao, P., Yu, B.: On model selection consistency of Lasso. J. Mach. Learn. Res. 7, 2541–2567 (2006)

    MATH  MathSciNet  Google Scholar 

  39. Zou, H., Hastie, T.J.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B 67(2), 301–320 (2005)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Acknowledgments

Authors MP and MJW were partially supported by Office of Naval Research MURI grant N00014-11-1-0688, and National Science Foundation Grants CIF-31712-23800 and DMS-1107000. In addition, MP was supported by a Microsoft Research Fellowship.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Laurent El Ghaoui.

Appendix: Proofs

Appendix: Proofs

In this appendix, we provide the proofs of Theorems 2 and 3.

1.1 Proof of Theorem 2

Recalling the Definition (20) of the matrix \(M\), for each \(j \in \{1, \ldots , d\}\), define the rescaled random variable \(U_j : \, =\frac{X_j^T My}{\rho n}\). In terms of this notation, it suffices to find a scalar \(\lambda \) such that

$$\begin{aligned} \min _{j \in S} |U_j| > \lambda \quad \text{ and } \quad \max _{j \in S^c} |U_j| < \lambda . \end{aligned}$$
(32)

By definition, we have \(y = X_Sw^*_S+ \varepsilon \), whence

$$\begin{aligned} U_j&= \underbrace{\frac{X_j^T MX_Sw^*_S}{\rho n}}_{A_j} \quad + \quad \underbrace{\frac{X_j^T M\varepsilon }{\rho n}}_{B_j}. \end{aligned}$$

Based on this decomposition, we then make the following claims:

Lemma 1

There are numerical constants \(c_1, c_2\) such that

$$\begin{aligned} \mathbb {P}\big [ \max _{j = 1, \ldots , d} |B_j| \ge t \big ]&\le c_1 e^{-c_2 \frac{n\, t^2}{\gamma ^2} + \log d}. \end{aligned}$$
(33)

Lemma 2

There are numerical constants \(c_1, c_2\) such that

$$\begin{aligned} \mathbb {P}\big [ \min _{j \in S} |A_j| < \frac{w_{}\mathrm{min}}{4} \big ]&\le c_1 e^{- c_2 n\frac{w_{}\mathrm{min}^2}{\Vert w^*_S\Vert _2^2}+\log (2k)} \quad \text{ and } \end{aligned}$$
(34a)
$$\begin{aligned} \mathbb {P}\big [ \max _{j \in S^c} |A_j| \ge \frac{w_{}\mathrm{min}}{16} \big ]&\le c_3 e^{- c_4 n\frac{w_{}\mathrm{min}^2}{\Vert w^*_S\Vert _2^2} + \log (d- k)}, \end{aligned}$$
(34b)

Using these two lemmas, we can now complete the proof. Recall that Theorem 2 assumes a lower bound of the form \(n> c_0 \frac{\gamma ^2 + \Vert w^*_S\Vert _2^2}{w_{}\mathrm{min}^2} \, \log d\), where \(c_0\) is a sufficiently large constant. Thus, setting \(t = \frac{w_{}\mathrm{min}}{16}\) in Lemma 1 ensures that \(\max \nolimits _{j = 1, \ldots , d} |B_j| \le \frac{w_{}\mathrm{min}}{16}\) with high probability. Combined with the bound (34a) from Lemma 2, we are guaranteed that

$$\begin{aligned} \min _{j \in S} |U_j|&\ge \frac{w_{}\mathrm{min}}{4} - \frac{w_{}\mathrm{min}}{16} \; = \; \frac{3 w_{}\mathrm{min}}{16} \qquad \text{ with } \text{ high } \text{ probability }. \end{aligned}$$

Similarly, the bound (34b) guarantees that

$$\begin{aligned} \max _{j \in S^c} |U_j|&\le \frac{w_{}\mathrm{min}}{16} + \frac{w_{}\mathrm{min}}{16} = \frac{2 w_{}\mathrm{min}}{16} \qquad \text{ also } \text{ with } \text{ high } \text{ probability. } \end{aligned}$$

Thus, setting \(\lambda = \frac{5 w_{}\mathrm{min}}{32}\) ensures that the condition (32) holds.

The only remaining detail is to prove the two lemmas.

Proof of Lemma 1

Define the event \(\mathcal {E}_j = \{ \Vert X_j\Vert _2/\sqrt{n} \le 2 \}\), and observe that

$$\begin{aligned} \mathbb {P}\big [ |B_j| > t \big ]&\le \mathbb {P}[|B_j| > t \mid \mathcal {E}] + \mathbb {P}[\mathcal {E}^c]. \end{aligned}$$

Since the variable \(\Vert X_j\Vert _2^2\) follows a \(\chi ^2\)-distribution with \(n\) degrees of freedom, we have \(\mathbb {P}\big [ \mathcal {E}^c \big ] \le 2 e^{-c_2 n}\). Recalling the Definition (20) of the matrix \(M\), note that \(\sigma _{\mathrm{max}}\big (M\big ) \le \rho ^{-1}\), whence conditioned on \(\mathcal {E}\), we have \(\Vert MX_j\Vert _2 \le \Vert X_j\Vert _2 \le 2\sqrt{n}\). Consequently, conditioned on \(\mathcal {E}\), the variable \(\frac{X_j^T M\varepsilon }{\rho }\) is a Gaussian random vector with variance at most \(4 \gamma ^2/\rho ^2\), and hence \(\mathbb {P}[|B_j| > t \mid \mathcal {E}] \le 2 e^{-\frac{ \rho ^2 t^2}{32 \gamma ^2}}\).

Finally, by union bound, we have

$$\begin{aligned} \mathbb {P}\big [ \max _{j = 1, \ldots , d} |B_j| > t \big ]&\!\le \! d\, \mathbb {P}\big [ |B_j| > t \big ] \; \!\le \! \; d\Big \{ 2 e^{-\frac{\rho ^2 t^2}{32 \gamma ^2}} + 2 e^{-c_2 \rho n} \Big \} \; \!\le \! \; c_1 e^{-c_2 \frac{\rho ^2 t^2}{\gamma ^2} + \log d}, \end{aligned}$$

as claimed. \(\square \)

Proof of Lemma 2

We split the proof into two parts.

(1) Proof of the bound (34a):

Note that

$$\begin{aligned} \frac{1}{\rho }X_S^T MX_S&= X_S^T(\rho I_n+X_SX_S^T)^{-1}X_S\end{aligned}$$

We now write \(X_S = UDV^T\) for singular value decomposition of \(\frac{1}{\sqrt{n}}X_S\) in compact form. We thus have

$$\begin{aligned} \frac{1}{\rho } X_S^T MX_S&= V \big ( \rho I_n + nD^2 \big )^{-1} D^2 V^T. \end{aligned}$$

We will prove that for a fixed vector \(z\), the following holds with high probability

$$\begin{aligned} \frac{\Vert \left( \frac{1}{\rho }X_S^T M X_S-I\right) z\Vert _\infty }{\Vert z\Vert _\infty } \le \epsilon . \end{aligned}$$
(35)

Applying the above bound to \(w_S^*\), which is a fixed vector we obtain

$$\begin{aligned} \Vert \left( \frac{1}{\rho }X_S^T M X_S-I\right) w_s^*\Vert _\infty \le \epsilon \Vert w_s^*\Vert _\infty \end{aligned}$$
(36)

Then by triangle inequality the above statement implies that

$$\begin{aligned} \min _{i \in S} | \frac{1}{\rho }X_S^T M X_S w_i^* | > (1-\epsilon ) \min _{i\in S} |w_i^*|. \end{aligned}$$

and setting \(\epsilon =3/4\) yields the claim.

Next we let \(\frac{1}{\rho }X_S^T M X_S-I = V \tilde{D} V\) where we defined \(\tilde{D} \!: \, =\! \left( (\rho I_n\!+\!D^2)^{-1}D^2\!-\!I\right) \). By standard results on operator norm of Gaussian random matrices (e.g., see Davidson and Szarek [10]), the minimum singular valyue

$$\begin{aligned} \sigma _{\min }\left( \frac{1}{\sqrt{n}}\mathrm{{X}}_S\right)&= \min _{i=1, \ldots , k} D_{ii} \end{aligned}$$

of the matrix \(\mathrm{{X}}_S/\sqrt{n}\) can be bounded as

$$\begin{aligned} \mathbb {P}\left[ \frac{1}{\sqrt{n}} \min _{i=1,\ldots ,k} |D_{ii}| \le 1 - \sqrt{\frac{k}{n}} - t \right]&\le 2 e^{-c_1 nt^2}, \end{aligned}$$
(37)

where \(c_1\) is a numerical constant (independent of \((n, k)\)).

Now define \(Y_i\,:= e_i^T V \tilde{D} V^T z = z_i v_i \tilde{D} v_i + v_i^T \tilde{D} \sum _{l\ne i} z_l v_l\). Then note that,

$$\begin{aligned} |Y_1|&\le \Vert \tilde{D}\Vert _2 |z_1| + v_1^T \tilde{D} \sum _{l\ne i} z_l v_l \nonumber = \frac{\rho }{\rho + \min _{i=1, \ldots ,k} |D_{ii}|^2} |z_1| + F(v_1) \end{aligned}$$

where we defined \(F(v_1): \, =v_1^T \tilde{D} \sum _{l\ne i} z_l v_l\) and \(v_1\) is uniformly distributed over a sphere in \(k-1\) dimensions and hence \(\mathbb {E} F(v_1) = 0\). Observe that \(F\) is a Lipschitz map satisfying

$$\begin{aligned} |F(v_1)-F(v_1')|&\le \Vert \tilde{D}\Vert _{\infty } \sqrt{\sum _{l\ne i} |z_l^2} |v_1-v_1'\Vert _2\\ {}&= \frac{\rho }{\rho + \min _{i} |D_{ii}|^2} |\sqrt{k-1} \Vert z\Vert _\infty \Vert v_1-v_1'\Vert _2 \end{aligned}$$

Applying concentration of measure for Lipschitz functions on the sphere (e.g., see [18]) the function \(F(v_1)\) we get that for all \(t>0\) we have,

$$\begin{aligned} \mathbb {P}\big [ F(v_1) > t\Vert z\Vert _\infty \big ]&\le 2 e^{ -c_4(k-1) \frac{t^2}{\big (\frac{\rho }{\rho + \min _{i} |D_{ii}|^2} \big )^2 (k-1)}}. \end{aligned}$$
(38)

Conditioning on the high probability event \(\{\min _{i} |D_{ii}|^2 \le \frac{n}{2}\}\) and then applying the tail bound (37) yields

$$\begin{aligned} \mathbb {P}\big [ F(v_1) > t\Vert z\Vert _\infty \big ]&\le 2 \exp {\left( -c_4 \frac{n^2t^2}{\rho ^2} \right) } + 2 e^{-c_2 \frac{nt^2}{\rho ^2}} \nonumber \nonumber \\&\le 4 e^{-c_5 \frac{n^2 t^2}{\rho ^2}} \,. \end{aligned}$$
(39)

Combining the pieces in (39) and (38), we take a union bound over \(2k\) coordinates,

$$\begin{aligned} \mathbb {P}\left[ \min _{j\in S} |Y_j| > t \Vert z\Vert _\infty \right]&\le 2k \; 3\exp {\left( -c_5 n^2 t^2/\rho ^2 \right) } \\&\le 2k \; 3 \exp {\left( -c_5 nt^2 \right) }\,. \end{aligned}$$

where the final line follows from our choice \(\rho = \sqrt{n}\). Finally setting \(t=\epsilon \) we obtain the statement in (35) and hence complete the proof.

Proof of the bound (34b): A similar calculation yields

$$\begin{aligned} A_j = \frac{1}{\rho }X_{S^c}^T MX_{S}w_S^*&= X_{S^c}^T\big ( \rho I_{n} + X_SX_S^T \big )^{-1} X_s w_S^*\,, \end{aligned}$$

for each \(j \in S^c\).

Defining the event \(\mathcal {E}= \{ \sigma _{\mathrm{max}}\big (X_S\big )/ \le 2 \sqrt{n} \}\), standard bounds in random matrix theory [10] imply that \(\mathbb {P}[\mathcal {E}^c] \le 2 e^{-c_2 n}\). Conditioned on \(\mathcal {E}\), we have

$$\begin{aligned} \Vert \big ( \rho I_{n} + X_SX_S^T \big )^{-1} X_s w^*_S\Vert _2 \le \frac{2}{\rho } \Vert w^*_S\Vert _2, \end{aligned}$$

so that the variable \(A_j\) is conditionally Gaussian with variance at most \(\frac{4}{\rho ^2} \Vert w^*_S\Vert _2^2\). Consequently, we have

$$\begin{aligned} \mathbb {P}[|A_j| \ge t]&\le \mathbb {P}[|A_j| \ge t \mid \mathcal {E}] + \mathbb {P}[\mathcal {E}^c] \; = \; 2 e^{- \frac{\rho ^2 t^2}{32 \Vert w^*_S\Vert _2^2}} + 2 e^{-c_2 } \le c_1 e^{-c_2 \frac{\rho ^2 t^2}{\Vert w^*_S\Vert _2^2}}, \end{aligned}$$

Setting \(t = \frac{w_{}\mathrm{min}}{8}\), \(\rho = \sqrt{n}\) and taking union bound over all \(d- k\) indices in \(S^c\) yields the claim (34b). \(\square \)

1.2 Proof of Theorem 3

The vector \(\widetilde{u}\in \{0,1\}^d\) consists of independent Bernoulli trials, and we have \(\mathbb {E}[\sum _{j=1}^d\widetilde{u}_j] \le k\). Consequently, by the Chernoff bound for Bernoulli sums, we have

$$\begin{aligned} \mathbb {P}\Big [ \sum _{j=1}^d\widetilde{u}_j \ge (1 + \delta ) k\Big ]&\le c_1 e^{-c_2 k\delta ^2}. \end{aligned}$$

as claimed.

It remains to establish the high-probability bound on the optimal value. As shown previously, the Boolean problem admits the saddle point representation

$$\begin{aligned} P^*&= \min _{u \in \{0,1\}^d,~\sum _{i=1}^du_i \le k} \Big \{ \underbrace{\max _{\alpha \in \mathbb {R}^n} -\frac{1}{\rho } \alpha ^T \mathrm{{X}}D(u)\mathrm{{X}}^T\alpha - \Vert \alpha \Vert ^2_2 - 2 \alpha ^T y}_{G(u)} \Big \}. \end{aligned}$$
(40)

Since the optimal value is non-negative, the optimal dual parameter \(\alpha \in \mathbb {R}^n\) must have its \(\ell _2\)-norm bounded as \(\Vert \alpha \Vert _2\le 2 \Vert y\Vert _2\le 2\). Using this fact, we have

$$\begin{aligned} G(\widehat{u}) - G(\widetilde{u}) =&\max _{\Vert \alpha \Vert _2 \le 2} \Big \{ -\frac{1}{\rho } \alpha ^T \mathrm{{X}}D(\widehat{u})\mathrm{{X}}^T\alpha - \Vert \alpha \Vert ^2_2 - 2 \alpha ^T y \Big \} \\&\quad - \max _{\Vert \alpha \Vert _2 \le 2} \Big \{ -\frac{1}{\rho } \alpha ^T \mathrm{{X}}D(\widetilde{u})\mathrm{{X}}^T\alpha - \Vert \alpha \Vert ^2_2 - 2 \alpha ^T y \Big \} \\ \le&\max _{\Vert \alpha \Vert _2 \le 2} \Big \{ -\frac{1}{\rho } \alpha ^T \mathrm{{X}}(D(\widehat{u})-D(\widetilde{u}))\mathrm{{X}}^T\alpha \Big \}\\ \le&\frac{2}{\rho } \sigma _{\mathrm{max}}\big (\mathrm{{X}}(D(\widehat{u})-D(\widetilde{u}))\mathrm{{X}}^T\big ), \end{aligned}$$

where \(\sigma _{\mathrm{max}}\big (\cdot \big )\) denotes the maximum eigenvalue of a symmetric matrix.

It remains to establish a high probability bound on this maximum eigenvalue. Recall that \(R\) is the subset of indices associated with fractional elements of \(\widehat{u}\), and moreover that \(\mathbb {E}[\widetilde{u}_j] = \widehat{u}_j\). Using these facts, we can write

$$\begin{aligned} \mathrm{{X}}(D(\widetilde{u}) - D(\widehat{u}))\mathrm{{X}}^T = \sum _{j \in R} \underbrace{\big ( \widetilde{u}_j - \mathbb {E}[\widetilde{u}_j] \big ) X_j X_j^T}_{A_j} \end{aligned}$$

where \(X_j \in \mathbb {R}^{n}\) denotes the \(j\)th column of \(\mathrm{{X}}\). Since \(\Vert X_j\Vert _2 \le 1\) by assumption and \(\widetilde{u}_j\) is Bernoulli, the matrix \(A_j\) has operator norm at most 1, and is zero mean. Consequently, by the Ahlswede-Winter matrix bound [1, 26], we have

$$\begin{aligned} \mathbb {P}\Big [ \sigma _{\mathrm{max}}\big (\sum _{j \in R} A_j\big ) \ge \sqrt{r} t \Big ]&\le 2 \min \{ n, r\} e^{-t^2/16}, \end{aligned}$$

where \(r= |R|\) is the number of fractional components. Setting \(t^2 = c \log \min \{n, r\}\) for a sufficiently large constant \(c\) yields the claim.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pilanci, M., Wainwright, M.J. & El Ghaoui, L. Sparse learning via Boolean relaxations. Math. Program. 151, 63–87 (2015). https://doi.org/10.1007/s10107-015-0894-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-015-0894-1

Keywords

Mathematics Subject Classification

Navigation