Skip to main content
Log in

E-ENDPP: a safe feature selection rule for speeding up Elastic Net

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Lasso is a popular regression model, which can do automatic variable selection and continuous shrinkage simultaneously. The Elastic Net is one of the corrective methods of Lasso, which selects groups of correlated variables. It is particularly useful when the number of features p is much bigger than the number of observations n. However, the training efficiency of the Elastic Net for high-dimensional data remains a challenge. Therefore, in this paper, we propose a new safe screening rule, i.e., E-ENDPP, for the Elastic Net problem which can identify the inactive features prior to training. Then, the inactive features or predictors can be removed to reduce the size of problem and accelerate the training speed. Since this E-ENDPP is derived from the optimality conditions of the model, it can be guaranteed in theory that E-ENDPP will give identical solutions with the original model. Simulation studies and real data examples show that our proposed E-ENDPP can substantially accelerate the training speed of the Elastic Net without affecting its accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  1. Hastie T, Tibshirani R, Friedman J, Franklin J (2005) The elements of statistical learning: data mining, inference and prediction. The Mathematical Intelligencer 27(2):83–85

    Google Scholar 

  2. Li J, Cheng K, Wang S, Morstatter F, Trevino RP, Tang J, Liu H (2017) Feature selection: a data perspective. Acm Computing Surveys 50(6):1–45

    Article  Google Scholar 

  3. Bühlmann P., Kalisch M, Meier L (2014) High-dimensional statistics with a view toward applications in biology. Annual Review of Statistics and Its Application 1(1):255–278

    Article  Google Scholar 

  4. Boyd S, Vandenberghe L (2004) Convex optimization, Cambridge University Press, New York

  5. Bondell H, Reich B (2010) Simultaneous regression shrinkage, variable selection and clustering of predictors with OSCAR. Biometrics 64(1):115–123

    Article  MathSciNet  MATH  Google Scholar 

  6. Xu Y, Zhong P, Wang L (2010) Support vector machine-based embedded approach feature selection algorithm. Journal of Information and Computational Science 7(5):1155–1163

    Google Scholar 

  7. Tibshirani R (1996) Regression shrinkage and subset selection with the lasso. Journal of the Royal Statistical Society

  8. Zhao P, Yu B (2006) On model selection consistency of lasso. J Mach Learn Res 7(12):2541–2563

    MathSciNet  MATH  Google Scholar 

  9. Candès E (2006) Compressive sampling. In: Proceedings of the international congress of mathematics

  10. Chen S, Donoho D, Saunders M (2001) Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing (SISC) 58(1):33–61

    MathSciNet  MATH  Google Scholar 

  11. Wright J, Ma Y, Mairal J, Sapiro G, Huang T, Yan S (2010) Sparse representation for computer vision and pattern recognition.. In: Proceedings of IEEE, 98(6): pp 1031–1044

  12. Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32(2):407–499

    Article  MathSciNet  MATH  Google Scholar 

  13. Kim S, Koh K, Lustig M, Boyd S, Gorinevsky D (2008) An interior-point method for large scale l1-regularized least squares. IEEE Journal on Selected Topics in Signal Processing 1(4):606–617

    Article  Google Scholar 

  14. Friedman J, Hastie T, Hëfling H, Tibshirani R (2007) Pathwise coordinate optimization. Ann Appl Stat 1(2):302–332

    Article  MathSciNet  MATH  Google Scholar 

  15. Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33(1):1–22

    Article  Google Scholar 

  16. Park M, Hastie T (2007) L1-regularized path algorithm for generalized linear models. Journal of the Royal Statistical Society Series B 69(4):659–677

    Article  MathSciNet  Google Scholar 

  17. Donoho D, Tsaig Y (2008) Fast solution of l-1 norm minimization problems when the solution may be sparse. IEEE Trans Inf Theory 54(11):4789–4812

    Article  MathSciNet  MATH  Google Scholar 

  18. El Ghaoui L, Viallon V, Rabbani T (2012) Safe feature elimination in sparse supervised learning. Pacific Journal of Optimization 8(4):667–698

    MathSciNet  MATH  Google Scholar 

  19. Pan X, Yang Z, Xu Y, Wang L (2018) Safe screening rules for accelerating twin support vector machine classification. IEEE Transactions on Neural Networks and Learning Systems 29(5):1876–1887

    Article  MathSciNet  Google Scholar 

  20. Xiang Z, Ramadge P (2012) Fast lasso screening tests based on correlations. In: 2012 IEEE international conference on acoustics. Speech and Signal Processing (ICASSP) 22(10):2137–2140

    Google Scholar 

  21. Xiang Z, Xu H, Ramadge P (2011) Learning sparse representations of high dimensional data on large scale dictionaries. International conference on neural information processing systems 24:900–908

    Google Scholar 

  22. Tibshirani R, Bien J, Friedman J, Hastie T, Simon N (2012) Strong rules for discarding predictors in lasso-type problems. J R Stat Soc 74(2):245–266

    Article  MathSciNet  Google Scholar 

  23. Wang J, Wonka P, Ye J (2012) Lasso screening rules via dual polytope projection. J Mach Learn Res 16(1):1063–1101

    MathSciNet  MATH  Google Scholar 

  24. Bruckstein A, Donoho D, Elad M (2009) From sparse solutions of systems of equations to sparse modeling of signals and images. SIAM Rev 51:34–81

    Article  MathSciNet  MATH  Google Scholar 

  25. Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B 67 (2):301–320

    Article  MathSciNet  MATH  Google Scholar 

  26. Hastie T, Tibshirani R (2009) The elements of statistical learning. Technometrics 45(3):267–268

    MATH  Google Scholar 

  27. Hoerl A, Kennard R (1988) Ridge regression. In: Encyclopedia of statistical sciences, 8: 129–136. Wiley, New York

  28. Breiman L (1996) Heuristics of instability in model selection. The Annals of Statistics 24

  29. Bertsekas D (2003) Convex analysis and opitimization. Athena scientific

  30. Bauschke H, Combettes P (2011) Convex analysis and monotone operator theory in hilbert spaces, Springer, New York

  31. Johnson T, Guestrin C (2015) BLITZ: a principled meta-algorithm for scaling sparse optimization. In: International conference on international conference on machine learning 18(12): 1171–1179

    Google Scholar 

  32. He X, Cai D, Niyogi P (2005) Laplacian score for feature selection. In: International conference on neural information processing systems 18:507–514

    Google Scholar 

Download references

Acknowledgments

The authors gratefully acknowledge the helpful comments and suggestions of the reviewers, which have improved the presentation. This work was supported in part by the Beijing Natural Science Foundation (No. 4172035) and National Natural Science Foundation of China (No. 11671010).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yitian Xu.

Appendices

Appendix A: Proof of Lemma 1

First of all, we reintroduce problem (3) in the following:

$$\begin{array}{@{}rcl@{}} \underset{\beta\in R^{p}}{min}\frac{1}{2}\|y-X\beta\|_{2}^{2}+\frac{\gamma}{2}\|\beta\|_{2}^{2}+\lambda\|\beta\|_{1}. \end{array} $$

For the above problem, let

$$\begin{array}{@{}rcl@{}} \bar{Y}=\left( \begin{array}{l} y \\ 0\end{array} \right) , \bar{X}=\left( \begin{array}{l} X \\ \sqrt{\gamma}I \end{array} \right). \end{array} $$

Then we will obtain that

$$\begin{array}{@{}rcl@{}} &&\frac{1}{2}\|y-X\beta\|_{2}^{2}+\frac{\gamma}{2}\|\beta\|_{2}^{2}+\lambda \|\beta\|_{1}\\ ~~&&=\frac{1}{2}(y-X\beta)^{T}(y-X\beta)+\frac{\gamma}{2}\beta^{T}\beta+\lambda\|\beta\|_{1}\\ ~~&&=\frac{1}{2}y^{T}y-y^{T}X\beta+\frac{1}{2}\beta^{T}(X^{T}X+\gamma I)\beta+\lambda \|\beta\|_{1} \end{array} $$
$$\begin{array}{@{}rcl@{}} ~~&&=\frac{1}{2}(y^{T}~~0)\left( \begin{array}{l}y \\0 \end{array}\right)-(y^{T}~~0) \left( \begin{array}{l}X \\ \sqrt{\gamma}I \end{array} \right) \beta\\ &&+\frac{1}{2}\beta^{T}(X^{T}~~\sqrt{\gamma}I)\left( \begin{array}{l}X \\ \sqrt{\gamma}I \end{array} \right) \beta+\lambda\|\beta\|_{1}\\ ~~&&=\frac{1}{2}\bar{Y}^{T}\bar{Y}+\frac{1}{2}\beta^{T}\bar{X}^{T}\bar{X}\beta-\bar{Y}^{T}\bar{X}\beta+\lambda\|\beta\|_{1}\\ ~~&&=\frac{1}{2}\|\bar{Y}-\bar{X}\beta\|_{2}^{2}+\lambda\|\beta\|_{1}, \end{array} $$
(24)

therefore, problem (3) can be rewritten as:

$$\begin{array}{@{}rcl@{}} \underset{\beta\in R^{p}}{min}~~\frac{1}{2}\|\bar{Y}-\bar{X}\beta\|_{2}^{2}+\lambda\|\beta\|_{1}, \end{array} $$

Lemma 1 has been proved.

Appendix B: The dual problem of the Elastic Net

For the problem (2), according to (Boyd and Vandenberghe, 2004)[4], by introducing a new set of variables \(w=y-X\beta \), the problem (2) becomes

$$\begin{array}{@{}rcl@{}} \underset{w, \beta}{min}~~&&\frac{1}{2}\|w\|_{2}^{2}+\lambda\|\beta\|_{1},\\ \text{s.t.} ~~&& w=y-X\beta. \end{array} $$
(25)

By introducing the multipliers \(\eta \in R^{n}\), we get the Lagrangian function as follows:

$$\begin{array}{@{}rcl@{}} L(\beta,w,\eta)=\frac{1}{2}\|w\|_{2}^{2}+\lambda\|\beta\|_{1}+\eta^{T}(y-X\beta-w). \end{array} $$

The dual function \(g(\eta )\) is

$$\begin{array}{@{}rcl@{}} g(\eta)&=&\underset{\beta,w }{inf}L(\beta,w,\eta)=\eta^{T}y+\underset{w}{inf} (\frac{1}{2}\|w\|_{2}^{2} -\eta^{T}w) \\ &+&\underset{\beta}{inf} (-\eta^{T}X\beta+\lambda\|\beta\|_{1}). \end{array} $$
(26)

Note that the right side of \(g(\eta )\) has three items. In order to get \(g(\eta )\), we need to solve the following two optimization problems.

$$\begin{array}{@{}rcl@{}} \underset{w}{inf} &&\frac{1}{2}\|w\|_{2}^{2} -\eta^{T}w, \end{array} $$
(27)
$$\begin{array}{@{}rcl@{}} \underset{\beta}{inf} &&-\eta^{T}X\beta+\lambda\|\beta\|_{1}. \end{array} $$
(28)

Let us first consider (27). Denote the objective function as

$$\begin{array}{@{}rcl@{}} f_{1}(w)=\frac{1}{2}\|w\|_{2}^{2} -\eta^{T}w, \end{array} $$

then, let

$$\begin{array}{@{}rcl@{}} \frac{\partial f_{1}(w)}{\partial w}=w-\eta= 0\Rightarrow w=\eta, \end{array} $$

so

$$\begin{array}{@{}rcl@{}} \underset{w}{inf} f_{1}(w)=-\frac{1}{2}\eta^{T}\eta=-\frac{1}{2}\|\eta\|^{2}_{2}. \end{array} $$

Next, let us consider the problem (28). Denote the objective function as

$$\begin{array}{@{}rcl@{}} f_{2}(\beta)=-\eta^{T}X\beta+\lambda\|\beta\|_{1}, \end{array} $$

f2(β) is convex but not smooth. So let us consider its subgradient

$$\begin{array}{@{}rcl@{}} \frac{\partial f_{2}(\beta)}{\partial \beta}=-X^{T}\eta+\lambda\frac{\partial \|\beta\|_{1}}{\partial \beta}= 0, \end{array} $$

where

$$\begin{array}{@{}rcl@{}} \frac{\partial \|\beta\|_{1}}{\partial \beta_{i}}=\left\{\begin{array}{lll} &sign([\beta^{*}]_{i}),&if [\beta^{*}]_{i}\neq0 \\ &[-1,1],&if [\beta^{*}]_{i}= 0 \end{array}\right.\quad i = 1,2,{\cdots} p. \end{array} $$

According to the necessary condition for \(f_{2}(\beta )\) to attain an optimum, we have \(|{X^{T}_{i}}\eta |\leq \lambda ,\quad i = 1,2,{\cdots } p\). Therefore, the optimum value of problem (28) is 0. Combining the equations above, we can get the dual problem:

$$\begin{array}{@{}rcl@{}} \underset{\eta}{max} &&g(\eta)=\eta^{T}y-\frac{1}{2} \|\eta\|_{2}^{2}, \\ \text{s.t.}~~&&|{X^{T}_{i}}\eta|\leq\lambda,\quad i = 1,2,{\cdots} p, \end{array} $$

which is equivalent to the following optimization problem:

$$\begin{array}{@{}rcl@{}} \underset{\eta}{max} &&\frac{1}{2} \|y\|_{2}^{2}-\frac{1}{2} \|\eta-y\|_{2}^{2},\\ \text{s.t.}~~&& |{X^{T}_{i}}\eta|\leq\lambda,\quad i = 1,2,{\cdots} p. \end{array} $$
(29)

By a simple re-scaling of the dual variable \(\eta \), that is, let \(\theta =\frac {\eta }{\lambda }\), we get the following result. The dual problem of (2) is:

$$\begin{array}{@{}rcl@{}} \underset{\theta}{max} &&\frac{1}{2}\|y\|_{2}^{2}-\frac{\lambda^{2}}{2} \|\theta-\frac{y}{\lambda}\|_{2}^{2},\\ \text{s.t.}~~&&\mid {X_{i}^{T}}\theta\mid\leq1\quad i = 1,2,{\cdots} p, \end{array} $$
(30)

where \(\theta \) is the dual variable. For notational convenience, let the optimal solution of problem (30) be \(\theta ^{*}(\gamma ,\lambda )\), and the optimal solution of problem (2) with parameters \(\gamma \) and \(\lambda \) is denoted by \(\beta ^{*}(\gamma ,\lambda )\). Then, the KKT conditions are given by:

$$\begin{array}{@{}rcl@{}} y=X\beta^{*}(\gamma,\lambda)+\lambda\theta^{*}(\gamma,\lambda), \end{array} $$
(31)
$$\begin{array}{@{}rcl@{}} &&{X^{T}_{i}}\theta^{*}(\gamma,\lambda)\in \left\{ \begin{array}{lll} &sign([\beta^{*}(\gamma,\lambda)]_{i}),&if [\beta^{*}(\gamma,\lambda)]_{i}\neq0 \\ &[-1,1],&if [\beta^{*}(\gamma,\lambda)]_{i}= 0 \end{array} \right.\\&& i = 1,2,{\cdots} p, \end{array} $$
(32)

Note that \(\bar {Y}=\left (\begin {array}{l} y \\ 0 \end {array} \right ) , \bar {X}=\left (\begin {array}{l} X \\ \sqrt {\gamma }I \end {array}\right )\). Let \(\theta =\left (\begin {array}{l} \theta _{1} \\ \theta _{2} \end {array} \right )\), then we have the dual problem of (3) as follows,

$$\begin{array}{@{}rcl@{}} \underset{\theta}{max} &&\frac{1}{2}\|y\|_{2}^{2}-\frac{\lambda^{2}}{2} (\|\theta_{1}-\frac{y}{\lambda}\|_{2}^{2}+\|\theta_{2}\|_{2}^{2}),\\ \text{s.t.}~~&&\mid \bar{X}_{i}^{T}\theta\mid=\mid {x_{i}^{T}}\theta_{1}+\sqrt{\gamma}(\theta_{2})_{i}\mid\leq1,~ i = 1,2,{\cdots} p, \end{array} $$
(33)

where \(\theta =\left (\begin {array}{l} \theta _{1} \\ \theta _{2} \end {array} \right )\) is the dual variable. So, Theorem 1 has been proved.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xu, Y., Tian, Y., Pan, X. et al. E-ENDPP: a safe feature selection rule for speeding up Elastic Net. Appl Intell 49, 592–604 (2019). https://doi.org/10.1007/s10489-018-1295-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-018-1295-y

Keywords

Navigation