E-ENDPP: a safe feature selection rule for speeding up Elastic Net

Xu, Yitian; Tian, Ying; Pan, Xianli; Wang, Hongmei

doi:10.1007/s10489-018-1295-y

E-ENDPP: a safe feature selection rule for speeding up Elastic Net

Published: 07 September 2018

Volume 49, pages 592–604, (2019)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Yitian Xu¹,
Ying Tian¹,
Xianli Pan¹ &
…
Hongmei Wang²

450 Accesses
2 Citations
Explore all metrics

Abstract

Lasso is a popular regression model, which can do automatic variable selection and continuous shrinkage simultaneously. The Elastic Net is one of the corrective methods of Lasso, which selects groups of correlated variables. It is particularly useful when the number of features p is much bigger than the number of observations n. However, the training efficiency of the Elastic Net for high-dimensional data remains a challenge. Therefore, in this paper, we propose a new safe screening rule, i.e., E-ENDPP, for the Elastic Net problem which can identify the inactive features prior to training. Then, the inactive features or predictors can be removed to reduce the size of problem and accelerate the training speed. Since this E-ENDPP is derived from the optimality conditions of the model, it can be guaranteed in theory that E-ENDPP will give identical solutions with the original model. Simulation studies and real data examples show that our proposed E-ENDPP can substantially accelerate the training speed of the Elastic Net without affecting its accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Discriminative Lasso

Article 07 April 2016

A novel self-weighted Lasso and its safe screening rule

Article 08 March 2022

Model Selection With Mixed Variables on the Lasso Path

Article 04 May 2020

References

Hastie T, Tibshirani R, Friedman J, Franklin J (2005) The elements of statistical learning: data mining, inference and prediction. The Mathematical Intelligencer 27(2):83–85
Google Scholar
Li J, Cheng K, Wang S, Morstatter F, Trevino RP, Tang J, Liu H (2017) Feature selection: a data perspective. Acm Computing Surveys 50(6):1–45
Article Google Scholar
Bühlmann P., Kalisch M, Meier L (2014) High-dimensional statistics with a view toward applications in biology. Annual Review of Statistics and Its Application 1(1):255–278
Article Google Scholar
Boyd S, Vandenberghe L (2004) Convex optimization, Cambridge University Press, New York
Bondell H, Reich B (2010) Simultaneous regression shrinkage, variable selection and clustering of predictors with OSCAR. Biometrics 64(1):115–123
Article MathSciNet MATH Google Scholar
Xu Y, Zhong P, Wang L (2010) Support vector machine-based embedded approach feature selection algorithm. Journal of Information and Computational Science 7(5):1155–1163
Google Scholar
Tibshirani R (1996) Regression shrinkage and subset selection with the lasso. Journal of the Royal Statistical Society
Zhao P, Yu B (2006) On model selection consistency of lasso. J Mach Learn Res 7(12):2541–2563
MathSciNet MATH Google Scholar
Candès E (2006) Compressive sampling. In: Proceedings of the international congress of mathematics
Chen S, Donoho D, Saunders M (2001) Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing (SISC) 58(1):33–61
MathSciNet MATH Google Scholar
Wright J, Ma Y, Mairal J, Sapiro G, Huang T, Yan S (2010) Sparse representation for computer vision and pattern recognition.. In: Proceedings of IEEE, 98(6): pp 1031–1044
Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32(2):407–499
Article MathSciNet MATH Google Scholar
Kim S, Koh K, Lustig M, Boyd S, Gorinevsky D (2008) An interior-point method for large scale l1-regularized least squares. IEEE Journal on Selected Topics in Signal Processing 1(4):606–617
Article Google Scholar
Friedman J, Hastie T, Hëfling H, Tibshirani R (2007) Pathwise coordinate optimization. Ann Appl Stat 1(2):302–332
Article MathSciNet MATH Google Scholar
Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33(1):1–22
Article Google Scholar
Park M, Hastie T (2007) L1-regularized path algorithm for generalized linear models. Journal of the Royal Statistical Society Series B 69(4):659–677
Article MathSciNet Google Scholar
Donoho D, Tsaig Y (2008) Fast solution of l-1 norm minimization problems when the solution may be sparse. IEEE Trans Inf Theory 54(11):4789–4812
Article MathSciNet MATH Google Scholar
El Ghaoui L, Viallon V, Rabbani T (2012) Safe feature elimination in sparse supervised learning. Pacific Journal of Optimization 8(4):667–698
MathSciNet MATH Google Scholar
Pan X, Yang Z, Xu Y, Wang L (2018) Safe screening rules for accelerating twin support vector machine classification. IEEE Transactions on Neural Networks and Learning Systems 29(5):1876–1887
Article MathSciNet Google Scholar
Xiang Z, Ramadge P (2012) Fast lasso screening tests based on correlations. In: 2012 IEEE international conference on acoustics. Speech and Signal Processing (ICASSP) 22(10):2137–2140
Google Scholar
Xiang Z, Xu H, Ramadge P (2011) Learning sparse representations of high dimensional data on large scale dictionaries. International conference on neural information processing systems 24:900–908
Google Scholar
Tibshirani R, Bien J, Friedman J, Hastie T, Simon N (2012) Strong rules for discarding predictors in lasso-type problems. J R Stat Soc 74(2):245–266
Article MathSciNet Google Scholar
Wang J, Wonka P, Ye J (2012) Lasso screening rules via dual polytope projection. J Mach Learn Res 16(1):1063–1101
MathSciNet MATH Google Scholar
Bruckstein A, Donoho D, Elad M (2009) From sparse solutions of systems of equations to sparse modeling of signals and images. SIAM Rev 51:34–81
Article MathSciNet MATH Google Scholar
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B 67 (2):301–320
Article MathSciNet MATH Google Scholar
Hastie T, Tibshirani R (2009) The elements of statistical learning. Technometrics 45(3):267–268
MATH Google Scholar
Hoerl A, Kennard R (1988) Ridge regression. In: Encyclopedia of statistical sciences, 8: 129–136. Wiley, New York
Breiman L (1996) Heuristics of instability in model selection. The Annals of Statistics 24
Bertsekas D (2003) Convex analysis and opitimization. Athena scientific
Bauschke H, Combettes P (2011) Convex analysis and monotone operator theory in hilbert spaces, Springer, New York
Johnson T, Guestrin C (2015) BLITZ: a principled meta-algorithm for scaling sparse optimization. In: International conference on international conference on machine learning 18(12): 1171–1179
Google Scholar
He X, Cai D, Niyogi P (2005) Laplacian score for feature selection. In: International conference on neural information processing systems 18:507–514
Google Scholar

Download references

Acknowledgments

The authors gratefully acknowledge the helpful comments and suggestions of the reviewers, which have improved the presentation. This work was supported in part by the Beijing Natural Science Foundation (No. 4172035) and National Natural Science Foundation of China (No. 11671010).

Author information

Authors and Affiliations

College of Science, China Agricultural University, Beijing, 100083, China
Yitian Xu, Ying Tian & Xianli Pan
College of Information and Electrical Engineering, China Agricultural University, Beijing, 100083, China
Hongmei Wang

Authors

Yitian Xu
View author publications
You can also search for this author in PubMed Google Scholar
Ying Tian
View author publications
You can also search for this author in PubMed Google Scholar
Xianli Pan
View author publications
You can also search for this author in PubMed Google Scholar
Hongmei Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yitian Xu.

Appendices

Appendix A: Proof of Lemma 1

First of all, we reintroduce problem (3) in the following:

$$\begin{array}{@{}rcl@{}} \underset{\beta\in R^{p}}{min}\frac{1}{2}\|y-X\beta\|_{2}^{2}+\frac{\gamma}{2}\|\beta\|_{2}^{2}+\lambda\|\beta\|_{1}. \end{array} $$

For the above problem, let

$$\begin{array}{@{}rcl@{}} \bar{Y}=\left( \begin{array}{l} y \\ 0\end{array} \right) , \bar{X}=\left( \begin{array}{l} X \\ \sqrt{\gamma}I \end{array} \right). \end{array} $$

Then we will obtain that

$$\begin{array}{@{}rcl@{}} &&\frac{1}{2}\|y-X\beta\|_{2}^{2}+\frac{\gamma}{2}\|\beta\|_{2}^{2}+\lambda \|\beta\|_{1}\\ ~~&&=\frac{1}{2}(y-X\beta)^{T}(y-X\beta)+\frac{\gamma}{2}\beta^{T}\beta+\lambda\|\beta\|_{1}\\ ~~&&=\frac{1}{2}y^{T}y-y^{T}X\beta+\frac{1}{2}\beta^{T}(X^{T}X+\gamma I)\beta+\lambda \|\beta\|_{1} \end{array} $$

$$\begin{array}{@{}rcl@{}} ~~&&=\frac{1}{2}(y^{T}~~0)\left( \begin{array}{l}y \\0 \end{array}\right)-(y^{T}~~0) \left( \begin{array}{l}X \\ \sqrt{\gamma}I \end{array} \right) \beta\\ &&+\frac{1}{2}\beta^{T}(X^{T}~~\sqrt{\gamma}I)\left( \begin{array}{l}X \\ \sqrt{\gamma}I \end{array} \right) \beta+\lambda\|\beta\|_{1}\\ ~~&&=\frac{1}{2}\bar{Y}^{T}\bar{Y}+\frac{1}{2}\beta^{T}\bar{X}^{T}\bar{X}\beta-\bar{Y}^{T}\bar{X}\beta+\lambda\|\beta\|_{1}\\ ~~&&=\frac{1}{2}\|\bar{Y}-\bar{X}\beta\|_{2}^{2}+\lambda\|\beta\|_{1}, \end{array} $$

(24)

therefore, problem (3) can be rewritten as:

$$\begin{array}{@{}rcl@{}} \underset{\beta\in R^{p}}{min}~~\frac{1}{2}\|\bar{Y}-\bar{X}\beta\|_{2}^{2}+\lambda\|\beta\|_{1}, \end{array} $$

Lemma 1 has been proved.

Appendix B: The dual problem of the Elastic Net

For the problem (2), according to (Boyd and Vandenberghe, 2004)[4], by introducing a new set of variables $w=y-X\beta $, the problem (2) becomes

$$\begin{array}{@{}rcl@{}} \underset{w, \beta}{min}~~&&\frac{1}{2}\|w\|_{2}^{2}+\lambda\|\beta\|_{1},\\ \text{s.t.} ~~&& w=y-X\beta. \end{array} $$

(25)

By introducing the multipliers $\eta \in R^{n}$, we get the Lagrangian function as follows:

$$\begin{array}{@{}rcl@{}} L(\beta,w,\eta)=\frac{1}{2}\|w\|_{2}^{2}+\lambda\|\beta\|_{1}+\eta^{T}(y-X\beta-w). \end{array} $$

The dual function $g(\eta )$ is

$$\begin{array}{@{}rcl@{}} g(\eta)&=&\underset{\beta,w }{inf}L(\beta,w,\eta)=\eta^{T}y+\underset{w}{inf} (\frac{1}{2}\|w\|_{2}^{2} -\eta^{T}w) \\ &+&\underset{\beta}{inf} (-\eta^{T}X\beta+\lambda\|\beta\|_{1}). \end{array} $$

(26)

Note that the right side of $g(\eta )$ has three items. In order to get $g(\eta )$, we need to solve the following two optimization problems.

$$\begin{array}{@{}rcl@{}} \underset{w}{inf} &&\frac{1}{2}\|w\|_{2}^{2} -\eta^{T}w, \end{array} $$

(27)

$$\begin{array}{@{}rcl@{}} \underset{\beta}{inf} &&-\eta^{T}X\beta+\lambda\|\beta\|_{1}. \end{array} $$

(28)

Let us first consider (27). Denote the objective function as

$$\begin{array}{@{}rcl@{}} f_{1}(w)=\frac{1}{2}\|w\|_{2}^{2} -\eta^{T}w, \end{array} $$

then, let

$$\begin{array}{@{}rcl@{}} \frac{\partial f_{1}(w)}{\partial w}=w-\eta= 0\Rightarrow w=\eta, \end{array} $$

so

$$\begin{array}{@{}rcl@{}} \underset{w}{inf} f_{1}(w)=-\frac{1}{2}\eta^{T}\eta=-\frac{1}{2}\|\eta\|^{2}_{2}. \end{array} $$

Next, let us consider the problem (28). Denote the objective function as

$$\begin{array}{@{}rcl@{}} f_{2}(\beta)=-\eta^{T}X\beta+\lambda\|\beta\|_{1}, \end{array} $$

f₂(β) is convex but not smooth. So let us consider its subgradient

$$\begin{array}{@{}rcl@{}} \frac{\partial f_{2}(\beta)}{\partial \beta}=-X^{T}\eta+\lambda\frac{\partial \|\beta\|_{1}}{\partial \beta}= 0, \end{array} $$

where

$$\begin{array}{@{}rcl@{}} \frac{\partial \|\beta\|_{1}}{\partial \beta_{i}}=\left\{\begin{array}{lll} &sign([\beta^{*}]_{i}),&if [\beta^{*}]_{i}\neq0 \\ &[-1,1],&if [\beta^{*}]_{i}= 0 \end{array}\right.\quad i = 1,2,{\cdots} p. \end{array} $$

According to the necessary condition for $f_{2}(\beta )$ to attain an optimum, we have $|{X^{T}_{i}}\eta |\leq \lambda ,\quad i = 1,2,{\cdots } p$. Therefore, the optimum value of problem (28) is 0. Combining the equations above, we can get the dual problem:

$$\begin{array}{@{}rcl@{}} \underset{\eta}{max} &&g(\eta)=\eta^{T}y-\frac{1}{2} \|\eta\|_{2}^{2}, \\ \text{s.t.}~~&&|{X^{T}_{i}}\eta|\leq\lambda,\quad i = 1,2,{\cdots} p, \end{array} $$

which is equivalent to the following optimization problem:

$$\begin{array}{@{}rcl@{}} \underset{\eta}{max} &&\frac{1}{2} \|y\|_{2}^{2}-\frac{1}{2} \|\eta-y\|_{2}^{2},\\ \text{s.t.}~~&& |{X^{T}_{i}}\eta|\leq\lambda,\quad i = 1,2,{\cdots} p. \end{array} $$

(29)

By a simple re-scaling of the dual variable $\eta $, that is, let $\theta =\frac {\eta }{\lambda }$, we get the following result. The dual problem of (2) is:

$$\begin{array}{@{}rcl@{}} \underset{\theta}{max} &&\frac{1}{2}\|y\|_{2}^{2}-\frac{\lambda^{2}}{2} \|\theta-\frac{y}{\lambda}\|_{2}^{2},\\ \text{s.t.}~~&&\mid {X_{i}^{T}}\theta\mid\leq1\quad i = 1,2,{\cdots} p, \end{array} $$

(30)

where $\theta $ is the dual variable. For notational convenience, let the optimal solution of problem (30) be $\theta ^{*}(\gamma ,\lambda )$, and the optimal solution of problem (2) with parameters $\gamma $ and $\lambda $ is denoted by $\beta ^{*}(\gamma ,\lambda )$. Then, the KKT conditions are given by:

$$\begin{array}{@{}rcl@{}} y=X\beta^{*}(\gamma,\lambda)+\lambda\theta^{*}(\gamma,\lambda), \end{array} $$

(31)

$$\begin{array}{@{}rcl@{}} &&{X^{T}_{i}}\theta^{*}(\gamma,\lambda)\in \left\{ \begin{array}{lll} &sign([\beta^{*}(\gamma,\lambda)]_{i}),&if [\beta^{*}(\gamma,\lambda)]_{i}\neq0 \\ &[-1,1],&if [\beta^{*}(\gamma,\lambda)]_{i}= 0 \end{array} \right.\\&& i = 1,2,{\cdots} p, \end{array} $$

(32)

Note that $\bar {Y}=\left (\begin {array}{l} y \\ 0 \end {array} \right ) , \bar {X}=\left (\begin {array}{l} X \\ \sqrt {\gamma }I \end {array}\right )$. Let $\theta =\left (\begin {array}{l} \theta _{1} \\ \theta _{2} \end {array} \right )$, then we have the dual problem of (3) as follows,

$$\begin{array}{@{}rcl@{}} \underset{\theta}{max} &&\frac{1}{2}\|y\|_{2}^{2}-\frac{\lambda^{2}}{2} (\|\theta_{1}-\frac{y}{\lambda}\|_{2}^{2}+\|\theta_{2}\|_{2}^{2}),\\ \text{s.t.}~~&&\mid \bar{X}_{i}^{T}\theta\mid=\mid {x_{i}^{T}}\theta_{1}+\sqrt{\gamma}(\theta_{2})_{i}\mid\leq1,~ i = 1,2,{\cdots} p, \end{array} $$

(33)

where $\theta =\left (\begin {array}{l} \theta _{1} \\ \theta _{2} \end {array} \right )$ is the dual variable. So, Theorem 1 has been proved.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xu, Y., Tian, Y., Pan, X. et al. E-ENDPP: a safe feature selection rule for speeding up Elastic Net. Appl Intell 49, 592–604 (2019). https://doi.org/10.1007/s10489-018-1295-y

Download citation

Published: 07 September 2018
Issue Date: 15 February 2019
DOI: https://doi.org/10.1007/s10489-018-1295-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

E-ENDPP: a safe feature selection rule for speeding up Elastic Net

Abstract

Access this article

Similar content being viewed by others

Discriminative Lasso

A novel self-weighted Lasso and its safe screening rule

Model Selection With Mixed Variables on the Lasso Path

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A: Proof of Lemma 1

Appendix B: The dual problem of the Elastic Net

Rights and permissions

About this article

Cite this article

Keywords

Navigation

E-ENDPP: a safe feature selection rule for speeding up Elastic Net

Abstract

Access this article

Similar content being viewed by others

Discriminative Lasso

A novel self-weighted Lasso and its safe screening rule

Model Selection With Mixed Variables on the Lasso Path

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A: Proof of Lemma 1

Appendix B: The dual problem of the Elastic Net

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation