Abstract
Lasso is a popular regression model, which can do automatic variable selection and continuous shrinkage simultaneously. The Elastic Net is one of the corrective methods of Lasso, which selects groups of correlated variables. It is particularly useful when the number of features p is much bigger than the number of observations n. However, the training efficiency of the Elastic Net for high-dimensional data remains a challenge. Therefore, in this paper, we propose a new safe screening rule, i.e., E-ENDPP, for the Elastic Net problem which can identify the inactive features prior to training. Then, the inactive features or predictors can be removed to reduce the size of problem and accelerate the training speed. Since this E-ENDPP is derived from the optimality conditions of the model, it can be guaranteed in theory that E-ENDPP will give identical solutions with the original model. Simulation studies and real data examples show that our proposed E-ENDPP can substantially accelerate the training speed of the Elastic Net without affecting its accuracy.


Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Hastie T, Tibshirani R, Friedman J, Franklin J (2005) The elements of statistical learning: data mining, inference and prediction. The Mathematical Intelligencer 27(2):83–85
Li J, Cheng K, Wang S, Morstatter F, Trevino RP, Tang J, Liu H (2017) Feature selection: a data perspective. Acm Computing Surveys 50(6):1–45
Bühlmann P., Kalisch M, Meier L (2014) High-dimensional statistics with a view toward applications in biology. Annual Review of Statistics and Its Application 1(1):255–278
Boyd S, Vandenberghe L (2004) Convex optimization, Cambridge University Press, New York
Bondell H, Reich B (2010) Simultaneous regression shrinkage, variable selection and clustering of predictors with OSCAR. Biometrics 64(1):115–123
Xu Y, Zhong P, Wang L (2010) Support vector machine-based embedded approach feature selection algorithm. Journal of Information and Computational Science 7(5):1155–1163
Tibshirani R (1996) Regression shrinkage and subset selection with the lasso. Journal of the Royal Statistical Society
Zhao P, Yu B (2006) On model selection consistency of lasso. J Mach Learn Res 7(12):2541–2563
Candès E (2006) Compressive sampling. In: Proceedings of the international congress of mathematics
Chen S, Donoho D, Saunders M (2001) Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing (SISC) 58(1):33–61
Wright J, Ma Y, Mairal J, Sapiro G, Huang T, Yan S (2010) Sparse representation for computer vision and pattern recognition.. In: Proceedings of IEEE, 98(6): pp 1031–1044
Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32(2):407–499
Kim S, Koh K, Lustig M, Boyd S, Gorinevsky D (2008) An interior-point method for large scale l1-regularized least squares. IEEE Journal on Selected Topics in Signal Processing 1(4):606–617
Friedman J, Hastie T, Hëfling H, Tibshirani R (2007) Pathwise coordinate optimization. Ann Appl Stat 1(2):302–332
Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33(1):1–22
Park M, Hastie T (2007) L1-regularized path algorithm for generalized linear models. Journal of the Royal Statistical Society Series B 69(4):659–677
Donoho D, Tsaig Y (2008) Fast solution of l-1 norm minimization problems when the solution may be sparse. IEEE Trans Inf Theory 54(11):4789–4812
El Ghaoui L, Viallon V, Rabbani T (2012) Safe feature elimination in sparse supervised learning. Pacific Journal of Optimization 8(4):667–698
Pan X, Yang Z, Xu Y, Wang L (2018) Safe screening rules for accelerating twin support vector machine classification. IEEE Transactions on Neural Networks and Learning Systems 29(5):1876–1887
Xiang Z, Ramadge P (2012) Fast lasso screening tests based on correlations. In: 2012 IEEE international conference on acoustics. Speech and Signal Processing (ICASSP) 22(10):2137–2140
Xiang Z, Xu H, Ramadge P (2011) Learning sparse representations of high dimensional data on large scale dictionaries. International conference on neural information processing systems 24:900–908
Tibshirani R, Bien J, Friedman J, Hastie T, Simon N (2012) Strong rules for discarding predictors in lasso-type problems. J R Stat Soc 74(2):245–266
Wang J, Wonka P, Ye J (2012) Lasso screening rules via dual polytope projection. J Mach Learn Res 16(1):1063–1101
Bruckstein A, Donoho D, Elad M (2009) From sparse solutions of systems of equations to sparse modeling of signals and images. SIAM Rev 51:34–81
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B 67 (2):301–320
Hastie T, Tibshirani R (2009) The elements of statistical learning. Technometrics 45(3):267–268
Hoerl A, Kennard R (1988) Ridge regression. In: Encyclopedia of statistical sciences, 8: 129–136. Wiley, New York
Breiman L (1996) Heuristics of instability in model selection. The Annals of Statistics 24
Bertsekas D (2003) Convex analysis and opitimization. Athena scientific
Bauschke H, Combettes P (2011) Convex analysis and monotone operator theory in hilbert spaces, Springer, New York
Johnson T, Guestrin C (2015) BLITZ: a principled meta-algorithm for scaling sparse optimization. In: International conference on international conference on machine learning 18(12): 1171–1179
He X, Cai D, Niyogi P (2005) Laplacian score for feature selection. In: International conference on neural information processing systems 18:507–514
Acknowledgments
The authors gratefully acknowledge the helpful comments and suggestions of the reviewers, which have improved the presentation. This work was supported in part by the Beijing Natural Science Foundation (No. 4172035) and National Natural Science Foundation of China (No. 11671010).
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A: Proof of Lemma 1
First of all, we reintroduce problem (3) in the following:
For the above problem, let
Then we will obtain that
therefore, problem (3) can be rewritten as:
Lemma 1 has been proved.
Appendix B: The dual problem of the Elastic Net
For the problem (2), according to (Boyd and Vandenberghe, 2004)[4], by introducing a new set of variables \(w=y-X\beta \), the problem (2) becomes
By introducing the multipliers \(\eta \in R^{n}\), we get the Lagrangian function as follows:
The dual function \(g(\eta )\) is
Note that the right side of \(g(\eta )\) has three items. In order to get \(g(\eta )\), we need to solve the following two optimization problems.
Let us first consider (27). Denote the objective function as
then, let
so
Next, let us consider the problem (28). Denote the objective function as
f2(β) is convex but not smooth. So let us consider its subgradient
where
According to the necessary condition for \(f_{2}(\beta )\) to attain an optimum, we have \(|{X^{T}_{i}}\eta |\leq \lambda ,\quad i = 1,2,{\cdots } p\). Therefore, the optimum value of problem (28) is 0. Combining the equations above, we can get the dual problem:
which is equivalent to the following optimization problem:
By a simple re-scaling of the dual variable \(\eta \), that is, let \(\theta =\frac {\eta }{\lambda }\), we get the following result. The dual problem of (2) is:
where \(\theta \) is the dual variable. For notational convenience, let the optimal solution of problem (30) be \(\theta ^{*}(\gamma ,\lambda )\), and the optimal solution of problem (2) with parameters \(\gamma \) and \(\lambda \) is denoted by \(\beta ^{*}(\gamma ,\lambda )\). Then, the KKT conditions are given by:
Note that \(\bar {Y}=\left (\begin {array}{l} y \\ 0 \end {array} \right ) , \bar {X}=\left (\begin {array}{l} X \\ \sqrt {\gamma }I \end {array}\right )\). Let \(\theta =\left (\begin {array}{l} \theta _{1} \\ \theta _{2} \end {array} \right )\), then we have the dual problem of (3) as follows,
where \(\theta =\left (\begin {array}{l} \theta _{1} \\ \theta _{2} \end {array} \right )\) is the dual variable. So, Theorem 1 has been proved.
Rights and permissions
About this article
Cite this article
Xu, Y., Tian, Y., Pan, X. et al. E-ENDPP: a safe feature selection rule for speeding up Elastic Net. Appl Intell 49, 592–604 (2019). https://doi.org/10.1007/s10489-018-1295-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-018-1295-y