Abstract
We consider statistical procedures for feature selection defined by a family of regularization problems with convex piecewise linear loss functions and penalties of l 1 nature. Many known statistical procedures (e.g. quantile regression and support vector machines with l 1-norm penalty) are subsumed under this category. Computationally, the regularization problems are linear programming (LP) problems indexed by a single parameter, which are known as ‘parametric cost LP’ or ‘parametric right-hand-side LP’ in the optimization theory. Exploiting the connection with the LP theory, we lay out general algorithms, namely, the simplex algorithm and its variant for generating regularized solution paths for the feature selection problems. The significance of such algorithms is that they allow a complete exploration of the model space along the paths and provide a broad view of persistent features in the data. The implications of the general path-finding algorithms are outlined for several statistical procedures, and they are illustrated with numerical examples.
Similar content being viewed by others
References
Barrodale, I., Roberts, F.: An improved algorithm for discrete l 1 linear approximation. SIAM J. Numer. Anal. 10, 839–848 (1973)
Benson, H.Y., Shanno, D.F.: An exact primal—dual penalty method approach to warmstarting interior-point methods for linear programming. Comput. Optim. Appl. 38(3), 371–399 (2007)
Bertsimas, D., Tsitsiklis, J.: Introduction to Linear Programming. Athena Scientific, Belmont (1997)
Bloomfield, P., Steiger, W.: Least absolute deviations curve-fitting. SIAM J. Sci. Comput. 1, 290–301 (1980)
Bloomfield, P., Steiger, W.L.: Least Absolute Deviations: Theory, Applications, and Algorithms. Birkhäuser, Basel (1983)
Bondell, H., Reich, B.: Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR. Biometrics 64, 115–123 (2008)
Bradley, P.S., Mangasarian, O.L.: Feature selection via concave minimization and support vector machines. In: Shavlik, J. (ed.) Machine Learning Proceedings of the Fifteenth International Conference, pp. 82–90. Morgan Kaufmann, San Francisco (1998)
Candes, E., Tao, T.: The Dantzig selector: statistical estimation when p is much larger than n. Ann. Stat. 35, 2313–2351 (2007)
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995)
Dantzig, G.: Linear Programming and Extensions. Princeton University Press, Princeton (1951)
Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least angle regression (with discussion). Ann. Stat. 32, 407–451 (2004)
Fisher, W.D.: A note on curve fitting with minimum deviations by linear programming. J. Am. Stat. Assoc. 56, 359–362 (1961)
Friedman, J., Hastie, T., Höfling, H., Tibshirani, R.: Pathwise coordinate optimization. Ann. Appl. Stat. 1(2), 302–332 (2007). doi:10.1214/07-AOAS131
Gal, T.: Postoptimal Analyses, Parametric Programming, and Related Topics. McGraw–Hill, New York (1979)
Gass, S., Saaty, T.: The computational algorithm for the parametric objective function. Nav. Res. Logist. Q. 2, 39–45 (1955a)
Gass, S., Saaty, T.: The parametric objective function (part 2). J. Oper. Res. Soc. Am. 3, 395–401 (1955b)
Gill, P.E., Murray, W., Wright, M.H.: Numerical Linear Algebra and Optimization. Addison–Wesley, Reading (1991)
Gondzio, J., Grothey, A.: A new unblocking technique to warmstart interior point methods based on sensitivity analysis. SIAM J. Optim. 19(3), 1184–1210 (2008)
Gunn, S.R., Kandola, J.S.: Structural modelling with sparse kernels. Mach. Learn. 48(1), 137–163 (2002)
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer, New York (2001)
Hastie, T., Rosset, S., Tibshirani, R., Zhu, J.: The entire regularization path for the support vector machine. J. Mach. Learn. Res. 5, 1391–1415 (2004)
Hoerl, A., Kennard, R.: Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12, 55–67 (1970)
Karmarkar, N.: A new polynomial-time algorithm for linear programming. Combinatorica 4(4), 373–395 (1984)
Kato, K.: Solving l 1 regularization problems with piecewise linear losses. J. Comput. Graph. Stat. 19(4), 1024–1040 (2010)
Kim, S.J., Koh, K., Lustig, M., Boyd, S., Gorinevsky, D.: An interior-point method for large-scale l 1-regularized least squares. IEEE J. Sel. Top. Signal Process. 1(4), 606–617 (2007)
Koenker, R.: Quantile Regression (Econometric Society Monographs). Cambridge University Press, Cambridge (2005)
Koenker, R., Bassett, G.: Regression quantiles. Econometrica 1, 33–50 (1978)
Koenker, R., D’Orey, V.: Algorithm AS 229: computing regression quantiles. Appl. Stat. 36, 383–393 (1987)
Koenker, R., D’Orey, V.: Remark on algorithm AS 229: computing dual regression quantiles and regression rank scores. Appl. Stat. 43, 410–414 (1994)
Koenker, R., Hallock, K.: Quantile regression. J. Econ. Perspect. 15, 143–156 (2001)
Koh, K., Kim, S.J., Boyd, S.: An interior-point method for large-scale l 1-regularized logistic regression. J. Mach. Learn. Res. 8, 1519–1555 (2007)
Lee, Y., Cui, Z.: Characterizing the solution path of multicategory support vector machine. Stat. Sin. 16, 391–409 (2006)
Lee, Y., Kim, Y., Lee, S., Koo, J.Y.: Structured multicategory support vector machine with ANOVA decomposition. Biometrika 93(3), 555–571 (2006)
Li, Y., Zhu, J.: L1-norm quantile regressions. J. Comput. Graph. Stat. 17, 163–185 (2008)
Lin, Y., Zhang, H.H.: Component selection and smoothing in multivariate nonparametric regression. Ann. Stat. 34, 2272–2297 (2006)
Mehrotra, S.: On the implementation of a primal-dual interior point method. SIAM J. Optim. 2(4), 575–601 (1992)
Micchelli, C., Pontil, M.: Learning the kernel function via regularization. J. Mach. Learn. Res. 6, 1099–1125 (2005)
Murty, K.: Linear Programming. Wiley, New York (1983)
Osborne, M., Turlach, B.: A homotopy algorithm for the quantile regression lasso and related piecewise linear problems. J. Comput. Graph. Stat. 20(4), 972–987 (2011)
Osborne, M., Presnell, B., Turlach, B.: A new approach to variable selection in least squares problems. IMA J. Numer. Anal. 20(3), 389–403 (2000)
Park, M.Y., Hastie, T.: l 1-regularization path algorithm for generalized linear models. J. R. Stat. Soc., Ser. B, Stat. Methodol. 69(4), 659–677 (2007)
Rosset, S., Zhu, J.: Piecewise linear regularized solution paths. Ann. Stat. 35(3), 1012–1030 (2007)
Saaty, T., Gass, S.: The parametric objective function (part 1). J. Oper. Res. Soc. Am. 2, 316–319 (1954)
SAS Institute: The QUANTSELECT procedure. In: SAS/Stat 12.1 User’s Guide. SAS Institute Inc., Cary (2012). http://support.sas.com/rnd/app/da/stat/procedures/quantselect.html
Schölkopf, B., Smola, A.: Learning with Kernels—Support Vector Machines, Regularization, Optimization and Beyond. MIT Press, Cambridge (2002)
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc., Ser. B 58, 267–288 (1996)
Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., Knight, K.: Sparsity and smoothness via the fused lasso. J. R. Stat. Soc., Ser. B 67, 91–108 (2005)
Turlach, B., Venables, W., Wright, S.: Simultaneous variable selection. Technometrics 47(3), 349–363 (2005)
Vanderbei, R.J.: Linear Programming: Foundations and Extensions. Springer, Berlin (1997)
Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)
Wagner, H.M.: Linear programming techniques for regression analysis. J. Am. Stat. Assoc. 54, 206–212 (1959)
Wahba, G.: Spline Models for Observational Data. Applied Mathematics, vol. 59. SIAM, Philadelphia (1990)
Wang, L., Shen, X.: Multi-category support vector machines, feature selection and solution path. Stat. Sin. 16, 617–633 (2006)
Wolfe, P.: A technique for resolving degeneracy in linear programming. J. Soc. Ind. Appl. Math. 11(2), 205–211 (1963)
Wright, M.H.: Interior methods for constrained optimization. Acta Numer. 1, 341–407 (1992)
Wright, S.J.: Primal-Dual Interior-Point Methods. Society for Industrial and Applied Mathematics, Philadelphia (1997)
Yao, Y.: Statistical applications of linear programming for feature selection via regularization methods. PhD thesis, The Ohio State University (2008)
Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Stat. Soc., Ser. B 68, 49–67 (2006)
Zhang, H.H.: Variable selection for support vector machines via smoothing spline ANOVA. Stat. Sin. 16(2), 659–674 (2006)
Zhao, P., Rocha, G., Yu, B.: The composite absolute penalties family for grouped and hierarchical variable selection. Ann. Stat. 36(6A), 3468–3497 (2009)
Zhu, J., Rosset, S., Hastie, T., Tibshirani, R.: 1-Norm Support Vector Machines. In: Thrun, S., Saul, L., Schölkopf, B. (eds.): Advances in Neural Information Processing Systems 16. MIT Press, Cambridge (2004)
Zou, H., Yuan, M.: The f ∞-norm support vector machine. Stat. Sin. 18, 379–398 (2008)
Acknowledgements
The authors thank the Editor and anonymous referees for helpful comments and additional references.
Author information
Authors and Affiliations
Corresponding author
Additional information
Lee’s research was supported in part by National Security Agency grant H98230-10-1-0202 and National Science Foundation grant DMS-12-09194.
Appendix
Appendix
Lemma 1
Suppose that \(\mathcal {B}^{l+1}:=\mathcal {B}^{l}\cup\{j^{l}\}\setminus\{i^{l}\}\), where \(i^{l}:=k^{l}_{{i^{l}_{*}}}\). Let be defined as in (13). Then .
Proof
First observe that
Without loss of generality, the \({i^{l}_{*}}\)th column vector \({{\bf A}}_{i^{l}}\) of \({{\bf A}}_{\mathcal {B}^{l}}\) is replaced with \({{\bf A}}_{j^{l}}\) to give \({{\bf A}}_{\mathcal {B}^{l+1}}\). For the \({{\bf A}}_{\mathcal {B}^{l+1}}\),
where . Thus, we have
Then it immediately follows that . Hence, . □
1.1 8.1 Proof of (15)
For l=0,…,J−1, consider the following difference
By the intermediate calculation in Lemma 1, we can show that the difference is \(\kappa_{\lambda_{l}}{{\bf A}}^{\top}({{\bf A}}_{\mathcal {B}^{l}}^{-1} )^{\top} {{\bf e}}_{{i^{l}_{*}}}\), where
Since \(\lambda_{l}:= -\check{{{c}}}^{l}_{j^{l}}/ {\check{{{a}}}^{l}_{j^{l}}}\), \(\kappa_{\lambda_{l}}=0\), which proves (15).
1.2 8.2 Proof of Theorem 3
Let \({\mathfrak {B}}^{l}:=\mathcal {B}^{l}\cup\{j^{l}\}\) for l=0,…,J−1, and \({\mathfrak {B}}^{J}:=\mathcal {B}^{J}\cup\{N+1\}\), where \(\mathcal {B}^{l}\), \(\mathcal {B}^{J}\), andj l are as defined in the simplex algorithm. We will show that, for any fixed s∈[s l ,s l+1) (or s≥s J ), \({\mathfrak {B}}^{l}\) (or \({\mathfrak {B}}^{J}\)) is an optimal basic index set for the LP problem in (10).
For simplicity, let j J:=N+1, c N+1:=0, , and a N+1:=1. The inverse of
is given by
for l=0,…,J.
First, we show that \({\mbox {$\mathbb {A}$}}_{{{\mathfrak {B}}^{l}}}\) is a feasible basic index set of (10) for s∈[s l ,s l+1], i.e.
Recalling that , \(z^{l}_{j^{l}}=0\), , , and \(d^{l}_{j^{l}}=1\), we have
From and
it can be shown that
Thus, (30) is a convex combination of and for s∈[s l ,s l+1], and hence it is non-negative. This proves the feasibility of \({\mbox {$\mathbb {A}$}}_{{\mathfrak {B}}^{l}}\) for s∈[s l ,s l+1] and l=0,…,J−1. For s≥s J, we have
Next, we prove that \({\mbox {$\mathbb {A}$}}_{{\mathfrak {B}}^{l}}\) is an optimal basic index set of (10) for s∈[s l ,s l+1] by showing . For i=1,…,N, the ith element of is
Similarly, for s≥s J,
Clearly, the optimality condition holds by the non-negativity of all the elements as defined in the simplex algorithm. This completes the proof.
1.3 8.3 Proof of Theorem 4
(i) By (28), we can update the pivot rows of the tableau as follows:
If \(u^{l}_{i}=0\), the ith pivot row of \(\mathcal {B}^{l+1}\) is the same as the ith pivot row of . For \(i={i^{l}_{*}}\), the ith pivot row of \({\mathcal {B}^{l+1}}\) is \((1/{u}^{l}_{i^{l}_{*}})\) . If \(i\neq{i^{l}_{*}}\) and \({u}^{l}_{i}<0\), which imply \(-(u^{l}_{i}/{u}^{l}_{{i^{l}_{*}}})>0\), the ith pivot row of since the sum of any two lexicographically positive vectors is still lexicographically positive. According to the tableau update algorithm, we have \({u}^{l}_{{i^{l}_{*}}}>0\), where \({i^{l}_{*}}\) is the index number of the lexicographically smallest pivot row among all the pivot rows for \(\mathcal {B}^{l}\) with \({u}^{l}_{i}>0\). For \(i\neq{i^{l}_{*}}\) and \({u}^{l}_{i}>0\), by the definition of \({i^{l}_{*}}\), \(\mbox{(the ${i^{l}_{*}}$th pivot row of ${\mathcal {B}^{l}}$)}/{u}^{l}_{i^{l}_{*}}\overset{L}{<} \mbox{(the $i$th pivot row of ${\mathcal {B}^{l}}$)}/{u}^{l}_{i}\). This implies
Therefore, all the updated pivot rows are lexicographically positive.
Remark 2
If \(z^{l}_{{i^{l}}}=0\), (31) implies that \(z^{l}_{k^{l}_{i}}=z^{l+1}_{k^{l}_{i}}\) for \(i\neq{i^{l}_{*}}\), \(i\in \mathcal {M}\). and \(z^{l+1}_{j^{l}}=0\). Hence . On the other hand, if \(z^{l}_{i^{l}}>0\), \(z^{l+1}_{j^{l}} =(z^{l}_{i^{l}}/{u}^{l}_{j^{l}})>0\) while \(z^{l}_{j^{l}}=0\) since \(j^{l}\notin \mathcal {B}^{l}\). This implies . Therefore, if and only if \(z_{i^{l}}^{l}=0\).
(ii) When the basic index set \(\mathcal {B}^{l}\) is updated to \(\mathcal {B}^{l+1}\), \(\check{{{c}}}^{l}_{j^{l}}<0\). Since \(j^{l}\in \mathcal {B}^{l+1}\), \(\check{{{c}}}^{l+1}_{j^{l}}=0\). Then, \(({{\bf c}}_{j^{l}}-{{\bf c}}_{\mathcal {B}^{l+1}}^{\top} {{\bf A}}_{\mathcal {B}^{l+1}}^{-1}{{\bf A}}_{j^{l}}) -({{\bf c}}_{j^{l}}-{{\bf c}}_{\mathcal {B}^{l}}^{\top} {{\bf A}}_{\mathcal {B}^{l}}^{-1}{{\bf A}}_{j^{l}}) =(\check{{{c}}}^{l+1}_{j^{l}}-\check{{{c}}}^{l}_{j^{l}})>0\).
Similarly as the proof of (15),
where . \({{\bf e}}_{{i^{l}_{*}}}^{\top} {{\bf A}}_{\mathcal {B}^{l}}^{-1}{{\bf A}}\) is the \({i^{l}_{*}}\)th pivot row for \(\mathcal {B}^{l}\), which is lexicographically positive. Since the j lth entry of \({{\bf e}}_{{i^{l}_{*}}}^{\top} {{\bf A}}_{\mathcal {B}^{l}}^{-1}{{\bf A}}\) is strictly positive, that of \(({{\bf c}}^{\top}-{{\bf c}}_{\mathcal {B}^{l+1}}^{\top} {{\bf A}}_{\mathcal {B}^{l+1}}^{-1}{{\bf A}})- ({{\bf c}}^{\top}-{{\bf c}}_{\mathcal {B}^{l}}^{\top} {{\bf A}}_{\mathcal {B}^{l}}^{-1}{{\bf A}})\) must share the same sign with κ l. Thus, we have κ l>0. Then the updated cost row is given as
Clearly, the cost row for \(\mathcal {B}^{l+1}\) is lexicographically greater than that for \(\mathcal {B}^{l}\).
Rights and permissions
About this article
Cite this article
Yao, Y., Lee, Y. Another look at linear programming for feature selection via methods of regularization. Stat Comput 24, 885–905 (2014). https://doi.org/10.1007/s11222-013-9408-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-013-9408-2