Skip to main content
Log in

Another look at linear programming for feature selection via methods of regularization

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

We consider statistical procedures for feature selection defined by a family of regularization problems with convex piecewise linear loss functions and penalties of l 1 nature. Many known statistical procedures (e.g. quantile regression and support vector machines with l 1-norm penalty) are subsumed under this category. Computationally, the regularization problems are linear programming (LP) problems indexed by a single parameter, which are known as ‘parametric cost LP’ or ‘parametric right-hand-side LP’ in the optimization theory. Exploiting the connection with the LP theory, we lay out general algorithms, namely, the simplex algorithm and its variant for generating regularized solution paths for the feature selection problems. The significance of such algorithms is that they allow a complete exploration of the model space along the paths and provide a broad view of persistent features in the data. The implications of the general path-finding algorithms are outlined for several statistical procedures, and they are illustrated with numerical examples.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Barrodale, I., Roberts, F.: An improved algorithm for discrete l 1 linear approximation. SIAM J. Numer. Anal. 10, 839–848 (1973)

    Article  MATH  MathSciNet  Google Scholar 

  • Benson, H.Y., Shanno, D.F.: An exact primal—dual penalty method approach to warmstarting interior-point methods for linear programming. Comput. Optim. Appl. 38(3), 371–399 (2007)

    Article  MATH  MathSciNet  Google Scholar 

  • Bertsimas, D., Tsitsiklis, J.: Introduction to Linear Programming. Athena Scientific, Belmont (1997)

    Google Scholar 

  • Bloomfield, P., Steiger, W.: Least absolute deviations curve-fitting. SIAM J. Sci. Comput. 1, 290–301 (1980)

    Article  MATH  MathSciNet  Google Scholar 

  • Bloomfield, P., Steiger, W.L.: Least Absolute Deviations: Theory, Applications, and Algorithms. Birkhäuser, Basel (1983)

    MATH  Google Scholar 

  • Bondell, H., Reich, B.: Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR. Biometrics 64, 115–123 (2008)

    Article  MATH  MathSciNet  Google Scholar 

  • Bradley, P.S., Mangasarian, O.L.: Feature selection via concave minimization and support vector machines. In: Shavlik, J. (ed.) Machine Learning Proceedings of the Fifteenth International Conference, pp. 82–90. Morgan Kaufmann, San Francisco (1998)

    Google Scholar 

  • Candes, E., Tao, T.: The Dantzig selector: statistical estimation when p is much larger than n. Ann. Stat. 35, 2313–2351 (2007)

    Article  MATH  MathSciNet  Google Scholar 

  • Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995)

    MATH  Google Scholar 

  • Dantzig, G.: Linear Programming and Extensions. Princeton University Press, Princeton (1951)

    Google Scholar 

  • Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least angle regression (with discussion). Ann. Stat. 32, 407–451 (2004)

    Article  MATH  MathSciNet  Google Scholar 

  • Fisher, W.D.: A note on curve fitting with minimum deviations by linear programming. J. Am. Stat. Assoc. 56, 359–362 (1961)

    Article  Google Scholar 

  • Friedman, J., Hastie, T., Höfling, H., Tibshirani, R.: Pathwise coordinate optimization. Ann. Appl. Stat. 1(2), 302–332 (2007). doi:10.1214/07-AOAS131

    Article  MATH  MathSciNet  Google Scholar 

  • Gal, T.: Postoptimal Analyses, Parametric Programming, and Related Topics. McGraw–Hill, New York (1979)

    MATH  Google Scholar 

  • Gass, S., Saaty, T.: The computational algorithm for the parametric objective function. Nav. Res. Logist. Q. 2, 39–45 (1955a)

    Article  MathSciNet  Google Scholar 

  • Gass, S., Saaty, T.: The parametric objective function (part 2). J. Oper. Res. Soc. Am. 3, 395–401 (1955b)

    MathSciNet  Google Scholar 

  • Gill, P.E., Murray, W., Wright, M.H.: Numerical Linear Algebra and Optimization. Addison–Wesley, Reading (1991)

    MATH  Google Scholar 

  • Gondzio, J., Grothey, A.: A new unblocking technique to warmstart interior point methods based on sensitivity analysis. SIAM J. Optim. 19(3), 1184–1210 (2008)

    Article  MATH  MathSciNet  Google Scholar 

  • Gunn, S.R., Kandola, J.S.: Structural modelling with sparse kernels. Mach. Learn. 48(1), 137–163 (2002)

    Article  MATH  Google Scholar 

  • Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer, New York (2001)

    Book  MATH  Google Scholar 

  • Hastie, T., Rosset, S., Tibshirani, R., Zhu, J.: The entire regularization path for the support vector machine. J. Mach. Learn. Res. 5, 1391–1415 (2004)

    MATH  MathSciNet  Google Scholar 

  • Hoerl, A., Kennard, R.: Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12, 55–67 (1970)

    Article  MATH  Google Scholar 

  • Karmarkar, N.: A new polynomial-time algorithm for linear programming. Combinatorica 4(4), 373–395 (1984)

    Article  MATH  MathSciNet  Google Scholar 

  • Kato, K.: Solving l 1 regularization problems with piecewise linear losses. J. Comput. Graph. Stat. 19(4), 1024–1040 (2010)

    Article  Google Scholar 

  • Kim, S.J., Koh, K., Lustig, M., Boyd, S., Gorinevsky, D.: An interior-point method for large-scale l 1-regularized least squares. IEEE J. Sel. Top. Signal Process. 1(4), 606–617 (2007)

    Article  Google Scholar 

  • Koenker, R.: Quantile Regression (Econometric Society Monographs). Cambridge University Press, Cambridge (2005)

    Book  Google Scholar 

  • Koenker, R., Bassett, G.: Regression quantiles. Econometrica 1, 33–50 (1978)

    Article  MathSciNet  Google Scholar 

  • Koenker, R., D’Orey, V.: Algorithm AS 229: computing regression quantiles. Appl. Stat. 36, 383–393 (1987)

    Article  Google Scholar 

  • Koenker, R., D’Orey, V.: Remark on algorithm AS 229: computing dual regression quantiles and regression rank scores. Appl. Stat. 43, 410–414 (1994)

    Article  Google Scholar 

  • Koenker, R., Hallock, K.: Quantile regression. J. Econ. Perspect. 15, 143–156 (2001)

    Article  Google Scholar 

  • Koh, K., Kim, S.J., Boyd, S.: An interior-point method for large-scale l 1-regularized logistic regression. J. Mach. Learn. Res. 8, 1519–1555 (2007)

    MATH  MathSciNet  Google Scholar 

  • Lee, Y., Cui, Z.: Characterizing the solution path of multicategory support vector machine. Stat. Sin. 16, 391–409 (2006)

    MATH  MathSciNet  Google Scholar 

  • Lee, Y., Kim, Y., Lee, S., Koo, J.Y.: Structured multicategory support vector machine with ANOVA decomposition. Biometrika 93(3), 555–571 (2006)

    Article  MATH  MathSciNet  Google Scholar 

  • Li, Y., Zhu, J.: L1-norm quantile regressions. J. Comput. Graph. Stat. 17, 163–185 (2008)

    Article  Google Scholar 

  • Lin, Y., Zhang, H.H.: Component selection and smoothing in multivariate nonparametric regression. Ann. Stat. 34, 2272–2297 (2006)

    Article  MATH  Google Scholar 

  • Mehrotra, S.: On the implementation of a primal-dual interior point method. SIAM J. Optim. 2(4), 575–601 (1992)

    Article  MATH  MathSciNet  Google Scholar 

  • Micchelli, C., Pontil, M.: Learning the kernel function via regularization. J. Mach. Learn. Res. 6, 1099–1125 (2005)

    MATH  MathSciNet  Google Scholar 

  • Murty, K.: Linear Programming. Wiley, New York (1983)

    MATH  Google Scholar 

  • Osborne, M., Turlach, B.: A homotopy algorithm for the quantile regression lasso and related piecewise linear problems. J. Comput. Graph. Stat. 20(4), 972–987 (2011)

    Article  MathSciNet  Google Scholar 

  • Osborne, M., Presnell, B., Turlach, B.: A new approach to variable selection in least squares problems. IMA J. Numer. Anal. 20(3), 389–403 (2000)

    Article  MATH  MathSciNet  Google Scholar 

  • Park, M.Y., Hastie, T.: l 1-regularization path algorithm for generalized linear models. J. R. Stat. Soc., Ser. B, Stat. Methodol. 69(4), 659–677 (2007)

    Article  MathSciNet  Google Scholar 

  • Rosset, S., Zhu, J.: Piecewise linear regularized solution paths. Ann. Stat. 35(3), 1012–1030 (2007)

    Article  MATH  MathSciNet  Google Scholar 

  • Saaty, T., Gass, S.: The parametric objective function (part 1). J. Oper. Res. Soc. Am. 2, 316–319 (1954)

    MathSciNet  Google Scholar 

  • SAS Institute: The QUANTSELECT procedure. In: SAS/Stat 12.1 User’s Guide. SAS Institute Inc., Cary (2012). http://support.sas.com/rnd/app/da/stat/procedures/quantselect.html

    Google Scholar 

  • Schölkopf, B., Smola, A.: Learning with Kernels—Support Vector Machines, Regularization, Optimization and Beyond. MIT Press, Cambridge (2002)

    Google Scholar 

  • Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc., Ser. B 58, 267–288 (1996)

    MATH  MathSciNet  Google Scholar 

  • Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., Knight, K.: Sparsity and smoothness via the fused lasso. J. R. Stat. Soc., Ser. B 67, 91–108 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  • Turlach, B., Venables, W., Wright, S.: Simultaneous variable selection. Technometrics 47(3), 349–363 (2005)

    Article  MathSciNet  Google Scholar 

  • Vanderbei, R.J.: Linear Programming: Foundations and Extensions. Springer, Berlin (1997)

    Google Scholar 

  • Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)

    MATH  Google Scholar 

  • Wagner, H.M.: Linear programming techniques for regression analysis. J. Am. Stat. Assoc. 54, 206–212 (1959)

    Article  MATH  Google Scholar 

  • Wahba, G.: Spline Models for Observational Data. Applied Mathematics, vol. 59. SIAM, Philadelphia (1990)

    Book  MATH  Google Scholar 

  • Wang, L., Shen, X.: Multi-category support vector machines, feature selection and solution path. Stat. Sin. 16, 617–633 (2006)

    MATH  Google Scholar 

  • Wolfe, P.: A technique for resolving degeneracy in linear programming. J. Soc. Ind. Appl. Math. 11(2), 205–211 (1963)

    Article  MATH  MathSciNet  Google Scholar 

  • Wright, M.H.: Interior methods for constrained optimization. Acta Numer. 1, 341–407 (1992)

    Article  Google Scholar 

  • Wright, S.J.: Primal-Dual Interior-Point Methods. Society for Industrial and Applied Mathematics, Philadelphia (1997)

    Book  MATH  Google Scholar 

  • Yao, Y.: Statistical applications of linear programming for feature selection via regularization methods. PhD thesis, The Ohio State University (2008)

  • Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Stat. Soc., Ser. B 68, 49–67 (2006)

    Article  MATH  MathSciNet  Google Scholar 

  • Zhang, H.H.: Variable selection for support vector machines via smoothing spline ANOVA. Stat. Sin. 16(2), 659–674 (2006)

    MATH  Google Scholar 

  • Zhao, P., Rocha, G., Yu, B.: The composite absolute penalties family for grouped and hierarchical variable selection. Ann. Stat. 36(6A), 3468–3497 (2009)

    Article  MathSciNet  Google Scholar 

  • Zhu, J., Rosset, S., Hastie, T., Tibshirani, R.: 1-Norm Support Vector Machines. In: Thrun, S., Saul, L., Schölkopf, B. (eds.): Advances in Neural Information Processing Systems 16. MIT Press, Cambridge (2004)

    Google Scholar 

  • Zou, H., Yuan, M.: The f -norm support vector machine. Stat. Sin. 18, 379–398 (2008)

    MATH  MathSciNet  Google Scholar 

Download references

Acknowledgements

The authors thank the Editor and anonymous referees for helpful comments and additional references.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yoonkyung Lee.

Additional information

Lee’s research was supported in part by National Security Agency grant H98230-10-1-0202 and National Science Foundation grant DMS-12-09194.

Appendix

Appendix

Lemma 1

Suppose that \(\mathcal {B}^{l+1}:=\mathcal {B}^{l}\cup\{j^{l}\}\setminus\{i^{l}\}\), where \(i^{l}:=k^{l}_{{i^{l}_{*}}}\). Let be defined as in (13). Then .

Proof

First observe that

Without loss of generality, the \({i^{l}_{*}}\)th column vector \({{\bf A}}_{i^{l}}\) of \({{\bf A}}_{\mathcal {B}^{l}}\) is replaced with \({{\bf A}}_{j^{l}}\) to give \({{\bf A}}_{\mathcal {B}^{l+1}}\). For the \({{\bf A}}_{\mathcal {B}^{l+1}}\),

(27)

where . Thus, we have

$$ {{\bf A}}_{\mathcal {B}^{l+1}}^{-1}{{\bf A}}_{\mathcal {B}^l}= \left [ \begin{array}{c@{\quad}c@{\quad}c@{\quad}c@{\quad}c@{\quad}c} 1&&-u^l_1/{u}^l_{i^l_*}&&\\&\ddots&\vdots&&\\&&1/u^l_{i^l_*}&&\\&&\vdots&\ddots&\\&&-u^l_M/{u}^l_{i^l_*} &&1 \end{array} \right ]. $$
(28)

Then it immediately follows that . Hence, . □

1.1 8.1 Proof of (15)

For l=0,…,J−1, consider the following difference

By the intermediate calculation in Lemma 1, we can show that the difference is \(\kappa_{\lambda_{l}}{{\bf A}}^{\top}({{\bf A}}_{\mathcal {B}^{l}}^{-1} )^{\top} {{\bf e}}_{{i^{l}_{*}}}\), where

$$\begin{aligned} \kappa_{\lambda_l} :=&({{c}}_{{\tiny k^l_{{i^l_*}}}} +\lambda_l {{a}}_{{\tiny k^l_{{i^l_*}}}}) -\frac{{{c}}_{j^l}+\lambda_l{{a}}_{j^l}}{u^l_{{i^l_*}}}\\&{}+\sum_{i\in {\tiny (\mathcal {B}^{l+1}\setminus\{j^l\} )}} \frac{({{c}}_i+\lambda_l{{a}}_i){u}^l_i}{u^l_{{i^l_*}}} \\=& \frac{({{\bf c}}_{\mathcal {B}^l}+\lambda_l{{\bf a}}_{\mathcal {B}^l})^\top {{\bf A}}_{\mathcal {B}^l}^{-1} {{\bf A}}_{j^l}-({{c}}_{j^l}+\lambda_l{{a}}_{j^l})}{u^l_{{i^l_*}}} \\=&-\frac{\check{{{c}}}^l_{j^l}+\lambda_l\check{{{a}}}^l_{j^l}}{u^l_{{i^l_*}}}. \end{aligned}$$

Since \(\lambda_{l}:= -\check{{{c}}}^{l}_{j^{l}}/ {\check{{{a}}}^{l}_{j^{l}}}\), \(\kappa_{\lambda_{l}}=0\), which proves (15).

1.2 8.2 Proof of Theorem 3

Let \({\mathfrak {B}}^{l}:=\mathcal {B}^{l}\cup\{j^{l}\}\) for l=0,…,J−1, and \({\mathfrak {B}}^{J}:=\mathcal {B}^{J}\cup\{N+1\}\), where \(\mathcal {B}^{l}\), \(\mathcal {B}^{J}\), andj l are as defined in the simplex algorithm. We will show that, for any fixed s∈[s l ,s l+1) (or ss J ), \({\mathfrak {B}}^{l}\) (or \({\mathfrak {B}}^{J}\)) is an optimal basic index set for the LP problem in (10).

For simplicity, let j J:=N+1, c N+1:=0, , and a N+1:=1. The inverse of

$$\begin{aligned} {\mbox {$\mathbb {A}$}}_{{\mathfrak {B}}^l} =&\left [ \begin{array}{c@{\quad}c} {{\bf A}}_{\mathcal {B}^l} & {{\bf A}}_{j^l}\\ {{\bf a}}_{\mathcal {B}^l}^\top& {{a}}_{j^l} \end{array} \right ] \end{aligned}$$

is given by

for l=0,…,J.

First, we show that \({\mbox {$\mathbb {A}$}}_{{{\mathfrak {B}}^{l}}}\) is a feasible basic index set of (10) for s∈[s l ,s l+1], i.e.

(29)

Recalling that , \(z^{l}_{j^{l}}=0\), , , and \(d^{l}_{j^{l}}=1\), we have

(30)

From and

it can be shown that

Thus, (30) is a convex combination of and for s∈[s l ,s l+1], and hence it is non-negative. This proves the feasibility of \({\mbox {$\mathbb {A}$}}_{{\mathfrak {B}}^{l}}\) for s∈[s l ,s l+1] and l=0,…,J−1. For ss J, we have

Next, we prove that \({\mbox {$\mathbb {A}$}}_{{\mathfrak {B}}^{l}}\) is an optimal basic index set of (10) for s∈[s l ,s l+1] by showing . For i=1,…,N, the ith element of is

Similarly, for ss J,

$$\begin{aligned} {{c}}_i-\left [ \begin{array}{c} {{{\bf c}}_{\mathcal {B}^J}}\\0 \end{array} \right ]^\top {\mbox {$\mathbb {A}$}}_{{\mathfrak {B}}^J}^{-1}\left [ \begin{array}{c} {{\bf A}}_i\\{{a}}_i \end{array} \right ] =&{{c}}_i- {{\bf c}}_{\mathcal {B}^J}^\top {{\bf A}}_{\mathcal {B}^J}^{-1}{{\bf A}}_i \\=&\left\{ \begin{array} {l@{\quad}l} \check{{{c}}}^J_i &\mbox{for }i=1,\ldots,N\\0&\mbox{for }i=N+1. \end{array} \right. \end{aligned}$$

Clearly, the optimality condition holds by the non-negativity of all the elements as defined in the simplex algorithm. This completes the proof.

1.3 8.3 Proof of Theorem 4

(i) By (28), we can update the pivot rows of the tableau as follows:

(31)

If \(u^{l}_{i}=0\), the ith pivot row of \(\mathcal {B}^{l+1}\) is the same as the ith pivot row of . For \(i={i^{l}_{*}}\), the ith pivot row of \({\mathcal {B}^{l+1}}\) is \((1/{u}^{l}_{i^{l}_{*}})\) . If \(i\neq{i^{l}_{*}}\) and \({u}^{l}_{i}<0\), which imply \(-(u^{l}_{i}/{u}^{l}_{{i^{l}_{*}}})>0\), the ith pivot row of since the sum of any two lexicographically positive vectors is still lexicographically positive. According to the tableau update algorithm, we have \({u}^{l}_{{i^{l}_{*}}}>0\), where \({i^{l}_{*}}\) is the index number of the lexicographically smallest pivot row among all the pivot rows for \(\mathcal {B}^{l}\) with \({u}^{l}_{i}>0\). For \(i\neq{i^{l}_{*}}\) and \({u}^{l}_{i}>0\), by the definition of \({i^{l}_{*}}\), \(\mbox{(the ${i^{l}_{*}}$th pivot row of ${\mathcal {B}^{l}}$)}/{u}^{l}_{i^{l}_{*}}\overset{L}{<} \mbox{(the $i$th pivot row of ${\mathcal {B}^{l}}$)}/{u}^{l}_{i}\). This implies

Therefore, all the updated pivot rows are lexicographically positive.

Remark 2

If \(z^{l}_{{i^{l}}}=0\), (31) implies that \(z^{l}_{k^{l}_{i}}=z^{l+1}_{k^{l}_{i}}\) for \(i\neq{i^{l}_{*}}\), \(i\in \mathcal {M}\). and \(z^{l+1}_{j^{l}}=0\). Hence . On the other hand, if \(z^{l}_{i^{l}}>0\), \(z^{l+1}_{j^{l}} =(z^{l}_{i^{l}}/{u}^{l}_{j^{l}})>0\) while \(z^{l}_{j^{l}}=0\) since \(j^{l}\notin \mathcal {B}^{l}\). This implies . Therefore, if and only if \(z_{i^{l}}^{l}=0\).

(ii) When the basic index set \(\mathcal {B}^{l}\) is updated to \(\mathcal {B}^{l+1}\), \(\check{{{c}}}^{l}_{j^{l}}<0\). Since \(j^{l}\in \mathcal {B}^{l+1}\), \(\check{{{c}}}^{l+1}_{j^{l}}=0\). Then, \(({{\bf c}}_{j^{l}}-{{\bf c}}_{\mathcal {B}^{l+1}}^{\top} {{\bf A}}_{\mathcal {B}^{l+1}}^{-1}{{\bf A}}_{j^{l}}) -({{\bf c}}_{j^{l}}-{{\bf c}}_{\mathcal {B}^{l}}^{\top} {{\bf A}}_{\mathcal {B}^{l}}^{-1}{{\bf A}}_{j^{l}}) =(\check{{{c}}}^{l+1}_{j^{l}}-\check{{{c}}}^{l}_{j^{l}})>0\).

Similarly as the proof of (15),

$$\bigl({{\bf c}}^\top-{{\bf c}}_{\mathcal {B}^{l+1}}^\top {{\bf A}}_{\mathcal {B}^{l+1}}^{-1}{{\bf A}}\bigr)- \bigl({{\bf c}}^\top- {{\bf c}}_{\mathcal {B}^l}^\top {{\bf A}}_{\mathcal {B}^l}^{-1}{{\bf A}}\bigr) = \kappa^l{{\bf e}}_{{i^l_*}}^\top {{\bf A}}_{\mathcal {B}^l}^{-1} {{\bf A}}, $$

where . \({{\bf e}}_{{i^{l}_{*}}}^{\top} {{\bf A}}_{\mathcal {B}^{l}}^{-1}{{\bf A}}\) is the \({i^{l}_{*}}\)th pivot row for \(\mathcal {B}^{l}\), which is lexicographically positive. Since the j lth entry of \({{\bf e}}_{{i^{l}_{*}}}^{\top} {{\bf A}}_{\mathcal {B}^{l}}^{-1}{{\bf A}}\) is strictly positive, that of \(({{\bf c}}^{\top}-{{\bf c}}_{\mathcal {B}^{l+1}}^{\top} {{\bf A}}_{\mathcal {B}^{l+1}}^{-1}{{\bf A}})- ({{\bf c}}^{\top}-{{\bf c}}_{\mathcal {B}^{l}}^{\top} {{\bf A}}_{\mathcal {B}^{l}}^{-1}{{\bf A}})\) must share the same sign with κ l. Thus, we have κ l>0. Then the updated cost row is given as

Clearly, the cost row for \(\mathcal {B}^{l+1}\) is lexicographically greater than that for \(\mathcal {B}^{l}\).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yao, Y., Lee, Y. Another look at linear programming for feature selection via methods of regularization. Stat Comput 24, 885–905 (2014). https://doi.org/10.1007/s11222-013-9408-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-013-9408-2

Keywords

Navigation