Skip to main content
Log in

A sparse logistic regression framework by difference of convex functions programming

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Feature selection for logistic regression (LR) is still a challenging subject. In this paper, we present a new feature selection method for logistic regression based on a combination of the zero-norm and l 2-norm regularization. However, discontinuity of the zero-norm makes it difficult to find the optimal solution. We apply a proper nonconvex approximation of the zero-norm to derive a robust difference of convex functions (DC) program. Moreover, DC optimization algorithm (DCA) is used to solve the problem effectively and the corresponding DCA converges linearly. Compared with traditional methods, numerical experiments on benchmark datasets show that the proposed method reduces the number of input features while maintaining accuracy. Furthermore, as a practical application, the proposed method is used to directly classify licorice seeds using near-infrared spectroscopy data. The simulation results in different spectral regions illustrates that the proposed method achieves equivalent classification performance to traditional logistic regressions yet suppresses more features. These results show the feasibility and effectiveness of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Karsmakers P, Pelckmans K, Suykens JAK (2007) Multi-class kernel logistic regression: a fixed-size implementation. In: Proceedings of the International Joint Conference on Neural Networks, Orlando, pp., 1756-1761

  2. Koh K, Kim SJ, Boyd S (2007) An Interior-Point Method for Large-Scale L 1-Regularized Logistic Regression. J Machine Learn Res 8:1519–1555

    MathSciNet  MATH  Google Scholar 

  3. Ryali S, Supekar K, Abrams DA, Menon V (2010) Sparse logistic regression for whole-brain classification of fMRI data. NeuroImage 51(2):752–764

    Article  Google Scholar 

  4. Aseervatham S, Antoniadis A, Gaussier E, Burlet M , Denneulin Y (2011) A sparse version of the ridge logistic regression for large-scale text categorization. Pattern Recogn Lett 32:101–106

    Article  Google Scholar 

  5. Bielza C, Robles V, Larranaga P (2011) Regularized logistic regression without a penalty term: An application to cancer classification with microarray data. Appl Expert Syst 389:5110–5118

    Article  Google Scholar 

  6. Maher MM, Trafalis TB, Adrianto I (2011) Kernel logistic regression using truncated Newton method. Comput Manag Sci 8:415–428

    Article  MathSciNet  MATH  Google Scholar 

  7. Vapnik VN (1998) Statistical Learning Theory. Wiley, New York

    MATH  Google Scholar 

  8. Guyon I (2003) An Introduction to Variable and Feature Selection. J Machine Learn Res 3:1157–1182

    MATH  Google Scholar 

  9. Le Thi HA, Le Hoai M, Vinh Nguyen V, Pham Dinh T (2008) A DC programming approach for feature selection in support vector machines learning. Adv Data Anal Classif 2:259–278

    Article  MathSciNet  MATH  Google Scholar 

  10. Yang LM, Wang LSH, Sun YH, Zhang RY (2010) Simultaneous feature selection and classification via Minimax Probability Machine. J Comput Intell Syst 3(6):754–760

    Article  Google Scholar 

  11. Musa AB (2013) A comparison of l 1-regularizion, PCA, KPCA and ICA for dimensionality reduction in logistic regression. Int J Mach Learn Cyber. doi:10.1007/s13042-013-0171-7

  12. Zou H (2006) The Adaptive Lasso and Its Oracle Properties. J Amer Statist Assoc 101:1418–1429

    Article  MathSciNet  MATH  Google Scholar 

  13. Lin ZHY, Xiang YB, Zhang CY (2009) Adaptive Lasso in high-dimensional settings. J Nonparametric Statist 21(6):683–696

    Article  MathSciNet  MATH  Google Scholar 

  14. Le HM, Le Thi HA, Nguyen MC (2015) Sparse semi-supervised support vector machines by DC programming and DCA. Neurocomputing

  15. Pham Dinh T, Le Thi TA, Akoa F (2008) Combining DCA (DC Algorithms) and interior point techniques for large-scale nonconvex quadratic programming. Optim Methods Softw 23(4):609–629

    Article  MathSciNet  MATH  Google Scholar 

  16. Guan W, Gray A (2013) Sparse high-dimensional fractional-norm support vector machine via DC programming. Comput Stat Data Anal 67:136–148

    Article  MathSciNet  Google Scholar 

  17. Le Thi HA, Le Hoai M, Pham Dinh T (2014) New and efficient DCA based algorithms for minimum sum-of-squares clustering. Pattern Recogn 47:388–401

    Article  MATH  Google Scholar 

  18. Chouzenoux E, Jezierska A, Christophe JP, Talbot H (2013) A Majorize-minimize approach for l 2- l 0 image regularization. SIAM J Imaging Sciety 6(1):563–591

    Article  MATH  Google Scholar 

  19. Herskovits J (1998) Feasible direction interior-point technique for nonlinear optimization. J Optim Theory and Appl 99(1):121–146

    Article  MathSciNet  MATH  Google Scholar 

  20. Bakhtiari S, Tits AL (2003) A simple primal-dual feasible interior-point method for nonlinear programming with monotone descent. Comput Optim Appl 25:17–38

    Article  MathSciNet  MATH  Google Scholar 

  21. Bohning D (1999) The lower bound method in probit regression. Comput Stat Data Anal 30:13–17

    Article  MathSciNet  MATH  Google Scholar 

  22. Minka TP (2003) A comparison of numerical optimizers for logistic regression, http://research.microsoft.com/minka/papers/logreg/

  23. Zhang M (2008) Primal-dual interior-point methods for linearly constrained convex optimization. Master’s Thesis, China

    Google Scholar 

  24. Zhang CH, Shao YH, Tan JY, Deng NY (2013) Mixed-norm linear support vector machine. Neural Comput Appl 23:2159–2166. doi:10.1007/s00521-012-1166-0

    Article  Google Scholar 

  25. Rangarijan YAL (2003) The concave-convex procedure. Neural Comput 15:915–936

    Article  Google Scholar 

  26. Huang J, Ling CX (2005) Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Know Data Eng 17:299–310

    Article  Google Scholar 

  27. Zhu J, Rosset S, Hastie T (2003) l 1-norm support vector machines. In: Neural Information Processing Systems. Cambridge: MIT Press

  28. Wang G, Ma M, Zhang Z, Xiang Y, Harrington Pde B (2013) A novel DPSO-SVM system for variable interval selection of endometrial tissue sections by near infrared spectroscopy. Talanta 112(15):136–142

    Article  Google Scholar 

  29. Yang LM, Go YP, Sun Q (2015) A New Minimax Probabilistic Approach and Its Application in Recognition the Purity of Hybrid Seeds CMES:Comp. Model Eng Sci 104(6):493–506

    Google Scholar 

Download references

Acknowledgments

This work is supported by National Nature Science Foundation of China (11471010,11271367).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Liming Yang.

Appendix: The primal-dual interior-point method for solving convex problem (32)

Appendix: The primal-dual interior-point method for solving convex problem (32)

Note that p(y = 1/x)+p(y = −1/x)=1, and thus problem 32) can be written as:

$$\begin{array}{@{}rcl@{}} \min\limits_{b,\mathbf{w},\mathbf{t}}&&\left\{G(b,\mathbf{w},\mathbf{t})-\langle \mathbf{v}^{k},\mathbf{w}\rangle: (b,\mathbf{w},\mathbf{t})\in {\Omega}\right\}\\[-2pt] =\min\limits_{b,\mathbf{w},\mathbf{t}}&&\left\{-\sum\limits_{y_{i}=1} (b+ \mathbf{w}^{T} \mathbf{x}_{i})+\sum\limits_{i}log\left( 1+e^{b+\mathbf{w}^{T} \mathbf{x}_{i}}\right)+\lambda \|\mathbf{w}\|_{2}^{2}\right.\\[-2pt] &&\left.+\mu \sum\limits_{j=1}^{n} a | w_{j}|-\mu\langle \mathbf{v}^{k},\mathbf{w}\rangle: (b,\mathbf{w},\mathbf{t})\in {\Omega}\right\}\\[-2pt] =\min\limits_{b,\mathbf{w},\mathbf{t}}&&\left\{-\sum\limits_{y_{i}=1} \left( b+\mathbf{w}^{T} \mathbf{x}_{i}\right)+\sum\limits_{i}log\left( 1+e^{b+\mathbf{w}^{T} x_{i}}\right)+\lambda \|\mathbf{w}\|_{2}^{2}\right.\\[-2pt] &&\left.+\mu\sum\limits_{j=1}^{n}t_{j}-\mu\langle \mathbf{v}^{k},\mathbf{w}\rangle : (b,\mathbf{w},\mathbf{t})\in {\Omega}\right\} \end{array} $$
(42)

where:

$${\Omega}\,=\,\left\{(b,\mathbf{w},\mathbf{t})\in R^{2n+1}:-\alpha w_{j}\leq t_{j},\alpha w_{j}\leq t_{j},j=1...n \right\} $$

Let x = (b, w, t) with xR 2n+1and

$$\begin{array}{@{}rcl@{}} F(\mathbf{x})&=&\!\!-\sum\limits_{y_{i}=1}\! \left( b+\mathbf{w}^{T} \mathbf{x}_{i}\right)\,+\,\sum\limits_{i}log\left( 1+e^{b+\mathbf{w}^{T}\mathbf{x}_{i}}\right)+\lambda \|\mathbf{w} \|_{2}^{2}\\ &&+\mu\sum\limits_{j=1}^{n}t_{j}-\mu\langle \mathbf{v}^{k},\mathbf{w}\rangle\ \end{array} $$
(43)

Then the problem (42) is equivalent to

$$\begin{array}{@{}rcl@{}} \min\limits_{b,\mathbf{w},\mathbf{t}}&& F(b,\mathbf{w},\mathbf{t})\\ \textit{ s.t.}&& -a w_{j}\leq t_{j},a w_{j}\leq t_{j},j=1...n \label {03} \end{array} $$
(44)

Introducing Lagrangian multiplier s with components s i (s i ≥0, the Lagrangian function for the problem (44) can be expressed as

$$\begin{array}{@{}rcl@{}} F(\mathbf{x})-\mathbf{s}^{T}A\mathbf{x}, s\geq 0, \mathbf{s} \in R^{2n} \end{array} $$
(45)

where

$$\begin{array}{@{}rcl@{}} A=\left( \begin{array}{lcr} 0_{n\times1} &\alpha*I_{n\times n}&I_{n\times n}\\ 0_{n\times 1} &-\alpha*I_{n\times n}&I_{n\times n} \end{array} \right) \end{array} $$
(46)

where I n×n is unit matrix and 0 n×1 stands for a real n×1 matrix. The first-order necessary optimality conditions for (44) is

$$\begin{array}{@{}rcl@{}} \nabla F(\mathbf{x})-A^{T}{s}=0 \\ {s}^{T}A\mathbf{x}=0 ,\mathbf{s}\geq 0 \end{array} $$
(47)
$$\begin{array}{@{}rcl@{}} Ax\geq 0 \end{array} $$
(48)

where

$$\begin{array}{@{}rcl@{}} \nabla F(\mathbf{x})=\left( \begin{array}{lcr} -{\sum}_{y_{i}=1}1+{\sum}_{i}\frac{e^{b+\mathbf{w}^{T} x_{i}}}{1+e^{b+\mathbf{w}^{T} x_{i}}}\\ -{\sum}_{y_{i}=1}\mathbf{x}_{i}+{\sum}_{i}\dfrac {e^{b+\mathbf{w}^{T} \mathbf{x}_{i}}}{1+e^{b+\mathbf{w}^{T} \mathbf{x}_{i}}}\mathbf{x}_{i}+2\lambda \mathbf{w}-\mu \mathbf{v}^{k}\\ \mu \mathbf{\xi}_{n\times1} \end{array} \right) \end{array} $$
(49)

where ξ n×1 denotes a real n×1 matrix. Let A x = z, the above system (47) can be written as

$$\begin{array}{@{}rcl@{}} \nabla F(\mathbf{x})-A^{T}\mathbf{s}=0 \\ A\mathbf{x}=\mathbf{z},\mathbf{z}\geq 0 \end{array} $$
(50)
$$\begin{array}{@{}rcl@{}} \mathbf{s}^{T}\mathbf{z}=0 ,\mathbf{s}\geq 0 \end{array} $$
(51)

Accordingto the primal-dual interior-point algorithm, we replace s T z = 0 by s i z i = μ(μ>0) in the system (50), and then obtain

$$\begin{array}{@{}rcl@{}} \nabla F(\mathbf{x})-A^{T}\mathbf{s}=0 \\ A\mathbf{x}=\mathbf{z} ,\mathbf{z} \geq 0 \end{array} $$
(52)
$$\begin{array}{@{}rcl@{}} s_{i}z_{i}=\mu, s_{i}\geq 0 \end{array} $$
(53)

where z i is the i-th component of the variable z. The above system (52) is a perturbation of the first-order optimality conditions for (47). For a certain μ, the Newton method is used to solve this system, and then we decease μ to 0. Finally, we get the approximation solution for the system (47). Moreover, at each iteration the Newton direction is obtained by solving:

$$\begin{array}{@{}rcl@{}} \nabla^{2}F(\mathbf{x})\triangle \mathbf{x}-A^{T}{\Delta} \mathbf{s}=-\nabla F(\mathbf{x})+A^{T}\mathbf{s} \\ A{\Delta} y-{\Delta} \mathbf{z}=\mathbf{z}-A\mathbf{x},\mathbf{z}\geq 0 \end{array} $$
(54)
$$\begin{array}{@{}rcl@{}} s_{i}{\Delta} z+{\Delta} s_{i}z=\mu-s_{i}z_{i}, s_{i} \geq 0 \end{array} $$
(55)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, L., Qian, Y. A sparse logistic regression framework by difference of convex functions programming. Appl Intell 45, 241–254 (2016). https://doi.org/10.1007/s10489-016-0758-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-016-0758-2

Keywords

Navigation