Skip to main content

Advertisement

Log in

Subspace quadratic regularization method for group sparse multinomial logistic regression

  • Published:
Computational Optimization and Applications Aims and scope Submit manuscript

Abstract

Sparse multinomial logistic regression has recently received widespread attention. It provides a useful tool for solving multi-classification problems in various fields, such as signal and image processing, machine learning and disease diagnosis. In this paper, we first study the group sparse multinomial logistic regression model and establish its optimality conditions. Based on the theoretical results of this model, we hence propose an efficient algorithm called the subspace quadratic regularization algorithm to compute a stationary point of a given problem. This algorithm enjoys excellent convergence properties, including the global convergence and locally quadratic convergence. Finally, our numerical results on standard benchmark data clearly demonstrate the superior performance of our proposed algorithm in terms of logistic loss value, sparsity recovery and computational time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. http://featureselection.asu.edu/.

  2. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html.

References

  1. Bahmani, S., Raj, B., Boufounos, P.: Greedy sparsity-constrained optimization. J. Mach. Learn. Res. 14, 807–841 (2013)

    MathSciNet  MATH  Google Scholar 

  2. Beck, A., Hallak, N.: Optimization problems involving group sparsity terms. Math. Program. 178, 39–67 (2019)

    Article  MathSciNet  Google Scholar 

  3. Blondel, M., Seki, K., Uehara, K.: Block coordinate descent algorithms for large-scale sparse multiclass classification. Mach. Learn. 93, 31–52 (2013)

    Article  MathSciNet  Google Scholar 

  4. Böhning, D.: Multinomial logistic regression algorithm. Ann. Inst. Stat. Math. 44(1), 197–200 (1992)

    Article  Google Scholar 

  5. Breheny, P., Huang, J.: Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors. Stat. Comput. 25(2), 173–187 (2015)

    Article  MathSciNet  Google Scholar 

  6. Byrne, E., Schniter, P.: Sparse multinomial logistic regression via approximate message passing. IEEE Trans. Signal Process. 64(21), 5485–5498 (2016)

    Article  MathSciNet  Google Scholar 

  7. Cawley, G.C., Talbot, N.L., Girolami, M.: Sparse multinomial logistic regression via Bayesian L1 regularisation. Adv. Neural Inf. Process. Syst. pp. 209–216 (2007)

  8. Chen, X.J, Pan, L.L., Xiu, N.H.: Solution sets of three sparse optimization problems for multivariate regression. Technical Report, Department of Applied Mathematics, The Hong Kong Polytechnic University (2020)

  9. Freedman, D.A.: Statistical Models: Theory and Practice. Cambridge University Press, Cambridge (2009)

    Book  Google Scholar 

  10. Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1–22 (2010)

    Article  Google Scholar 

  11. Harrell, F.E., Jr.: Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. Springer, Berlin (2015)

    Book  Google Scholar 

  12. Kanzow, C., Qi, H.D.: A QP-free constrained Newton-type method for variational inequality problems. Math. Program. 85(1), 81–106 (1999)

    Article  MathSciNet  Google Scholar 

  13. Krishnapuram, B., Carin, L., Figueiredo, M., Hartemink, A.: Sparse multinomial logistic regression: fast algorithms and generalization bounds. IEEE Trans. Pattern Anal. Mach. Intell. 27(6), 957–968 (2005)

    Article  Google Scholar 

  14. Kwak, C., Clayton-Matthews, A.: Multinomial logistic regression. Nurs. Res. 51(6), 404–410 (2002)

    Article  Google Scholar 

  15. Lee, J.D., Sun, Y., Saunders, M.A.: Proximal Newton-type methods for minimizing composite functions. SIAM J. Optim. 24, 1420–1443 (2014)

    Article  MathSciNet  Google Scholar 

  16. Li, F.F., Fergus, R., Perona, P.: Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. In: 2004 Conference on Computer Vision and Pattern Recognition Workshop, IEEE (2004)

  17. Li, H., Lin, Z.: Accelerated proximal gradient methods for nonconvex programming. Adv. Neural Inf. Process. Syst. 379–387 (2015)

  18. Li, J., Bioucas-Dias, J.M., Plaza, A.: Semi-supervised hyperspectral image classification using a new (soft) sparse multinomial logistic regression model. In: 2011 3rd Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS). IEEE. 1-4 (2011)

  19. McCullagh, P., Nelder, J.A.: Generalized Linear Models, 2nd edn. Chapman and Hall, New York (1989)

    Book  Google Scholar 

  20. Moré, J.J., Sorensen, D.C.: Computing a trust region step. SIAM J. Sci. Stat. Comput. 4(3), 553–572 (1983)

    Article  MathSciNet  Google Scholar 

  21. Natarajan, B.K.: Sparse approximate solutions to linear systems. SIAM J. Comput. 24(2), 227–234 (1995)

    Article  MathSciNet  Google Scholar 

  22. Nocedal, J., Wright, S.J.: Numerical Optimization, 2nd edn. Springer, New York (1999)

    Book  Google Scholar 

  23. Obozinski, G., Taskar, B., Jordan, M.: Joint covariate selection and joint subspace selection for multiple classification problems. Stat. Comput. 20(2), 231–252 (2010)

    Article  MathSciNet  Google Scholar 

  24. Pang, T., Nie, F., Han, J., Li, X.: Efficient feature selection via \(l _{2,0}\)-norm constrained sparse regression. IEEE Trans. Knowl. Data Eng. 31(5), 880–893 (2019)

    Article  Google Scholar 

  25. Rockafellar, R.T., Wets, R.J.: Variational Analysis. Springer, New York (1998)

    Book  Google Scholar 

  26. Ryalia, S., Supekarbc, K., Abramsa, D.A., Menonad, V.: Sparse logistic regression for whole-brain classification of fMRI data. NeuroImage 51(2), 752–764 (2010)

    Article  Google Scholar 

  27. Simon, N., Friedman, J., Hastie, T.: A blockwise descent algorithm for group-penalized multiresponse and multinomial regression. arXiv preprint arXiv:1311.6529 (2013)

  28. Sun, Y., Babu, P., Palomar, D.P.: Majorization-minimization algorithms in signal processing, communications, and machine learning. IEEE Trans. Signal Process. 65(3), 794–816 (2016)

    Article  MathSciNet  Google Scholar 

  29. Tutz, G., Pößnecker, W., Uhlmann, L.: Variable selection in general multinomial logit models. Comput. Stat. Data Anal. 82, 207–222 (2015)

    Article  MathSciNet  Google Scholar 

  30. Vincent, M., Hansen, N.R.: Sparse group lasso and high dimensional multinomial classification. Comput. Stat. Data Anal. 71, 771–786 (2014)

    Article  MathSciNet  Google Scholar 

  31. Wang, R., Xiu, N.H., Zhou, S.L.:  An extended Newton-type algorithm for l2-regularized sparse logistic regression and its efficiency for classifying large-scale datasets. J. Comput. Appl. Math. 397, 113656 (2021)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

We sincerely thank the associate editor and two referees for their detailed comments that have helped to improve this paper. The research of Rui Wang and Naihua Xiu is partially supported by the National Natural Science Foundation of China (11971052), the Beijing National Science Foundation (Z190002), and the research of Kim-Chuan Toh is partially supported by the Ministry of Education of Singapore under ARF Grant Number R-146-000-257-112.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rui Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Proofs of Lemmas 1 and 2

Appendix A: Proofs of Lemmas 1 and 2

Proof of Lemma 1:

Since the results of Lemma 1 (i)–(iii) can be directly calculated, we omit their proofs and give the following detailed derivation of 1 (iv).

Denote

$$\begin{aligned} C^{(k)}(W)={\mathrm {Diag}}\Big (\frac{\exp ({w^{(k)}}^{\top }x_i)}{({\mathbf {1}}^{\top }\exp (W^{\top }x_i)+1)^{2}}\Big |\ i=1,2,\ldots ,n\Big ). \end{aligned}$$
(27)

If no confusion arises, we use the simple notation \(A=A(W),\ B=B(W),\ C=C(W)\). Since \(A^{(k)}=C^{(k)}-\sum \limits _{\begin{array}{c} j\ne k\\ j=1 \end{array}}^{m-1}B^{(k,j)},\ k=1,2,\ldots ,(m-1)\), we have

$$\begin{aligned} X^{\top }A^{(k)}X=X^{\top }(C^{(k)}-\sum \limits _{\begin{array}{c} j\ne k\\ j=1 \end{array}}^{m-1}B^{(k,j)})X. \end{aligned}$$
(28)

Therefore, we can rewrite the Hessian matrix as

$$\begin{aligned} \nabla ^{2}\ell (W)=\left( \begin{array}{ccc} X^{\top }(C^{(1)}-\sum \limits _{j=2}^{m-1}B^{(1,j)})X & \cdots & X^{\top }B^{(1,m-1)}X \\ \vdots & \ddots & \vdots \\ X^{\top }B^{(m-1,1)}X & \cdots & X^{\top }(C^{(m-1)}-\sum \limits _{j=1}^{m-2}B^{(m-1,j)})X \\ \end{array} \right) . \end{aligned}$$

For any \(z=(z_1; z_2;\ldots ;z_{m-1}) \in {\mathbb {R}}^{p(m-1)}\) with \(z_i \in {\mathbb {R}}^{p},\ i=1,2,\ldots ,(m-1)\), we have

$$\begin{aligned} z^{\top }\nabla ^{2}\ell (W) & z= \sum \limits _{k=1}^{m-1}z_k^\top X^{\top }C^{(k)}Xz_k -\sum \limits _{k=1}^{m-2}\sum \limits _{j=k+1}^{m-1}(z_k-z_j)^\top X^{\top }B^{(k,j)}X(z_k-z_j)\\ & \ge 0, \quad (\text {because of}\ C^{(k)}\succ {\mathbf {0}},\ B^{(k,j)}\preceq {\mathbf {0}}) \nonumber \end{aligned}$$
(29)

which means that \(\nabla ^{2}\ell (W)\succeq {\mathbf {0}}\).

To verify the second inequality of (3), D. Böhning [4] has shown that the Hessian matrix is upper bounded by a positive definite matrix that does not depend on W, i.e.,

$$\begin{aligned} \nabla ^{2} \ell (W) \preceq \frac{1}{2}[I_{m-1}-{\mathbf {1}}{\mathbf {1}}^{\top }/m]\otimes X^{\top }X. \end{aligned}$$
(30)

We next prove the Lipschitz continuity of the Hessian matrix. Denote

$$\begin{aligned} h(\mathbf{z })=\displaystyle \frac{\exp (\mathbf{z }^k)}{({\mathbf {1}}^{\top }\exp (\mathbf{z })+1)^{2}},\,\,\, g(\mathbf{z })=\displaystyle \frac{\exp (\mathbf{z }^k)\exp (\mathbf{z }^j)}{({\mathbf {1}}^{\top }\exp (\mathbf{z })+1)^{2}}, \end{aligned}$$

where \(\mathbf{z }=(\mathbf{z }^{1},\mathbf{z }^{2},\ldots ,\mathbf{z }^{m-1})^{\top }=W^{\top }x_i \in {\mathbb {R}}^{m-1}\) with \(\mathbf{z }^k={w^{(k)}}^{\top }x_i\) for \(i \in \{1,2,\ldots ,n\}\). We first consider the following two cases.

Case 1. If \(j\ne k\),

$$\begin{aligned} \Big |\frac{\partial h(\mathbf{z })}{\partial \mathbf{z }^j}\Big |=\Big |\frac{-2\exp (\mathbf{z }^k)\exp (\mathbf{z }^j)}{({\mathbf {1}}^{\top }\exp (\mathbf{z })+1)^{3}}\Big | \le 2. \end{aligned}$$

Case 2. If \(j= k\),

$$\begin{aligned} \Big |\frac{\partial h(\mathbf{z })}{\partial \mathbf{z }^j}\Big |=\Big |\frac{\exp (\mathbf{z }^k)}{({\mathbf {1}}^{\top }\exp (\mathbf{z })+1)^{2}}-\frac{2\exp ^2(\mathbf{z }^k)}{({\mathbf {1}}^{\top }\exp (\mathbf{z })+1)^{3}}\Big | \le 2. \end{aligned}$$

Hence, \(\Vert \nabla h(\mathbf{z } )\Vert \le 2\sqrt{m-1}.\) Then by the mean value theorem, there exists \(\theta \in (0,1)\) such that for any \(i \in \{1,2,\ldots ,n\}\),

$$\begin{aligned} h(W^{\top }x_i)-h(\widehat{W}^{\top }x_i)& = h(\mathbf{z })-h(\hat{\mathbf{z }})=\nabla h(\mathbf{z }+\theta (\hat{\mathbf{z }}-\mathbf{z }))^{\top }(\mathbf{z }-\hat{\mathbf{z }})\\& \le \Vert \nabla h(\mathbf{z }+\theta (\hat{\mathbf{z }}-\mathbf{z }) )\Vert \Vert \mathbf{z }-\hat{\mathbf{z }}\Vert \le 2\sqrt{m-1}\Vert \mathbf{z }-\hat{\mathbf{z }}\Vert \\ & \le 2\sqrt{m-1}\Vert W-\widehat{W}\Vert \Vert x_i\Vert \le \varpi \Vert W-\widehat{W}\Vert , \end{aligned}$$

where \(\varpi := 2\sqrt{m-1}\Vert X\Vert _\infty\). Similarly, we have

$$\begin{aligned} g(W^{\top }x_i)-g(\widehat{W}^{\top }x_i)\le \varpi \Vert W-\widehat{W}\Vert . \end{aligned}$$

Notice that for any \(k,j \in \{1,2,\ldots ,m-1\}\),

$$\begin{aligned} \begin{array}{cl} &\Vert C^{(k)}(W)-C^k(\widehat{W})\Vert _2 =\max \limits _{i \in \{1,2,\ldots ,n\}}\{h(W^{\top }x_i)-h(\widehat{W}^{\top }x_i)\} \le \varpi \Vert W-\widehat{W}\Vert \\ &\\ &\Vert B^{(k,j)}(W)-B^{(k,j)}(\widehat{W})\Vert _2=\max \limits _{i \in \{1,2,\ldots ,n\}}\{g(W^{\top }x_i)-g(\widehat{W}^{\top }x_i)\} \le \varpi \Vert W-\widehat{W}\Vert \end{array} \end{aligned}$$
(31)

Therefore,

$$\begin{aligned}&\Vert \nabla ^{2}\ell (W)-\nabla ^{2}\ell (\widehat{W})\Vert \\&\le \Vert \sum \limits _{k=1}^{m-1} X^{\top }(A^{(k)}(W)-A^k(\widehat{W})X\Vert +2\Vert \sum \limits _{k=1}^{m-2}\sum \limits _{j=k+1}^{m-1} X^{\top }(B^{(k,j)}(W)-B^{(k,j)}(\widehat{W}))X \Vert \\&\overset{(28)}{\le } \Vert \sum \limits _{k=1}^{m-1} X^{\top }(C^{(k)}(W)-C^k(\widehat{W})X\Vert +4\Vert \sum \limits _{k=1}^{m-2}\sum \limits _{j=k+1}^{m-1} X^{\top }(B^{(k,j)}(W)-B^{(k,j)}(\widehat{W}))X \Vert \\&\le (m-1)\max \limits _{k,j} \{\Vert X^{\top }(C^{(k)}(W)-C^k(\widehat{W}))X\Vert +2(m-2)\Vert X^{\top }(B^{(k,j)}(W)-B^{(k,j)}(\widehat{W}))X \Vert \}\\&\le (m-1)\max \limits _{k,j} \{\Vert C^{(k)}(W)-C^k(\widehat{W})\Vert _2+2(m-2)\Vert B^{(k,j)}(W)-B^{(k,j)}(\widehat{W})\Vert _2\}\Vert X^{\top }X \Vert \\&\overset{(31)}{\le }2\varpi (m-1)^2 \Vert X^{\top }X \Vert \Vert W-\widehat{W}\Vert , \end{aligned}$$

which completes the proof of the Lemma. \(\square\)

Proof of Lemma 2:

For any \(W,D\in {\mathbb {R}}^{p\times {(m-1)}}\) and \(\bar{t} \in [0,1]\), there exists \(\varXi = W+\bar{t}(D-W)\) such that

$$\begin{aligned} \Vert \nabla \ell (W+\bar{t}D)-\nabla \ell (W)\Vert& = \Vert \bar{t}\nabla ^2 \ell (\varXi )\mathrm{vec}(D)\Vert \quad (\text {by mean-value theorem}) \nonumber \\&\overset{(30)}{\le }\bar{t}\Vert \frac{1}{2}[I_{m-1}-{\mathbf {1}}{\mathbf {1}}^{\top }/m]\otimes X^{\top }X\Vert _2\Vert D\Vert \nonumber \\&\le \frac{\bar{t}}{2}\Vert [I_{m-1}-{\mathbf {1}}{\mathbf {1}}^{\top }/m]\Vert _2\Vert X^{\top }X\Vert _2\Vert D\Vert \nonumber \\&\le \frac{\bar{t}}{2}\lambda _{\max }( X^{\top }X)\Vert D\Vert :=\bar{t}\lambda _{{\mathrm {x}}}\Vert D\Vert . \end{aligned}$$
(32)

Let t be a scalar parameter and \(g(t)=\ell (W+tD)\). The chain rule yields

$$\begin{aligned} \frac{\partial g}{\partial t}(t)=\langle \nabla \ell (W+tD),D \rangle . \end{aligned}$$
(33)

Then, we have

$$\begin{aligned} \ell (W+tD)-\ell (W)&=g(1)-g(0)=\int _{0}^{1}\frac{\partial g}{\partial t}(t)dt \overset{(33)}{=} \int _{0}^{1}\langle \nabla \ell (W+tD),D \rangle dt\\&\le \int _{0}^{1}\langle \nabla \ell (W),D \rangle dt+\Big |\int _{0}^{1}\langle \nabla \ell (W+tD)-\nabla \ell (W),D \rangle dt\Big |\\&\le \langle \nabla \ell (W),D \rangle +\int _{0}^{1}\Vert \nabla \ell (W+tD)-\nabla \ell (W)\Vert \Vert D\Vert dt\\&\overset{(32)}{\le } \langle \nabla \ell (W),D \rangle + \Vert D\Vert \int _{0}^{1} \lambda _{{\mathrm {x}}}t\Vert D\Vert dt\\&=\langle \nabla \ell (W),D \rangle + \frac{\lambda _{{\mathrm {x}}}}{2}\Vert D\Vert ^{2}. \end{aligned}$$

Hence, we have completed the proof. \(\square\)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, R., Xiu, N. & Toh, KC. Subspace quadratic regularization method for group sparse multinomial logistic regression. Comput Optim Appl 79, 531–559 (2021). https://doi.org/10.1007/s10589-021-00287-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10589-021-00287-2

Keywords