Subspace quadratic regularization method for group sparse multinomial logistic regression

Wang, Rui; Xiu, Naihua; Toh, Kim-Chuan

doi:10.1007/s10589-021-00287-2

Subspace quadratic regularization method for group sparse multinomial logistic regression

Published: 12 June 2021

Volume 79, pages 531–559, (2021)
Cite this article

Computational Optimization and Applications Aims and scope Submit manuscript

686 Accesses
3 Citations
1 Altmetric
Explore all metrics

Abstract

Sparse multinomial logistic regression has recently received widespread attention. It provides a useful tool for solving multi-classification problems in various fields, such as signal and image processing, machine learning and disease diagnosis. In this paper, we first study the group sparse multinomial logistic regression model and establish its optimality conditions. Based on the theoretical results of this model, we hence propose an efficient algorithm called the subspace quadratic regularization algorithm to compute a stationary point of a given problem. This algorithm enjoys excellent convergence properties, including the global convergence and locally quadratic convergence. Finally, our numerical results on standard benchmark data clearly demonstrate the superior performance of our proposed algorithm in terms of logistic loss value, sparsity recovery and computational time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Globally Convergent Accelerated Algorithms for Multilinear Sparse Logistic Regression with $${{\ell}}_{0}$$ -Constraints

Sparse and robust SVM classifier for large scale classification

Article 13 March 2023

Stochastic DCA for Sparse Multiclass Logistic Regression

Notes

References

Bahmani, S., Raj, B., Boufounos, P.: Greedy sparsity-constrained optimization. J. Mach. Learn. Res. 14, 807–841 (2013)
MathSciNet MATH Google Scholar
Beck, A., Hallak, N.: Optimization problems involving group sparsity terms. Math. Program. 178, 39–67 (2019)
Article MathSciNet Google Scholar
Blondel, M., Seki, K., Uehara, K.: Block coordinate descent algorithms for large-scale sparse multiclass classification. Mach. Learn. 93, 31–52 (2013)
Article MathSciNet Google Scholar
Böhning, D.: Multinomial logistic regression algorithm. Ann. Inst. Stat. Math. 44(1), 197–200 (1992)
Article Google Scholar
Breheny, P., Huang, J.: Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors. Stat. Comput. 25(2), 173–187 (2015)
Article MathSciNet Google Scholar
Byrne, E., Schniter, P.: Sparse multinomial logistic regression via approximate message passing. IEEE Trans. Signal Process. 64(21), 5485–5498 (2016)
Article MathSciNet Google Scholar
Cawley, G.C., Talbot, N.L., Girolami, M.: Sparse multinomial logistic regression via Bayesian L1 regularisation. Adv. Neural Inf. Process. Syst. pp. 209–216 (2007)
Chen, X.J, Pan, L.L., Xiu, N.H.: Solution sets of three sparse optimization problems for multivariate regression. Technical Report, Department of Applied Mathematics, The Hong Kong Polytechnic University (2020)
Freedman, D.A.: Statistical Models: Theory and Practice. Cambridge University Press, Cambridge (2009)
Book Google Scholar
Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1–22 (2010)
Article Google Scholar
Harrell, F.E., Jr.: Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. Springer, Berlin (2015)
Book Google Scholar
Kanzow, C., Qi, H.D.: A QP-free constrained Newton-type method for variational inequality problems. Math. Program. 85(1), 81–106 (1999)
Article MathSciNet Google Scholar
Krishnapuram, B., Carin, L., Figueiredo, M., Hartemink, A.: Sparse multinomial logistic regression: fast algorithms and generalization bounds. IEEE Trans. Pattern Anal. Mach. Intell. 27(6), 957–968 (2005)
Article Google Scholar
Kwak, C., Clayton-Matthews, A.: Multinomial logistic regression. Nurs. Res. 51(6), 404–410 (2002)
Article Google Scholar
Lee, J.D., Sun, Y., Saunders, M.A.: Proximal Newton-type methods for minimizing composite functions. SIAM J. Optim. 24, 1420–1443 (2014)
Article MathSciNet Google Scholar
Li, F.F., Fergus, R., Perona, P.: Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. In: 2004 Conference on Computer Vision and Pattern Recognition Workshop, IEEE (2004)
Li, H., Lin, Z.: Accelerated proximal gradient methods for nonconvex programming. Adv. Neural Inf. Process. Syst. 379–387 (2015)
Li, J., Bioucas-Dias, J.M., Plaza, A.: Semi-supervised hyperspectral image classification using a new (soft) sparse multinomial logistic regression model. In: 2011 3rd Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS). IEEE. 1-4 (2011)
McCullagh, P., Nelder, J.A.: Generalized Linear Models, 2nd edn. Chapman and Hall, New York (1989)
Book Google Scholar
Moré, J.J., Sorensen, D.C.: Computing a trust region step. SIAM J. Sci. Stat. Comput. 4(3), 553–572 (1983)
Article MathSciNet Google Scholar
Natarajan, B.K.: Sparse approximate solutions to linear systems. SIAM J. Comput. 24(2), 227–234 (1995)
Article MathSciNet Google Scholar
Nocedal, J., Wright, S.J.: Numerical Optimization, 2nd edn. Springer, New York (1999)
Book Google Scholar
Obozinski, G., Taskar, B., Jordan, M.: Joint covariate selection and joint subspace selection for multiple classification problems. Stat. Comput. 20(2), 231–252 (2010)
Article MathSciNet Google Scholar
Pang, T., Nie, F., Han, J., Li, X.: Efficient feature selection via $l _{2,0}$-norm constrained sparse regression. IEEE Trans. Knowl. Data Eng. 31(5), 880–893 (2019)
Article Google Scholar
Rockafellar, R.T., Wets, R.J.: Variational Analysis. Springer, New York (1998)
Book Google Scholar
Ryalia, S., Supekarbc, K., Abramsa, D.A., Menonad, V.: Sparse logistic regression for whole-brain classification of fMRI data. NeuroImage 51(2), 752–764 (2010)
Article Google Scholar
Simon, N., Friedman, J., Hastie, T.: A blockwise descent algorithm for group-penalized multiresponse and multinomial regression. arXiv preprint arXiv:1311.6529 (2013)
Sun, Y., Babu, P., Palomar, D.P.: Majorization-minimization algorithms in signal processing, communications, and machine learning. IEEE Trans. Signal Process. 65(3), 794–816 (2016)
Article MathSciNet Google Scholar
Tutz, G., Pößnecker, W., Uhlmann, L.: Variable selection in general multinomial logit models. Comput. Stat. Data Anal. 82, 207–222 (2015)
Article MathSciNet Google Scholar
Vincent, M., Hansen, N.R.: Sparse group lasso and high dimensional multinomial classification. Comput. Stat. Data Anal. 71, 771–786 (2014)
Article MathSciNet Google Scholar
Wang, R., Xiu, N.H., Zhou, S.L.: An extended Newton-type algorithm for l2-regularized sparse logistic regression and its efficiency for classifying large-scale datasets. J. Comput. Appl. Math. 397, 113656 (2021)
Article MathSciNet Google Scholar

Download references

Acknowledgements

We sincerely thank the associate editor and two referees for their detailed comments that have helped to improve this paper. The research of Rui Wang and Naihua Xiu is partially supported by the National Natural Science Foundation of China (11971052), the Beijing National Science Foundation (Z190002), and the research of Kim-Chuan Toh is partially supported by the Ministry of Education of Singapore under ARF Grant Number R-146-000-257-112.

Author information

Authors and Affiliations

Department of Applied Mathematics, Beijing Jiaotong University, Beijing, People’s Republic of China
Rui Wang & Naihua Xiu
Department of Mathematics, National University of Singapore, Singapore, Singapore
Kim-Chuan Toh

Authors

Rui Wang
View author publications
You can also search for this author in PubMed Google Scholar
Naihua Xiu
View author publications
You can also search for this author in PubMed Google Scholar
Kim-Chuan Toh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rui Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Proofs of Lemmas 1 and 2

Proof of Lemma 1:

Since the results of Lemma 1 (i)–(iii) can be directly calculated, we omit their proofs and give the following detailed derivation of 1 (iv).

Denote

$$\begin{aligned} C^{(k)}(W)={\mathrm {Diag}}\Big (\frac{\exp ({w^{(k)}}^{\top }x_i)}{({\mathbf {1}}^{\top }\exp (W^{\top }x_i)+1)^{2}}\Big |\ i=1,2,\ldots ,n\Big ). \end{aligned}$$

(27)

If no confusion arises, we use the simple notation $A=A(W),\ B=B(W),\ C=C(W)$. Since $A^{(k)}=C^{(k)}-\sum \limits _{\begin{array}{c} j\ne k\\ j=1 \end{array}}^{m-1}B^{(k,j)},\ k=1,2,\ldots ,(m-1)$, we have

$$\begin{aligned} X^{\top }A^{(k)}X=X^{\top }(C^{(k)}-\sum \limits _{\begin{array}{c} j\ne k\\ j=1 \end{array}}^{m-1}B^{(k,j)})X. \end{aligned}$$

(28)

Therefore, we can rewrite the Hessian matrix as

$$\begin{aligned} \nabla ^{2}\ell (W)=\left( \begin{array}{ccc} X^{\top }(C^{(1)}-\sum \limits _{j=2}^{m-1}B^{(1,j)})X & \cdots & X^{\top }B^{(1,m-1)}X \\ \vdots & \ddots & \vdots \\ X^{\top }B^{(m-1,1)}X & \cdots & X^{\top }(C^{(m-1)}-\sum \limits _{j=1}^{m-2}B^{(m-1,j)})X \\ \end{array} \right) . \end{aligned}$$

For any $z=(z_1; z_2;\ldots ;z_{m-1}) \in {\mathbb {R}}^{p(m-1)}$ with $z_i \in {\mathbb {R}}^{p},\ i=1,2,\ldots ,(m-1)$, we have

$$\begin{aligned} z^{\top }\nabla ^{2}\ell (W) & z= \sum \limits _{k=1}^{m-1}z_k^\top X^{\top }C^{(k)}Xz_k -\sum \limits _{k=1}^{m-2}\sum \limits _{j=k+1}^{m-1}(z_k-z_j)^\top X^{\top }B^{(k,j)}X(z_k-z_j)\\ & \ge 0, \quad (\text {because of}\ C^{(k)}\succ {\mathbf {0}},\ B^{(k,j)}\preceq {\mathbf {0}}) \nonumber \end{aligned}$$

(29)

which means that $\nabla ^{2}\ell (W)\succeq {\mathbf {0}}$.

To verify the second inequality of (3), D. Böhning [4] has shown that the Hessian matrix is upper bounded by a positive definite matrix that does not depend on W, i.e.,

$$\begin{aligned} \nabla ^{2} \ell (W) \preceq \frac{1}{2}[I_{m-1}-{\mathbf {1}}{\mathbf {1}}^{\top }/m]\otimes X^{\top }X. \end{aligned}$$

(30)

We next prove the Lipschitz continuity of the Hessian matrix. Denote

$$\begin{aligned} h(\mathbf{z })=\displaystyle \frac{\exp (\mathbf{z }^k)}{({\mathbf {1}}^{\top }\exp (\mathbf{z })+1)^{2}},\,\,\, g(\mathbf{z })=\displaystyle \frac{\exp (\mathbf{z }^k)\exp (\mathbf{z }^j)}{({\mathbf {1}}^{\top }\exp (\mathbf{z })+1)^{2}}, \end{aligned}$$

where $\mathbf{z }=(\mathbf{z }^{1},\mathbf{z }^{2},\ldots ,\mathbf{z }^{m-1})^{\top }=W^{\top }x_i \in {\mathbb {R}}^{m-1}$ with $\mathbf{z }^k={w^{(k)}}^{\top }x_i$ for $i \in \{1,2,\ldots ,n\}$. We first consider the following two cases.

Case 1. If $j\ne k$,

$$\begin{aligned} \Big |\frac{\partial h(\mathbf{z })}{\partial \mathbf{z }^j}\Big |=\Big |\frac{-2\exp (\mathbf{z }^k)\exp (\mathbf{z }^j)}{({\mathbf {1}}^{\top }\exp (\mathbf{z })+1)^{3}}\Big | \le 2. \end{aligned}$$

Case 2. If $j= k$,

$$\begin{aligned} \Big |\frac{\partial h(\mathbf{z })}{\partial \mathbf{z }^j}\Big |=\Big |\frac{\exp (\mathbf{z }^k)}{({\mathbf {1}}^{\top }\exp (\mathbf{z })+1)^{2}}-\frac{2\exp ^2(\mathbf{z }^k)}{({\mathbf {1}}^{\top }\exp (\mathbf{z })+1)^{3}}\Big | \le 2. \end{aligned}$$

Hence, $\Vert \nabla h(\mathbf{z } )\Vert \le 2\sqrt{m-1}.$ Then by the mean value theorem, there exists $\theta \in (0,1)$ such that for any $i \in \{1,2,\ldots ,n\}$,

$$\begin{aligned} h(W^{\top }x_i)-h(\widehat{W}^{\top }x_i)& = h(\mathbf{z })-h(\hat{\mathbf{z }})=\nabla h(\mathbf{z }+\theta (\hat{\mathbf{z }}-\mathbf{z }))^{\top }(\mathbf{z }-\hat{\mathbf{z }})\\& \le \Vert \nabla h(\mathbf{z }+\theta (\hat{\mathbf{z }}-\mathbf{z }) )\Vert \Vert \mathbf{z }-\hat{\mathbf{z }}\Vert \le 2\sqrt{m-1}\Vert \mathbf{z }-\hat{\mathbf{z }}\Vert \\ & \le 2\sqrt{m-1}\Vert W-\widehat{W}\Vert \Vert x_i\Vert \le \varpi \Vert W-\widehat{W}\Vert , \end{aligned}$$

where $\varpi := 2\sqrt{m-1}\Vert X\Vert _\infty$. Similarly, we have

$$\begin{aligned} g(W^{\top }x_i)-g(\widehat{W}^{\top }x_i)\le \varpi \Vert W-\widehat{W}\Vert . \end{aligned}$$

Notice that for any $k,j \in \{1,2,\ldots ,m-1\}$,

$$\begin{aligned} \begin{array}{cl} &\Vert C^{(k)}(W)-C^k(\widehat{W})\Vert _2 =\max \limits _{i \in \{1,2,\ldots ,n\}}\{h(W^{\top }x_i)-h(\widehat{W}^{\top }x_i)\} \le \varpi \Vert W-\widehat{W}\Vert \\ &\\ &\Vert B^{(k,j)}(W)-B^{(k,j)}(\widehat{W})\Vert _2=\max \limits _{i \in \{1,2,\ldots ,n\}}\{g(W^{\top }x_i)-g(\widehat{W}^{\top }x_i)\} \le \varpi \Vert W-\widehat{W}\Vert \end{array} \end{aligned}$$

(31)

Therefore,

$$\begin{aligned}&\Vert \nabla ^{2}\ell (W)-\nabla ^{2}\ell (\widehat{W})\Vert \\&\le \Vert \sum \limits _{k=1}^{m-1} X^{\top }(A^{(k)}(W)-A^k(\widehat{W})X\Vert +2\Vert \sum \limits _{k=1}^{m-2}\sum \limits _{j=k+1}^{m-1} X^{\top }(B^{(k,j)}(W)-B^{(k,j)}(\widehat{W}))X \Vert \\&\overset{(28)}{\le } \Vert \sum \limits _{k=1}^{m-1} X^{\top }(C^{(k)}(W)-C^k(\widehat{W})X\Vert +4\Vert \sum \limits _{k=1}^{m-2}\sum \limits _{j=k+1}^{m-1} X^{\top }(B^{(k,j)}(W)-B^{(k,j)}(\widehat{W}))X \Vert \\&\le (m-1)\max \limits _{k,j} \{\Vert X^{\top }(C^{(k)}(W)-C^k(\widehat{W}))X\Vert +2(m-2)\Vert X^{\top }(B^{(k,j)}(W)-B^{(k,j)}(\widehat{W}))X \Vert \}\\&\le (m-1)\max \limits _{k,j} \{\Vert C^{(k)}(W)-C^k(\widehat{W})\Vert _2+2(m-2)\Vert B^{(k,j)}(W)-B^{(k,j)}(\widehat{W})\Vert _2\}\Vert X^{\top }X \Vert \\&\overset{(31)}{\le }2\varpi (m-1)^2 \Vert X^{\top }X \Vert \Vert W-\widehat{W}\Vert , \end{aligned}$$

which completes the proof of the Lemma. $\square$

Proof of Lemma 2:

For any $W,D\in {\mathbb {R}}^{p\times {(m-1)}}$ and $\bar{t} \in [0,1]$, there exists $\varXi = W+\bar{t}(D-W)$ such that

$$\begin{aligned} \Vert \nabla \ell (W+\bar{t}D)-\nabla \ell (W)\Vert& = \Vert \bar{t}\nabla ^2 \ell (\varXi )\mathrm{vec}(D)\Vert \quad (\text {by mean-value theorem}) \nonumber \\&\overset{(30)}{\le }\bar{t}\Vert \frac{1}{2}[I_{m-1}-{\mathbf {1}}{\mathbf {1}}^{\top }/m]\otimes X^{\top }X\Vert _2\Vert D\Vert \nonumber \\&\le \frac{\bar{t}}{2}\Vert [I_{m-1}-{\mathbf {1}}{\mathbf {1}}^{\top }/m]\Vert _2\Vert X^{\top }X\Vert _2\Vert D\Vert \nonumber \\&\le \frac{\bar{t}}{2}\lambda _{\max }( X^{\top }X)\Vert D\Vert :=\bar{t}\lambda _{{\mathrm {x}}}\Vert D\Vert . \end{aligned}$$

(32)

Let t be a scalar parameter and $g(t)=\ell (W+tD)$. The chain rule yields

$$\begin{aligned} \frac{\partial g}{\partial t}(t)=\langle \nabla \ell (W+tD),D \rangle . \end{aligned}$$

(33)

Then, we have

$$\begin{aligned} \ell (W+tD)-\ell (W)&=g(1)-g(0)=\int _{0}^{1}\frac{\partial g}{\partial t}(t)dt \overset{(33)}{=} \int _{0}^{1}\langle \nabla \ell (W+tD),D \rangle dt\\&\le \int _{0}^{1}\langle \nabla \ell (W),D \rangle dt+\Big |\int _{0}^{1}\langle \nabla \ell (W+tD)-\nabla \ell (W),D \rangle dt\Big |\\&\le \langle \nabla \ell (W),D \rangle +\int _{0}^{1}\Vert \nabla \ell (W+tD)-\nabla \ell (W)\Vert \Vert D\Vert dt\\&\overset{(32)}{\le } \langle \nabla \ell (W),D \rangle + \Vert D\Vert \int _{0}^{1} \lambda _{{\mathrm {x}}}t\Vert D\Vert dt\\&=\langle \nabla \ell (W),D \rangle + \frac{\lambda _{{\mathrm {x}}}}{2}\Vert D\Vert ^{2}. \end{aligned}$$

Hence, we have completed the proof. $\square$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, R., Xiu, N. & Toh, KC. Subspace quadratic regularization method for group sparse multinomial logistic regression. Comput Optim Appl 79, 531–559 (2021). https://doi.org/10.1007/s10589-021-00287-2

Download citation

Received: 29 May 2020
Accepted: 01 June 2021
Published: 12 June 2021
Issue Date: July 2021
DOI: https://doi.org/10.1007/s10589-021-00287-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Subspace quadratic regularization method for group sparse multinomial logistic regression

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Globally Convergent Accelerated Algorithms for Multilinear Sparse Logistic Regression with $${{\ell}}_{0}$$ -Constraints

Sparse and robust SVM classifier for large scale classification

Stochastic DCA for Sparse Multiclass Logistic Regression

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix A: Proofs of Lemmas 1 and 2

Appendix A: Proofs of Lemmas 1 and 2

Proof of Lemma 1:

Proof of Lemma 2:

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now