GSDAR: a fast Newton algorithm for $$\ell _0$$ regularized generalized linear models with statistical guarantee

Huang, Jian; Jiao, Yuling; Kang, Lican; Liu, Jin; Liu, Yanyan; Lu, Xiliang

doi:10.1007/s00180-021-01098-z

GSDAR: a fast Newton algorithm for $\ell _0$ regularized generalized linear models with statistical guarantee

Original paper
Published: 29 March 2021

Volume 37, pages 507–533, (2022)
Cite this article

Computational Statistics Aims and scope Submit manuscript

Jian Huang ORCID: orcid.org/0000-0002-5218-9269¹,
Yuling Jiao²,
Lican Kang²,
Jin Liu³,
Yanyan Liu² &
…
Xiliang Lu²

595 Accesses
12 Citations
Explore all metrics

Abstract

We propose a fast Newton algorithm for $\ell _0$ regularized high-dimensional generalized linear models based on support detection and root finding. We refer to the proposed method as GSDAR. GSDAR is developed based on the KKT conditions for $\ell _0$-penalized maximum likelihood estimators and generates a sequence of solutions of the KKT system iteratively. We show that GSDAR can be equivalently formulated as a generalized Newton algorithm. Under a restricted invertibility condition on the likelihood function and a sparsity condition on the regression coefficient, we establish an explicit upper bound on the estimation errors of the solution sequence generated by GSDAR in supremum norm and show that it achieves the optimal order in finite iterations with high probability. Moreover, we show that the oracle estimator can be recovered with high probability if the target signal is above the detectable level. These results directly concern the solution sequence generated from the GSDAR algorithm, instead of a theoretically defined global solution. We conduct simulations and real data analysis to illustrate the effectiveness of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sparse estimation via lower-order penalty optimization methods in high-dimensional linear regression

Article 06 September 2022

Element-wise uniqueness, prior knowledge, and data-dependent resolution

Article 04 April 2016

Newton method for ℓ₀-regularized optimization

Article 24 March 2021

References

Abramovich F, Benjamini Y, Donoho DL, Johnstone IM et al (2006) Adapting to unknown sparsity by controlling the false discovery rate. Ann Stat 34(2):584–653
Article MathSciNet MATH Google Scholar
Birgé L, Massart P (2001) Gaussian model selection. J Eur Math Soc 3(3):203–268
Article MathSciNet MATH Google Scholar
Bogdan M, Ghosh JK, Doerge R (2004) Modifying the schwarz bayesian information criterion to locate multiple interacting quantitative trait loci. Genetics 167(2):989–999
Article Google Scholar
Bogdan M, Ghosh JK, Żak-Szatkowska M (2008) Selecting explanatory variables with the modified version of the Bayesian information criterion. Qual Reliab Eng Int 24(6):627–641
Article Google Scholar
Breheny P, Huang J (2011) Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann Appl Stat 5(1):232
Article MathSciNet MATH Google Scholar
Breheny P, Huang J (2015) Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors. Stat Comput 25(2):173–187
Article MathSciNet MATH Google Scholar
Chen J, Chen Z (2008) Extended bayesian information criteria for model selection with large model spaces. Biometrika 95(3):759–771
Article MathSciNet MATH Google Scholar
Chen J, Chen Z (2012) Extended bic for small-n-large-p sparse glm. Stat Sinica 22:555–574
Article MathSciNet MATH Google Scholar
Chen X, Nashed Z, Qi L (2000) Smoothing methods and semismooth methods for nondifferentiable operator equations. SIAM J Numer Anal 38(4):1200–1216
Article MathSciNet MATH Google Scholar
Chen X, Ge D, Wang Z, Ye Y (2014) Complexity of unconstrained l2-lp minimization. Math Program 143(1–2):371–383
Article MathSciNet MATH Google Scholar
Dolejsi E, Bodenstorfer B, Frommlet F (2014) Analyzing genome-wide association studies with an fdr controlling modification of the bayesian information criterion. PloS One 9(7):e103322
Article Google Scholar
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am stat Assoc 96(456):1348–1360
Article MathSciNet MATH Google Scholar
Fan J, Lv J (2008) Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc: Ser B (Stat Methodol) 70(5):849–911
Article MathSciNet MATH Google Scholar
Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33(1):1
Article Google Scholar
Frommlet F, Nuel G (2016) An adaptive ridge procedure for l 0 regularization. PloS One 11(2):e0148620
Article Google Scholar
Frommlet F, Ruhaltinger F, Twaróg P, Bogdan M (2012) Modified versions of Bayesian information criterion for genome-wide association studies. Comput Stat Data Anal 56(5):1038–1051
Article MathSciNet Google Scholar
Frommlet F, Bogdan M, Ramsey D (2016) Phenotypes and genotypes. Springer, Berlin
Book MATH Google Scholar
Ge J, Li X, Jiang H, Liu H, Zhang T, Wang M, Zhao T (2019) Picasso: A sparse learning library for high dimensional data analysis in R and Python. J Mach Learn Res 20(44):1–5
Google Scholar
Huang J, Jiao Y, Liu Y, Lu X (2018) A constructive approach to l 0 penalized regression. J Mach Learn Res 19(1):403–439
MathSciNet Google Scholar
Huang J, Jiao Y, Jin B, Liu J, Lu X, Yang C (2021) A unified primal dual active set algorithm for nonconvex sparse recovery. Stat Sci (to appear). arXiv:1310.1147
Jiao Y, Jin B, Lu X (2017) Group sparse recovery via the $\ell ^0(\ell ^2)$ penalty: theory and algorithm. IEEE Trans Signal Process 65(4):998–1012
Article MathSciNet MATH Google Scholar
Li X, Yang L, Ge J, Haupt J, Zhang T, Zhao T (2017) On quadratic convergence of DC proximal Newton algorithm in nonconvex sparse learning. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems, vol 30, Curran Associates, Inc
Loh P-L, Wainwright MJ (2015) Regularized m-estimators with nonconvexity: statistical and algorithmic theory for local optima. J Mach Learn Res 16:559–616
MathSciNet MATH Google Scholar
Louizos C, Welling M, Kingma DP (2018) Learning sparse neural networks through $L_0$ regularization. In: International conference on learning representations, pp 1–13. URL https://openreview.net/forum?id=H1Y8hhg0b
Luo S, Chen Z (2014) Sequential lasso cum ebic for feature selection with ultra-high dimensional feature space. J Am Stat Assoc 109(507):1229–1240
Article MathSciNet MATH Google Scholar
Ma R, Miao J, Niu L, Zhang P (2019) Transformed $\ell _1$ regularization for learning sparse deep neural networks. URL https://arxiv.org/abs/1901.01021
McCullagh P (2019) Generalized linear models. Routledge, Oxfordshire
Book MATH Google Scholar
Meier L, Van De Geer S, Bühlmann P (2008) The group lasso for logistic regression. J R Stat Soc: Ser B (Stat Methodol) 70(1):53–71
Article MathSciNet MATH Google Scholar
Natarajan BK (1995) Sparse approximate solutions to linear systems. SIAM J Comput 24(2):227–234
Article MathSciNet MATH Google Scholar
Nelder JA, Wedderburn RW (1972) Generalized linear models. J R Stat Soc: Ser A (General) 135(3):370–384
Google Scholar
Nesterov Y (2013) Gradient methods for minimizing composite functions. Math Program 140(1):125–161
Article MathSciNet MATH Google Scholar
Park MY, Hastie T (2007) L1-regularization path algorithm for generalized linear models. J R Stat Soc: Ser B (Stat Methodol) 69(4):659–677
Article MathSciNet Google Scholar
Qi L (1993) Convergence analysis of some algorithms for solving nonsmooth equations. Math Oper Res 18(1):227–244
Article MathSciNet MATH Google Scholar
Qi L, Sun J (1993) A nonsmooth version of Newton’s method. Math Program 58(1–3):353–367
Article MathSciNet MATH Google Scholar
Rockafellar RT, Wets RJ-B (2009) Var Anal. Springer Science & Business Media, Berlin
Google Scholar
Scardapane S, Comminiello D, Hussain A, Uncini A (2017) Group sparse regularization for deep neural networks. Neurocomputing 241:81–89
Article Google Scholar
Schwarz G et al (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Article MathSciNet MATH Google Scholar
Shen J, Li P (2017) On the iteration complexity of support recovery via hard thresholding pursuit. In: Proceedings of the 34th international conference on machine learning-volume 70, pp 3115–3124
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc: Ser B (Methodological) 58(1):267–288
MathSciNet MATH Google Scholar
Van de Geer SA et al (2008) High-dimensional generalized linear models and the lasso. Ann Stat 36(2):614–645
MathSciNet MATH Google Scholar
Wang L, Kim Y, Li R (2013) Calibrating non-convex penalized regression in ultra-high dimension. Ann Stat 41(5):2505
Article MATH Google Scholar
Wang R, Xiu N, Zhou S (2019) Fast newton method for sparse logistic regression. arXiv preprint arXiv:1901.02768
Wang Z, Liu H, Zhang T (2014) Optimal computational and statistical rates of convergence for sparse nonconvex learning problems. Ann Stat 42(6):2164
Article MathSciNet MATH Google Scholar
Ye F, Zhang C-H (2010) Rate minimaxity of the lasso and dantzig selector for the lq loss in lr balls. J Mach Learn Res 11(Dec):3519–3540
MathSciNet MATH Google Scholar
Yuan X-T, Li P, Zhang T (2017) Gradient hard thresholding pursuit. J Mach Learn Res 18:166
MathSciNet MATH Google Scholar
Żak-Szatkowska M, Bogdan M (2011) Modified versions of the bayesian information criterion for sparse generalized linear models. Comput Stat Data Anal 55(11):2908–2924
Article MathSciNet MATH Google Scholar
Zhang C-H (2010) Nearly unbiased variable selection under minimax concave penalty. Ann Stat 38(2):894–942
Article MathSciNet MATH Google Scholar
Zhang C-H, Zhang T et al (2012) A general theory of concave regularization for high-dimensional sparse estimation problems. Stat Sci 27(4):576–593
Article MathSciNet MATH Google Scholar
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc: Ser B (Stat Methodol) 67(2):301–320
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

We wish to thank two anonymous reviewers for their constructive and helpful comments that led to significant improvements in the paper. The work of J. Huang is supported in part by the U.S. National Science Foundation grant DMS-1916199. The work of Y. Jiao is supported in part by the National Science Foundation of China grant 11871474 and by the research fund of KLATASDSMOE of China. The research of J. Liu is supported by Duke-NUS Graduate Medical School WBS: R-913-200-098-263 and MOE2016-T2-2-029 from Ministry of Eduction, Singapore. The work of Y. Liu is supported in part by the National Science Foundation of China grant 11971362. The work X. Lu is supported by National Science Foundation of China Grants 11471253 and 91630313.

Author information

Authors and Affiliations

Department of Statistics and Actuarial Science, University of Iowa, Iowa City, Iowa, 52242, USA
Jian Huang
School of Mathematics and Statistics, Wuhan University, Wuhan, 430072, China
Yuling Jiao, Lican Kang, Yanyan Liu & Xiliang Lu
Center of Quantitative Medicine Duke-NUS Medical School, Singapore, Singapore
Jin Liu

Authors

Jian Huang
View author publications
You can also search for this author inPubMed Google Scholar
Yuling Jiao
View author publications
You can also search for this author inPubMed Google Scholar
Lican Kang
View author publications
You can also search for this author inPubMed Google Scholar
Jin Liu
View author publications
You can also search for this author inPubMed Google Scholar
Yanyan Liu
View author publications
You can also search for this author inPubMed Google Scholar
Xiliang Lu
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding authors

Correspondence to Jian Huang or Yanyan Liu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

In the appendix, we prove Lemma 1, Proposition 1 and Theorems 1 and 2.

1.1 Proof of Lemma 1

Proof

Let $L_\lambda (\varvec{\beta })={\mathcal {L}}(\varvec{\beta })+\lambda \Vert \varvec{\beta }\Vert _0$. Assume $\widehat{\varvec{\beta }}$ is a global minimizer of $L_\lambda (\varvec{\beta })$. Then by Theorem 10.1 in Rockafellar and Wets (2009), we have

$$\begin{aligned} \mathbf{0} \in \nabla {\mathcal {L}} (\widehat{\varvec{\beta }}) + \lambda \partial \Vert \widehat{\varvec{\beta }}\Vert _{0}, \end{aligned}$$

(7)

where $\partial \Vert \widehat{\varvec{\beta }}\Vert _{0}$ denotes the limiting subdifferential (see Definition 8.3 in Rockafellar and Wets (2009)) of $ \Vert \cdot \Vert _{0}$ at $\widehat{\varvec{\beta }}$. Let $\widehat{\mathbf{d}}= -\nabla {\mathcal {L}}(\widehat{\varvec{\beta }})$ and define $G(\varvec{\beta }) = \frac{1}{2}\Vert \varvec{\beta }- (\widehat{\varvec{\beta }}+\widehat{\mathbf{d}})\Vert ^2 + \lambda \Vert \varvec{\beta }\Vert _0$. We recall that, from the definition of the limiting subdifferential of Definition 8.3 in Rockafellar and Wets (2009)), $\partial \Vert \widehat{\varvec{\beta }}\Vert _{0}$ satisfies that $\Vert \varvec{\beta }\Vert _0\ge \Vert \widehat{\varvec{\beta }}\Vert _0+\langle \partial \Vert \widehat{\varvec{\beta }}\Vert _{0}, \varvec{\beta }-\widehat{\varvec{\beta }}\rangle +o(\Vert \varvec{\beta }-\widehat{\varvec{\beta }}\Vert )$ for any $\varvec{\beta }\in {\mathbb {R}}^p$. (7) is equivalent to

$$\begin{aligned} \mathbf{0} \in \widehat{\varvec{\beta }}-(\widehat{\varvec{\beta }}+\widehat{\mathbf{d}}) + \lambda \partial \Vert \widehat{\varvec{\beta }}\Vert _{0}. \end{aligned}$$

Moreover, ${\widetilde{\varvec{\beta }}}$ being the minimizer of $G(\varvec{\beta })$ is equivalent to $\mathbf{0} \in \partial G({\widetilde{\varvec{\beta }}})$. Obviously, $\widehat{\varvec{\beta }}$ satisfies $\mathbf{0} \in \partial G(\widehat{\varvec{\beta }})$. Thus we deduce that $\widehat{\varvec{\beta }}$ is a KKT point of $ G(\varvec{\beta })$. Then $\widehat{\varvec{\beta }}=H_{\lambda }(\widehat{\varvec{\beta }}+\widehat{\mathbf{d}})$ follows from the result that the KKT points of G coincide with its coordinate-wise minimizer (Huang et al. 2021). Conversely, suppose $\widehat{\varvec{\beta }}$ and $\widehat{\mathbf{d}}$ satisfy (2), then $\widehat{\varvec{\beta }}$ is a local minimizer of $L_{\lambda }(\varvec{\beta })$. To show $\widehat{\varvec{\beta }}$ is a local minimizer of $L_{\lambda }(\varvec{\beta })$, we can assume $\text{ h }$ is small enough and $\Vert \text{ h }\Vert _{\infty }<\sqrt{2\lambda }$. Then we will show $L_{\lambda }(\widehat{\varvec{\beta }}+\text{ h})\ge L_{\lambda }(\widehat{\varvec{\beta }})$ in two cases respectively.

First, we denote

$$\begin{aligned} {\widehat{A}}=\{i:|{\widehat{\beta }}_{i}+{\widehat{d}}_{i}|\ge \sqrt{2\lambda }\},\quad {\widehat{I}}=\{i:|{\widehat{\beta }}_{i}+{\widehat{d}}_{i}|<\sqrt{2\lambda }\}. \end{aligned}$$

By the definition of $H_{\lambda }(\cdot )$ and (2), we can conclude that $|{\widehat{\beta }}_{i}|\ge \sqrt{2\lambda }$ when $i \in {\widehat{A}}$ and $\widehat{\varvec{\beta }}_{{\widehat{I}}}=0$. Thus it yields that $\text{ supp }(\widehat{\varvec{\beta }})={\widehat{A}}$. Moreover, we also have $\widehat{\mathbf{d}}_{{\widehat{A}}}=[-\nabla {\mathcal {L}}(\widehat{\varvec{\beta }})]_{{\widehat{A}}}=0$, which is equivalent to $\widehat{\varvec{\beta }}_{{\widehat{A}}}\in \underset{\varvec{\beta }_{{\widehat{A}}}}{\text{ argmin }}~\widetilde{{\mathcal {L}}}(\varvec{\beta }_{{\widehat{A}}})$.

Case 1: $\text{ h}_{{\widehat{I}}}\ne 0$.

$$\begin{aligned}&\displaystyle \Vert \widehat{\varvec{\beta }}+\text{ h }\Vert _{0}=\Vert \widehat{\varvec{\beta }}_{{\widehat{A}}}+\text{ h}_{{\widehat{A}}}\Vert _{0}+\Vert \text{ h}_{{\widehat{I}}}\Vert _{0}, \\&\displaystyle \lambda \Vert \widehat{\varvec{\beta }}+\text{ h }\Vert _{0}-\lambda \Vert \widehat{\varvec{\beta }}\Vert _{0}=\lambda \Vert \widehat{\varvec{\beta }}_{{\widehat{A}}}+\text{ h}_{{\widehat{A}}}\Vert _{0} +\lambda \Vert \text{ h}_{{\widehat{I}}}\Vert _{0}-\lambda \Vert \widehat{\varvec{\beta }}_{{\widehat{A}}}\Vert _{0}. \end{aligned}$$

Because $|{\widehat{\beta }}_{i}|\ge \sqrt{2\lambda }$ for $i\in {{\widehat{A}}}$ and $\Vert \text{ h }\Vert _{\infty }<\sqrt{2\lambda }$, we have

$$\begin{aligned}&\lambda \Vert \widehat{\varvec{\beta }}_{{\widehat{A}}}+\text{ h}_{{\widehat{A}}}\Vert _{0}-\lambda \Vert \widehat{\varvec{\beta }}_{{\widehat{A}}}\Vert _{0}=0, \\&\lambda \Vert \widehat{\varvec{\beta }}+\text{ h }\Vert _{0}-\lambda \Vert \widehat{\varvec{\beta }}\Vert _{0}=\lambda \Vert \text{ h}_{{\widehat{I}}}\Vert _{0}>\lambda . \end{aligned}$$

Therefore, we get

$$\begin{aligned}&L_{\lambda }(\widehat{\varvec{\beta }}+\text{ h})-L_{\lambda }(\widehat{\varvec{\beta }})\\&\quad =\sum _{i=1}^{n}[c(\text{ x}_{i}^{T}(\widehat{\varvec{\beta }}+\text{ h}))-c(\text{ x}_{i}^{T}\widehat{\varvec{\beta }})]-\text{ y}^T\mathbf{X}\text{ h }+\lambda \Vert \text{ h}_{{\widehat{I}}}\Vert _{0}\\&\quad>\sum _{i=1}^{n}[c(\text{ x}_{i}^{T}(\widehat{\varvec{\beta }}+\text{ h}))-c(\text{ x}_{i}^{T}\widehat{\varvec{\beta }})]-\text{ y}^T\mathbf{X}\text{ h }+\lambda \\&\quad >0. \end{aligned}$$

Let $m(\text{ h})=\sum _{i=1}^{n}[c(\text{ x}_{i}^{T}(\widehat{\varvec{\beta }}+\text{ h}))-c(\text{ x}_{i}^{T}\widehat{\varvec{\beta }})]-\text{ y}^{T}\mathbf{X}\text{ h }$, so $m(\text{ h})$ is a continuous function about $\text{ h }$. As $\text{ h }$ is small enough and $\Vert \text{ h }\Vert _{\infty }<\sqrt{2\lambda }$, then $m(\text{ h})+\lambda >0$. Thus the last inequality holds.

Case 2: $\text{ h}_{{\widehat{I}}}=0$.

$$\begin{aligned} \lambda \Vert \widehat{\varvec{\beta }}+\text{ h }\Vert _{0}-\lambda \Vert \widehat{\varvec{\beta }}\Vert _{0}=\lambda \Vert \widehat{\varvec{\beta }}_{{\widehat{A}}}+\text{ h}_{{\widehat{A}}}\Vert _{0}-\lambda \Vert \widehat{\varvec{\beta }}_{{\widehat{A}}}\Vert _{0}. \end{aligned}$$

As $|{\widehat{\beta }}_{i}|\ge \sqrt{2\lambda }$ for $i\in {{\widehat{A}}}$ and $\Vert \text{ h}_{{\widehat{A}}}\Vert _{\infty }<\sqrt{2\lambda }$, then we have

$$\begin{aligned} \lambda \Vert \widehat{\varvec{\beta }}+\text{ h }\Vert _{0}-\lambda \Vert \widehat{\varvec{\beta }}\Vert _{0}=\lambda \Vert \widehat{\varvec{\beta }}_{{\widehat{A}}}+\text{ h}_{{\widehat{A}}}\Vert _{0}-\lambda \Vert \widehat{\varvec{\beta }}_{{\widehat{A}}}\Vert _{0}=0, \end{aligned}$$

and

$$\begin{aligned}&L_{\lambda }(\widehat{\varvec{\beta }}+\text{ h})-L_{\lambda }(\widehat{\varvec{\beta }})\\&\quad =\sum _{i=1}^{n}[c(\text{ x}_{i}^{T}(\widehat{\varvec{\beta }}+\text{ h}))-c(\text{ x}_{i}^{T}\widehat{\varvec{\beta }})]-\text{ y}^T\mathbf{X}\text{ h }\\&\quad =\sum _{i=1}^{n}[c(\text{ x}_{i({\widehat{A}})}^{T}(\widehat{\varvec{\beta }}_{{\widehat{A}}}+\text{ h}_{{\widehat{A}}}))-c(\text{ x}_{i({\widehat{A}})}^{T}\widehat{\varvec{\beta }}_{{\widehat{A}}})]-\text{ y}^T\mathbf{X}_{{\widehat{A}}}\text{ h}_{{\widehat{A}}}\\&\quad =\sum _{i=1}^{n}[c(\text{ x}_{i({\widehat{A}})}^{T}(\widehat{\varvec{\beta }}_{{\widehat{A}}}+\text{ h}_{{\widehat{A}}}))]-\text{ y}^T\mathbf{X}_{{\widehat{A}}}(\widehat{\varvec{\beta }}_{{\widehat{A}}}+ \text{ h}_{{\widehat{A}}})\\&\qquad -\sum _{i=1}^{n}[c(\text{ x}_{i({\widehat{A}})}^{T}\widehat{\varvec{\beta }}_{{\widehat{A}}})]+\text{ y}^T\mathbf{X}_{{\widehat{A}}}\widehat{\varvec{\beta }}_{{\widehat{A}}}\\&\quad \ge 0. \end{aligned}$$

As known that $\widehat{\varvec{\beta }}_{{\widehat{A}}}\in \underset{\varvec{\beta }_{{\widehat{A}}}}{\text{ argmin }}~\widetilde{{\mathcal {L}}}(\varvec{\beta }_{{\widehat{A}}})$, so the last inequality holds. In summary, $\widehat{\varvec{\beta }}$ is a local minimizer of $L_{\lambda }(\widehat{\varvec{\beta }})$. $\square $

1.2 Proof of Proposition 1

Proof

Denote $D^{k}=-\left( {\mathcal {H}}^{k}\right) ^{-1} F\left( \text{ w}^{k}\right) $. Then

$$\begin{aligned} \text{ w}^{k+1}=\text{ w}^{k}-\left( {\mathcal {H}}^{k}\right) ^{-1} F\left( \text{ w}^{k}\right) \end{aligned}$$

can be recast as

$$\begin{aligned}&{\mathcal {H}}^{k} D^{k}=-F\left( \text{ w}^{k}\right) , \end{aligned}$$

(8)

$$\begin{aligned}&\text{ w}^{k+1}=\text{ w}^{k}+D^{k}. \end{aligned}$$

(9)

Partition $\text{ w}^{k}, D^{k}$ and $F\left( \text{ w}^{k}\right) $ according to $A^{k}$ and $I^{k}$ such that

$$\begin{aligned}&\text{ w}^{k}=\left( \begin{array}{c} \varvec{\beta }_{A^{k}}^{k} \\ \varvec{\beta }_{I^{k}}^{k} \\ \mathbf{d}_{A^{k}}^{k} \\ \mathbf{d}_{I^{k}}^{k} \end{array}\right) , \quad D^{k}=\left( \begin{array}{c} D_{A^{k}}^{\varvec{\beta }} \\ D_{I^{k}}^{\varvec{\beta }} \\ D_{A^{k}}^{\mathbf{d}} \\ D_{I^{k}}^{\mathbf{d}} \end{array}\right) , \end{aligned}$$

(10)

$$\begin{aligned}&F\left( \text{ w}^{k}\right) = \left[ \begin{array}{c} -\mathbf{d}_{A^{k}}^{k} \\ \varvec{\beta }_{I^{k}}^{k} \\ n[\nabla {\mathcal {L}}(\varvec{\beta }^k)]_{A^k}+n\mathbf{d}^k_{A^k} \\ n[\nabla {\mathcal {L}}(\varvec{\beta }^k)]_{I^k}+n\mathbf{d}^k_{I^k} \end{array} \right] . \end{aligned}$$

(11)

Substituting (10), (11) and ${\mathcal {H}}^{k}$ into (8), we have

$$\begin{aligned}&\displaystyle \left( \mathbf{d}_{A^{k}}^{k}+D_{A^{k}}^{\mathbf{d}}\right) ={\mathbf {0}}_{A^{k}}, \end{aligned}$$

(12)

$$\begin{aligned}&\displaystyle \varvec{\beta }_{I^{k}}^{k}+D_{I^{k}}^{\varvec{\beta }} ={\mathbf {0}}_{I^{k}}, \end{aligned}$$

(13)

$$\begin{aligned}&\displaystyle n\nabla ^2{\mathcal {L}}(\varvec{\beta }^k)\left( \begin{array}{c} D_{A^{k}}^{\varvec{\beta }} \\ D_{I^{k}}^{\varvec{\beta }} \end{array}\right) +n\left( \begin{array}{c} D_{A^{k}}^{\mathbf{d}} \\ D_{I^{k}}^{\mathbf{d}} \end{array}\right) =n[\nabla {\mathcal {L}}(\varvec{\beta }^k)]+n\mathbf{d}^k. \end{aligned}$$

(14)

It follows from (9) that

$$\begin{aligned} \left( \begin{array}{c} \varvec{\beta }_{A^{k}}^{k+1} \\ \varvec{\beta }_{I^{k}}^{k+1} \\ \mathbf{d}_{A^{k}}^{k+1} \\ \mathbf{d}_{I^{k}}^{k+1} \end{array}\right) =\left( \begin{array}{c} \varvec{\beta }_{A^{k}}^{k}+D_{A^{k}}^{\varvec{\beta }} \\ \varvec{\beta }_{I^{k}}^{k}+D_{I^{k}}^{\varvec{\beta }} \\ \mathbf{d}_{A^{k}}^{k}+D_{A^{k}}^{\mathbf{d}} \\ \mathbf{d}_{I^{k}}^{k}+D_{I^{k}}^{\mathbf{d}} \end{array}\right) . \end{aligned}$$

(15)

Substituting (15) into (12)–(14), we get (4) of Algorithm 1. This completes the proof. $\square $

1.3 Preparatory lemmas

The proofs of Theorems 1 and 2 are built on the following lemmas.

Lemma 2

Assume (C1) holds and $\Vert \varvec{\beta }^{*}\Vert _{0}=K\le T$. Denote $B^k = A^{k}\backslash A^{k-1}$. Then,

$$\begin{aligned} \Vert \nabla _{B^k}{\mathcal {L}}(\varvec{\beta }^{k})\Vert _1\Vert \nabla _{B^k}{\mathcal {L}}(\varvec{\beta }^{k})\Vert _{\infty }\ge 2L\zeta [{\mathcal {L}}(\varvec{\beta }^k)-{\mathcal {L}}(\varvec{\beta }^*)], \end{aligned}$$

where $\zeta =\frac{|B^k|}{|B^k|+|A^*\backslash A^{k-1}|}$.

Proof

Obviously, this lemma holds if $A^{k}=A^{k-1}$ or ${\mathcal {L}}(\varvec{\beta }^k)\le {\mathcal {L}}(\varvec{\beta }^*)$. So we only prove the lemma by assuming $A^{k}\ne A^{k-1}$ and ${\mathcal {L}}(\varvec{\beta }^k)>{\mathcal {L}}(\varvec{\beta }^*)$. The condition (C1) indicates

$$\begin{aligned}&{\mathcal {L}}(\varvec{\beta }^{*})-{\mathcal {L}}(\varvec{\beta }^k)-\langle \nabla {\mathcal {L}}(\varvec{\beta }^k),{\varvec{\beta }^*-\varvec{\beta }^k}\rangle \\&\quad \ge \frac{L}{2}\big \Vert \varvec{\beta }^*-\varvec{\beta }^k\big \Vert _1\big \Vert \varvec{\beta }^*-\varvec{\beta }^k\big \Vert _{\infty }. \end{aligned}$$

Hence,

$$\begin{aligned}&\langle -\nabla {\mathcal {L}}(\varvec{\beta }^k),{\varvec{\beta }^*-\varvec{\beta }^k}\rangle \\&\quad =\langle \nabla {\mathcal {L}}(\varvec{\beta }^k),-\varvec{\beta }^*\rangle \\&\quad \ge \frac{L}{2}\big \Vert \varvec{\beta }^*-\varvec{\beta }^k\big \Vert _1\big \Vert \varvec{\beta }^*-\varvec{\beta }^k\big \Vert _{\infty }+{\mathcal {L}}(\varvec{\beta }^k)-{\mathcal {L}}(\varvec{\beta }^{*})\\&\quad \ge \sqrt{2L}\sqrt{\big \Vert \varvec{\beta }^*-\varvec{\beta }^k\big \Vert _1\big \Vert \varvec{\beta }^*-\varvec{\beta }^k\big \Vert _{\infty }}\sqrt{{\mathcal {L}}(\varvec{\beta }^k)-{\mathcal {L}}(\varvec{\beta }^{*})}.\\ \end{aligned}$$

From the definition of $A^{k}$ and $A^*$, it is known that $B^k$ contains the first $|B^k|$-largest elements (in absolute value) of $\nabla {\mathcal {L}}(\varvec{\beta }^k)$, and $\text{ supp }(\nabla {\mathcal {L}}(\varvec{\beta }^k))\bigcap \text{ supp }(\varvec{\beta }^*)=A^{*}\backslash A^{k-1}$. Thus, we have

$$\begin{aligned} \langle \nabla {\mathcal {L}}(\varvec{\beta }^k),-\varvec{\beta }^*\rangle\le & {} \frac{1}{\sqrt{\zeta }}\Vert \nabla _{B^k}{\mathcal {L}}(\varvec{\beta }^k)\Vert _2\Vert \varvec{\beta }_{A^*\backslash A^{k-1}}^{*}\Vert _2\\= & {} \frac{1}{\sqrt{\zeta }}\Vert \nabla _{B^k}{\mathcal {L}}(\varvec{\beta }^k)\Vert _2\Vert (\varvec{\beta }^{*}-\varvec{\beta }^k)_{A^*\backslash A^{k-1}}\Vert _2\\\le & {} \frac{1}{\sqrt{\zeta }}\Vert \nabla _{B^k}{\mathcal {L}}(\varvec{\beta }^k)\Vert _2\Vert \varvec{\beta }^{*}-\varvec{\beta }^k\Vert _2\\\le & {} \frac{1}{\sqrt{\zeta }}\sqrt{\Vert \nabla _{B^k}{\mathcal {L}}(\varvec{\beta }^k)\Vert _1\Vert \nabla _{B^k}{\mathcal {L}}(\varvec{\beta }^k)\Vert _{\infty }}\\&\times \sqrt{\Vert \varvec{\beta }^{*}-\varvec{\beta }^k\Vert _1\Vert \varvec{\beta }^{*}-\varvec{\beta }^k\Vert _{\infty }}. \end{aligned}$$

Therefore,

$$\begin{aligned}&{\sqrt{2L}\sqrt{{\mathcal {L}}(\varvec{\beta }^k)-{\mathcal {L}}(\varvec{\beta }^{*})}}\\&\quad \le \frac{1}{\sqrt{\zeta }} \sqrt{\Vert \nabla _{B^k}{\mathcal {L}}(\varvec{\beta }^k)\Vert _1\Vert \nabla _{B^{k}} {\mathcal {L}}(\varvec{\beta }^k)\Vert _{\infty }}. \end{aligned}$$

In summary,

$$\begin{aligned} \Vert \nabla _{B^k}{\mathcal {L}}(\varvec{\beta }^{k})\Vert _1\Vert \nabla _{B^k}{\mathcal {L}}(\varvec{\beta }^{k})\Vert _{\infty }\ge 2L\zeta [{\mathcal {L}}(\varvec{\beta }^k)-{\mathcal {L}}(\varvec{\beta }^*)]. \end{aligned}$$

$\square $

Lemma 3

Assume (C1) holds with $0<U<\frac{1}{T}$, and $K\le T$ in Algorithm 1. Then before Algorithm 1 terminates,

$$\begin{aligned} {\mathcal {L}}(\varvec{\beta }^{k+1})-{\mathcal {L}}(\varvec{\beta }^*)\le \xi [{\mathcal {L}}(\varvec{\beta }^k)-{\mathcal {L}}(\varvec{\beta }^*)], \end{aligned}$$

where $\xi =1-\frac{2L(1-TU)}{T(1+K)}\in (0,1)$.

Proof

Let $\Delta ^{k}=\varvec{\beta }^k-\nabla {\mathcal {L}}(\varvec{\beta }^{k})$. The condition of (C1) indicates

$$\begin{aligned}&{\mathcal {L}}(\Delta ^{k+1}|_{A^{k+1}})-{\mathcal {L}}(\varvec{\beta }^{k+1})\le \langle \nabla {\mathcal {L}}(\varvec{\beta }^{k+1}),\Delta ^{k+1}|_{A^{k+1}}-\varvec{\beta }^{k+1}\rangle \\&\quad \quad \quad \quad +\frac{U}{2}\big \Vert \Delta ^{k+1}|_{A^{k+1}}-\varvec{\beta }^{k+1}\big \Vert _1\big \Vert \Delta ^{k+1}|_{A^{k+1}}-\varvec{\beta }^{k+1}\big \Vert _{\infty }. \end{aligned}$$

On the one hand, by the definition of $\varvec{\beta }^{k+1}$ and $\nabla {\mathcal {L}}(\varvec{\beta }^{k+1})$, we have

$$\begin{aligned}&\langle \nabla {\mathcal {L}}(\varvec{\beta }^{k+1}), \Delta ^{k+1}|_{A^{k+1}}-\varvec{\beta }^{k+1}\rangle \\&\quad =\langle \nabla {\mathcal {L}}(\varvec{\beta }^{k+1}), \Delta ^{k+1}|_{A^{k+1}}\rangle \\&\quad =\langle \nabla _{A^{k+1}}{\mathcal {L}}(\varvec{\beta }^{k+1}), \Delta _{A^{k+1}}^{k+1}\rangle \\&\quad =\langle \nabla _{A^{k+1}\backslash A^{k}}{\mathcal {L}}(\varvec{\beta }^{k+1}), \Delta _{A^{k+1}\backslash A^{k}}^{k+1}\rangle . \end{aligned}$$

Further, we also have

$$\begin{aligned}&\big \Vert \Delta ^{k+1}|_{A^{k+1}}-\varvec{\beta }^{k+1}\big \Vert _1\\&\quad =\big \Vert \Delta ^{k+1}|_{A^{k+1}\backslash A^{k}}+\Delta ^{k+1}|_{A^{k+1}\bigcap A^{k}}\\&\quad \quad -\varvec{\beta }^{k+1}|_{A^{k+1}\bigcap A^{k}}-\varvec{\beta }^{k+1}|_{A^{k}\backslash A^{k+1}}\big \Vert _1\\&\quad =\big \Vert \Delta _{A^{k+1}\backslash A^{k}}^{k+1}\big \Vert _1+\big \Vert \Delta _{A^{k+1}\bigcap A^{k}}^{k+1}-\varvec{\beta }_{A^{k+1}\bigcap A^{k}}^{k+1}\big \Vert _1\\&\quad \quad +\big \Vert \varvec{\beta }_{A^{k}\backslash A^{k+1}}^{k+1}\big \Vert _1\\&\quad =\big \Vert \Delta _{A^{k+1}\backslash A^{k}}^{k+1}\big \Vert _1+\big \Vert \varvec{\beta }_{A^{k}\backslash A^{k+1}}^{k+1}\big \Vert _1, \end{aligned}$$

and

$$\begin{aligned}&\big \Vert \Delta ^{k+1}|_{A^{k+1}}-\varvec{\beta }^{k+1}\big \Vert _{\infty }\\&\quad =\big \Vert \Delta ^{k+1}|_{A^{k+1}\backslash A^{k}}+\Delta ^{k+1}|_{A^{k+1}\bigcap A^{k}}\\&\quad \quad -\varvec{\beta }^{k+1}|_{A^{k+1}\bigcap A^{k}}-\varvec{\beta }^{k+1}|_{A^{k}\backslash A^{k+1}}\big \Vert _{\infty }\\&\quad =\big \Vert \Delta _{A^{k+1}\backslash A^{k}}^{k+1}\big \Vert _{\infty }\bigvee \big \Vert \varvec{\beta }_{A^{k}\backslash A^{k+1}}^{k+1}\big \Vert _{\infty }, \end{aligned}$$

where $a \bigvee b=\max \{a,b\}$. By the definition of $A^k$, $A^{k+1}$ and $\varvec{\beta }^{k+1}$, we know that

$$\begin{aligned} |A^{k}\backslash A^{k+1}|=|A^{k+1}\backslash A^{k}|,\quad \Delta _{A^{k}\backslash A^{k+1}}^{k+1}=\varvec{\beta }_{A^{k}\backslash A^{k+1}}^{k+1}. \end{aligned}$$

By the definition of $A^{k+1}$, we can conclude that

$$\begin{aligned}&\Vert \Delta _{A^{k}\backslash A^{k+1}}^{k+1}\Vert _1=\Vert \varvec{\beta }_{A^{k}\backslash A^{k+1}}^{k+1}\Vert _1\le \Vert \Delta _{A^{k+1}\backslash A^{k}}^{k+1}\Vert _1, \\&\Vert \Delta _{A^{k+1}\backslash A^{k}}^{k+1}\Vert _{\infty }\bigvee \Vert \varvec{\beta }_{A^{k}\backslash A^{k+1}}^{k+1}\Vert _{\infty }=\Vert \Delta _{A^{k+1}\backslash A^{k}}^{k+1}\Vert _{\infty }. \end{aligned}$$

Due to $-\nabla _{A^{k+1}\backslash A^{k}}{\mathcal {L}}(\varvec{\beta }^{k+1})=\Delta _{A^{k+1}\backslash A^{k}}^{k+1}$ and $U<\frac{1}{T}$, hence we can deduce that

$$\begin{aligned}&{{\mathcal {L}}(\Delta ^{k+1}|_{A^{k+1}})-{\mathcal {L}}(\varvec{\beta }^{k+1})}\\&\quad \le \langle \nabla _{A^{k+1}\backslash A^{k}}{\mathcal {L}}(\varvec{\beta }^{k+1}),\Delta _{A^{k+1}\backslash A^{k}}^{k+1}\rangle +U\big \Vert \Delta _{A^{k+1}\backslash A^{k}}^{k+1}\big \Vert _1\big \Vert \Delta _{A^{k+1}\backslash A^{k}}^{k+1}\big \Vert _{\infty }\\&\quad \le -(1/T-U)\big \Vert \nabla _{A^{k+1}\backslash A^{k}}{\mathcal {L}}(\varvec{\beta }^{k+1})\big \Vert _1 \times \big \Vert \nabla _{A^{k+1}\backslash A^{k}}{\mathcal {L}}(\varvec{\beta }^{k+1})\big \Vert _{\infty }. \end{aligned}$$

By the definition of $\varvec{\beta }^{k+1}$, we have

$$\begin{aligned} {\mathcal {L}}(\varvec{\beta }^{k+1})-{\mathcal {L}}(\varvec{\beta }^{k})\le & {} {\mathcal {L}}(\Delta ^{k}|_{A^{k}})-{\mathcal {L}}(\varvec{\beta }^{k})\\\le & {} -(1/T-U)\big \Vert \nabla _{A^{k+1}\backslash A^{k}}{\mathcal {L}}(\varvec{\beta }^{k+1})\big \Vert _1 \times \big \Vert \nabla _{A^{k+1}\backslash A^{k}}{\mathcal {L}}(\varvec{\beta }^{k+1})\big \Vert _{\infty }. \end{aligned}$$

Moreover, $\frac{|A^*\backslash A^{k-1}|}{|B^k|}\le K$. By Lemma 2, we have

$$\begin{aligned} {\mathcal {L}}(\varvec{\beta }^{k+1})-{\mathcal {L}}(\varvec{\beta }^{k})\le -\frac{2L(1-TU)}{T(1+K)}[{\mathcal {L}}(\varvec{\beta }^{k})-{\mathcal {L}}(\varvec{\beta }^{*})]. \end{aligned}$$

Therefore, we have

$$\begin{aligned} {\mathcal {L}}(\varvec{\beta }^{k+1})-{\mathcal {L}}(\varvec{\beta }^*)\le \xi [{\mathcal {L}}(\varvec{\beta }^k)-{\mathcal {L}}(\varvec{\beta }^*)], \end{aligned}$$

where $\xi =1-\frac{2L(1-TU)}{T(1+K)}\in (0,1)$. $\square $

Lemma 4

Assume ${\mathcal {L}}$ satisfies (C1) and

$$\begin{aligned} {\mathcal {L}}(\varvec{\beta }^{k+1})-{\mathcal {L}}(\varvec{\beta }^{*})\le \xi [{\mathcal {L}}(\varvec{\beta }^{k})-{\mathcal {L}}(\varvec{\beta }^{*})] \end{aligned}$$

for all $k\ge 0$. Then,

$$\begin{aligned} \begin{aligned} \Vert \varvec{\beta }^{k}-\varvec{\beta }^{*}\Vert _{\infty }&\le \sqrt{(K+T)(1+\frac{U}{L})}(\sqrt{\xi })^k\Vert \varvec{\beta }^0-\varvec{\beta }^*\Vert _{\infty }\\&\quad +\frac{2}{L}\Vert \nabla {\mathcal {L}}(\varvec{\beta }^*)\Vert _{\infty }. \end{aligned} \end{aligned}$$

(16)

Proof

If $\Vert \varvec{\beta }^k-\varvec{\beta }^*\Vert _{\infty }< \frac{2\Vert \nabla {\mathcal {L}}(\varvec{\beta }^*)\Vert _{\infty }}{L}$, then (16) holds, so we only consider the case that $\Vert \varvec{\beta }^k-\varvec{\beta }^*\Vert _{\infty }\ge \frac{2\Vert \nabla {\mathcal {L}}(\varvec{\beta }^*)\Vert _{\infty }}{L}$. On the one hand, ${\mathcal {L}}$ satisfies (C1), then

$$\begin{aligned}&{\mathcal {L}}(\varvec{\beta }^k)-{\mathcal {L}}(\varvec{\beta }^*)\\&\quad \ge \langle \nabla {\mathcal {L}}(\varvec{\beta }^*),\varvec{\beta }^k-\varvec{\beta }^*\rangle +\frac{L}{2}\big \Vert \varvec{\beta }^k-\varvec{\beta }^*\big \Vert _{1}\big \Vert \varvec{\beta }^k-\varvec{\beta }^*\big \Vert _{\infty }\\&\quad \ge -\Vert \nabla {\mathcal {L}}(\varvec{\beta }^*)\Vert _{\infty }\Vert \varvec{\beta }^k-\varvec{\beta }^*\Vert _1+\frac{L}{2}\big \Vert \varvec{\beta }^k-\varvec{\beta }^*\big \Vert _1\big \Vert \varvec{\beta }^k-\varvec{\beta }^*\big \Vert _{\infty }. \end{aligned}$$

Due to $\Vert \varvec{\beta }^k-\varvec{\beta }^*\Vert _{\infty }\ge \frac{2\Vert \nabla {\mathcal {L}}(\varvec{\beta }^*)\Vert _{\infty }}{L}$, then

$$\begin{aligned} (\Vert \varvec{\beta }^k-\varvec{\beta }^*\Vert _1-\Vert \varvec{\beta }^k-\varvec{\beta }^*\Vert _{\infty })\left( \frac{L}{2}\Vert \varvec{\beta }^k-\varvec{\beta }^*\Vert _{\infty }-\Vert \nabla {\mathcal {L}}(\varvec{\beta }^*)\Vert _{\infty }\right) \ge 0. \end{aligned}$$

Further, we can get

$$\begin{aligned} \frac{L}{2}\Vert \varvec{\beta }^k-\varvec{\beta }^*\Vert _{\infty }^2-\Vert \nabla {\mathcal {L}}(\varvec{\beta }^*)\Vert _{\infty }\Vert \varvec{\beta }^k-\varvec{\beta }^*\Vert _{\infty }-[{\mathcal {L}}(\varvec{\beta }^k)-{\mathcal {L}}(\varvec{\beta }^*)]\le 0, \end{aligned}$$

which is univariate quadratic inequality about $\Vert \varvec{\beta }^k-\varvec{\beta }^*\Vert _{\infty }$. Thus, by simple computation, we can get

$$\begin{aligned} \Vert \varvec{\beta }^k-\varvec{\beta }^*\Vert _{\infty }\le \sqrt{\frac{2\max \{{\mathcal {L}}(\varvec{\beta }^k)-{\mathcal {L}}(\varvec{\beta }^*),0\}}{L}}+\frac{2\Vert \nabla {\mathcal {L}}(\varvec{\beta }^*)\Vert _{\infty }}{L}. \end{aligned}$$

(17)

On the other hand, because ${\mathcal {L}}$ satisfies (C1), then

$$\begin{aligned}&{\mathcal {L}}(\varvec{\beta }^0)-{\mathcal {L}}(\varvec{\beta }^*)\\&\quad \le \langle \nabla {\mathcal {L}}(\varvec{\beta }^*),\varvec{\beta }^0-\varvec{\beta }^*\rangle +\frac{U}{2}\big \Vert \varvec{\beta }^0-\varvec{\beta }^*\big \Vert _1\big \Vert \varvec{\beta }^0-\varvec{\beta }^*\big \Vert _{\infty }\\&\quad \le \big \Vert \nabla {\mathcal {L}}(\varvec{\beta }^*)\big \Vert _{\infty }\big \Vert \varvec{\beta }^0-\varvec{\beta }^*\big \Vert _1+\frac{U}{2}\big \Vert \varvec{\beta }^0-\varvec{\beta }^*\big \Vert _1\big \Vert \varvec{\beta }^0-\varvec{\beta }^*\big \Vert _{\infty }\\&\quad \le (K+T)\big \Vert \varvec{\beta }^0-\varvec{\beta }^{*}\big \Vert _{\infty }(\Vert \nabla {\mathcal {L}}(\varvec{\beta }^*)\big \Vert _{\infty }+\frac{U}{2}\big \Vert \varvec{\beta }^0-\varvec{\beta }^*\big \Vert _{\infty }). \end{aligned}$$

Then, we can get

$$\begin{aligned} {\mathcal {L}}(\varvec{\beta }^{k})-{\mathcal {L}}(\varvec{\beta }^{*})&\le \xi [{\mathcal {L}}(\varvec{\beta }^{k-1})-{\mathcal {L}}(\varvec{\beta }^{*})]\\&\le \xi ^{k}[{\mathcal {L}}(\varvec{\beta }^{0})-{\mathcal {L}}(\varvec{\beta }^{*})]\\&\le \xi ^k(K+T)\big \Vert \varvec{\beta }^0-\varvec{\beta }^{*}\big \Vert _{\infty }\\&\quad \times (\Vert \nabla {\mathcal {L}}(\varvec{\beta }^*)\big \Vert _{\infty }+\frac{U}{2}\big \Vert \varvec{\beta }^0-\varvec{\beta }^*\big \Vert _{\infty })\\&\le \frac{\xi ^k (L+U)(K+T)}{2}\big \Vert \varvec{\beta }^0-\varvec{\beta }^{*}\big \Vert _{\infty }^2. \end{aligned}$$

Hence, by (17), we have

$$\begin{aligned} \Vert \varvec{\beta }^{k}-\varvec{\beta }^{*}\Vert _{\infty }&\le \sqrt{(K+T)(1+\frac{U}{L})}(\sqrt{\xi })^k\Vert \varvec{\beta }^0-\varvec{\beta }^*\Vert _{\infty }\\&\quad +\frac{2}{L}\Vert \nabla {\mathcal {L}}(\varvec{\beta }^*)\Vert _{\infty }. \end{aligned}$$

$\square $

Lemma 5

(Proof of Corollary 2 in Loh and Wainwright (2015)). Assume $x_{ij}^{,}s$ are sub-Gaussian and $n > rsim \log (p)$, then there exists universal constants $(c_1,c_2,c_3)$ with $0<c_i<\infty $, $i=1,2,3$ such that

$$\begin{aligned} P\left( \Vert \nabla {\mathcal {L}}(\varvec{\beta }^*)\Vert _{\infty }\ge c_1\sqrt{\frac{\log (p)}{n}}\right) \le c_2\exp (-c_3\log (p)). \end{aligned}$$

1.4 Proof of Theorem 1

Proof

By Lemma 3, we have

$$\begin{aligned} {\mathcal {L}}(\varvec{\beta }^{k+1})-{\mathcal {L}}(\varvec{\beta }^*)\le \xi [{\mathcal {L}}(\varvec{\beta }^k)-{\mathcal {L}}(\varvec{\beta }^*)], \end{aligned}$$

where

$$\begin{aligned} \xi =1-\frac{2L(1-TU)}{T(1+K)}\in (0,1). \end{aligned}$$

So the conditions of Lemma 4 are satisfied. Taking $\varvec{\beta }^0 = \mathbf{0} $, we can get

$$\begin{aligned}&\Vert \varvec{\beta }^{k}-\varvec{\beta }^{*}\Vert _{\infty }\\&\quad \le \sqrt{(K+T)(1+\frac{U}{L})}(\sqrt{\xi })^k\Vert \varvec{\beta }^*\Vert _{\infty } +\frac{2}{L}\Vert \nabla {\mathcal {L}}(\varvec{\beta }^*)\Vert _{\infty }. \end{aligned}$$

By Lemma 5, then there exists universal constants $(c_1,c_2,c_3)$ defined in Lemma 5, with at least probability $1-c_2\exp (-c_3\log (p))$, we have

$$\begin{aligned}&\Vert \varvec{\beta }^{k}-\varvec{\beta }^{*}\Vert _{\infty }\nonumber \\&\quad \le \sqrt{(K+T)(1+\frac{U}{L})}(\sqrt{\xi })^k\Vert \varvec{\beta }^*\Vert _{\infty } +\frac{2c_1}{L} \sqrt{\frac{\log (p)}{n}}. \end{aligned}$$

(18)

Some algebra shows that

$$\begin{aligned} \Vert \varvec{\beta }^{k}-\varvec{\beta }^{*}\Vert _{\infty }\le {\mathcal {O}}(\sqrt{\frac{\log (p)}{n}}) \end{aligned}$$

by taking $k \ge {\mathcal {O}}(\log _{\frac{1}{\xi }} \frac{n}{\log (p)} )$ in (18). Then, the proof is complete. $\square $

1.5 Proof of Theorem 2

Proof

(18) and assumption (C2) and some algebra shows that that

$$\begin{aligned}&\Vert \varvec{\beta }^{k}-\varvec{\beta }^{*}\Vert _{\infty }\\&\le \sqrt{(K+T)(1+\frac{U}{L})}(\sqrt{\xi })^k\Vert \varvec{\beta }^*\Vert _{\infty } +\frac{2}{3}\Vert \varvec{\beta }^*_{A^*}\Vert _{\min }\\&< \Vert \varvec{\beta }^*_{A^*}\Vert _{\min }, \end{aligned}$$

if $k>\log _{\frac{1}{\xi }} 9 (T+K)(1+\frac{U}{L})r^2.$ This implies that $A^* \subseteq A^k$. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huang, J., Jiao, Y., Kang, L. et al. GSDAR: a fast Newton algorithm for $\ell _0$ regularized generalized linear models with statistical guarantee. Comput Stat 37, 507–533 (2022). https://doi.org/10.1007/s00180-021-01098-z

Download citation

Received: 08 July 2020
Accepted: 13 March 2021
Published: 29 March 2021
Issue Date: March 2022
DOI: https://doi.org/10.1007/s00180-021-01098-z

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

GSDAR: a fast Newton algorithm for \(\ell _0\) regularized generalized linear models with statistical guarantee

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Sparse estimation via lower-order penalty optimization methods in high-dimensional linear regression

Element-wise uniqueness, prior knowledge, and data-dependent resolution

Newton method for ℓ₀-regularized optimization

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Appendix

1.1 Proof of Lemma 1

Proof

1.2 Proof of Proposition 1

Proof

1.3 Preparatory lemmas

Lemma 2

Proof

Lemma 3

Proof

Lemma 4

Proof

Lemma 5

1.4 Proof of Theorem 1

Proof

1.5 Proof of Theorem 2

Proof

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

GSDAR: a fast Newton algorithm for \(\ell _0\) regularized generalized linear models with statistical guarantee

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Sparse estimation via lower-order penalty optimization methods in high-dimensional linear regression

Element-wise uniqueness, prior knowledge, and data-dependent resolution

Newton method for ℓ0-regularized optimization

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Appendix

Appendix

1.1 Proof of Lemma 1

Proof

1.2 Proof of Proposition 1

Proof

1.3 Preparatory lemmas

Lemma 2

Proof

Lemma 3

Proof

Lemma 4

Proof

Lemma 5

1.4 Proof of Theorem 1

Proof

1.5 Proof of Theorem 2

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Newton method for ℓ₀-regularized optimization