We propose a fast Newton algorithm for \(\ell _0\) regularized high-dimensional generalized linear models based on support detection and root finding. We refer to the proposed method as GSDAR. GSDAR is developed based on the KKT conditions for \(\ell _0\)-penalized maximum likelihood estimators and generates a sequence of solutions of the KKT system iteratively. We show that GSDAR can be equivalently formulated as a generalized Newton algorithm. Under a restricted invertibility condition on the likelihood function and a sparsity condition on the regression coefficient, we establish an explicit upper bound on the estimation errors of the solution sequence generated by GSDAR in supremum norm and show that it achieves the optimal order in finite iterations with high probability. Moreover, we show that the oracle estimator can be recovered with high probability if the target signal is above the detectable level. These results directly concern the solution sequence generated from the GSDAR algorithm, instead of a theoretically defined global solution. We conduct simulations and real data analysis to illustrate the effectiveness of the proposed method.

We wish to thank two anonymous reviewers for their constructive and helpful comments that led to significant improvements in the paper. The work of J. Huang is supported in part by the U.S. National Science Foundation grant DMS-1916199. The work of Y. Jiao is supported in part by the National Science Foundation of China grant 11871474 and by the research fund of KLATASDSMOE of China. The research of J. Liu is supported by Duke-NUS Graduate Medical School WBS: R-913-200-098-263 and MOE2016-T2-2-029 from Ministry of Eduction, Singapore. The work of Y. Liu is supported in part by the National Science Foundation of China grant 11971362. The work X. Lu is supported by National Science Foundation of China Grants 11471253 and 91630313.
In the appendix, we prove Lemma 1, Proposition 1 and Theorems 1 and 2.
1.1 Proof of Lemma 1
Let \(L_\lambda (\varvec{\beta })={\mathcal {L}}(\varvec{\beta })+\lambda \Vert \varvec{\beta }\Vert _0\). Assume \(\widehat{\varvec{\beta }}\) is a global minimizer of \(L_\lambda (\varvec{\beta })\). Then by Theorem 10.1 in Rockafellar and Wets (2009), we have
where \(\partial \Vert \widehat{\varvec{\beta }}\Vert _{0}\) denotes the limiting subdifferential (see Definition 8.3 in Rockafellar and Wets (2009)) of \( \Vert \cdot \Vert _{0}\) at \(\widehat{\varvec{\beta }}\). Let \(\widehat{\mathbf{d}}= -\nabla {\mathcal {L}}(\widehat{\varvec{\beta }})\) and define \(G(\varvec{\beta }) = \frac{1}{2}\Vert \varvec{\beta }- (\widehat{\varvec{\beta }}+\widehat{\mathbf{d}})\Vert ^2 + \lambda \Vert \varvec{\beta }\Vert _0\). We recall that, from the definition of the limiting subdifferential of Definition 8.3 in Rockafellar and Wets (2009)), \(\partial \Vert \widehat{\varvec{\beta }}\Vert _{0}\) satisfies that \(\Vert \varvec{\beta }\Vert _0\ge \Vert \widehat{\varvec{\beta }}\Vert _0+\langle \partial \Vert \widehat{\varvec{\beta }}\Vert _{0}, \varvec{\beta }-\widehat{\varvec{\beta }}\rangle +o(\Vert \varvec{\beta }-\widehat{\varvec{\beta }}\Vert )\) for any \(\varvec{\beta }\in {\mathbb {R}}^p\). (7) is equivalent to
Moreover, \({\widetilde{\varvec{\beta }}}\) being the minimizer of \(G(\varvec{\beta })\) is equivalent to \(\mathbf{0} \in \partial G({\widetilde{\varvec{\beta }}})\). Obviously, \(\widehat{\varvec{\beta }}\) satisfies \(\mathbf{0} \in \partial G(\widehat{\varvec{\beta }})\). Thus we deduce that \(\widehat{\varvec{\beta }}\) is a KKT point of \( G(\varvec{\beta })\). Then \(\widehat{\varvec{\beta }}=H_{\lambda }(\widehat{\varvec{\beta }}+\widehat{\mathbf{d}})\) follows from the result that the KKT points of G coincide with its coordinate-wise minimizer (Huang et al. 2021). Conversely, suppose \(\widehat{\varvec{\beta }}\) and \(\widehat{\mathbf{d}}\) satisfy (2), then \(\widehat{\varvec{\beta }}\) is a local minimizer of \(L_{\lambda }(\varvec{\beta })\). To show \(\widehat{\varvec{\beta }}\) is a local minimizer of \(L_{\lambda }(\varvec{\beta })\), we can assume \(\text{ h }\) is small enough and \(\Vert \text{ h }\Vert _{\infty }<\sqrt{2\lambda }\). Then we will show \(L_{\lambda }(\widehat{\varvec{\beta }}+\text{ h})\ge L_{\lambda }(\widehat{\varvec{\beta }})\) in two cases respectively.
First, we denote
By the definition of \(H_{\lambda }(\cdot )\) and (2), we can conclude that \(|{\widehat{\beta }}_{i}|\ge \sqrt{2\lambda }\) when \(i \in {\widehat{A}}\) and \(\widehat{\varvec{\beta }}_{{\widehat{I}}}=0\). Thus it yields that \(\text{ supp }(\widehat{\varvec{\beta }})={\widehat{A}}\). Moreover, we also have \(\widehat{\mathbf{d}}_{{\widehat{A}}}=[-\nabla {\mathcal {L}}(\widehat{\varvec{\beta }})]_{{\widehat{A}}}=0\), which is equivalent to \(\widehat{\varvec{\beta }}_{{\widehat{A}}}\in \underset{\varvec{\beta }_{{\widehat{A}}}}{\text{ argmin }}~\widetilde{{\mathcal {L}}}(\varvec{\beta }_{{\widehat{A}}})\).
Case 1: \(\text{ h}_{{\widehat{I}}}\ne 0\).
Because \(|{\widehat{\beta }}_{i}|\ge \sqrt{2\lambda }\) for \(i\in {{\widehat{A}}}\) and \(\Vert \text{ h }\Vert _{\infty }<\sqrt{2\lambda }\), we have
Therefore, we get
Let \(m(\text{ h})=\sum _{i=1}^{n}[c(\text{ x}_{i}^{T}(\widehat{\varvec{\beta }}+\text{ h}))-c(\text{ x}_{i}^{T}\widehat{\varvec{\beta }})]-\text{ y}^{T}\mathbf{X}\text{ h }\), so \(m(\text{ h})\) is a continuous function about \(\text{ h }\). As \(\text{ h }\) is small enough and \(\Vert \text{ h }\Vert _{\infty }<\sqrt{2\lambda }\), then \(m(\text{ h})+\lambda >0\). Thus the last inequality holds.
Case 2: \(\text{ h}_{{\widehat{I}}}=0\).
As \(|{\widehat{\beta }}_{i}|\ge \sqrt{2\lambda }\) for \(i\in {{\widehat{A}}}\) and \(\Vert \text{ h}_{{\widehat{A}}}\Vert _{\infty }<\sqrt{2\lambda }\), then we have
As known that \(\widehat{\varvec{\beta }}_{{\widehat{A}}}\in \underset{\varvec{\beta }_{{\widehat{A}}}}{\text{ argmin }}~\widetilde{{\mathcal {L}}}(\varvec{\beta }_{{\widehat{A}}})\), so the last inequality holds. In summary, \(\widehat{\varvec{\beta }}\) is a local minimizer of \(L_{\lambda }(\widehat{\varvec{\beta }})\). \(\square \)
1.2 Proof of Proposition 1
Denote \(D^{k}=-\left( {\mathcal {H}}^{k}\right) ^{-1} F\left( \text{ w}^{k}\right) \). Then
can be recast as
Partition \(\text{ w}^{k}, D^{k}\) and \(F\left( \text{ w}^{k}\right) \) according to \(A^{k}\) and \(I^{k}\) such that
Substituting (10), (11) and \({\mathcal {H}}^{k}\) into (8), we have
It follows from (9) that
Substituting (15) into (12)–(14), we get (4) of Algorithm 1. This completes the proof. \(\square \)
1.3 Preparatory lemmas
The proofs of Theorems 1 and 2 are built on the following lemmas.
Lemma 2
Assume (C1) holds and \(\Vert \varvec{\beta }^{*}\Vert _{0}=K\le T\). Denote \(B^k = A^{k}\backslash A^{k-1}\). Then,
where \(\zeta =\frac{|B^k|}{|B^k|+|A^*\backslash A^{k-1}|}\).
Obviously, this lemma holds if \(A^{k}=A^{k-1}\) or \({\mathcal {L}}(\varvec{\beta }^k)\le {\mathcal {L}}(\varvec{\beta }^*)\). So we only prove the lemma by assuming \(A^{k}\ne A^{k-1}\) and \({\mathcal {L}}(\varvec{\beta }^k)>{\mathcal {L}}(\varvec{\beta }^*)\). The condition (C1) indicates
From the definition of \(A^{k}\) and \(A^*\), it is known that \(B^k\) contains the first \(|B^k|\)-largest elements (in absolute value) of \(\nabla {\mathcal {L}}(\varvec{\beta }^k)\), and \(\text{ supp }(\nabla {\mathcal {L}}(\varvec{\beta }^k))\bigcap \text{ supp }(\varvec{\beta }^*)=A^{*}\backslash A^{k-1}\). Thus, we have
In summary,
\(\square \)
Lemma 3
Assume (C1) holds with \(0<U<\frac{1}{T}\), and \(K\le T\) in Algorithm 1. Then before Algorithm 1 terminates,
where \(\xi =1-\frac{2L(1-TU)}{T(1+K)}\in (0,1)\).
Let \(\Delta ^{k}=\varvec{\beta }^k-\nabla {\mathcal {L}}(\varvec{\beta }^{k})\). The condition of (C1) indicates
On the one hand, by the definition of \(\varvec{\beta }^{k+1}\) and \(\nabla {\mathcal {L}}(\varvec{\beta }^{k+1})\), we have
Further, we also have
where \(a \bigvee b=\max \{a,b\}\). By the definition of \(A^k\), \(A^{k+1}\) and \(\varvec{\beta }^{k+1}\), we know that
By the definition of \(A^{k+1}\), we can conclude that
Due to \(-\nabla _{A^{k+1}\backslash A^{k}}{\mathcal {L}}(\varvec{\beta }^{k+1})=\Delta _{A^{k+1}\backslash A^{k}}^{k+1}\) and \(U<\frac{1}{T}\), hence we can deduce that
By the definition of \(\varvec{\beta }^{k+1}\), we have
Moreover, \(\frac{|A^*\backslash A^{k-1}|}{|B^k|}\le K\). By Lemma 2, we have
Therefore, we have
where \(\xi =1-\frac{2L(1-TU)}{T(1+K)}\in (0,1)\). \(\square \)
Lemma 4
Assume \({\mathcal {L}}\) satisfies (C1) and
for all \(k\ge 0\). Then,
If \(\Vert \varvec{\beta }^k-\varvec{\beta }^*\Vert _{\infty }< \frac{2\Vert \nabla {\mathcal {L}}(\varvec{\beta }^*)\Vert _{\infty }}{L}\), then (16) holds, so we only consider the case that \(\Vert \varvec{\beta }^k-\varvec{\beta }^*\Vert _{\infty }\ge \frac{2\Vert \nabla {\mathcal {L}}(\varvec{\beta }^*)\Vert _{\infty }}{L}\). On the one hand, \({\mathcal {L}}\) satisfies (C1), then
Due to \(\Vert \varvec{\beta }^k-\varvec{\beta }^*\Vert _{\infty }\ge \frac{2\Vert \nabla {\mathcal {L}}(\varvec{\beta }^*)\Vert _{\infty }}{L}\), then
Further, we can get
which is univariate quadratic inequality about \(\Vert \varvec{\beta }^k-\varvec{\beta }^*\Vert _{\infty }\). Thus, by simple computation, we can get
On the other hand, because \({\mathcal {L}}\) satisfies (C1), then
Then, we can get
Hence, by (17), we have
\(\square \)
Lemma 5
(Proof of Corollary 2 in Loh and Wainwright (2015)). Assume \(x_{ij}^{,}s\) are sub-Gaussian and \(n > rsim \log (p)\), then there exists universal constants \((c_1,c_2,c_3)\) with \(0<c_i<\infty \), \(i=1,2,3\) such that
1.4 Proof of Theorem 1
By Lemma 3, we have
So the conditions of Lemma 4 are satisfied. Taking \(\varvec{\beta }^0 = \mathbf{0} \), we can get
By Lemma 5, then there exists universal constants \((c_1,c_2,c_3)\) defined in Lemma 5, with at least probability \(1-c_2\exp (-c_3\log (p))\), we have
Some algebra shows that
by taking \(k \ge {\mathcal {O}}(\log _{\frac{1}{\xi }} \frac{n}{\log (p)} )\) in (18). Then, the proof is complete. \(\square \)
1.5 Proof of Theorem 2
(18) and assumption (C2) and some algebra shows that that
if \(k>\log _{\frac{1}{\xi }} 9 (T+K)(1+\frac{U}{L})r^2.\) This implies that \(A^* \subseteq A^k\). \(\square \)
