Skip to main content
Log in

Communication-efficient estimation for distributed subset selection

  • Original Paper
  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

Due to the large scale both of the sample size and dimensions, modern data is usually stored in a distributed system, which poses unprecedented challenges in computation and statistical inference. Best subset selection is widely known as a benchmark method for handling high-dimensional data. However, there still is a lack of the study of the efficient algorithm for the best subset selection in the distributed system. To this end, we propose a new communication-efficient method to deal with the best subset selection in the distributed system. The proposed method restricts the information communication among local machines in a moderate active set, and leads not only to an efficient computation but also a cheaper cost of communication in a network of the distributed system. Moreover, we propose a new generalized information criterion for tuning the sparsity level on the central machine. Under mild conditions, we establish the consistency of estimation and variable selection for the proposed estimator. We demonstrate the superiority of the proposed method through several numerical studies and a real data application in adolescent health.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1
Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Battey, H., Fan, J., Liu, H., Lu, J., Zhu, Z.: Distributed testing and estimation under sparse high dimensional models. Ann. Stat. 46(3), 1352–1382 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  • Bertsimas, D., King, A., Mazumder, R.: Best subset selection via a modern optimization lens. Ann. Stat. 44(2), 813–852 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  • Burrows, M.: The chubby lock service for loosely-coupled distributed systems. In: 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI) (2006)

  • Cai, T., Liu, W., Luo, X.: A constrained \(\ell _1\) minimization approach to sparse precision matrix estimation. J. Am. Stat. Assoc. 106(494), 594–607 (2011)

    Article  MATH  Google Scholar 

  • Candes, E.J., Tao, T.: Decoding by linear programming. IEEE Trans. Inf. Theory 51(12), 4203–4215 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  • DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: Amazon’s highly available key-value store. ACM SIGOPS Oper. Syst. Rev. 41(6), 205–220 (2007)

    Article  Google Scholar 

  • Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least angle regression. Ann. Stat. 32(2), 407–499 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  • Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1360 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  • Fan, Y., Lv, J.: Asymptotic properties for combined \(\ell _1\) and concave regularization. Biometrika 101(1), 57–70 (2014)

    Article  MathSciNet  Google Scholar 

  • Fan, Y., Lv, J.: Innovated scalable efficient estimation in ultra-large gaussian graphical models. Ann. Stat. 44(5), 2098–2126 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  • Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1 (2010)

  • Huang, J., Jiao, Y., Liu, Y., Lu, X.: A constructive approach to \(l_0\) penalized regression. J. Mach. Learn. Res. 19(10), 1–37 (2018)

    MathSciNet  MATH  Google Scholar 

  • Jordan, M.I., Lee, J., Yang, Y.: Communication-efficient distributed statistical inference. J. Am. Stat. Assoc. 114, 668–681 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  • Lee, J.D., Liu, Q., Sun, Y., Taylor, J.E.: Communication-efficient sparse regression. J. Mach. Learn. Res. 18(1), 115–144 (2017)

    MathSciNet  MATH  Google Scholar 

  • Mullan, Kathleen. H.: The national longitudinal study of adolescent to adult health (add health), waves i & ii, 1994–1996; wave iii, 2001–2002; wave iv, 2007–2009 [machine-readable data file and documentation]. Chapel Hill, NC: Carolina Population Center, University of North Carolina at Chapel Hill 10 (2009)

  • Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc.: Ser. B (Stat. Methodol.) 58(1), 267–288 (1996)

  • Toshniwal, A., Taneja, S., Shukla, A., Ramasamy, K., Patel, J.M., Kulkarni, S., Jackson, J., Gade, K., Fu, M., Donham, J., et al: Storm@ twitter. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 147–156 (2014)

  • Vershynin, R.: Introduction to the non-asymptotic analysis of random matrices. In: Compressed Sensing (2012)

  • Wang, J., Kolar, M., Srebro, N., Zhang, T.: Efficient distributed learning with sparsity. In: Proceedings of the 34th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 70, pp. 3636–3645 (2017)

  • Wen, C., Zhang, A., Quan, S., Wang, X.: Bess: An r package for best subset selection in linear, logistic and cox proportional hazards models. J. Stat. Softw. 94(4), 1–24 (2020)

    Article  Google Scholar 

  • Zhang, C.-H.: Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38(2), 894–942 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  • Zhang, C.-H., Huang, J.: The sparsity and bias of the lasso selection in high-dimensional linear regression. Ann. Stat. 36(4), 1567–1594 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  • Zhang, Y., Duchi, J.C., Wainwright, M.J.: Communication-efficient algorithms for statistical optimization. J. Mach. Learn. Res. 14(1), 3321–3363 (2013)

  • Zheng, Z., Fan, Y., Lv, J.: High dimensional thresholded regression and shrinkage effect. J. R. Stat. Soc.: Ser. B (Stat. Methodol.) 76(3), 627–649 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  • Zheng, Z., Bahadori, M.T., Liu, Y., Lv, J.: Scalable interpretable multi-response regression via seed. J. Mach. Learn. Res. 20(107), 1–34 (2019)

  • Zhu, J., Wen, C., Zhu, J., Zhang, H., Wang, X.: A polynomial algorithm for best-subset selection problem. Proc. Natl. Acad. Sci. 117(52), 33117–33123 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  • Zhu, X., Li, F., Wang, H.: Least-square approximation for a distributed system. J. Comput. Graph. Stat. 30(4), 1004–1018 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  • Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc.: Ser. B (Stat. Methodol.) 67(2), 301–320 (2005)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

Wen’s research is partially supported by National Science Foundation of China (Grant 12171449), Fundamental Research Funds for the Central Universities (Grant WK3470000027), and USTC Research Funds of the Double First-Class Initiative (Grant YD2040002019). Dong’s research is supported by China Postdoctoral Science Foundation (Grant 2023M733402).

Author information

Authors and Affiliations

Authors

Contributions

Yan Chen, Ruipeng Dong and Canhong Wen wrote the main manuscript text. And Yan Chen prepared the simulation result and data analysis.

Corresponding author

Correspondence to Ruipeng Dong.

Ethics declarations

Competing interests

The authors have no relevant financial or non-financial interests to disclose. The authors have no competing interests to declare that are relevant to the content of this article. All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript. The authors have no financial or proprietary interests in any material discussed in this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Proofs of main results

To ease the presentation, we now define some notation. On the k-th distributed machine, we remark that \({\mathcal {A}}_{{k}}^l\) and \({\mathcal {I}}_{{k}}^l\) are the active set and the inactive set in the l-th calculation by the ABESS algorithm. The true active set is \({\mathcal {A}}^{*}\) and true inactive set is \({\mathcal {I}}^{*}\). Let

$$\begin{aligned} \begin{array}{l} {\mathcal {A}}_{{k},1}^l={\mathcal {A}}_{{k}}^l \cap {\mathcal {A}}^{*},{\mathcal {A}}_{{k},2}^l ={\mathcal {A}}^l_{k}\cap {\mathcal {I}}^{*}, \\ {\mathcal {I}}_{{k},1}^l ={\mathcal {I}}_{{k}}^l \cap {\mathcal {A}}^{*},{\mathcal {I}}_{{k},2}^l ={\mathcal {I}}_{{k}}^l \cap {\mathcal {I}}^{*}. \end{array} \end{aligned}$$

In the l-th iteration, \({\mathcal {A}}_{{k},1}^l\) represents the correct active set is selected, \({\mathcal {A}}^l_{{k},2}\) represents the incorrect active set is selected, \({\mathcal {I}}^l_{{k},1}\) represents the missing active set, \({\mathcal {I}}^l_{{k},2}\) represents the correct inactive set is selected. We denote the true estimator \(\varvec{\beta }^*\), \(\bar{\varvec{\beta }}^*=\varvec{\beta }^*_{{\mathcal {A}}}\), and the l-th estimator \(\varvec{\beta }^l_{k}\).

Then we present one additional lemma and the proof of Lemma 2 is provided in A.4.

Lemma 2

Suppose \((\hat{\varvec{\beta }}_{k},\hat{{\mathcal {A}}}_{k})\) is the output of the k-th worker of ABESS under \(s\geqslant s^*\), assume Conditions 1, 2, 4, and 5 hold, \({{k}}=1,2,\cdots ,K\). Denote \(\gamma _{{s}}(n,p)=O(p\exp (-\frac{C_snb^*}{s}))\), where \(C_s\) is a constant that depends on s. With probability \(1-O(p^{-c_0})-\gamma _{{s}}(n,p)\), we have

$$\begin{aligned} \begin{aligned} \Vert \hat{\varvec{\beta }}_{{k}}-\varvec{\beta }^{*}\Vert _{2}&\leqslant O\left( \frac{\sqrt{s\log p}}{\sqrt{n}}\right) . \end{aligned} \end{aligned}$$
(11)

1.1 A.2 Proof of Lemma 1

Proof

On each distributed machine, by Lemma 1 of Zhu et al. (2020), we can calculate the probability of the missing active set

$$\begin{aligned} \textrm{P}(\hat{{\mathcal {A}}}_{{k}} \nsupseteq {\mathcal {A}}^{*}) < \gamma _{s}(n, p). \end{aligned}$$

Considering the probability of the missing elements on the central machine

$$\begin{aligned} \begin{aligned} \textrm{P}({\mathcal {A}} \nsupseteq {\mathcal {A}}^{*})&= \textrm{P}\left( \bigcup \limits _{{{k}}=1}^{K}\hat{{\mathcal {A}}}_{{k}} \nsupseteq {\mathcal {A}}^{*}\right) . \end{aligned} \end{aligned}$$

On each machine, the occurrence of missing active sets is independent of each other. It follows that

$$\begin{aligned} \begin{aligned}&\textrm{P}({\mathcal {A}} \nsupseteq {\mathcal {A}}^{*})\\&=\textrm{P}(\hat{{\mathcal {A}}}_1 \nsupseteq {\mathcal {A}}^{*})\textrm{P}(\hat{{\mathcal {A}}}_2 \nsupseteq {\mathcal {A}}^{*})\cdots \textrm{P}(\hat{{\mathcal {A}}}_K \nsupseteq {\mathcal {A}}^{*})\\&<\gamma _{s}(n, p)\gamma _{s}(n, p)\cdots \gamma _{s}(n, p)\\&=O\left( p^K\,\exp {\left[ -\frac{C_sKnb^*}{{s}}\right] }\right) \\&=O\left( p^K\,\exp {\left[ -\frac{C_sNb^*}{{s}}\right] }\right) , \end{aligned} \end{aligned}$$

where \(C_{s}\) is a constant depending on s. The last equality by \(N=nK\). Define \(\gamma _{{s},K}(n,p)=O(p^K\,\exp {-\frac{C_{s}Nb^*}{{s}}})\), with probability \(1-\gamma _{{s},K}(N, p)\), since we can gain

$$\begin{aligned} {{\mathcal {A}}}=\bigcup \limits _{{{k}}=1}^{K}{\hat{{\mathcal {A}}}_{{k}}}\supseteq {\mathcal {A}}^{*}. \end{aligned}$$

\(\square \)

1.2 A.2 Proof of Theorem 1

Proof

Conditions of Theorem 1 satisfy the conditions of the Lemma 1. Therefore, with probability \(1- \gamma _{{s},K}(n,p)\),

$$\begin{aligned} {{\mathcal {A}}}=\bigcup \limits _{{{k}}=1}^{K}{\hat{{\mathcal {A}}}_{{k}}}\supseteq {\mathcal {A}}^{*}. \end{aligned}$$

Combining the DC-DBESS algorithm, it holds that

$$\begin{aligned} \hat{\varvec{\beta }}_{d}=\frac{1}{K}\sum \limits _{{{k}}=1}^K \hat{\varvec{\beta }}_{{k}}+\frac{1}{N}\sum \limits _{{{k}}=1}^K\varvec{M}_{{k}} \hat{ \varvec{X}}_{{k}}^T(\varvec{y}_{{k}}-\hat{\varvec{X}}_{{k}} \hat{\varvec{\beta }}_{{k}}). \end{aligned}$$

The algorithm uses the local derivative information and the local debiased information. Further, we analyze the convergence order of the Algorithm 1.

$$\begin{aligned} \begin{array}{l} \hat{\varvec{\beta }}_{d}-\bar{\varvec{\beta }}^*=\frac{1}{K}\sum \limits _{{{k}}=1}^K(\hat{\varvec{\beta }}_{{k}}-\bar{\varvec{\beta }}^*) +\frac{1}{N}\sum \limits _{{{k}}=1}^K\varvec{M}_{{k}} \hat{\varvec{X}}_{{k}}^T( \varvec{y}_{{k}}-\hat{\varvec{X}}_{{k}}\hat{\varvec{\beta }}_{{k}}). \end{array} \end{aligned}$$

Based on distributed system, under the condition that Lemma 1 holds, with probability \(1-\gamma _{{s},K}(N,p)\), we could obtain \({{\mathcal {A}}}=\bigcup \limits _{{{k}}=1}^{K}{\hat{{\mathcal {A}}}_{{k}}}\supseteq {\mathcal {A}}^{*}\), then, we have \( \varvec{y}_{{k}}=\hat{ \varvec{X}}_{{k}}\bar{\varvec{\beta }}^*+\varvec{\epsilon }_{{k}}\). Thus,

$$\begin{aligned} \begin{aligned} \hat{\varvec{\beta }}_{d}-\bar{\varvec{\beta }}^*&=\frac{1}{K}\sum \limits _{{{k}}=1}^K(\hat{\varvec{\beta }}_{{k}} -\bar{\varvec{\beta }}^*)+\frac{1}{N}\sum \limits _{{{k}}=1}^K\varvec{M}_{{k}}\hat{\varvec{X}}_{{k}}^T\hat{\varvec{X}}_{{k}}(\bar{\varvec{\beta }}^*-\hat{\varvec{\beta }}_{{k}})\\&+\frac{1}{N}\sum \limits _{{{k}}=1}^K\varvec{M}_{{k}} \hat{\varvec{X}}_{{k}}^T\varvec{\epsilon }_{{k}}. \end{aligned} \end{aligned}$$

We note \(\hat{\varvec{\Sigma }}_{{k}}=\hat{\varvec{X}}_{{{k}},{\mathcal {A}}}^T\hat{\varvec{X}}_{{{k}},{\mathcal {A}}}/n\) and define the bias \(\hat{\Delta }_{{k}}=(\varvec{M}_{{k}}\hat{\varvec{\Sigma }}_{{k}}-\varvec{I})(\bar{\varvec{\beta }}^*-\hat{\varvec{\beta }}_{{k}})\). Then, it follows that

$$\begin{aligned} \begin{array}{l} \quad \hat{\varvec{\beta }}_{d}-\varvec{\beta }^*=\frac{1}{K}\sum \limits _{{{k}}=1}^K\hat{\Delta }_{{k}}+\frac{1}{N}\sum \limits _{{{k}}=1}^K\varvec{M}_{{{k}}} \hat{\varvec{X}}_{{{k}}}^T \varvec{\epsilon }_{{k}}. \end{array} \end{aligned}$$

Denote \(\hat{\varvec{D}}_{{{k}}}=\varvec{M}_{{{k}}} \hat{\varvec{X}}_{{{k}}}^T\in {\mathbb {R}}^{m\times n}\) and the number of rows is m. \(\hat{\varvec{D}}=(\hat{\varvec{D}}_1,\hat{\varvec{D}}_2,\cdots ,\hat{\varvec{D}}_K)\in {\mathbb {R}}^{m\times N}\), \(\varvec{\epsilon }=(\varvec{\epsilon }_1,\varvec{\epsilon }_2,\cdots ,\varvec{\epsilon }_K)\), we can show that

$$\begin{aligned} \frac{1}{N}\sum \limits _{{{{k}}}=1}^K\varvec{M}_{{{k}}} \hat{\varvec{X}}_{{{k}}}^T\varvec{\epsilon }_{{{k}}}=\frac{1}{N}\hat{\varvec{D}}\varvec{\epsilon }. \end{aligned}$$

By Condition 1, using Vershynin (2012), Proposition 5.16, we have

$$\begin{aligned} \left\| \frac{1}{N}\sum \limits _{{{{k}}}=1}^K \varvec{M}_{{{k}}} \hat{\varvec{X}}_{{{k}}}^T \varvec{\epsilon }_{{{k}}}\right\| _{\infty }=\Vert \frac{1}{N}\hat{\varvec{D}} \varvec{\epsilon }\Vert _{\infty } =O\left( \sqrt{\frac{\log m}{N}}\right) . \end{aligned}$$
(12)

Next, we consider the order of the bias \(\hat{\Delta }_{{k}}\).

$$\begin{aligned}\begin{aligned} \Vert \hat{\Delta }_{{k}}\Vert _{\infty }&=\Vert (\varvec{M}_{{{k}}} \hat{\varvec{\Sigma }}_{{{k}}}-\varvec{I})(\hat{\varvec{\beta }}_{{{k}}}-\varvec{\beta }^{*})\Vert _{\infty }\\&\leqslant \max _{j \in [m]}\Vert {\varvec{M}_{{{k}},j}}\hat{\varvec{\Sigma }}_{{{k}},j}-e_j\Vert _{\infty } \Vert \hat{\varvec{\beta }}_{{{k}}}-\varvec{\beta }^{*}\Vert _{1}. \end{aligned} \end{aligned}$$

By using the technique in Cai et al. (2011), Fan and Lv (2016), we have

$$\begin{aligned} \max _{j \in [m]}\Vert {\varvec{M}_{{{k}},j}}\hat{\varvec{\Sigma }}_{{{k}},j}-e_j\Vert _{\infty } \leqslant C\left( \sqrt{\frac{\log m}{n}}\right) . \end{aligned}$$

By the equivalence of the norms and Lemma 2, it follows that

$$\begin{aligned} \Vert \hat{\varvec{\beta }}_{{{k}}}-\varvec{\beta }^{*}\Vert _{1}\leqslant \sqrt{2s}\Vert \hat{\varvec{\beta }}_{{{k}}}-\varvec{\beta }^{*}\Vert _{2} \leqslant C\sqrt{\frac{{s^2\log p}}{{n}}}. \end{aligned}$$

Therefore,

$$\begin{aligned} \begin{aligned} \Vert \hat{\Delta }_{{k}}\Vert _{\infty }&\leqslant C\left( \frac{\sqrt{s^2\log p \log m}}{n}\right) . \end{aligned} \end{aligned}$$
(13)

Thus, combining (12) and (13), we can further gain the convergence of the algorithm. With probability \(1-O(Kp^{-c_0})-K\gamma _{{s}}(n,p)\), we have

$$\begin{aligned}\begin{aligned}&\Vert \hat{\varvec{\beta }}_{d}-\bar{\varvec{\beta }}^{*}\Vert _{\infty } \\ {}&=\left\| \frac{1}{K} \sum _{{{k}}=1}^{K} \hat{\Delta }_{{{k}}}+\frac{1}{N} \sum _{{{k}}=1}^{K} \varvec{M}_{{{k}}} \hat{\varvec{X}}_{{{k}}}^{T} \hat{\varvec{\epsilon }}_{{{k}}}\right\| _{\infty } \\ {}&\leqslant \frac{1}{K} \sum _{{{k}}=1}^{K}\left\| \hat{\Delta }_{{{k}}}\right\| _{\infty }+\frac{1}{N} \sum _{{{k}}=1}^{K}\left\| \varvec{M}_{{{k}}} \hat{\varvec{X}}_{{{k}}}^{T} \hat{\varvec{\epsilon }}_{{{k}}}\right\| _{\infty } \\ {}&\leqslant C\left( K \frac{{s}\sqrt{\log p\log m}}{N}+\sqrt{\frac{\log m}{N}}\right) . \end{aligned} \end{aligned}$$

By \(\lambda =c_1 \sqrt{\log m/N}\) for some sufficiently large constant \(c_1\), we have

$$\begin{aligned}\begin{aligned} \Vert \hat{\varvec{\beta }}-\varvec{\beta }^{*}\Vert _{\infty }&\leqslant {\Vert \hat{\varvec{\beta }}_{d}-\bar{\varvec{\beta }}^{*}\Vert _{\infty }+\Vert \hat{\varvec{\beta }}_{d}-\hat{\varvec{\beta }}_{{\mathcal {A}}}\Vert _{\infty }} \\ {}&\leqslant C\left( K \frac{{s}\sqrt{\log p\log m}}{N}+\sqrt{\frac{\log m}{N}}\right) . \end{aligned} \end{aligned}$$

\(\square \)

1.3 A.3 Proof of Theorem 2

Proof

We will omit the specific details of the proof and only list some of its main parts, the detailed proof refers to Theorem 4 of Zhu et al. (2020). Same as the proof of Theorem 1, the conditions of Theorem 2 satisfy the conditions of Lemma 1. With probability \(1-\gamma _{{s},K}(n,p)\), we can obtain

$$\begin{aligned} {{\mathcal {A}}}=\bigcup \limits _{{{k}}=1}^{K}{\hat{{\mathcal {A}}}_{{k}}}\supseteq {\mathcal {A}}^{*}. \end{aligned}$$

Denote \(\hat{s}\) is the cardinality of \(\hat{{\mathcal {A}}}\), for any \(\hat{s}\geqslant s^*\), combining the minimal strength \(b^*\geqslant c_2(\frac{\log m}{N})^{\frac{1}{2}}\) for some sufficiently large constant \(c_2\), we have

$$\begin{aligned} \begin{aligned} \textrm{P}(\hat{{\mathcal {A}}} \supseteq {\mathcal {A}}^{*})\geqslant 1-\gamma . \end{aligned} \end{aligned}$$

Denote \({\mathcal {L}}_{\hat{{\mathcal {A}}}}=\frac{1}{2N}\sum _{{{k}}=1}^K\Vert \varvec{{y}}_{{k}} -\varvec{\hat{X}}_{{k}}\varvec{\hat{\beta }}\Vert ^2\) and \({\mathcal {L}}_{{{\mathcal {A}}}^*}=\frac{1}{2N}\sum _{{{k}}=1}^K\Vert \varvec{{y}}_{{k}} -\varvec{\hat{X}}_{{k}}\varvec{{\beta }}^*\Vert ^2\). For a sufficiently large N, we have

$$\begin{aligned} \begin{aligned}&\text {DGIC}(\vert {\mathcal {A}}^*\vert )- \text {DGIC}(\vert \hat{{\mathcal {A}}}\vert )\\&=N\log {\left( {{\mathcal {L}}_{{{\mathcal {A}}}^*}}/{{\mathcal {L}}_{\hat{{\mathcal {A}}}}}\right) }-(\hat{s}-s^*)K\log (m)\log \log N \\ {}&\leqslant O\left( (\hat{s}-s^*)\log (2m)\right) -(\hat{s}-s^*)K\log (m)\log \log N \\ {}&<0. \end{aligned} \end{aligned}$$

On the other hand, if \(\hat{s}< s^*\), use Conditions 2, 3, for a sufficiently large N, we have

$$\begin{aligned} \begin{aligned}&\text {DGIC}(\vert \hat{{\mathcal {A}}}\vert )-\text {DGIC}(\vert {\mathcal {A}}^*\vert )\\&=N\log {\left( {{\mathcal {L}}_{\hat{{\mathcal {A}}}}}/{{\mathcal {L}}_{{{\mathcal {A}}}^*}}\right) }+(\hat{s}-s^*)K\log (m)\log \log N \\ {}&\geqslant N O\left( \min \{1,(s^*-\hat{s})b^*\}\right) -(s^*-\hat{s})K\log (m)\log \log N \\ {}&>0. \end{aligned} \end{aligned}$$

Thus, with probability \(1-O(p^{-c_0})-\gamma _{{s},K}(n,p)\), it is easy to deduce that

$$\begin{aligned} {{\hat{{\mathcal {A}}}}}={\mathcal {A}}^*. \end{aligned}$$

\(\square \)

1.4 A.4 Proof of Lemma 2

Proof

Applying Theorem 3 of Zhu et al. (2020), with probability \(1- \gamma _{s}(n,p)\), we have

$$\begin{aligned}{} & {} \Vert \varvec{\beta }_{{{k}},{\mathcal {I}}_{{{k}},1}^{l}}^{*}\Vert _{2}^2\leqslant \frac{2{\mathcal {L}}_n(\varvec{\beta }_{{k}}^{l+1})-2 {\mathcal {L}}_n(\varvec{\beta }_{{k}}^{l})}{(1-\rho _{s})(1-\Delta ) \left( c_{-}(s)-\frac{(\theta _{s,s})^2}{{c_{-}(s)}}\right) }, \end{aligned}$$
(14)
$$\begin{aligned}{} & {} \quad 2n {\mathcal {L}}_n(\varvec{\beta }_{{k}}^{l+1})-2 n{\mathcal {L}}_n(\varvec{\beta }^{*})\leqslant \rho _{s}(2 n{\mathcal {L}}_n(\varvec{\beta }_{{k}}^{l})-2 n {\mathcal {L}}_n(\varvec{\beta }^{*})). \end{aligned}$$
(15)

By \({\mathcal {A}}^*={\mathcal {A}}_{{{k}},1}^l\cup {\mathcal {I}}_{{{k}},1}^l\), we deduce that

$$\begin{aligned} \varvec{\beta }^{*}=\varvec{\beta }_{{\mathcal {A}}_{{{k}},1}^{l}}^{*} +\varvec{\beta }_{{\mathcal {I}}_{{{k}},1}^{l}}^{*}. \end{aligned}$$

By \({\mathcal {A}}_{{k}}^l={\mathcal {A}}_{{{k}},1}^l\cup {\mathcal {A}}_{{{k}},2}^l\) and \(\varvec{\beta }^*_{{\mathcal {A}}_{{{k}},2}^l}=0\), it follows that

$$\begin{aligned} \varvec{\beta }^{*} =\varvec{\beta }_{{\mathcal {A}}_{{k}}^{l}}^{*}+\varvec{\beta }_{{\mathcal {I}}_{{{k}},1}^{l}}^{*}. \end{aligned}$$
(16)

We can use the trigonometric inequality of the \(\ell _2\) norm:

$$\begin{aligned} \begin{aligned} \Vert \varvec{\beta }_{{k}}^{l}-\varvec{\beta }^{*}\Vert _{2} \leqslant \Vert \varvec{\beta }_{{{k}},{\mathcal {A}}_{{k}}^{l}}^{l} -\varvec{\beta }_{{{k}},{\mathcal {A}}_{{k}}^{l}}^{*}\Vert _{2}+\Vert \varvec{\beta }_{{\mathcal {I}}_{{{k}},1}^{l}}^{*}\Vert _{2}. \end{aligned} \end{aligned}$$

We calculate the distributed estimator \(\varvec{\beta }_{{{k}},{\mathcal {A}}_{{k}}^{l}}^{l}\) of linear regression problem \(\varvec{y}_{{k}}=\varvec{X}_{{k}}\varvec{\beta }^{*}+\varvec{\epsilon }_{{k}}\) by least squares, i.e. \(\varvec{\beta }_{{{k}},{\mathcal {A}}_{{k}}^{l}}^{l}=(\varvec{X}_{{{k}},{\mathcal {A}}_{{k}}^{l}}^{{T}} \varvec{X}_{{{k}},{\mathcal {A}}_{{k}}^{l}})^{-1} \varvec{X}_{{{k}},{\mathcal {A}}_{{k}}^{l}}^{{T}}(\varvec{X}_{{k}} \varvec{\beta }^{*}+\varvec{\epsilon }_{{k}})\), we get

$$\begin{aligned} \begin{aligned}&\Vert \varvec{\beta }_{{k}}^{l}-\varvec{\beta }^{*}\Vert _{2}\\&\leqslant \Vert (\varvec{X}_{{{k}},{\mathcal {A}}_{{k}}^{l}}^{{T}} \varvec{X}_{{{k}},{\mathcal {A}}_{{k}}^{l}})^{-1} \varvec{X}_{{{k}},{\mathcal {A}}_{{k}}^{l}}^{{T}}(\varvec{X}_{{k}} \varvec{\beta }^{*}+\varvec{\epsilon }_{{k}}-\varvec{X}_{{{k}},{\mathcal {A}}_{{k}}^{l}} \varvec{\beta }_{{\mathcal {A}}_{{k}}^{l}}^{*})\Vert _{2}\\&+\Vert \varvec{\beta }_{{\mathcal {I}}_{{{k}},1}^{l}}^{*}\Vert _{2}. \end{aligned} \end{aligned}$$

Using (16), we have

$$\begin{aligned} \begin{aligned} \Vert \varvec{\beta }_{{k}}^{l}-\varvec{\beta }^{*}\Vert _{2}&\leqslant \Vert (\varvec{X}_{{{k}},{\mathcal {A}}_{{k}}^{l}}^{{T}} \varvec{X}_{{{k}},{\mathcal {A}}_{{k}}^{l}})^{-1} \varvec{X}_{{{k}},{\mathcal {A}}_{{k}}^{l}}^{{T}} \varvec{X}_{{{k}},{\mathcal {I}}_{{{k}},1}^{l}} \varvec{\beta }_{{\mathcal {I}}_{{{k}},1}^{l}}^{*}\Vert _{2}\\&+\Vert (\varvec{X}_{{{k}},{\mathcal {A}}_{{k}}^{l}}^{{T}} \varvec{X}_{{{k}},{\mathcal {A}}_{{k}}^{l}})^{-1} \varvec{X}_{{{k}},{\mathcal {A}}_{{k}}^{l}}^{{T}}\varvec{\epsilon }_{{k}}\Vert _{2} +\Vert \varvec{\beta }_{{\mathcal {I}}_{{{k}},1}^{l}}^{*}\Vert _{2}. \end{aligned} \end{aligned}$$

Combining SRC condition, then

$$\begin{aligned} \Vert \varvec{X}_{{{k}},{{\mathcal {A}}_{{k}}^l}}^{{T}}\varvec{X}_{{{k}},{{\mathcal {A}}}_{{k}}^l}/n\Vert _2\geqslant c_{-}(s),\,\Vert \varvec{X}_{{{k}},{\mathcal {A}}_{{k}}^{l}}^{{T}} \varvec{X}_{{{k}},{\mathcal {I}}_{{{k}},1}^{l}}/n \Vert _2\leqslant \theta _{s,s}. \end{aligned}$$

Thus,

$$\begin{aligned} \begin{aligned} \Vert \varvec{\beta }_{{k}}^{l}-\varvec{\beta }^{*}\Vert _{2}&\leqslant \left( 1+\frac{\theta _{s, s}}{c_{-}(s)}\right) \Vert \varvec{\beta }_{{\mathcal {I}}_{{{k}},1}^{l}}^{*}\Vert _{2} +\frac{\Vert \varvec{X}_{{{k}},{\mathcal {A}}_{{k}}^{l}}^{{T}} \varvec{\epsilon }_{{k}}\Vert _{2}}{n c_{-}(s)}. \end{aligned} \end{aligned}$$

By (14) and (15), we can deduce that

$$\begin{aligned} \begin{aligned}&\Vert \varvec{\beta }_{{k}}^{l}-\varvec{\beta }^{*}\Vert _{2} \leqslant \frac{1+\frac{\theta _{s, s}}{c_{-}(s)}}{(1-\Delta ) n\left( c_{-}(s) -\frac{(\theta _{s, s})^{2}}{c_{-}(s)}\right) } \rho _{s}^{l}\\&\quad \vert 2 n {\mathcal {L}}_{n}(\varvec{\beta }^{0}) -2 n {\mathcal {L}}_{n}(\varvec{\beta }^{*})\vert \\&\quad +\frac{\Vert \varvec{X}_{{{k}}, {\mathcal {A}}_{{k}}^{l}}^{{T}} \varvec{\epsilon }_{{k}}\Vert _{2}}{n c_{-}(s)} \\&\quad \leqslant \frac{1+\frac{\theta _{s, s}}{c_{-}(s)}}{(1-\Delta ) n\left( c_{-}(s) -\frac{(\theta _{s, s})^2}{c_{-}(s)}\right) } \rho _{s}^{l}\Vert \varvec{y}_{{k}}\Vert _{2}^{2}\\&\quad +\frac{\Vert \varvec{X}_{{{k}},{\mathcal {A}}_{{k}}^{l}}^{{T}} \varvec{\epsilon }_{{k}}\Vert _{2}}{n c_{-}(s)}. \end{aligned} \end{aligned}$$

Using sub-Gaussian condition of \(\varvec{\epsilon }_k\), with probability \(1-O(p^{-c_0})- \gamma _{{s}}(n,p)\), where \(c_0\) is some arbitrarily large, we can show that

$$\begin{aligned} \begin{aligned} \Vert \varvec{\beta }_{{k}}^{l}-\varvec{\beta }^{*}\Vert _{2}&\leqslant \frac{1+\frac{\theta _{{s}, s}}{c_{-}(s)}}{(1-\Delta ) n\left( c_{-}(s) -\frac{(\theta _{s, s}){2}}{c_{-}(s)}\right) } \rho _{s}^{l}\Vert \varvec{y}_{{k}}\Vert _{2}^{2}\\&+\frac{c\sqrt{s\log p}}{\sqrt{n} c_{-}(s)}, \end{aligned} \end{aligned}$$

where c is a constant. We have

$$\begin{aligned} \begin{aligned} \Vert \hat{\varvec{\beta }}_{{k}}-\varvec{\beta }^{*}\Vert _{2} \leqslant C\frac{\sqrt{s\log p}}{\sqrt{n}}, \end{aligned} \end{aligned}$$

where C is a constant. \(\square \)

B Add Health data for variable selection result

Tables 6 and 7 show the detailed variable selection results.

Table 6 Variable selection results by Central-ABESS and DC-DBESS
Table 7 Variable selection results by Central-ABESS and DC-DBESS

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, Y., Dong, R. & Wen, C. Communication-efficient estimation for distributed subset selection. Stat Comput 33, 141 (2023). https://doi.org/10.1007/s11222-023-10302-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11222-023-10302-7

Keywords

Navigation