Communication-efficient estimation for distributed subset selection

Chen, Yan; Dong, Ruipeng; Wen, Canhong

doi:10.1007/s11222-023-10302-7

Communication-efficient estimation for distributed subset selection

Original Paper
Published: 16 October 2023

Volume 33, article number 141, (2023)
Cite this article

Statistics and Computing Aims and scope Submit manuscript

Yan Chen¹,
Ruipeng Dong² &
Canhong Wen²

314 Accesses
Explore all metrics

Abstract

Due to the large scale both of the sample size and dimensions, modern data is usually stored in a distributed system, which poses unprecedented challenges in computation and statistical inference. Best subset selection is widely known as a benchmark method for handling high-dimensional data. However, there still is a lack of the study of the efficient algorithm for the best subset selection in the distributed system. To this end, we propose a new communication-efficient method to deal with the best subset selection in the distributed system. The proposed method restricts the information communication among local machines in a moderate active set, and leads not only to an efficient computation but also a cheaper cost of communication in a network of the distributed system. Moreover, we propose a new generalized information criterion for tuning the sparsity level on the central machine. Under mild conditions, we establish the consistency of estimation and variable selection for the proposed estimator. We demonstrate the superiority of the proposed method through several numerical studies and a real data application in adolescent health.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Article 09 November 2022

Stratified random sampling from streaming and stored data

Article 23 October 2020

Entropy-Based Subsampling Methods for Big Data

Article 11 April 2024

References

Battey, H., Fan, J., Liu, H., Lu, J., Zhu, Z.: Distributed testing and estimation under sparse high dimensional models. Ann. Stat. 46(3), 1352–1382 (2018)
Article MathSciNet MATH Google Scholar
Bertsimas, D., King, A., Mazumder, R.: Best subset selection via a modern optimization lens. Ann. Stat. 44(2), 813–852 (2016)
Article MathSciNet MATH Google Scholar
Burrows, M.: The chubby lock service for loosely-coupled distributed systems. In: 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI) (2006)
Cai, T., Liu, W., Luo, X.: A constrained $\ell _1$ minimization approach to sparse precision matrix estimation. J. Am. Stat. Assoc. 106(494), 594–607 (2011)
Article MATH Google Scholar
Candes, E.J., Tao, T.: Decoding by linear programming. IEEE Trans. Inf. Theory 51(12), 4203–4215 (2005)
Article MathSciNet MATH Google Scholar
DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: Amazon’s highly available key-value store. ACM SIGOPS Oper. Syst. Rev. 41(6), 205–220 (2007)
Article Google Scholar
Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least angle regression. Ann. Stat. 32(2), 407–499 (2004)
Article MathSciNet MATH Google Scholar
Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1360 (2001)
Article MathSciNet MATH Google Scholar
Fan, Y., Lv, J.: Asymptotic properties for combined $\ell _1$ and concave regularization. Biometrika 101(1), 57–70 (2014)
Article MathSciNet Google Scholar
Fan, Y., Lv, J.: Innovated scalable efficient estimation in ultra-large gaussian graphical models. Ann. Stat. 44(5), 2098–2126 (2016)
Article MathSciNet MATH Google Scholar
Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1 (2010)
Huang, J., Jiao, Y., Liu, Y., Lu, X.: A constructive approach to $l_0$ penalized regression. J. Mach. Learn. Res. 19(10), 1–37 (2018)
MathSciNet MATH Google Scholar
Jordan, M.I., Lee, J., Yang, Y.: Communication-efficient distributed statistical inference. J. Am. Stat. Assoc. 114, 668–681 (2018)
Article MathSciNet MATH Google Scholar
Lee, J.D., Liu, Q., Sun, Y., Taylor, J.E.: Communication-efficient sparse regression. J. Mach. Learn. Res. 18(1), 115–144 (2017)
MathSciNet MATH Google Scholar
Mullan, Kathleen. H.: The national longitudinal study of adolescent to adult health (add health), waves i & ii, 1994–1996; wave iii, 2001–2002; wave iv, 2007–2009 [machine-readable data file and documentation]. Chapel Hill, NC: Carolina Population Center, University of North Carolina at Chapel Hill 10 (2009)
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc.: Ser. B (Stat. Methodol.) 58(1), 267–288 (1996)
Toshniwal, A., Taneja, S., Shukla, A., Ramasamy, K., Patel, J.M., Kulkarni, S., Jackson, J., Gade, K., Fu, M., Donham, J., et al: Storm@ twitter. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 147–156 (2014)
Vershynin, R.: Introduction to the non-asymptotic analysis of random matrices. In: Compressed Sensing (2012)
Wang, J., Kolar, M., Srebro, N., Zhang, T.: Efficient distributed learning with sparsity. In: Proceedings of the 34th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 70, pp. 3636–3645 (2017)
Wen, C., Zhang, A., Quan, S., Wang, X.: Bess: An r package for best subset selection in linear, logistic and cox proportional hazards models. J. Stat. Softw. 94(4), 1–24 (2020)
Article Google Scholar
Zhang, C.-H.: Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38(2), 894–942 (2010)
Article MathSciNet MATH Google Scholar
Zhang, C.-H., Huang, J.: The sparsity and bias of the lasso selection in high-dimensional linear regression. Ann. Stat. 36(4), 1567–1594 (2008)
Article MathSciNet MATH Google Scholar
Zhang, Y., Duchi, J.C., Wainwright, M.J.: Communication-efficient algorithms for statistical optimization. J. Mach. Learn. Res. 14(1), 3321–3363 (2013)
Zheng, Z., Fan, Y., Lv, J.: High dimensional thresholded regression and shrinkage effect. J. R. Stat. Soc.: Ser. B (Stat. Methodol.) 76(3), 627–649 (2014)
Article MathSciNet MATH Google Scholar
Zheng, Z., Bahadori, M.T., Liu, Y., Lv, J.: Scalable interpretable multi-response regression via seed. J. Mach. Learn. Res. 20(107), 1–34 (2019)
Zhu, J., Wen, C., Zhu, J., Zhang, H., Wang, X.: A polynomial algorithm for best-subset selection problem. Proc. Natl. Acad. Sci. 117(52), 33117–33123 (2020)
Article MathSciNet MATH Google Scholar
Zhu, X., Li, F., Wang, H.: Least-square approximation for a distributed system. J. Comput. Graph. Stat. 30(4), 1004–1018 (2021)
Article MathSciNet MATH Google Scholar
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc.: Ser. B (Stat. Methodol.) 67(2), 301–320 (2005)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

Wen’s research is partially supported by National Science Foundation of China (Grant 12171449), Fundamental Research Funds for the Central Universities (Grant WK3470000027), and USTC Research Funds of the Double First-Class Initiative (Grant YD2040002019). Dong’s research is supported by China Postdoctoral Science Foundation (Grant 2023M733402).

Author information

Authors and Affiliations

School of Mathematical Sciences, University of Science and Technology of China, Hefei, 230026, Anhui, China
Yan Chen
International Institute of Finance, School of Management, University of Science and Technology of China, Hefei, 230026, Anhui, China
Ruipeng Dong & Canhong Wen

Authors

Yan Chen
View author publications
You can also search for this author in PubMed Google Scholar
Ruipeng Dong
View author publications
You can also search for this author in PubMed Google Scholar
Canhong Wen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Yan Chen, Ruipeng Dong and Canhong Wen wrote the main manuscript text. And Yan Chen prepared the simulation result and data analysis.

Corresponding author

Correspondence to Ruipeng Dong.

Ethics declarations

Competing interests

The authors have no relevant financial or non-financial interests to disclose. The authors have no competing interests to declare that are relevant to the content of this article. All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript. The authors have no financial or proprietary interests in any material discussed in this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Proofs of main results

To ease the presentation, we now define some notation. On the k-th distributed machine, we remark that ${\mathcal {A}}_{{k}}^l$ and ${\mathcal {I}}_{{k}}^l$ are the active set and the inactive set in the l-th calculation by the ABESS algorithm. The true active set is ${\mathcal {A}}^{*}$ and true inactive set is ${\mathcal {I}}^{*}$. Let

$$\begin{aligned} \begin{array}{l} {\mathcal {A}}_{{k},1}^l={\mathcal {A}}_{{k}}^l \cap {\mathcal {A}}^{*},{\mathcal {A}}_{{k},2}^l ={\mathcal {A}}^l_{k}\cap {\mathcal {I}}^{*}, \\ {\mathcal {I}}_{{k},1}^l ={\mathcal {I}}_{{k}}^l \cap {\mathcal {A}}^{*},{\mathcal {I}}_{{k},2}^l ={\mathcal {I}}_{{k}}^l \cap {\mathcal {I}}^{*}. \end{array} \end{aligned}$$

In the l-th iteration, ${\mathcal {A}}_{{k},1}^l$ represents the correct active set is selected, ${\mathcal {A}}^l_{{k},2}$ represents the incorrect active set is selected, ${\mathcal {I}}^l_{{k},1}$ represents the missing active set, ${\mathcal {I}}^l_{{k},2}$ represents the correct inactive set is selected. We denote the true estimator $\varvec{\beta }^*$, $\bar{\varvec{\beta }}^*=\varvec{\beta }^*_{{\mathcal {A}}}$, and the l-th estimator $\varvec{\beta }^l_{k}$.

Then we present one additional lemma and the proof of Lemma 2 is provided in A.4.

Lemma 2

Suppose $(\hat{\varvec{\beta }}_{k},\hat{{\mathcal {A}}}_{k})$ is the output of the k-th worker of ABESS under $s\geqslant s^*$, assume Conditions 1, 2, 4, and 5 hold, ${{k}}=1,2,\cdots ,K$. Denote $\gamma _{{s}}(n,p)=O(p\exp (-\frac{C_snb^*}{s}))$, where $C_s$ is a constant that depends on s. With probability $1-O(p^{-c_0})-\gamma _{{s}}(n,p)$, we have

$$\begin{aligned} \begin{aligned} \Vert \hat{\varvec{\beta }}_{{k}}-\varvec{\beta }^{*}\Vert _{2}&\leqslant O\left( \frac{\sqrt{s\log p}}{\sqrt{n}}\right) . \end{aligned} \end{aligned}$$

(11)

1.1 A.2 Proof of Lemma 1

Proof

On each distributed machine, by Lemma 1 of Zhu et al. (2020), we can calculate the probability of the missing active set

$$\begin{aligned} \textrm{P}(\hat{{\mathcal {A}}}_{{k}} \nsupseteq {\mathcal {A}}^{*}) < \gamma _{s}(n, p). \end{aligned}$$

Considering the probability of the missing elements on the central machine

$$\begin{aligned} \begin{aligned} \textrm{P}({\mathcal {A}} \nsupseteq {\mathcal {A}}^{*})&= \textrm{P}\left( \bigcup \limits _{{{k}}=1}^{K}\hat{{\mathcal {A}}}_{{k}} \nsupseteq {\mathcal {A}}^{*}\right) . \end{aligned} \end{aligned}$$

On each machine, the occurrence of missing active sets is independent of each other. It follows that

$$\begin{aligned} \begin{aligned}&\textrm{P}({\mathcal {A}} \nsupseteq {\mathcal {A}}^{*})\\&=\textrm{P}(\hat{{\mathcal {A}}}_1 \nsupseteq {\mathcal {A}}^{*})\textrm{P}(\hat{{\mathcal {A}}}_2 \nsupseteq {\mathcal {A}}^{*})\cdots \textrm{P}(\hat{{\mathcal {A}}}_K \nsupseteq {\mathcal {A}}^{*})\\&<\gamma _{s}(n, p)\gamma _{s}(n, p)\cdots \gamma _{s}(n, p)\\&=O\left( p^K\,\exp {\left[ -\frac{C_sKnb^*}{{s}}\right] }\right) \\&=O\left( p^K\,\exp {\left[ -\frac{C_sNb^*}{{s}}\right] }\right) , \end{aligned} \end{aligned}$$

where $C_{s}$ is a constant depending on s. The last equality by $N=nK$. Define $\gamma _{{s},K}(n,p)=O(p^K\,\exp {-\frac{C_{s}Nb^*}{{s}}})$, with probability $1-\gamma _{{s},K}(N, p)$, since we can gain

$$\begin{aligned} {{\mathcal {A}}}=\bigcup \limits _{{{k}}=1}^{K}{\hat{{\mathcal {A}}}_{{k}}}\supseteq {\mathcal {A}}^{*}. \end{aligned}$$

$\square $

1.2 A.2 Proof of Theorem 1

Proof

Conditions of Theorem 1 satisfy the conditions of the Lemma 1. Therefore, with probability $1- \gamma _{{s},K}(n,p)$,

$$\begin{aligned} {{\mathcal {A}}}=\bigcup \limits _{{{k}}=1}^{K}{\hat{{\mathcal {A}}}_{{k}}}\supseteq {\mathcal {A}}^{*}. \end{aligned}$$

Combining the DC-DBESS algorithm, it holds that

$$\begin{aligned} \hat{\varvec{\beta }}_{d}=\frac{1}{K}\sum \limits _{{{k}}=1}^K \hat{\varvec{\beta }}_{{k}}+\frac{1}{N}\sum \limits _{{{k}}=1}^K\varvec{M}_{{k}} \hat{ \varvec{X}}_{{k}}^T(\varvec{y}_{{k}}-\hat{\varvec{X}}_{{k}} \hat{\varvec{\beta }}_{{k}}). \end{aligned}$$

The algorithm uses the local derivative information and the local debiased information. Further, we analyze the convergence order of the Algorithm 1.

$$\begin{aligned} \begin{array}{l} \hat{\varvec{\beta }}_{d}-\bar{\varvec{\beta }}^*=\frac{1}{K}\sum \limits _{{{k}}=1}^K(\hat{\varvec{\beta }}_{{k}}-\bar{\varvec{\beta }}^*) +\frac{1}{N}\sum \limits _{{{k}}=1}^K\varvec{M}_{{k}} \hat{\varvec{X}}_{{k}}^T( \varvec{y}_{{k}}-\hat{\varvec{X}}_{{k}}\hat{\varvec{\beta }}_{{k}}). \end{array} \end{aligned}$$

Based on distributed system, under the condition that Lemma 1 holds, with probability $1-\gamma _{{s},K}(N,p)$, we could obtain ${{\mathcal {A}}}=\bigcup \limits _{{{k}}=1}^{K}{\hat{{\mathcal {A}}}_{{k}}}\supseteq {\mathcal {A}}^{*}$, then, we have $ \varvec{y}_{{k}}=\hat{ \varvec{X}}_{{k}}\bar{\varvec{\beta }}^*+\varvec{\epsilon }_{{k}}$. Thus,

$$\begin{aligned} \begin{aligned} \hat{\varvec{\beta }}_{d}-\bar{\varvec{\beta }}^*&=\frac{1}{K}\sum \limits _{{{k}}=1}^K(\hat{\varvec{\beta }}_{{k}} -\bar{\varvec{\beta }}^*)+\frac{1}{N}\sum \limits _{{{k}}=1}^K\varvec{M}_{{k}}\hat{\varvec{X}}_{{k}}^T\hat{\varvec{X}}_{{k}}(\bar{\varvec{\beta }}^*-\hat{\varvec{\beta }}_{{k}})\\&+\frac{1}{N}\sum \limits _{{{k}}=1}^K\varvec{M}_{{k}} \hat{\varvec{X}}_{{k}}^T\varvec{\epsilon }_{{k}}. \end{aligned} \end{aligned}$$

We note $\hat{\varvec{\Sigma }}_{{k}}=\hat{\varvec{X}}_{{{k}},{\mathcal {A}}}^T\hat{\varvec{X}}_{{{k}},{\mathcal {A}}}/n$ and define the bias $\hat{\Delta }_{{k}}=(\varvec{M}_{{k}}\hat{\varvec{\Sigma }}_{{k}}-\varvec{I})(\bar{\varvec{\beta }}^*-\hat{\varvec{\beta }}_{{k}})$. Then, it follows that

$$\begin{aligned} \begin{array}{l} \quad \hat{\varvec{\beta }}_{d}-\varvec{\beta }^*=\frac{1}{K}\sum \limits _{{{k}}=1}^K\hat{\Delta }_{{k}}+\frac{1}{N}\sum \limits _{{{k}}=1}^K\varvec{M}_{{{k}}} \hat{\varvec{X}}_{{{k}}}^T \varvec{\epsilon }_{{k}}. \end{array} \end{aligned}$$

Denote $\hat{\varvec{D}}_{{{k}}}=\varvec{M}_{{{k}}} \hat{\varvec{X}}_{{{k}}}^T\in {\mathbb {R}}^{m\times n}$ and the number of rows is m. $\hat{\varvec{D}}=(\hat{\varvec{D}}_1,\hat{\varvec{D}}_2,\cdots ,\hat{\varvec{D}}_K)\in {\mathbb {R}}^{m\times N}$, $\varvec{\epsilon }=(\varvec{\epsilon }_1,\varvec{\epsilon }_2,\cdots ,\varvec{\epsilon }_K)$, we can show that

$$\begin{aligned} \frac{1}{N}\sum \limits _{{{{k}}}=1}^K\varvec{M}_{{{k}}} \hat{\varvec{X}}_{{{k}}}^T\varvec{\epsilon }_{{{k}}}=\frac{1}{N}\hat{\varvec{D}}\varvec{\epsilon }. \end{aligned}$$

By Condition 1, using Vershynin (2012), Proposition 5.16, we have

$$\begin{aligned} \left\| \frac{1}{N}\sum \limits _{{{{k}}}=1}^K \varvec{M}_{{{k}}} \hat{\varvec{X}}_{{{k}}}^T \varvec{\epsilon }_{{{k}}}\right\| _{\infty }=\Vert \frac{1}{N}\hat{\varvec{D}} \varvec{\epsilon }\Vert _{\infty } =O\left( \sqrt{\frac{\log m}{N}}\right) . \end{aligned}$$

(12)

Next, we consider the order of the bias $\hat{\Delta }_{{k}}$.

$$\begin{aligned}\begin{aligned} \Vert \hat{\Delta }_{{k}}\Vert _{\infty }&=\Vert (\varvec{M}_{{{k}}} \hat{\varvec{\Sigma }}_{{{k}}}-\varvec{I})(\hat{\varvec{\beta }}_{{{k}}}-\varvec{\beta }^{*})\Vert _{\infty }\\&\leqslant \max _{j \in [m]}\Vert {\varvec{M}_{{{k}},j}}\hat{\varvec{\Sigma }}_{{{k}},j}-e_j\Vert _{\infty } \Vert \hat{\varvec{\beta }}_{{{k}}}-\varvec{\beta }^{*}\Vert _{1}. \end{aligned} \end{aligned}$$

By using the technique in Cai et al. (2011), Fan and Lv (2016), we have

$$\begin{aligned} \max _{j \in [m]}\Vert {\varvec{M}_{{{k}},j}}\hat{\varvec{\Sigma }}_{{{k}},j}-e_j\Vert _{\infty } \leqslant C\left( \sqrt{\frac{\log m}{n}}\right) . \end{aligned}$$

By the equivalence of the norms and Lemma 2, it follows that

$$\begin{aligned} \Vert \hat{\varvec{\beta }}_{{{k}}}-\varvec{\beta }^{*}\Vert _{1}\leqslant \sqrt{2s}\Vert \hat{\varvec{\beta }}_{{{k}}}-\varvec{\beta }^{*}\Vert _{2} \leqslant C\sqrt{\frac{{s^2\log p}}{{n}}}. \end{aligned}$$

Therefore,

$$\begin{aligned} \begin{aligned} \Vert \hat{\Delta }_{{k}}\Vert _{\infty }&\leqslant C\left( \frac{\sqrt{s^2\log p \log m}}{n}\right) . \end{aligned} \end{aligned}$$

(13)

Thus, combining (12) and (13), we can further gain the convergence of the algorithm. With probability $1-O(Kp^{-c_0})-K\gamma _{{s}}(n,p)$, we have

$$\begin{aligned}\begin{aligned}&\Vert \hat{\varvec{\beta }}_{d}-\bar{\varvec{\beta }}^{*}\Vert _{\infty } \\ {}&=\left\| \frac{1}{K} \sum _{{{k}}=1}^{K} \hat{\Delta }_{{{k}}}+\frac{1}{N} \sum _{{{k}}=1}^{K} \varvec{M}_{{{k}}} \hat{\varvec{X}}_{{{k}}}^{T} \hat{\varvec{\epsilon }}_{{{k}}}\right\| _{\infty } \\ {}&\leqslant \frac{1}{K} \sum _{{{k}}=1}^{K}\left\| \hat{\Delta }_{{{k}}}\right\| _{\infty }+\frac{1}{N} \sum _{{{k}}=1}^{K}\left\| \varvec{M}_{{{k}}} \hat{\varvec{X}}_{{{k}}}^{T} \hat{\varvec{\epsilon }}_{{{k}}}\right\| _{\infty } \\ {}&\leqslant C\left( K \frac{{s}\sqrt{\log p\log m}}{N}+\sqrt{\frac{\log m}{N}}\right) . \end{aligned} \end{aligned}$$

By $\lambda =c_1 \sqrt{\log m/N}$ for some sufficiently large constant $c_1$, we have

$$\begin{aligned}\begin{aligned} \Vert \hat{\varvec{\beta }}-\varvec{\beta }^{*}\Vert _{\infty }&\leqslant {\Vert \hat{\varvec{\beta }}_{d}-\bar{\varvec{\beta }}^{*}\Vert _{\infty }+\Vert \hat{\varvec{\beta }}_{d}-\hat{\varvec{\beta }}_{{\mathcal {A}}}\Vert _{\infty }} \\ {}&\leqslant C\left( K \frac{{s}\sqrt{\log p\log m}}{N}+\sqrt{\frac{\log m}{N}}\right) . \end{aligned} \end{aligned}$$

$\square $

1.3 A.3 Proof of Theorem 2

Proof

We will omit the specific details of the proof and only list some of its main parts, the detailed proof refers to Theorem 4 of Zhu et al. (2020). Same as the proof of Theorem 1, the conditions of Theorem 2 satisfy the conditions of Lemma 1. With probability $1-\gamma _{{s},K}(n,p)$, we can obtain

$$\begin{aligned} {{\mathcal {A}}}=\bigcup \limits _{{{k}}=1}^{K}{\hat{{\mathcal {A}}}_{{k}}}\supseteq {\mathcal {A}}^{*}. \end{aligned}$$

Denote $\hat{s}$ is the cardinality of $\hat{{\mathcal {A}}}$, for any $\hat{s}\geqslant s^*$, combining the minimal strength $b^*\geqslant c_2(\frac{\log m}{N})^{\frac{1}{2}}$ for some sufficiently large constant $c_2$, we have

$$\begin{aligned} \begin{aligned} \textrm{P}(\hat{{\mathcal {A}}} \supseteq {\mathcal {A}}^{*})\geqslant 1-\gamma . \end{aligned} \end{aligned}$$

Denote ${\mathcal {L}}_{\hat{{\mathcal {A}}}}=\frac{1}{2N}\sum _{{{k}}=1}^K\Vert \varvec{{y}}_{{k}} -\varvec{\hat{X}}_{{k}}\varvec{\hat{\beta }}\Vert ^2$ and ${\mathcal {L}}_{{{\mathcal {A}}}^*}=\frac{1}{2N}\sum _{{{k}}=1}^K\Vert \varvec{{y}}_{{k}} -\varvec{\hat{X}}_{{k}}\varvec{{\beta }}^*\Vert ^2$. For a sufficiently large N, we have

$$\begin{aligned} \begin{aligned}&\text {DGIC}(\vert {\mathcal {A}}^*\vert )- \text {DGIC}(\vert \hat{{\mathcal {A}}}\vert )\\&=N\log {\left( {{\mathcal {L}}_{{{\mathcal {A}}}^*}}/{{\mathcal {L}}_{\hat{{\mathcal {A}}}}}\right) }-(\hat{s}-s^*)K\log (m)\log \log N \\ {}&\leqslant O\left( (\hat{s}-s^*)\log (2m)\right) -(\hat{s}-s^*)K\log (m)\log \log N \\ {}&<0. \end{aligned} \end{aligned}$$

On the other hand, if $\hat{s}< s^*$, use Conditions 2, 3, for a sufficiently large N, we have

$$\begin{aligned} \begin{aligned}&\text {DGIC}(\vert \hat{{\mathcal {A}}}\vert )-\text {DGIC}(\vert {\mathcal {A}}^*\vert )\\&=N\log {\left( {{\mathcal {L}}_{\hat{{\mathcal {A}}}}}/{{\mathcal {L}}_{{{\mathcal {A}}}^*}}\right) }+(\hat{s}-s^*)K\log (m)\log \log N \\ {}&\geqslant N O\left( \min \{1,(s^*-\hat{s})b^*\}\right) -(s^*-\hat{s})K\log (m)\log \log N \\ {}&>0. \end{aligned} \end{aligned}$$

Thus, with probability $1-O(p^{-c_0})-\gamma _{{s},K}(n,p)$, it is easy to deduce that

$$\begin{aligned} {{\hat{{\mathcal {A}}}}}={\mathcal {A}}^*. \end{aligned}$$

$\square $

1.4 A.4 Proof of Lemma 2

Proof

Applying Theorem 3 of Zhu et al. (2020), with probability $1- \gamma _{s}(n,p)$, we have

$$\begin{aligned}{} & {} \Vert \varvec{\beta }_{{{k}},{\mathcal {I}}_{{{k}},1}^{l}}^{*}\Vert _{2}^2\leqslant \frac{2{\mathcal {L}}_n(\varvec{\beta }_{{k}}^{l+1})-2 {\mathcal {L}}_n(\varvec{\beta }_{{k}}^{l})}{(1-\rho _{s})(1-\Delta ) \left( c_{-}(s)-\frac{(\theta _{s,s})^2}{{c_{-}(s)}}\right) }, \end{aligned}$$

(14)

$$\begin{aligned}{} & {} \quad 2n {\mathcal {L}}_n(\varvec{\beta }_{{k}}^{l+1})-2 n{\mathcal {L}}_n(\varvec{\beta }^{*})\leqslant \rho _{s}(2 n{\mathcal {L}}_n(\varvec{\beta }_{{k}}^{l})-2 n {\mathcal {L}}_n(\varvec{\beta }^{*})). \end{aligned}$$

(15)

By ${\mathcal {A}}^*={\mathcal {A}}_{{{k}},1}^l\cup {\mathcal {I}}_{{{k}},1}^l$, we deduce that

$$\begin{aligned} \varvec{\beta }^{*}=\varvec{\beta }_{{\mathcal {A}}_{{{k}},1}^{l}}^{*} +\varvec{\beta }_{{\mathcal {I}}_{{{k}},1}^{l}}^{*}. \end{aligned}$$

By ${\mathcal {A}}_{{k}}^l={\mathcal {A}}_{{{k}},1}^l\cup {\mathcal {A}}_{{{k}},2}^l$ and $\varvec{\beta }^*_{{\mathcal {A}}_{{{k}},2}^l}=0$, it follows that

$$\begin{aligned} \varvec{\beta }^{*} =\varvec{\beta }_{{\mathcal {A}}_{{k}}^{l}}^{*}+\varvec{\beta }_{{\mathcal {I}}_{{{k}},1}^{l}}^{*}. \end{aligned}$$

(16)

We can use the trigonometric inequality of the $\ell _2$ norm:

$$\begin{aligned} \begin{aligned} \Vert \varvec{\beta }_{{k}}^{l}-\varvec{\beta }^{*}\Vert _{2} \leqslant \Vert \varvec{\beta }_{{{k}},{\mathcal {A}}_{{k}}^{l}}^{l} -\varvec{\beta }_{{{k}},{\mathcal {A}}_{{k}}^{l}}^{*}\Vert _{2}+\Vert \varvec{\beta }_{{\mathcal {I}}_{{{k}},1}^{l}}^{*}\Vert _{2}. \end{aligned} \end{aligned}$$

We calculate the distributed estimator $\varvec{\beta }_{{{k}},{\mathcal {A}}_{{k}}^{l}}^{l}$ of linear regression problem $\varvec{y}_{{k}}=\varvec{X}_{{k}}\varvec{\beta }^{*}+\varvec{\epsilon }_{{k}}$ by least squares, i.e. $\varvec{\beta }_{{{k}},{\mathcal {A}}_{{k}}^{l}}^{l}=(\varvec{X}_{{{k}},{\mathcal {A}}_{{k}}^{l}}^{{T}} \varvec{X}_{{{k}},{\mathcal {A}}_{{k}}^{l}})^{-1} \varvec{X}_{{{k}},{\mathcal {A}}_{{k}}^{l}}^{{T}}(\varvec{X}_{{k}} \varvec{\beta }^{*}+\varvec{\epsilon }_{{k}})$, we get

$$\begin{aligned} \begin{aligned}&\Vert \varvec{\beta }_{{k}}^{l}-\varvec{\beta }^{*}\Vert _{2}\\&\leqslant \Vert (\varvec{X}_{{{k}},{\mathcal {A}}_{{k}}^{l}}^{{T}} \varvec{X}_{{{k}},{\mathcal {A}}_{{k}}^{l}})^{-1} \varvec{X}_{{{k}},{\mathcal {A}}_{{k}}^{l}}^{{T}}(\varvec{X}_{{k}} \varvec{\beta }^{*}+\varvec{\epsilon }_{{k}}-\varvec{X}_{{{k}},{\mathcal {A}}_{{k}}^{l}} \varvec{\beta }_{{\mathcal {A}}_{{k}}^{l}}^{*})\Vert _{2}\\&+\Vert \varvec{\beta }_{{\mathcal {I}}_{{{k}},1}^{l}}^{*}\Vert _{2}. \end{aligned} \end{aligned}$$

Using (16), we have

$$\begin{aligned} \begin{aligned} \Vert \varvec{\beta }_{{k}}^{l}-\varvec{\beta }^{*}\Vert _{2}&\leqslant \Vert (\varvec{X}_{{{k}},{\mathcal {A}}_{{k}}^{l}}^{{T}} \varvec{X}_{{{k}},{\mathcal {A}}_{{k}}^{l}})^{-1} \varvec{X}_{{{k}},{\mathcal {A}}_{{k}}^{l}}^{{T}} \varvec{X}_{{{k}},{\mathcal {I}}_{{{k}},1}^{l}} \varvec{\beta }_{{\mathcal {I}}_{{{k}},1}^{l}}^{*}\Vert _{2}\\&+\Vert (\varvec{X}_{{{k}},{\mathcal {A}}_{{k}}^{l}}^{{T}} \varvec{X}_{{{k}},{\mathcal {A}}_{{k}}^{l}})^{-1} \varvec{X}_{{{k}},{\mathcal {A}}_{{k}}^{l}}^{{T}}\varvec{\epsilon }_{{k}}\Vert _{2} +\Vert \varvec{\beta }_{{\mathcal {I}}_{{{k}},1}^{l}}^{*}\Vert _{2}. \end{aligned} \end{aligned}$$

Combining SRC condition, then

$$\begin{aligned} \Vert \varvec{X}_{{{k}},{{\mathcal {A}}_{{k}}^l}}^{{T}}\varvec{X}_{{{k}},{{\mathcal {A}}}_{{k}}^l}/n\Vert _2\geqslant c_{-}(s),\,\Vert \varvec{X}_{{{k}},{\mathcal {A}}_{{k}}^{l}}^{{T}} \varvec{X}_{{{k}},{\mathcal {I}}_{{{k}},1}^{l}}/n \Vert _2\leqslant \theta _{s,s}. \end{aligned}$$

Thus,

$$\begin{aligned} \begin{aligned} \Vert \varvec{\beta }_{{k}}^{l}-\varvec{\beta }^{*}\Vert _{2}&\leqslant \left( 1+\frac{\theta _{s, s}}{c_{-}(s)}\right) \Vert \varvec{\beta }_{{\mathcal {I}}_{{{k}},1}^{l}}^{*}\Vert _{2} +\frac{\Vert \varvec{X}_{{{k}},{\mathcal {A}}_{{k}}^{l}}^{{T}} \varvec{\epsilon }_{{k}}\Vert _{2}}{n c_{-}(s)}. \end{aligned} \end{aligned}$$

By (14) and (15), we can deduce that

$$\begin{aligned} \begin{aligned}&\Vert \varvec{\beta }_{{k}}^{l}-\varvec{\beta }^{*}\Vert _{2} \leqslant \frac{1+\frac{\theta _{s, s}}{c_{-}(s)}}{(1-\Delta ) n\left( c_{-}(s) -\frac{(\theta _{s, s})^{2}}{c_{-}(s)}\right) } \rho _{s}^{l}\\&\quad \vert 2 n {\mathcal {L}}_{n}(\varvec{\beta }^{0}) -2 n {\mathcal {L}}_{n}(\varvec{\beta }^{*})\vert \\&\quad +\frac{\Vert \varvec{X}_{{{k}}, {\mathcal {A}}_{{k}}^{l}}^{{T}} \varvec{\epsilon }_{{k}}\Vert _{2}}{n c_{-}(s)} \\&\quad \leqslant \frac{1+\frac{\theta _{s, s}}{c_{-}(s)}}{(1-\Delta ) n\left( c_{-}(s) -\frac{(\theta _{s, s})^2}{c_{-}(s)}\right) } \rho _{s}^{l}\Vert \varvec{y}_{{k}}\Vert _{2}^{2}\\&\quad +\frac{\Vert \varvec{X}_{{{k}},{\mathcal {A}}_{{k}}^{l}}^{{T}} \varvec{\epsilon }_{{k}}\Vert _{2}}{n c_{-}(s)}. \end{aligned} \end{aligned}$$

Using sub-Gaussian condition of $\varvec{\epsilon }_k$, with probability $1-O(p^{-c_0})- \gamma _{{s}}(n,p)$, where $c_0$ is some arbitrarily large, we can show that

$$\begin{aligned} \begin{aligned} \Vert \varvec{\beta }_{{k}}^{l}-\varvec{\beta }^{*}\Vert _{2}&\leqslant \frac{1+\frac{\theta _{{s}, s}}{c_{-}(s)}}{(1-\Delta ) n\left( c_{-}(s) -\frac{(\theta _{s, s}){2}}{c_{-}(s)}\right) } \rho _{s}^{l}\Vert \varvec{y}_{{k}}\Vert _{2}^{2}\\&+\frac{c\sqrt{s\log p}}{\sqrt{n} c_{-}(s)}, \end{aligned} \end{aligned}$$

where c is a constant. We have

$$\begin{aligned} \begin{aligned} \Vert \hat{\varvec{\beta }}_{{k}}-\varvec{\beta }^{*}\Vert _{2} \leqslant C\frac{\sqrt{s\log p}}{\sqrt{n}}, \end{aligned} \end{aligned}$$

where C is a constant. $\square $

B Add Health data for variable selection result

Tables 6 and 7 show the detailed variable selection results.

Table 6 Variable selection results by Central-ABESS and DC-DBESS

Full size table

Table 7 Variable selection results by Central-ABESS and DC-DBESS

Full size table

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chen, Y., Dong, R. & Wen, C. Communication-efficient estimation for distributed subset selection. Stat Comput 33, 141 (2023). https://doi.org/10.1007/s11222-023-10302-7

Download citation

Received: 02 July 2022
Accepted: 26 September 2023
Published: 16 October 2023
DOI: https://doi.org/10.1007/s11222-023-10302-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Communication-efficient estimation for distributed subset selection

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Stratified random sampling from streaming and stored data

Entropy-Based Subsampling Methods for Big Data

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Appendices

A Proofs of main results

Lemma 2

1.1 A.2 Proof of Lemma 1

Proof

1.2 A.2 Proof of Theorem 1

Proof

1.3 A.3 Proof of Theorem 2

Proof

1.4 A.4 Proof of Lemma 2

Proof

B Add Health data for variable selection result

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Communication-efficient estimation for distributed subset selection

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Stratified random sampling from streaming and stored data

Entropy-Based Subsampling Methods for Big Data

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Appendices

A Proofs of main results

Lemma 2

1.1 A.2 Proof of Lemma 1

Proof

1.2 A.2 Proof of Theorem 1

Proof

1.3 A.3 Proof of Theorem 2

Proof

1.4 A.4 Proof of Lemma 2

Proof

B Add Health data for variable selection result

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation