Abstract
Due to the large scale both of the sample size and dimensions, modern data is usually stored in a distributed system, which poses unprecedented challenges in computation and statistical inference. Best subset selection is widely known as a benchmark method for handling high-dimensional data. However, there still is a lack of the study of the efficient algorithm for the best subset selection in the distributed system. To this end, we propose a new communication-efficient method to deal with the best subset selection in the distributed system. The proposed method restricts the information communication among local machines in a moderate active set, and leads not only to an efficient computation but also a cheaper cost of communication in a network of the distributed system. Moreover, we propose a new generalized information criterion for tuning the sparsity level on the central machine. Under mild conditions, we establish the consistency of estimation and variable selection for the proposed estimator. We demonstrate the superiority of the proposed method through several numerical studies and a real data application in adolescent health.
Similar content being viewed by others
References
Battey, H., Fan, J., Liu, H., Lu, J., Zhu, Z.: Distributed testing and estimation under sparse high dimensional models. Ann. Stat. 46(3), 1352–1382 (2018)
Bertsimas, D., King, A., Mazumder, R.: Best subset selection via a modern optimization lens. Ann. Stat. 44(2), 813–852 (2016)
Burrows, M.: The chubby lock service for loosely-coupled distributed systems. In: 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI) (2006)
Cai, T., Liu, W., Luo, X.: A constrained \(\ell _1\) minimization approach to sparse precision matrix estimation. J. Am. Stat. Assoc. 106(494), 594–607 (2011)
Candes, E.J., Tao, T.: Decoding by linear programming. IEEE Trans. Inf. Theory 51(12), 4203–4215 (2005)
DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: Amazon’s highly available key-value store. ACM SIGOPS Oper. Syst. Rev. 41(6), 205–220 (2007)
Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least angle regression. Ann. Stat. 32(2), 407–499 (2004)
Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1360 (2001)
Fan, Y., Lv, J.: Asymptotic properties for combined \(\ell _1\) and concave regularization. Biometrika 101(1), 57–70 (2014)
Fan, Y., Lv, J.: Innovated scalable efficient estimation in ultra-large gaussian graphical models. Ann. Stat. 44(5), 2098–2126 (2016)
Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1 (2010)
Huang, J., Jiao, Y., Liu, Y., Lu, X.: A constructive approach to \(l_0\) penalized regression. J. Mach. Learn. Res. 19(10), 1–37 (2018)
Jordan, M.I., Lee, J., Yang, Y.: Communication-efficient distributed statistical inference. J. Am. Stat. Assoc. 114, 668–681 (2018)
Lee, J.D., Liu, Q., Sun, Y., Taylor, J.E.: Communication-efficient sparse regression. J. Mach. Learn. Res. 18(1), 115–144 (2017)
Mullan, Kathleen. H.: The national longitudinal study of adolescent to adult health (add health), waves i & ii, 1994–1996; wave iii, 2001–2002; wave iv, 2007–2009 [machine-readable data file and documentation]. Chapel Hill, NC: Carolina Population Center, University of North Carolina at Chapel Hill 10 (2009)
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc.: Ser. B (Stat. Methodol.) 58(1), 267–288 (1996)
Toshniwal, A., Taneja, S., Shukla, A., Ramasamy, K., Patel, J.M., Kulkarni, S., Jackson, J., Gade, K., Fu, M., Donham, J., et al: Storm@ twitter. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 147–156 (2014)
Vershynin, R.: Introduction to the non-asymptotic analysis of random matrices. In: Compressed Sensing (2012)
Wang, J., Kolar, M., Srebro, N., Zhang, T.: Efficient distributed learning with sparsity. In: Proceedings of the 34th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 70, pp. 3636–3645 (2017)
Wen, C., Zhang, A., Quan, S., Wang, X.: Bess: An r package for best subset selection in linear, logistic and cox proportional hazards models. J. Stat. Softw. 94(4), 1–24 (2020)
Zhang, C.-H.: Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38(2), 894–942 (2010)
Zhang, C.-H., Huang, J.: The sparsity and bias of the lasso selection in high-dimensional linear regression. Ann. Stat. 36(4), 1567–1594 (2008)
Zhang, Y., Duchi, J.C., Wainwright, M.J.: Communication-efficient algorithms for statistical optimization. J. Mach. Learn. Res. 14(1), 3321–3363 (2013)
Zheng, Z., Fan, Y., Lv, J.: High dimensional thresholded regression and shrinkage effect. J. R. Stat. Soc.: Ser. B (Stat. Methodol.) 76(3), 627–649 (2014)
Zheng, Z., Bahadori, M.T., Liu, Y., Lv, J.: Scalable interpretable multi-response regression via seed. J. Mach. Learn. Res. 20(107), 1–34 (2019)
Zhu, J., Wen, C., Zhu, J., Zhang, H., Wang, X.: A polynomial algorithm for best-subset selection problem. Proc. Natl. Acad. Sci. 117(52), 33117–33123 (2020)
Zhu, X., Li, F., Wang, H.: Least-square approximation for a distributed system. J. Comput. Graph. Stat. 30(4), 1004–1018 (2021)
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc.: Ser. B (Stat. Methodol.) 67(2), 301–320 (2005)
Acknowledgements
Wen’s research is partially supported by National Science Foundation of China (Grant 12171449), Fundamental Research Funds for the Central Universities (Grant WK3470000027), and USTC Research Funds of the Double First-Class Initiative (Grant YD2040002019). Dong’s research is supported by China Postdoctoral Science Foundation (Grant 2023M733402).
Author information
Authors and Affiliations
Contributions
Yan Chen, Ruipeng Dong and Canhong Wen wrote the main manuscript text. And Yan Chen prepared the simulation result and data analysis.
Corresponding author
Ethics declarations
Competing interests
The authors have no relevant financial or non-financial interests to disclose. The authors have no competing interests to declare that are relevant to the content of this article. All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript. The authors have no financial or proprietary interests in any material discussed in this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
A Proofs of main results
To ease the presentation, we now define some notation. On the k-th distributed machine, we remark that \({\mathcal {A}}_{{k}}^l\) and \({\mathcal {I}}_{{k}}^l\) are the active set and the inactive set in the l-th calculation by the ABESS algorithm. The true active set is \({\mathcal {A}}^{*}\) and true inactive set is \({\mathcal {I}}^{*}\). Let
In the l-th iteration, \({\mathcal {A}}_{{k},1}^l\) represents the correct active set is selected, \({\mathcal {A}}^l_{{k},2}\) represents the incorrect active set is selected, \({\mathcal {I}}^l_{{k},1}\) represents the missing active set, \({\mathcal {I}}^l_{{k},2}\) represents the correct inactive set is selected. We denote the true estimator \(\varvec{\beta }^*\), \(\bar{\varvec{\beta }}^*=\varvec{\beta }^*_{{\mathcal {A}}}\), and the l-th estimator \(\varvec{\beta }^l_{k}\).
Then we present one additional lemma and the proof of Lemma 2 is provided in A.4.
Lemma 2
Suppose \((\hat{\varvec{\beta }}_{k},\hat{{\mathcal {A}}}_{k})\) is the output of the k-th worker of ABESS under \(s\geqslant s^*\), assume Conditions 1, 2, 4, and 5 hold, \({{k}}=1,2,\cdots ,K\). Denote \(\gamma _{{s}}(n,p)=O(p\exp (-\frac{C_snb^*}{s}))\), where \(C_s\) is a constant that depends on s. With probability \(1-O(p^{-c_0})-\gamma _{{s}}(n,p)\), we have
1.1 A.2 Proof of Lemma 1
Proof
On each distributed machine, by Lemma 1 of Zhu et al. (2020), we can calculate the probability of the missing active set
Considering the probability of the missing elements on the central machine
On each machine, the occurrence of missing active sets is independent of each other. It follows that
where \(C_{s}\) is a constant depending on s. The last equality by \(N=nK\). Define \(\gamma _{{s},K}(n,p)=O(p^K\,\exp {-\frac{C_{s}Nb^*}{{s}}})\), with probability \(1-\gamma _{{s},K}(N, p)\), since we can gain
\(\square \)
1.2 A.2 Proof of Theorem 1
Proof
Conditions of Theorem 1 satisfy the conditions of the Lemma 1. Therefore, with probability \(1- \gamma _{{s},K}(n,p)\),
Combining the DC-DBESS algorithm, it holds that
The algorithm uses the local derivative information and the local debiased information. Further, we analyze the convergence order of the Algorithm 1.
Based on distributed system, under the condition that Lemma 1 holds, with probability \(1-\gamma _{{s},K}(N,p)\), we could obtain \({{\mathcal {A}}}=\bigcup \limits _{{{k}}=1}^{K}{\hat{{\mathcal {A}}}_{{k}}}\supseteq {\mathcal {A}}^{*}\), then, we have \( \varvec{y}_{{k}}=\hat{ \varvec{X}}_{{k}}\bar{\varvec{\beta }}^*+\varvec{\epsilon }_{{k}}\). Thus,
We note \(\hat{\varvec{\Sigma }}_{{k}}=\hat{\varvec{X}}_{{{k}},{\mathcal {A}}}^T\hat{\varvec{X}}_{{{k}},{\mathcal {A}}}/n\) and define the bias \(\hat{\Delta }_{{k}}=(\varvec{M}_{{k}}\hat{\varvec{\Sigma }}_{{k}}-\varvec{I})(\bar{\varvec{\beta }}^*-\hat{\varvec{\beta }}_{{k}})\). Then, it follows that
Denote \(\hat{\varvec{D}}_{{{k}}}=\varvec{M}_{{{k}}} \hat{\varvec{X}}_{{{k}}}^T\in {\mathbb {R}}^{m\times n}\) and the number of rows is m. \(\hat{\varvec{D}}=(\hat{\varvec{D}}_1,\hat{\varvec{D}}_2,\cdots ,\hat{\varvec{D}}_K)\in {\mathbb {R}}^{m\times N}\), \(\varvec{\epsilon }=(\varvec{\epsilon }_1,\varvec{\epsilon }_2,\cdots ,\varvec{\epsilon }_K)\), we can show that
By Condition 1, using Vershynin (2012), Proposition 5.16, we have
Next, we consider the order of the bias \(\hat{\Delta }_{{k}}\).
By using the technique in Cai et al. (2011), Fan and Lv (2016), we have
By the equivalence of the norms and Lemma 2, it follows that
Therefore,
Thus, combining (12) and (13), we can further gain the convergence of the algorithm. With probability \(1-O(Kp^{-c_0})-K\gamma _{{s}}(n,p)\), we have
By \(\lambda =c_1 \sqrt{\log m/N}\) for some sufficiently large constant \(c_1\), we have
\(\square \)
1.3 A.3 Proof of Theorem 2
Proof
We will omit the specific details of the proof and only list some of its main parts, the detailed proof refers to Theorem 4 of Zhu et al. (2020). Same as the proof of Theorem 1, the conditions of Theorem 2 satisfy the conditions of Lemma 1. With probability \(1-\gamma _{{s},K}(n,p)\), we can obtain
Denote \(\hat{s}\) is the cardinality of \(\hat{{\mathcal {A}}}\), for any \(\hat{s}\geqslant s^*\), combining the minimal strength \(b^*\geqslant c_2(\frac{\log m}{N})^{\frac{1}{2}}\) for some sufficiently large constant \(c_2\), we have
Denote \({\mathcal {L}}_{\hat{{\mathcal {A}}}}=\frac{1}{2N}\sum _{{{k}}=1}^K\Vert \varvec{{y}}_{{k}} -\varvec{\hat{X}}_{{k}}\varvec{\hat{\beta }}\Vert ^2\) and \({\mathcal {L}}_{{{\mathcal {A}}}^*}=\frac{1}{2N}\sum _{{{k}}=1}^K\Vert \varvec{{y}}_{{k}} -\varvec{\hat{X}}_{{k}}\varvec{{\beta }}^*\Vert ^2\). For a sufficiently large N, we have
On the other hand, if \(\hat{s}< s^*\), use Conditions 2, 3, for a sufficiently large N, we have
Thus, with probability \(1-O(p^{-c_0})-\gamma _{{s},K}(n,p)\), it is easy to deduce that
\(\square \)
1.4 A.4 Proof of Lemma 2
Proof
Applying Theorem 3 of Zhu et al. (2020), with probability \(1- \gamma _{s}(n,p)\), we have
By \({\mathcal {A}}^*={\mathcal {A}}_{{{k}},1}^l\cup {\mathcal {I}}_{{{k}},1}^l\), we deduce that
By \({\mathcal {A}}_{{k}}^l={\mathcal {A}}_{{{k}},1}^l\cup {\mathcal {A}}_{{{k}},2}^l\) and \(\varvec{\beta }^*_{{\mathcal {A}}_{{{k}},2}^l}=0\), it follows that
We can use the trigonometric inequality of the \(\ell _2\) norm:
We calculate the distributed estimator \(\varvec{\beta }_{{{k}},{\mathcal {A}}_{{k}}^{l}}^{l}\) of linear regression problem \(\varvec{y}_{{k}}=\varvec{X}_{{k}}\varvec{\beta }^{*}+\varvec{\epsilon }_{{k}}\) by least squares, i.e. \(\varvec{\beta }_{{{k}},{\mathcal {A}}_{{k}}^{l}}^{l}=(\varvec{X}_{{{k}},{\mathcal {A}}_{{k}}^{l}}^{{T}} \varvec{X}_{{{k}},{\mathcal {A}}_{{k}}^{l}})^{-1} \varvec{X}_{{{k}},{\mathcal {A}}_{{k}}^{l}}^{{T}}(\varvec{X}_{{k}} \varvec{\beta }^{*}+\varvec{\epsilon }_{{k}})\), we get
Using (16), we have
Combining SRC condition, then
Thus,
By (14) and (15), we can deduce that
Using sub-Gaussian condition of \(\varvec{\epsilon }_k\), with probability \(1-O(p^{-c_0})- \gamma _{{s}}(n,p)\), where \(c_0\) is some arbitrarily large, we can show that
where c is a constant. We have
where C is a constant. \(\square \)
B Add Health data for variable selection result
Tables 6 and 7 show the detailed variable selection results.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chen, Y., Dong, R. & Wen, C. Communication-efficient estimation for distributed subset selection. Stat Comput 33, 141 (2023). https://doi.org/10.1007/s11222-023-10302-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11222-023-10302-7