Distributed statistical optimization for non-randomly stored big data with application to penalized learning

Wang, Kangning; Li, Shaomin

doi:10.1007/s11222-023-10247-x

Distributed statistical optimization for non-randomly stored big data with application to penalized learning

Original Paper
Published: 29 April 2023

Volume 33, article number 73, (2023)
Cite this article

Statistics and Computing Aims and scope Submit manuscript

Kangning Wang¹ &
Shaomin Li²

415 Accesses
1 Altmetric
Explore all metrics

Abstract

Distributed optimization for big data has recently attracted enormous attention. However, the existing algorithms are all based on one critical randomness condition, i.e., the big data are randomly distributed on different machines. This is seldom in practice, and violating this condition can seriously degrade the estimation accuracy. To fix this problem, we propose a pilot dataset surrogate loss function based optimization framework, which can realize communication-efficient distributed optimization for non-randomly distributed big data. Furthermore, we also apply it to penalized high-dimensional sparse learning problems by combining it with the penalty functions. Theoretical properties and numerical results all confirm the good performance of the proposed methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Distributed optimization and statistical learning for large-scale penalized expectile regression

Article 09 June 2020

Robust estimation for nonrandomly distributed data

Article 12 October 2022

Byzantine-robust distributed sparse learning for M-estimation

Article 26 July 2021

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Battey, H., Fan, J., Liu, H., Lu, J., Zhu, Z.: Distributed testing and estimation under sparse high dimensional models. Ann. Stat. 46, 1352–1382 (2018)
Article MathSciNet MATH Google Scholar
Chen, X., Xie, M.: A split-and-conquer approach for analysis of extraordinarily large data. Stat. Sin. 24, 1655–1684 (2014)
MathSciNet MATH Google Scholar
Chen, X., Liu, W., Zhang, Y.: Quantile regression under memory constraint. Ann. Stat. 47(6), 3244–3273 (2019)
Article MathSciNet MATH Google Scholar
Chen, L., Zhou, Y.: Quantile regression in big data: a divide and conquer based strategy. Comput. Stat. Data Anal. 144, 106892 (2020)
Article MathSciNet MATH Google Scholar
Chen, X., Liu, W., Mao, X., Yang, Z.: Distributed high-dimensional regression under a quantile loss function. J. Mach. Learn. Res. 21, 1–43 (2020)
MathSciNet MATH Google Scholar
Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96, 1348–1360 (2001)
Article MathSciNet MATH Google Scholar
Fan, J., Peng, H.: Nonconcave penalized likelihood with a diverging number of parameters. Ann. Stat. 32, 928–961 (2004)
Article MathSciNet MATH Google Scholar
Fan, J., Wang, D., Wang, K., Zhu, Z.: Distributed estimation of principal eigenspaces (2017). arXiv: 1702.06488
Fan, J., Guo, Y., Wang, K.: Communication-efficient accurate statistical estimation (2019). arXiv: 1906.04870
Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010)
Gopal, S., Yang, Y.: Distributed training of large-scale logistic models. International Conference on Machine Learning, pp. 289–297 (2013)
Huang, C., Huo, X.: A distributed one-step estimator. Math. Program. 174, 41–76 (2019)
Article MathSciNet MATH Google Scholar
Jordan, M.I., Lee, J.D., Yang, Y.: Communication-efficient distributed statistical inference. J. Am. Stat. Assoc. 14, 668–681 (2019)
Lee, J., Sun, Y., Liu, Q., Taylor, J.: Communication-efficient sparse regression: a one-shot approach (2015). arXiv: 1503.04337
Lin, N., Xi, R.: Aggregated estimating equation estimation. Stat. Interface 4, 73–83 (2011)
Article MathSciNet MATH Google Scholar
Shamir, O., Srebro, N., Zhang, T.: Communication-efficient distributed optimization using an approximate newton-type method. Int. Conf. Mach. Learn. 32, 1000–1008 (2014)
Google Scholar
Tibshirani, R.: Regression shrinkage and selection via the LASSO. J R Stat Soc Ser B 58, 267–288 (1996)
MathSciNet MATH Google Scholar
Wang, K., Wang, H., Li, S.: Renewable Quantile Regression for Streaming Datasets. Knowl.-Based Syst. 235, 107675 (2022)
Article Google Scholar
Wang, K., Li, S.: Robust distributed modal regression for massive data. Comput. Stat. Data Anal. 160, 107225 (2021)
Article MathSciNet MATH Google Scholar
Wang, J., Kolar, M., Srebro, N., Zhang, T.: Efficient distributed learning with sparsity. Int. Conf. Mach. Learn. 70, 3636–3645 (2017)
Google Scholar
Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B 68, 49–67 (2006)
Article MathSciNet MATH Google Scholar
Zhang, Y., Duchi, J.C., Wainwright, M.: Communication-efficient algorithms for statistical optimization. J. Mach. Learn. Res. 14, 3321–3363 (2013)
MathSciNet MATH Google Scholar
Zhang, Y., Duchi, J., Wainwright, M.: Divide and conquer kernel ridge regression: a distributed algorithm with minimax optimal rates. J. Mach. Learn. Res. 16, 3299–3340 (2015)
MathSciNet MATH Google Scholar
Zhao, T., Cheng, G., Liu, H.: A partially linear framework for massive heterogeneous data. Ann. Stat. 44, 1400–1437 (2016)
Article MathSciNet MATH Google Scholar
Zhang, C.: Nearly unbiased variable selection under minimax concave penalty. Ann. Statist. 38, 894–942 (2010)
Zhu, X., Li, F., Wang, H.: Least squares approximation for a distributed system (2019). arXiv: 1908.04904
Zou, H., Li, R.: One-step sparse estimates in nonconcave penalized likelihood models (with discussion). Ann. Stat. 36, 1509–1533 (2008)
MATH Google Scholar

Download references

Author information

Authors and Affiliations

School of Statistics, Shandong Technology and Business University, Yantai, China
Kangning Wang
School of Mathematics and Statistics, Beijing Jiaotong University, Beijing, China
Shaomin Li

Authors

Kangning Wang
View author publications
You can also search for this author inPubMed Google Scholar
Shaomin Li
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Kangning Wang designed the new methods, wrote the main manuscript text, and Shaomin Li developed the theoretical framework and prepared the numerical experiments. All authors reviewed the manuscript.

Corresponding author

Correspondence to Shaomin Li.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This research is supported by National Natural Science Foundation of China (Grant No. 12101056), the National Statistical Science Research Project (Grant No. 2022LY040), and the Talent Fund of Beijing Jiaotong University (2023XKRC008).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, K., Li, S. Distributed statistical optimization for non-randomly stored big data with application to penalized learning. Stat Comput 33, 73 (2023). https://doi.org/10.1007/s11222-023-10247-x

Download citation

Received: 13 August 2022
Accepted: 14 April 2023
Published: 29 April 2023
DOI: https://doi.org/10.1007/s11222-023-10247-x

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Distributed statistical optimization for non-randomly stored big data with application to penalized learning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Distributed optimization and statistical learning for large-scale penalized expectile regression

Robust estimation for nonrandomly distributed data

Byzantine-robust distributed sparse learning for M-estimation

Explore related subjects

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now