Abstract
Distributed optimization for big data has recently attracted enormous attention. However, the existing algorithms are all based on one critical randomness condition, i.e., the big data are randomly distributed on different machines. This is seldom in practice, and violating this condition can seriously degrade the estimation accuracy. To fix this problem, we propose a pilot dataset surrogate loss function based optimization framework, which can realize communication-efficient distributed optimization for non-randomly distributed big data. Furthermore, we also apply it to penalized high-dimensional sparse learning problems by combining it with the penalty functions. Theoretical properties and numerical results all confirm the good performance of the proposed methods.



Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Battey, H., Fan, J., Liu, H., Lu, J., Zhu, Z.: Distributed testing and estimation under sparse high dimensional models. Ann. Stat. 46, 1352–1382 (2018)
Chen, X., Xie, M.: A split-and-conquer approach for analysis of extraordinarily large data. Stat. Sin. 24, 1655–1684 (2014)
Chen, X., Liu, W., Zhang, Y.: Quantile regression under memory constraint. Ann. Stat. 47(6), 3244–3273 (2019)
Chen, L., Zhou, Y.: Quantile regression in big data: a divide and conquer based strategy. Comput. Stat. Data Anal. 144, 106892 (2020)
Chen, X., Liu, W., Mao, X., Yang, Z.: Distributed high-dimensional regression under a quantile loss function. J. Mach. Learn. Res. 21, 1–43 (2020)
Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96, 1348–1360 (2001)
Fan, J., Peng, H.: Nonconcave penalized likelihood with a diverging number of parameters. Ann. Stat. 32, 928–961 (2004)
Fan, J., Wang, D., Wang, K., Zhu, Z.: Distributed estimation of principal eigenspaces (2017). arXiv: 1702.06488
Fan, J., Guo, Y., Wang, K.: Communication-efficient accurate statistical estimation (2019). arXiv: 1906.04870
Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010)
Gopal, S., Yang, Y.: Distributed training of large-scale logistic models. International Conference on Machine Learning, pp. 289–297 (2013)
Huang, C., Huo, X.: A distributed one-step estimator. Math. Program. 174, 41–76 (2019)
Jordan, M.I., Lee, J.D., Yang, Y.: Communication-efficient distributed statistical inference. J. Am. Stat. Assoc. 14, 668–681 (2019)
Lee, J., Sun, Y., Liu, Q., Taylor, J.: Communication-efficient sparse regression: a one-shot approach (2015). arXiv: 1503.04337
Lin, N., Xi, R.: Aggregated estimating equation estimation. Stat. Interface 4, 73–83 (2011)
Shamir, O., Srebro, N., Zhang, T.: Communication-efficient distributed optimization using an approximate newton-type method. Int. Conf. Mach. Learn. 32, 1000–1008 (2014)
Tibshirani, R.: Regression shrinkage and selection via the LASSO. J R Stat Soc Ser B 58, 267–288 (1996)
Wang, K., Wang, H., Li, S.: Renewable Quantile Regression for Streaming Datasets. Knowl.-Based Syst. 235, 107675 (2022)
Wang, K., Li, S.: Robust distributed modal regression for massive data. Comput. Stat. Data Anal. 160, 107225 (2021)
Wang, J., Kolar, M., Srebro, N., Zhang, T.: Efficient distributed learning with sparsity. Int. Conf. Mach. Learn. 70, 3636–3645 (2017)
Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B 68, 49–67 (2006)
Zhang, Y., Duchi, J.C., Wainwright, M.: Communication-efficient algorithms for statistical optimization. J. Mach. Learn. Res. 14, 3321–3363 (2013)
Zhang, Y., Duchi, J., Wainwright, M.: Divide and conquer kernel ridge regression: a distributed algorithm with minimax optimal rates. J. Mach. Learn. Res. 16, 3299–3340 (2015)
Zhao, T., Cheng, G., Liu, H.: A partially linear framework for massive heterogeneous data. Ann. Stat. 44, 1400–1437 (2016)
Zhang, C.: Nearly unbiased variable selection under minimax concave penalty. Ann. Statist. 38, 894–942 (2010)
Zhu, X., Li, F., Wang, H.: Least squares approximation for a distributed system (2019). arXiv: 1908.04904
Zou, H., Li, R.: One-step sparse estimates in nonconcave penalized likelihood models (with discussion). Ann. Stat. 36, 1509–1533 (2008)
Author information
Authors and Affiliations
Contributions
Kangning Wang designed the new methods, wrote the main manuscript text, and Shaomin Li developed the theoretical framework and prepared the numerical experiments. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This research is supported by National Natural Science Foundation of China (Grant No. 12101056), the National Statistical Science Research Project (Grant No. 2022LY040), and the Talent Fund of Beijing Jiaotong University (2023XKRC008).
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, K., Li, S. Distributed statistical optimization for non-randomly stored big data with application to penalized learning. Stat Comput 33, 73 (2023). https://doi.org/10.1007/s11222-023-10247-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11222-023-10247-x