Skip to main content
Log in

Distributed statistical optimization for non-randomly stored big data with application to penalized learning

  • Original Paper
  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

Distributed optimization for big data has recently attracted enormous attention. However, the existing algorithms are all based on one critical randomness condition, i.e., the big data are randomly distributed on different machines. This is seldom in practice, and violating this condition can seriously degrade the estimation accuracy. To fix this problem, we propose a pilot dataset surrogate loss function based optimization framework, which can realize communication-efficient distributed optimization for non-randomly distributed big data. Furthermore, we also apply it to penalized high-dimensional sparse learning problems by combining it with the penalty functions. Theoretical properties and numerical results all confirm the good performance of the proposed methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Battey, H., Fan, J., Liu, H., Lu, J., Zhu, Z.: Distributed testing and estimation under sparse high dimensional models. Ann. Stat. 46, 1352–1382 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  • Chen, X., Xie, M.: A split-and-conquer approach for analysis of extraordinarily large data. Stat. Sin. 24, 1655–1684 (2014)

    MathSciNet  MATH  Google Scholar 

  • Chen, X., Liu, W., Zhang, Y.: Quantile regression under memory constraint. Ann. Stat. 47(6), 3244–3273 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  • Chen, L., Zhou, Y.: Quantile regression in big data: a divide and conquer based strategy. Comput. Stat. Data Anal. 144, 106892 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  • Chen, X., Liu, W., Mao, X., Yang, Z.: Distributed high-dimensional regression under a quantile loss function. J. Mach. Learn. Res. 21, 1–43 (2020)

    MathSciNet  MATH  Google Scholar 

  • Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96, 1348–1360 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  • Fan, J., Peng, H.: Nonconcave penalized likelihood with a diverging number of parameters. Ann. Stat. 32, 928–961 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  • Fan, J., Wang, D., Wang, K., Zhu, Z.: Distributed estimation of principal eigenspaces (2017). arXiv: 1702.06488

  • Fan, J., Guo, Y., Wang, K.: Communication-efficient accurate statistical estimation (2019). arXiv: 1906.04870

  • Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010)

  • Gopal, S., Yang, Y.: Distributed training of large-scale logistic models. International Conference on Machine Learning, pp. 289–297 (2013)

  • Huang, C., Huo, X.: A distributed one-step estimator. Math. Program. 174, 41–76 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  • Jordan, M.I., Lee, J.D., Yang, Y.: Communication-efficient distributed statistical inference. J. Am. Stat. Assoc. 14, 668–681 (2019)

  • Lee, J., Sun, Y., Liu, Q., Taylor, J.: Communication-efficient sparse regression: a one-shot approach (2015). arXiv: 1503.04337

  • Lin, N., Xi, R.: Aggregated estimating equation estimation. Stat. Interface 4, 73–83 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  • Shamir, O., Srebro, N., Zhang, T.: Communication-efficient distributed optimization using an approximate newton-type method. Int. Conf. Mach. Learn. 32, 1000–1008 (2014)

    Google Scholar 

  • Tibshirani, R.: Regression shrinkage and selection via the LASSO. J R Stat Soc Ser B 58, 267–288 (1996)

    MathSciNet  MATH  Google Scholar 

  • Wang, K., Wang, H., Li, S.: Renewable Quantile Regression for Streaming Datasets. Knowl.-Based Syst. 235, 107675 (2022)

    Article  Google Scholar 

  • Wang, K., Li, S.: Robust distributed modal regression for massive data. Comput. Stat. Data Anal. 160, 107225 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  • Wang, J., Kolar, M., Srebro, N., Zhang, T.: Efficient distributed learning with sparsity. Int. Conf. Mach. Learn. 70, 3636–3645 (2017)

    Google Scholar 

  • Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B 68, 49–67 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  • Zhang, Y., Duchi, J.C., Wainwright, M.: Communication-efficient algorithms for statistical optimization. J. Mach. Learn. Res. 14, 3321–3363 (2013)

    MathSciNet  MATH  Google Scholar 

  • Zhang, Y., Duchi, J., Wainwright, M.: Divide and conquer kernel ridge regression: a distributed algorithm with minimax optimal rates. J. Mach. Learn. Res. 16, 3299–3340 (2015)

    MathSciNet  MATH  Google Scholar 

  • Zhao, T., Cheng, G., Liu, H.: A partially linear framework for massive heterogeneous data. Ann. Stat. 44, 1400–1437 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  • Zhang, C.: Nearly unbiased variable selection under minimax concave penalty. Ann. Statist. 38, 894–942 (2010)

  • Zhu, X., Li, F., Wang, H.: Least squares approximation for a distributed system (2019). arXiv: 1908.04904

  • Zou, H., Li, R.: One-step sparse estimates in nonconcave penalized likelihood models (with discussion). Ann. Stat. 36, 1509–1533 (2008)

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Contributions

Kangning Wang designed the new methods, wrote the main manuscript text, and Shaomin Li developed the theoretical framework and prepared the numerical experiments. All authors reviewed the manuscript.

Corresponding author

Correspondence to Shaomin Li.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This research is supported by National Natural Science Foundation of China (Grant No. 12101056), the National Statistical Science Research Project (Grant No. 2022LY040), and the Talent Fund of Beijing Jiaotong University (2023XKRC008).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, K., Li, S. Distributed statistical optimization for non-randomly stored big data with application to penalized learning. Stat Comput 33, 73 (2023). https://doi.org/10.1007/s11222-023-10247-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11222-023-10247-x

Keywords

Navigation