Skip to main content

Advertisement

Log in

Distributed Penalized Modal Regression for Massive Data

  • Published:
Journal of Systems Science and Complexity Aims and scope Submit manuscript

Abstract

Nowadays, researchers are frequently confronted with challenges from massive data computing by a number of limitations of computer primary memory. Modal regression (MR) is a good alternative of the mean regression and likelihood based methods, because of its robustness and high efficiency. To this end, the authors extend MR to massive data analysis and propose a computationally and statistically efficient divide and conquer MR method (DC-MR). The major novelty of this method consists of splitting one entire dataset into several blocks, implementing the MR method on data in each block, and deriving final results through combining these regression results via a weighted average, which provides approximate estimates of regression results on the entire dataset. The proposed method significantly reduces the required amount of primary memory, and the resulting estimator is theoretically as efficient as the traditional MR on the entire data set. The authors also investigate a multiple hypothesis testing variable selection approach to select significant parametric components and prove the approach possessing the oracle property. In addition, the authors propose a practical modified modal expectation-maximization (MEM) algorithm for the proposed procedures. Numerical studies on simulated and real datasets are conducted to assess and showcase the practical and effective performance of our proposed methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Drineas P, Mahoney M W, Muthukrishnan S, et al., Faster least squares approximation, Numerische Mathematik, 2011, 117(2): 219–249.

    Article  MathSciNet  MATH  Google Scholar 

  2. Dhillon P S, Lu Y, Foster D, et al., New subsampling algorithms for fast least squares regression, Advances in Neural Information Processing Systems, 2013, 360–368.

  3. Kleiner A, Talwalkar A, Sarkar P, et al., A scalable bootstrap for massive data, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2014, 76(4): 795–816.

    Article  MathSciNet  MATH  Google Scholar 

  4. Ma P, Mahoney M W, and Yu B, A statistical perspective on algorithmic leveraging, Journal of Machine Learning Research, 2015, 16(1): 861–919.

    MathSciNet  MATH  Google Scholar 

  5. Clarkson K L and Woodruff D P, Low rank approximation and regression in input sparsity time, Journal of the ACM, 2017, 63(6): 1–45.

    Article  MathSciNet  MATH  Google Scholar 

  6. Yang H, Lü J, and Guo C, Robust estimation and variable selection for varying-coefficient single-index models based on modal regression, Communication in Statistics Theory & Methods, 2015, 45(14): 4048–4067.

    Article  MathSciNet  MATH  Google Scholar 

  7. Xie R, Wang Z, Bai S, et al., Online decentralized leverage score sampling for streaming multidimensional time series, Proceedings of Machine Learning Research, 2019, 89(42): 2301–2311.

    Google Scholar 

  8. Quiroz M, Kohn R, Villani M, et al., Speeding up MCMC by efficient data subsampling, Journal of the American Statistical Association, 2019, 114(122): 831–843.

    Article  MathSciNet  MATH  Google Scholar 

  9. Zhang A, Zhang H, and Yin G, Adaptive iterative Hessian sketch via A-optimal subsampling, Statistics and Computing, 2020, 30(4): 1075–1090.

    Article  MathSciNet  MATH  Google Scholar 

  10. Hu G and Wang H, Most likely optimal subsampled Markov Chain Monte Carlo, Journal of Systems Science and Complexity, 2021, 34(3): 1121–1134.

    Article  MathSciNet  MATH  Google Scholar 

  11. Fan T H, Lin D, and Cheng K F, Regression analysis for massive datasets, Data & Knowledge Engineering, 2007, 61(3): 554–562.

    Article  Google Scholar 

  12. Lin N and Xi R, Aggregated estimating equation estimation, Statistics and Its Interface, 2011, 4(1): 73–83.

    Article  MathSciNet  MATH  Google Scholar 

  13. Li R, Lin D K, and Li B, Statistical inference in massive data sets, Applied Stochastic Models in Business and Industry, 2013, 29(5): 399–409.

    MathSciNet  Google Scholar 

  14. Chen X and Xie M, A split-and-conquer approach for analysis of extraordinarily large data, Statistica Sinica, 2014, 24(12): 1655–1684.

    MathSciNet  MATH  Google Scholar 

  15. Schifano E D, Wu J, Wang C, et al., Online updating of statistical inference in the big data setting, Technometrics, 2016, 58(3): 393–403.

    Article  MathSciNet  Google Scholar 

  16. Lin S B, Guo X, and Zhou D X, Distributed learning with regularized least squares, Journal of Machine Learning Research, 2017, 18(1): 3202–3232.

    MathSciNet  Google Scholar 

  17. Jordan M I, Lee J D, and Yang Y, Communication-efficient distributed statistical inference, Journal of the American Statistical Association, 2019, 114(23): 668–681.

    Article  MathSciNet  MATH  Google Scholar 

  18. Zhang Y, Duchi J, and Wainwright M, Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates, Journal of Machine Learning Research, 2015, 16(1): 3299–3340.

    MathSciNet  MATH  Google Scholar 

  19. Xu Q, Cai C, Jiang C, et al., Block average quantile regression for massive dataset, Statistical Papers, 2017, 61(1): 141–165.

    Article  MathSciNet  MATH  Google Scholar 

  20. Jiang R, Hu X P, Yu K M, et al., Composite quantile regression for massive datasets, Statistics, 2018, 52(5): 980–1004.

    Article  MathSciNet  MATH  Google Scholar 

  21. Zou H and Yuan M, Composite quantile regression and the oracle model selection theory, The Annals of Statistics, 2018, 36(3): 1108–1126.

    MathSciNet  MATH  Google Scholar 

  22. Jiang X J, Li J Z, Xia T, et al., Robust and efficient estimation with weighted composite quantile regression, Physica A: Statistical Mechanics and Its Applications, 2016, 457(13): 413–423.

    Article  MathSciNet  MATH  Google Scholar 

  23. Chen X, Liu W, and Zhang Y, Quantile regression under memory constraint, The Annals of Statistics, 2019, 47(6): 3244–3273.

    Article  MathSciNet  MATH  Google Scholar 

  24. Chen L and Zhou Y, Quantile regression in big data: A divide and conquer based strategy, Computational Statistics & Data Analysis, 2019, 144(12): 106–122.

    MathSciNet  Google Scholar 

  25. Sager T W and Thisted R A, Maximum likelihood estimation of isotonic modal regression, The Annals of Statistics, 1982, 10(22): 690–707.

    MathSciNet  MATH  Google Scholar 

  26. Collomb G, Härdle W, and Hassani S, A note on prediction via estimation of the conditional mode function, Journal of Statistical Planning and Inference, 1987, 15(14): 227–236.

    MathSciNet  MATH  Google Scholar 

  27. Lee M, Mode regression, Journal of Econometrics, 1989, 42(53): 337–349.

    Article  MathSciNet  MATH  Google Scholar 

  28. Yao W, Lindsay B, and Li R, Local modal regression, Journal of Nonparametric Statistics, 2012, 24(3): 647–663.

    Article  MathSciNet  MATH  Google Scholar 

  29. Liu J, Zhang R, Zhao W, et al., A robust and efficient estimation method for single index models, Journal of Multivariate Analysis, 2013, 122(41): 226–238.

    Article  MathSciNet  MATH  Google Scholar 

  30. Zhao W, Zhang R, Liu J, et al., Robust and efficient variable selection for semiparametric partially linear varying coefficient model based on modal regression, Annals of the Institute of Statistical Mathematics, 2014, 66(1): 165–191.

    Article  MathSciNet  MATH  Google Scholar 

  31. Yang H, Guo C, and Lü J, A robust and efficient estimation method for single-index varying coefficient models, Statistics & Probability Letters, 2014, 94(12): 119–127.

    Article  MathSciNet  MATH  Google Scholar 

  32. Chen Y C, Genovese C R, Tibshirani R J, et al., Nonparametric modal regression, The Annals of Statistics, 2016, 44(2): 489–514.

    Article  MathSciNet  MATH  Google Scholar 

  33. Guo C, Song B, Wang Y, et al., Robust variable selection and estimation based on kernel modal regression, Entropy, 2019, 21(4): 403–421.

    Article  MathSciNet  Google Scholar 

  34. Feng Y, Fan J, and Suykens J A, A statistical learning approach to modal regression, Journal of Machine Learning Research, 2020, 21(2): 1–35.

    MathSciNet  MATH  Google Scholar 

  35. Frank I and Friedman J, A statistical view of some chemometrics tools, Technometrics, 1993, 35(2): 109–135.

    Article  MATH  Google Scholar 

  36. Tibshirani R, Regression shrinkage and selection via the Lasso, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 1996, 58(1): 267–288.

    MathSciNet  MATH  Google Scholar 

  37. Fan J and Li R, Variable selection via nonconcave penalized likelihood and its oracle properties, Journal of the American statistical Association, 2001, 96(456): 1348–1360.

    Article  MathSciNet  MATH  Google Scholar 

  38. Zou H, The adaptive LASSO and its oracle properties, Journal of the American Statistical Association, 2006, 101(476): 1418–1429.

    Article  MathSciNet  MATH  Google Scholar 

  39. Benjamini Y and Hochberg Y, Controlling the false discovery rate: A practical and powerful approach to multiple testing, Journal of the Royal Statistical Society: Series B (Methodological), 1995, 57(1): 289–300.

    MathSciNet  MATH  Google Scholar 

  40. Yao W and Li L, A new regression model: Modal linear regression, Scandinavian Journal of Statistics, 2014, 41(3): 656–671.

    Article  MathSciNet  MATH  Google Scholar 

  41. Yang H, Li N, and Yang J, A robust and efficient estimation and variable selection method for partially linear models with large-dimensional covariates, Statistical Papers, 2020, 61(5): 1911–1937.

    Article  MathSciNet  MATH  Google Scholar 

  42. Wang H, Li R, and Tsai C L, Tuning parameter selectors for smoothly clipped absolute deviation method, Biometrika, 2007, 94(3): 553–568.

    Article  MathSciNet  MATH  Google Scholar 

  43. Kai B, Li R, and Zou H, New efficient estimation and variable selection methods for semiparametric varying-coefficient partially linear models, Annals of Statistics, 2011, 39(1): 305–312.

    Article  MathSciNet  MATH  Google Scholar 

  44. Wang P, Zhang H, and Liang Y, Model selection with distributed SCAD penalty, Journal of Applied Statistics, 2018, 45(1): 1938–1955.

    Article  MathSciNet  MATH  Google Scholar 

  45. Yu J, Wang H Y, Ai M, et al., Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data, Journal of the American Statistical Association, 2020, 12(3): 1–12.

    Google Scholar 

  46. Rao B, Nonparametric Functional Estimation, Academic Press, Orlando, 1983.

    MATH  Google Scholar 

  47. Li G R, Peng H, and Zhu L X, Nonconcave penalized M-estimation with a diverging number of parameters, Statistica Sinica, 2011, 23(24): 391–419.

    MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jun Jin.

Additional information

This research was supported by the Fundamental Research Funds for the Central Universities under Grant No. JBK1806002 and the National Natural Science Foundation of China under Grant No. 11471264.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jin, J., Liu, S. & Ma, T. Distributed Penalized Modal Regression for Massive Data. J Syst Sci Complex 36, 798–821 (2023). https://doi.org/10.1007/s11424-022-1197-2

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11424-022-1197-2

Keywords