Abstract
Nowadays, researchers are frequently confronted with challenges from massive data computing by a number of limitations of computer primary memory. Modal regression (MR) is a good alternative of the mean regression and likelihood based methods, because of its robustness and high efficiency. To this end, the authors extend MR to massive data analysis and propose a computationally and statistically efficient divide and conquer MR method (DC-MR). The major novelty of this method consists of splitting one entire dataset into several blocks, implementing the MR method on data in each block, and deriving final results through combining these regression results via a weighted average, which provides approximate estimates of regression results on the entire dataset. The proposed method significantly reduces the required amount of primary memory, and the resulting estimator is theoretically as efficient as the traditional MR on the entire data set. The authors also investigate a multiple hypothesis testing variable selection approach to select significant parametric components and prove the approach possessing the oracle property. In addition, the authors propose a practical modified modal expectation-maximization (MEM) algorithm for the proposed procedures. Numerical studies on simulated and real datasets are conducted to assess and showcase the practical and effective performance of our proposed methods.
Similar content being viewed by others
References
Drineas P, Mahoney M W, Muthukrishnan S, et al., Faster least squares approximation, Numerische Mathematik, 2011, 117(2): 219–249.
Dhillon P S, Lu Y, Foster D, et al., New subsampling algorithms for fast least squares regression, Advances in Neural Information Processing Systems, 2013, 360–368.
Kleiner A, Talwalkar A, Sarkar P, et al., A scalable bootstrap for massive data, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2014, 76(4): 795–816.
Ma P, Mahoney M W, and Yu B, A statistical perspective on algorithmic leveraging, Journal of Machine Learning Research, 2015, 16(1): 861–919.
Clarkson K L and Woodruff D P, Low rank approximation and regression in input sparsity time, Journal of the ACM, 2017, 63(6): 1–45.
Yang H, Lü J, and Guo C, Robust estimation and variable selection for varying-coefficient single-index models based on modal regression, Communication in Statistics Theory & Methods, 2015, 45(14): 4048–4067.
Xie R, Wang Z, Bai S, et al., Online decentralized leverage score sampling for streaming multidimensional time series, Proceedings of Machine Learning Research, 2019, 89(42): 2301–2311.
Quiroz M, Kohn R, Villani M, et al., Speeding up MCMC by efficient data subsampling, Journal of the American Statistical Association, 2019, 114(122): 831–843.
Zhang A, Zhang H, and Yin G, Adaptive iterative Hessian sketch via A-optimal subsampling, Statistics and Computing, 2020, 30(4): 1075–1090.
Hu G and Wang H, Most likely optimal subsampled Markov Chain Monte Carlo, Journal of Systems Science and Complexity, 2021, 34(3): 1121–1134.
Fan T H, Lin D, and Cheng K F, Regression analysis for massive datasets, Data & Knowledge Engineering, 2007, 61(3): 554–562.
Lin N and Xi R, Aggregated estimating equation estimation, Statistics and Its Interface, 2011, 4(1): 73–83.
Li R, Lin D K, and Li B, Statistical inference in massive data sets, Applied Stochastic Models in Business and Industry, 2013, 29(5): 399–409.
Chen X and Xie M, A split-and-conquer approach for analysis of extraordinarily large data, Statistica Sinica, 2014, 24(12): 1655–1684.
Schifano E D, Wu J, Wang C, et al., Online updating of statistical inference in the big data setting, Technometrics, 2016, 58(3): 393–403.
Lin S B, Guo X, and Zhou D X, Distributed learning with regularized least squares, Journal of Machine Learning Research, 2017, 18(1): 3202–3232.
Jordan M I, Lee J D, and Yang Y, Communication-efficient distributed statistical inference, Journal of the American Statistical Association, 2019, 114(23): 668–681.
Zhang Y, Duchi J, and Wainwright M, Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates, Journal of Machine Learning Research, 2015, 16(1): 3299–3340.
Xu Q, Cai C, Jiang C, et al., Block average quantile regression for massive dataset, Statistical Papers, 2017, 61(1): 141–165.
Jiang R, Hu X P, Yu K M, et al., Composite quantile regression for massive datasets, Statistics, 2018, 52(5): 980–1004.
Zou H and Yuan M, Composite quantile regression and the oracle model selection theory, The Annals of Statistics, 2018, 36(3): 1108–1126.
Jiang X J, Li J Z, Xia T, et al., Robust and efficient estimation with weighted composite quantile regression, Physica A: Statistical Mechanics and Its Applications, 2016, 457(13): 413–423.
Chen X, Liu W, and Zhang Y, Quantile regression under memory constraint, The Annals of Statistics, 2019, 47(6): 3244–3273.
Chen L and Zhou Y, Quantile regression in big data: A divide and conquer based strategy, Computational Statistics & Data Analysis, 2019, 144(12): 106–122.
Sager T W and Thisted R A, Maximum likelihood estimation of isotonic modal regression, The Annals of Statistics, 1982, 10(22): 690–707.
Collomb G, Härdle W, and Hassani S, A note on prediction via estimation of the conditional mode function, Journal of Statistical Planning and Inference, 1987, 15(14): 227–236.
Lee M, Mode regression, Journal of Econometrics, 1989, 42(53): 337–349.
Yao W, Lindsay B, and Li R, Local modal regression, Journal of Nonparametric Statistics, 2012, 24(3): 647–663.
Liu J, Zhang R, Zhao W, et al., A robust and efficient estimation method for single index models, Journal of Multivariate Analysis, 2013, 122(41): 226–238.
Zhao W, Zhang R, Liu J, et al., Robust and efficient variable selection for semiparametric partially linear varying coefficient model based on modal regression, Annals of the Institute of Statistical Mathematics, 2014, 66(1): 165–191.
Yang H, Guo C, and Lü J, A robust and efficient estimation method for single-index varying coefficient models, Statistics & Probability Letters, 2014, 94(12): 119–127.
Chen Y C, Genovese C R, Tibshirani R J, et al., Nonparametric modal regression, The Annals of Statistics, 2016, 44(2): 489–514.
Guo C, Song B, Wang Y, et al., Robust variable selection and estimation based on kernel modal regression, Entropy, 2019, 21(4): 403–421.
Feng Y, Fan J, and Suykens J A, A statistical learning approach to modal regression, Journal of Machine Learning Research, 2020, 21(2): 1–35.
Frank I and Friedman J, A statistical view of some chemometrics tools, Technometrics, 1993, 35(2): 109–135.
Tibshirani R, Regression shrinkage and selection via the Lasso, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 1996, 58(1): 267–288.
Fan J and Li R, Variable selection via nonconcave penalized likelihood and its oracle properties, Journal of the American statistical Association, 2001, 96(456): 1348–1360.
Zou H, The adaptive LASSO and its oracle properties, Journal of the American Statistical Association, 2006, 101(476): 1418–1429.
Benjamini Y and Hochberg Y, Controlling the false discovery rate: A practical and powerful approach to multiple testing, Journal of the Royal Statistical Society: Series B (Methodological), 1995, 57(1): 289–300.
Yao W and Li L, A new regression model: Modal linear regression, Scandinavian Journal of Statistics, 2014, 41(3): 656–671.
Yang H, Li N, and Yang J, A robust and efficient estimation and variable selection method for partially linear models with large-dimensional covariates, Statistical Papers, 2020, 61(5): 1911–1937.
Wang H, Li R, and Tsai C L, Tuning parameter selectors for smoothly clipped absolute deviation method, Biometrika, 2007, 94(3): 553–568.
Kai B, Li R, and Zou H, New efficient estimation and variable selection methods for semiparametric varying-coefficient partially linear models, Annals of Statistics, 2011, 39(1): 305–312.
Wang P, Zhang H, and Liang Y, Model selection with distributed SCAD penalty, Journal of Applied Statistics, 2018, 45(1): 1938–1955.
Yu J, Wang H Y, Ai M, et al., Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data, Journal of the American Statistical Association, 2020, 12(3): 1–12.
Rao B, Nonparametric Functional Estimation, Academic Press, Orlando, 1983.
Li G R, Peng H, and Zhu L X, Nonconcave penalized M-estimation with a diverging number of parameters, Statistica Sinica, 2011, 23(24): 391–419.
Author information
Authors and Affiliations
Corresponding author
Additional information
This research was supported by the Fundamental Research Funds for the Central Universities under Grant No. JBK1806002 and the National Natural Science Foundation of China under Grant No. 11471264.
Rights and permissions
About this article
Cite this article
Jin, J., Liu, S. & Ma, T. Distributed Penalized Modal Regression for Massive Data. J Syst Sci Complex 36, 798–821 (2023). https://doi.org/10.1007/s11424-022-1197-2
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11424-022-1197-2