Distributed Penalized Modal Regression for Massive Data

Jin, Jun; Liu, Shuangzhe; Ma, Tiefeng

doi:10.1007/s11424-022-1197-2

Distributed Penalized Modal Regression for Massive Data

Published: 14 October 2022

Volume 36, pages 798–821, (2023)
Cite this article

Journal of Systems Science and Complexity Aims and scope Submit manuscript

Jun Jin¹,
Shuangzhe Liu² &
Tiefeng Ma³

237 Accesses
Explore all metrics

Abstract

Nowadays, researchers are frequently confronted with challenges from massive data computing by a number of limitations of computer primary memory. Modal regression (MR) is a good alternative of the mean regression and likelihood based methods, because of its robustness and high efficiency. To this end, the authors extend MR to massive data analysis and propose a computationally and statistically efficient divide and conquer MR method (DC-MR). The major novelty of this method consists of splitting one entire dataset into several blocks, implementing the MR method on data in each block, and deriving final results through combining these regression results via a weighted average, which provides approximate estimates of regression results on the entire dataset. The proposed method significantly reduces the required amount of primary memory, and the resulting estimator is theoretically as efficient as the traditional MR on the entire data set. The authors also investigate a multiple hypothesis testing variable selection approach to select significant parametric components and prove the approach possessing the oracle property. In addition, the authors propose a practical modified modal expectation-maximization (MEM) algorithm for the proposed procedures. Numerical studies on simulated and real datasets are conducted to assess and showcase the practical and effective performance of our proposed methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimal subsampling for modal regression in massive data

Article 28 June 2023

A robust and efficient estimation and variable selection method for partially linear models with large-dimensional covariates

Article 21 May 2018

Monte Carlo Methods and Their Applications in Big Data Analysis

References

Drineas P, Mahoney M W, Muthukrishnan S, et al., Faster least squares approximation, Numerische Mathematik, 2011, 117(2): 219–249.
Article MathSciNet MATH Google Scholar
Dhillon P S, Lu Y, Foster D, et al., New subsampling algorithms for fast least squares regression, Advances in Neural Information Processing Systems, 2013, 360–368.
Kleiner A, Talwalkar A, Sarkar P, et al., A scalable bootstrap for massive data, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2014, 76(4): 795–816.
Article MathSciNet MATH Google Scholar
Ma P, Mahoney M W, and Yu B, A statistical perspective on algorithmic leveraging, Journal of Machine Learning Research, 2015, 16(1): 861–919.
MathSciNet MATH Google Scholar
Clarkson K L and Woodruff D P, Low rank approximation and regression in input sparsity time, Journal of the ACM, 2017, 63(6): 1–45.
Article MathSciNet MATH Google Scholar
Yang H, Lü J, and Guo C, Robust estimation and variable selection for varying-coefficient single-index models based on modal regression, Communication in Statistics Theory & Methods, 2015, 45(14): 4048–4067.
Article MathSciNet MATH Google Scholar
Xie R, Wang Z, Bai S, et al., Online decentralized leverage score sampling for streaming multidimensional time series, Proceedings of Machine Learning Research, 2019, 89(42): 2301–2311.
Google Scholar
Quiroz M, Kohn R, Villani M, et al., Speeding up MCMC by efficient data subsampling, Journal of the American Statistical Association, 2019, 114(122): 831–843.
Article MathSciNet MATH Google Scholar
Zhang A, Zhang H, and Yin G, Adaptive iterative Hessian sketch via A-optimal subsampling, Statistics and Computing, 2020, 30(4): 1075–1090.
Article MathSciNet MATH Google Scholar
Hu G and Wang H, Most likely optimal subsampled Markov Chain Monte Carlo, Journal of Systems Science and Complexity, 2021, 34(3): 1121–1134.
Article MathSciNet MATH Google Scholar
Fan T H, Lin D, and Cheng K F, Regression analysis for massive datasets, Data & Knowledge Engineering, 2007, 61(3): 554–562.
Article Google Scholar
Lin N and Xi R, Aggregated estimating equation estimation, Statistics and Its Interface, 2011, 4(1): 73–83.
Article MathSciNet MATH Google Scholar
Li R, Lin D K, and Li B, Statistical inference in massive data sets, Applied Stochastic Models in Business and Industry, 2013, 29(5): 399–409.
MathSciNet Google Scholar
Chen X and Xie M, A split-and-conquer approach for analysis of extraordinarily large data, Statistica Sinica, 2014, 24(12): 1655–1684.
MathSciNet MATH Google Scholar
Schifano E D, Wu J, Wang C, et al., Online updating of statistical inference in the big data setting, Technometrics, 2016, 58(3): 393–403.
Article MathSciNet Google Scholar
Lin S B, Guo X, and Zhou D X, Distributed learning with regularized least squares, Journal of Machine Learning Research, 2017, 18(1): 3202–3232.
MathSciNet Google Scholar
Jordan M I, Lee J D, and Yang Y, Communication-efficient distributed statistical inference, Journal of the American Statistical Association, 2019, 114(23): 668–681.
Article MathSciNet MATH Google Scholar
Zhang Y, Duchi J, and Wainwright M, Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates, Journal of Machine Learning Research, 2015, 16(1): 3299–3340.
MathSciNet MATH Google Scholar
Xu Q, Cai C, Jiang C, et al., Block average quantile regression for massive dataset, Statistical Papers, 2017, 61(1): 141–165.
Article MathSciNet MATH Google Scholar
Jiang R, Hu X P, Yu K M, et al., Composite quantile regression for massive datasets, Statistics, 2018, 52(5): 980–1004.
Article MathSciNet MATH Google Scholar
Zou H and Yuan M, Composite quantile regression and the oracle model selection theory, The Annals of Statistics, 2018, 36(3): 1108–1126.
MathSciNet MATH Google Scholar
Jiang X J, Li J Z, Xia T, et al., Robust and efficient estimation with weighted composite quantile regression, Physica A: Statistical Mechanics and Its Applications, 2016, 457(13): 413–423.
Article MathSciNet MATH Google Scholar
Chen X, Liu W, and Zhang Y, Quantile regression under memory constraint, The Annals of Statistics, 2019, 47(6): 3244–3273.
Article MathSciNet MATH Google Scholar
Chen L and Zhou Y, Quantile regression in big data: A divide and conquer based strategy, Computational Statistics & Data Analysis, 2019, 144(12): 106–122.
MathSciNet Google Scholar
Sager T W and Thisted R A, Maximum likelihood estimation of isotonic modal regression, The Annals of Statistics, 1982, 10(22): 690–707.
MathSciNet MATH Google Scholar
Collomb G, Härdle W, and Hassani S, A note on prediction via estimation of the conditional mode function, Journal of Statistical Planning and Inference, 1987, 15(14): 227–236.
MathSciNet MATH Google Scholar
Lee M, Mode regression, Journal of Econometrics, 1989, 42(53): 337–349.
Article MathSciNet MATH Google Scholar
Yao W, Lindsay B, and Li R, Local modal regression, Journal of Nonparametric Statistics, 2012, 24(3): 647–663.
Article MathSciNet MATH Google Scholar
Liu J, Zhang R, Zhao W, et al., A robust and efficient estimation method for single index models, Journal of Multivariate Analysis, 2013, 122(41): 226–238.
Article MathSciNet MATH Google Scholar
Zhao W, Zhang R, Liu J, et al., Robust and efficient variable selection for semiparametric partially linear varying coefficient model based on modal regression, Annals of the Institute of Statistical Mathematics, 2014, 66(1): 165–191.
Article MathSciNet MATH Google Scholar
Yang H, Guo C, and Lü J, A robust and efficient estimation method for single-index varying coefficient models, Statistics & Probability Letters, 2014, 94(12): 119–127.
Article MathSciNet MATH Google Scholar
Chen Y C, Genovese C R, Tibshirani R J, et al., Nonparametric modal regression, The Annals of Statistics, 2016, 44(2): 489–514.
Article MathSciNet MATH Google Scholar
Guo C, Song B, Wang Y, et al., Robust variable selection and estimation based on kernel modal regression, Entropy, 2019, 21(4): 403–421.
Article MathSciNet Google Scholar
Feng Y, Fan J, and Suykens J A, A statistical learning approach to modal regression, Journal of Machine Learning Research, 2020, 21(2): 1–35.
MathSciNet MATH Google Scholar
Frank I and Friedman J, A statistical view of some chemometrics tools, Technometrics, 1993, 35(2): 109–135.
Article MATH Google Scholar
Tibshirani R, Regression shrinkage and selection via the Lasso, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 1996, 58(1): 267–288.
MathSciNet MATH Google Scholar
Fan J and Li R, Variable selection via nonconcave penalized likelihood and its oracle properties, Journal of the American statistical Association, 2001, 96(456): 1348–1360.
Article MathSciNet MATH Google Scholar
Zou H, The adaptive LASSO and its oracle properties, Journal of the American Statistical Association, 2006, 101(476): 1418–1429.
Article MathSciNet MATH Google Scholar
Benjamini Y and Hochberg Y, Controlling the false discovery rate: A practical and powerful approach to multiple testing, Journal of the Royal Statistical Society: Series B (Methodological), 1995, 57(1): 289–300.
MathSciNet MATH Google Scholar
Yao W and Li L, A new regression model: Modal linear regression, Scandinavian Journal of Statistics, 2014, 41(3): 656–671.
Article MathSciNet MATH Google Scholar
Yang H, Li N, and Yang J, A robust and efficient estimation and variable selection method for partially linear models with large-dimensional covariates, Statistical Papers, 2020, 61(5): 1911–1937.
Article MathSciNet MATH Google Scholar
Wang H, Li R, and Tsai C L, Tuning parameter selectors for smoothly clipped absolute deviation method, Biometrika, 2007, 94(3): 553–568.
Article MathSciNet MATH Google Scholar
Kai B, Li R, and Zou H, New efficient estimation and variable selection methods for semiparametric varying-coefficient partially linear models, Annals of Statistics, 2011, 39(1): 305–312.
Article MathSciNet MATH Google Scholar
Wang P, Zhang H, and Liang Y, Model selection with distributed SCAD penalty, Journal of Applied Statistics, 2018, 45(1): 1938–1955.
Article MathSciNet MATH Google Scholar
Yu J, Wang H Y, Ai M, et al., Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data, Journal of the American Statistical Association, 2020, 12(3): 1–12.
Google Scholar
Rao B, Nonparametric Functional Estimation, Academic Press, Orlando, 1983.
MATH Google Scholar
Li G R, Peng H, and Zhu L X, Nonconcave penalized M-estimation with a diverging number of parameters, Statistica Sinica, 2011, 23(24): 391–419.
MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

College of Mathematical Sciences, Yangzhou University, Yangzhou, 225002, China
Jun Jin
Faculty of Science and Technology, University of Canberra, Canberra, ACT, 2601, Australia
Shuangzhe Liu
School of Statistics, Southwestern University of Finance and Economics, Chengdu, 661130, China
Tiefeng Ma

Authors

Jun Jin
View author publications
You can also search for this author inPubMed Google Scholar
Shuangzhe Liu
View author publications
You can also search for this author inPubMed Google Scholar
Tiefeng Ma
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Jun Jin.

Additional information

This research was supported by the Fundamental Research Funds for the Central Universities under Grant No. JBK1806002 and the National Natural Science Foundation of China under Grant No. 11471264.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jin, J., Liu, S. & Ma, T. Distributed Penalized Modal Regression for Massive Data. J Syst Sci Complex 36, 798–821 (2023). https://doi.org/10.1007/s11424-022-1197-2

Download citation

Received: 15 June 2021
Revised: 26 January 2022
Published: 14 October 2022
Issue Date: April 2023
DOI: https://doi.org/10.1007/s11424-022-1197-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Distributed Penalized Modal Regression for Massive Data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Optimal subsampling for modal regression in massive data

A robust and efficient estimation and variable selection method for partially linear models with large-dimensional covariates

Monte Carlo Methods and Their Applications in Big Data Analysis

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now