Abstract
Ridge regression is an important approach in linear regression when explanatory variables are highly correlated. Although expressions of estimators of ridge regression parameters have been successfully obtained via matrix operation after observed data are standardized, they cannot be used to big data since it is impossible to load the entire data set to the memory of a single computer and it is hard to standardize the original observed data. To overcome these difficulties, the present article proposes new methods and algorithms. The basic idea is to compute a matrix of sufficient statistics by rows. Once the matrix is derived, it is not necessary to use the original data again. Since the entire data set is only scanned once, the proposed methods and algorithms can be extremely efficient in the computation of estimates of ridge regression parameters. It is expected that the basic knowledge gained in this article will have a great impact on statistical approaches to big data.


Similar content being viewed by others
References
Alwal J, Herquet M, Maltoni F, Mattelaer O, Stelzer T (2011) MadGraph 5: going beyond. J High Energy Phys 1106, Article 128
Baldi P, Sadowski P, Whiteson D (2014) Searching for exotic particles in high-energy phusics with deep learning. Nat Commun 5, Article 4308
Chen Y, Dong G (2006) Regression cubes with lossless compression and aggregation. IEEE Trans Knowl Data Eng 18:1585–1599
Dean J, Ghamawat S (2004) MapRedue: simplified data processing on large clusters. In: Proceeding of OSDI, pp 137–150
Deng Z, Choi K, Jiang Y, Wang S (2014) Generalized hidden-mapping ridge regression, knowledge-leveraged inductive transfer learning for neural networks, fuzzy systems and kernel methods. IEEE Trans Cybern 44:2585–2599
Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression, with discussions. Ann Stat 32:407–499
Emerson JW, Kane MJ (2012) Don’t drown in the data. Significance 9:38–39
Enea M (2009) Fitting linear models and generalized linear models with large data sets. In: R. Statistical methods for the analysis of large datasets: book of short papers, pp 411–414
Fan J, Han F, Liu H (2014) Challenges of big data analysis. Natl Sci Rev 1:293–314
Fernández A, del Río S, López V, Bawakid A, del Jesus M,Bent́ez JM, Herrera F,(2014) Big data with cloud computing: an insight on the computing environment, MapReduce, and programming frameworks. WIREs Data Min Knowl Discov. doi:10.1002/widm.1134
Guha S, Hafen R, Rounds J, Xia J, Li J, Xi B, Cleveland WS (2012) Large complex data: divide and recombine (D&R) with Rhipe. Stat 1:53–67
Hoerl AE, Kennard RW (1970) Ridge regression: biased estimation for nonorthogonal problems. Technometrics 2:109–135
Hogg RV, McKean JW, Craig AT (2005) Introduction to mathematical statistics, 6th edn. Pearson Prentice Hall, Upper Saddle River
Howarth J, Shawei-Toylor J, Cheng T, Wang J (2014) Local online kernel ridge regression for forecasting of urban travel times. Transp Res Part C Energ Technol 46:151–178
Karloff H, Suri S, Vassilvitskii S (2010) A model of computation for MapReduce. In: Proceeding of SODA’10 proceedings of the twenty-first annual ACM-SIAM symposium on discrete algorithms, pp 938–948
Lin N, Xi R (2011) Aggregated estimating equation estimation. Stat Interface 4:73–83
Ma P, Sun X (2015) Leveraging for bid data regression. WIREs Comput Stat 7:70–76
Marquardt DW (1970) Generalized inverse, ridge regression, biased linear estimation and nonlinear estimation. Technometrics 12:591–612
Meeker WQ, Hong Y (2014) Reliability meets big data: opportunities and challenges. Qual Eng 26:102–116
Miner D, Shook A (2012) MapReduce design patterns: building effective algorithms and analytics for hadoop and other systems. O’Reilly Media Inc, Sebastpool
Moreno E, Girón J, Casella G (2010) Consistency of objective Bayes factors as the model dimension grows. Ann Stat 38:1937–1952
Moreno E, Girón J, Casella G (2015) Posterior model consistency in variable selection as the model dimension grows. Stat Sci 30:228–241
Ovyn S, Rouby X, Lemaitre V (2009) DELPHES, a framework for fast simulation of a generic collider experiment. Preprint at arXiv:0903.2225
Popo J, Carrera D, Becerra Y, Steinder M, Whalley I (2010) Performance-driven task co-scheduling for MapReduce environments. In: NOMS, pp 374–380
Shen X, Alam M, Fikse F, Rönnegard L (2013) A novel generalized ridge regression method for quantitative genetics. Genetics 193:1255–1268
Sjöstrand T, Mrenna S, Skands P (2006) PYTHIA 6.4 physics and manual. J High Energy Phys 0605, Article 026
Tibshirani R (1996) Regression shrinkage and selection via the LASSO. J R Stat Soc B 58:267–288
Vitter JS (2008) Algorithms and data structures for external memory. Now Publication Inc, Hanover
Wang M, Sun X (2014) Bayes factor consistency for nested linear models with a growing number of parameters. J Stat Plan Inference 147:95–105
Xue H, Zhu Y, Chen S (2009) Local ridge regression for face recognition. Nerocomputing 72:1342–1346
Zhan H, Xu S (2012) Adaptive ridge regression for rare variant detection. PLoS ONE 7, Article 8
Zou H, Zhang HH (2009) On the adaptive elastic-net with a diverging number of parameters. Ann Stat 37:1733–1751
Acknowledgements
The authors appreciate comments from the editor, an associate editor, and two anonymous reviewers, which significantly improve the quality of the article.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhang, T., Yang, B. An exact approach to ridge regression for big data. Comput Stat 32, 909–928 (2017). https://doi.org/10.1007/s00180-017-0731-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-017-0731-5