Skip to main content
Log in

An exact approach to ridge regression for big data

  • Original Paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

Ridge regression is an important approach in linear regression when explanatory variables are highly correlated. Although expressions of estimators of ridge regression parameters have been successfully obtained via matrix operation after observed data are standardized, they cannot be used to big data since it is impossible to load the entire data set to the memory of a single computer and it is hard to standardize the original observed data. To overcome these difficulties, the present article proposes new methods and algorithms. The basic idea is to compute a matrix of sufficient statistics by rows. Once the matrix is derived, it is not necessary to use the original data again. Since the entire data set is only scanned once, the proposed methods and algorithms can be extremely efficient in the computation of estimates of ridge regression parameters. It is expected that the basic knowledge gained in this article will have a great impact on statistical approaches to big data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  • Alwal J, Herquet M, Maltoni F, Mattelaer O, Stelzer T (2011) MadGraph 5: going beyond. J High Energy Phys 1106, Article 128

  • Baldi P, Sadowski P, Whiteson D (2014) Searching for exotic particles in high-energy phusics with deep learning. Nat Commun 5, Article 4308

  • Chen Y, Dong G (2006) Regression cubes with lossless compression and aggregation. IEEE Trans Knowl Data Eng 18:1585–1599

    Article  Google Scholar 

  • Dean J, Ghamawat S (2004) MapRedue: simplified data processing on large clusters. In: Proceeding of OSDI, pp 137–150

  • Deng Z, Choi K, Jiang Y, Wang S (2014) Generalized hidden-mapping ridge regression, knowledge-leveraged inductive transfer learning for neural networks, fuzzy systems and kernel methods. IEEE Trans Cybern 44:2585–2599

    Article  Google Scholar 

  • Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression, with discussions. Ann Stat 32:407–499

    Article  MATH  Google Scholar 

  • Emerson JW, Kane MJ (2012) Don’t drown in the data. Significance 9:38–39

    Article  Google Scholar 

  • Enea M (2009) Fitting linear models and generalized linear models with large data sets. In: R. Statistical methods for the analysis of large datasets: book of short papers, pp 411–414

  • Fan J, Han F, Liu H (2014) Challenges of big data analysis. Natl Sci Rev 1:293–314

    Article  Google Scholar 

  • Fernández A, del Río S, López V, Bawakid A, del Jesus M,Bent́ez JM, Herrera F,(2014) Big data with cloud computing: an insight on the computing environment, MapReduce, and programming frameworks. WIREs Data Min Knowl Discov. doi:10.1002/widm.1134

  • Guha S, Hafen R, Rounds J, Xia J, Li J, Xi B, Cleveland WS (2012) Large complex data: divide and recombine (D&R) with Rhipe. Stat 1:53–67

    Article  Google Scholar 

  • Hoerl AE, Kennard RW (1970) Ridge regression: biased estimation for nonorthogonal problems. Technometrics 2:109–135

    MATH  Google Scholar 

  • Hogg RV, McKean JW, Craig AT (2005) Introduction to mathematical statistics, 6th edn. Pearson Prentice Hall, Upper Saddle River

    Google Scholar 

  • Howarth J, Shawei-Toylor J, Cheng T, Wang J (2014) Local online kernel ridge regression for forecasting of urban travel times. Transp Res Part C Energ Technol 46:151–178

    Article  Google Scholar 

  • Karloff H, Suri S, Vassilvitskii S (2010) A model of computation for MapReduce. In: Proceeding of SODA’10 proceedings of the twenty-first annual ACM-SIAM symposium on discrete algorithms, pp 938–948

  • Lin N, Xi R (2011) Aggregated estimating equation estimation. Stat Interface 4:73–83

    Article  MathSciNet  MATH  Google Scholar 

  • Ma P, Sun X (2015) Leveraging for bid data regression. WIREs Comput Stat 7:70–76

    Article  Google Scholar 

  • Marquardt DW (1970) Generalized inverse, ridge regression, biased linear estimation and nonlinear estimation. Technometrics 12:591–612

    Article  MATH  Google Scholar 

  • Meeker WQ, Hong Y (2014) Reliability meets big data: opportunities and challenges. Qual Eng 26:102–116

    Article  Google Scholar 

  • Miner D, Shook A (2012) MapReduce design patterns: building effective algorithms and analytics for hadoop and other systems. O’Reilly Media Inc, Sebastpool

    Google Scholar 

  • Moreno E, Girón J, Casella G (2010) Consistency of objective Bayes factors as the model dimension grows. Ann Stat 38:1937–1952

    Article  MathSciNet  MATH  Google Scholar 

  • Moreno E, Girón J, Casella G (2015) Posterior model consistency in variable selection as the model dimension grows. Stat Sci 30:228–241

    Article  MathSciNet  MATH  Google Scholar 

  • Ovyn S, Rouby X, Lemaitre V (2009) DELPHES, a framework for fast simulation of a generic collider experiment. Preprint at arXiv:0903.2225

  • Popo J, Carrera D, Becerra Y, Steinder M, Whalley I (2010) Performance-driven task co-scheduling for MapReduce environments. In: NOMS, pp 374–380

  • Shen X, Alam M, Fikse F, Rönnegard L (2013) A novel generalized ridge regression method for quantitative genetics. Genetics 193:1255–1268

    Article  Google Scholar 

  • Sjöstrand T, Mrenna S, Skands P (2006) PYTHIA 6.4 physics and manual. J High Energy Phys 0605, Article 026

  • Tibshirani R (1996) Regression shrinkage and selection via the LASSO. J R Stat Soc B 58:267–288

    MathSciNet  MATH  Google Scholar 

  • Vitter JS (2008) Algorithms and data structures for external memory. Now Publication Inc, Hanover

    MATH  Google Scholar 

  • Wang M, Sun X (2014) Bayes factor consistency for nested linear models with a growing number of parameters. J Stat Plan Inference 147:95–105

    Article  MathSciNet  MATH  Google Scholar 

  • Xue H, Zhu Y, Chen S (2009) Local ridge regression for face recognition. Nerocomputing 72:1342–1346

    Article  Google Scholar 

  • Zhan H, Xu S (2012) Adaptive ridge regression for rare variant detection. PLoS ONE 7, Article 8

  • Zou H, Zhang HH (2009) On the adaptive elastic-net with a diverging number of parameters. Ann Stat 37:1733–1751

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

The authors appreciate comments from the editor, an associate editor, and two anonymous reviewers, which significantly improve the quality of the article.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tonglin Zhang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, T., Yang, B. An exact approach to ridge regression for big data. Comput Stat 32, 909–928 (2017). https://doi.org/10.1007/s00180-017-0731-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-017-0731-5

Keywords

Navigation