Skip to main content
Log in

Generalized linear models for massive data via doubly-sketching

  • Original Paper
  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

Generalized linear models are a popular analytics tool with interpretable results and broad applicability, but require iterative estimation procedures that impose data transfer and computational costs that can be problematic under some infrastructure constraints. We propose a doubly-sketched approximation of the iteratively re-weighted least squares algorithm to estimate generalized linear model parameters using a sequence of surrogate datasets. The procedure sketches once to reduce data transfer costs, and sketches again to reduce data computation costs, yielding wall-clock time savings. Regression coefficients and standard errors are produced, with comparison against literature methods. Asymptotic properties of the proposed procedure are shown, with empirical results from simulated and real-world datasets. The efficacy of the proposed method is investigated across a variety of commodity computational infrastructure configurations accessible to practitioners. A highlight of the present work is the estimation of a Poisson-log generalized linear model across almost 1.7 billion observations on a personal computer in 25 min.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Availability of data and materials

Real-world datasets are publicly available, and the construction of simulated datasets are provided.

Code Availability

Upon request.

References

Download references

Funding

None.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jason Hou-Liu.

Ethics declarations

Conflict of interest

Not applicable.

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 1190 KB)

Appendix A

Appendix A

An appendix contains supplementary information that is not an essential part of the text itself but which may be helpful in providing a more comprehensive understanding of the research problem or it is information that is too cumbersome to be included in the body of the paper.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hou-Liu, J., Browne, R.P. Generalized linear models for massive data via doubly-sketching. Stat Comput 33, 105 (2023). https://doi.org/10.1007/s11222-023-10274-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11222-023-10274-8

Keywords

Navigation