Generalized linear models for massive data via doubly-sketching

Hou-Liu, Jason; Browne, Ryan P.

doi:10.1007/s11222-023-10274-8

Generalized linear models for massive data via doubly-sketching

Original Paper
Published: 19 July 2023

Volume 33, article number 105, (2023)
Cite this article

Statistics and Computing Aims and scope Submit manuscript

Jason Hou-Liu¹ &
Ryan P. Browne¹

274 Accesses
Explore all metrics

Abstract

Generalized linear models are a popular analytics tool with interpretable results and broad applicability, but require iterative estimation procedures that impose data transfer and computational costs that can be problematic under some infrastructure constraints. We propose a doubly-sketched approximation of the iteratively re-weighted least squares algorithm to estimate generalized linear model parameters using a sequence of surrogate datasets. The procedure sketches once to reduce data transfer costs, and sketches again to reduce data computation costs, yielding wall-clock time savings. Regression coefficients and standard errors are produced, with comparison against literature methods. Asymptotic properties of the proposed procedure are shown, with empirical results from simulated and real-world datasets. The efficacy of the proposed method is investigated across a variety of commodity computational infrastructure configurations accessible to practitioners. A highlight of the present work is the estimation of a Poisson-log generalized linear model across almost 1.7 billion observations on a personal computer in 25 min.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Statistical Leveraging Methods in Big Data

Data science, big data and statistics

Article 08 April 2019

Large-Scale and Distributed Optimization: An Introduction

Availability of data and materials

Real-world datasets are publicly available, and the construction of simulated datasets are provided.

Code Availability

Upon request.

References

Ahfock, D., Astle, W.J., Richardson, S.: On randomized sketching algorithms and the Tracy–Widom law (2022). https://doi.org/10.48550/ARXIV.2201.00450. arXiv:2201.00450
Ahfock, D.C., Astle, W.J., Richardson, S.: Statistical properties of sketching algorithms. Biometrika 108(2), 283–297 (2020). https://doi.org/10.1093/biomet/asaa062
Article MathSciNet MATH Google Scholar
Ai, M., Yu, J., Zhang, H., et al.: Optimal subsampling algorithms for big data regressions. Stat. Sin. 31(2), 749–772 (2021)
MathSciNet MATH Google Scholar
Ailon, N., Chazelle, B.: The fast Johnson–Lindenstrauss transform and approximate nearest neighbors. SIAM J. Comput. 39(1), 302–322 (2009). https://doi.org/10.1137/060673096
Article MathSciNet MATH Google Scholar
Baldi, P., Sadowski, P., Whiteson, D.: Searching for exotic particles in high-energy physics with deep learning. Nat. Commun. 5(1), 4308 (2014). https://doi.org/10.1038/ncomms5308
Article Google Scholar
Blackford, L.S., Choi, J., Cleary, A., et al.: ScaLAPACK User’s Guide. Society for Industrial and Applied Mathematics, Philadelphia (1997)
Book MATH Google Scholar
Blum, J.R.: Multidimensional stochastic approximation methods. Ann. Math. Stat. 25(4), 737–744 (1954)
Article MathSciNet MATH Google Scholar
Bollapragada, R., Byrd, R.H., Nocedal, J.: Exact and inexact subsampled Newton methods for optimization. IMA J. Numer. Anal. 39(2), 545–578 (2018). https://doi.org/10.1093/imanum/dry009
Article MathSciNet MATH Google Scholar
Browne, R.P., Andrews, J.L.: Statistical inference for sketching algorithms (2023). arXiv:2306.03593
Bureau of Transportation Statistics: Data Expo 2009: Airline on time data (2008). https://doi.org/10.7910/DVN/HG7NV7
Byrd, R.H., Chin, G.M., Neveitt, W., et al.: On the use of stochastic Hessian information in optimization methods for machine learning. SIAM J. Optim. 21(3), 977–995 (2011). https://doi.org/10.1137/10079923X
Article MathSciNet MATH Google Scholar
Chapman, B., Jost, G., Van Der Pas, R.: Using OpenMP: Portable Shared Memory Parallel Programming. MIT Press, Cambridge (2007)
Google Scholar
Clarkson, K.L., Woodruff, D.P.: Low-rank approximation and regression in input sparsity time. J. ACM 63(6), 1–45 (2017). https://doi.org/10.1145/3019134
Article MathSciNet MATH Google Scholar
Cormode, G.: Sketch Techniques for Approximate Query Processing. Foundations and Trends in Databases NOW publishers, Hanover (2011)
Google Scholar
Dagum, L., Menon, R.: OpenMP: an industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998). https://doi.org/10.1109/99.660313
Article Google Scholar
Dahiya, Y., Konomis, D., Woodruff, D.P.: An empirical evaluation of sketching for numerical linear algebra. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Association for Computing Machinery, New York, KDD ’18, pp. 1292–1300 (2018). https://doi.org/10.1145/3219819.3220098
Derezinski, M., Mahoney, M.W.: Distributed estimation of the inverse Hessian by determinantal averaging. In: Wallach, H., Larochelle, H., Beygelzimer, A., et al. (eds.) Advances in Neural Information Processing Systems, vol. 32. Curran Associates Inc, Red Hook (2019)
Google Scholar
Dhillon, P., Lu, Y., Foster, D.P., et al.: New subsampling algorithms for fast least squares regression. In: Burges, C.J.C., Bottou, L., Welling, M., et al. (eds.) Advances in Neural Information Processing Systems, vol. 26. Curran Associates Inc, Red Hook (2013)
Google Scholar
Drineas, P., Mahoney, M.W., Muthukrishnan, S.: Sampling algorithms for l2 regression and applications. In: Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm. Society for Industrial and Applied Mathematics, USA, SODA ’06, pp. 1127–1136 (2006)
Drineas, P., Mahoney, M.W., Muthukrishnan, S., et al.: Faster least squares approximation. Numer. Math. 117(2), 219–249 (2011). https://doi.org/10.1007/s00211-010-0331-6
Article MathSciNet MATH Google Scholar
Drineas, P., Magdon-Ismail, M., Mahoney, M.W., et al.: Fast approximation of matrix coherence and statistical leverage. J. Mach. Learn. Res. 13(111), 3475–3506 (2012)
MathSciNet MATH Google Scholar
Dua, D., Graff, C.: UCI machine learning repository (2017). http://archive.ics.uci.edu/ml
Foong, A., Hady, F.: Storage as fast as rest of the system. In: 2016 IEEE 8th International Memory Workshop (IMW), pp. 1–4 (2016). https://doi.org/10.1109/IMW.2016.7495289
Haberman, S.J.: Maximum likelihood estimates in exponential response models. Ann. Stat. 5(5), 815–841 (1977). https://doi.org/10.1214/aos/1176343941
Article MathSciNet MATH Google Scholar
Kleiner, A., Talwalkar, A., Sarkar, P., et al.: A scalable bootstrap for massive data. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 76(4), 795–816 (2014). https://doi.org/10.1111/rssb.12050
Article MathSciNet MATH Google Scholar
Kylasa, S., Roosta, F.F., Mahoney, M.W., et al.: GPU Accelerated Sub-Sampled Newton’s Method for Convex Classification Problems, pp. 702–710 (2019). https://doi.org/10.1137/1.9781611975673.79
Lacotte, J., Liu, S., Dobriban, E., et al.: Optimal iterative sketching methods with the subsampled randomized Hadamard transform. In: Larochelle, H., Ranzato, M., Hadsell, R., et al. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 9725–9735. Curran Associates Inc, Red Hook (2020)
Google Scholar
Lee, J., Schifano, E.D., Wang, H.: Fast optimal subsampling probability approximation for generalized linear models. Econom. Stat. (2021). https://doi.org/10.1016/j.ecosta.2021.02.007
Article Google Scholar
Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Data Sets. Cambridge University Press, Cambridge (2020)
Book Google Scholar
Ma, P., Sun, X.: Leveraging for big data regression. WIREs Comput. Stat. 7(1), 70–76 (2015). https://doi.org/10.1002/wics.1324
Article MathSciNet Google Scholar
Mahoney, M.W.: Randomized algorithms for matrices and data. Found. Trends Mach. Learn. 3(2), 123–224 (2011). https://doi.org/10.1561/2200000035
Article MATH Google Scholar
McCullagh, P., Nelder, J.A.: Generalized Linear Models, 2nd edn. Chapman & Hall, London (1989)
Book MATH Google Scholar
Munteanu, A., Omlor, S., Woodruff, D.: Oblivious sketching for logistic regression. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 139, pp. 7861–7871. PMLR (2021). https://proceedings.mlr.press/v139/munteanu21a.html
NYC Taxi and Limousine Commission: TLC trip record data (2022). https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
Pilanci, M., Wainwright, M.J.: Iterative Hessian sketch: fast and accurate solution approximation for constrained least-squares. J. Mach. Learn. Res. 17(53), 1–38 (2016)
MathSciNet MATH Google Scholar
Pilanci, M., Wainwright, M.J.: Newton sketch: a near linear-time optimization algorithm with linear-quadratic convergence. SIAM J. Optim. 27(1), 205–245 (2017). https://doi.org/10.1137/15m1021106
R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2023). https://www.R-project.org/
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)
Article MathSciNet MATH Google Scholar
Robbins, H., Siegmund, D.: A convergence theorem for non negative almost supermartingales and some applications. In: Rustagi, J.S. (ed.) Optimizing Methods in Statistics, pp. 233–257. Academic Press, Cambridge (1971). https://doi.org/10.1016/B978-0-12-604550-5.50015-8
Chapter Google Scholar
Roosta-Khorasani, F., Mahoney, M.W.: Sub-sampled Newton methods. Math. Program. 174(1), 293–326 (2019). https://doi.org/10.1007/s10107-018-1346-5
Article MathSciNet MATH Google Scholar
Sarlos, T.: Improved approximation algorithms for large matrices via random projections. In: 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), pp. 143–152 (2006). https://doi.org/10.1109/FOCS.2006.37
Suchard, M.A., Simpson, S.E., Zorych, I., et al.: Massive parallelization of serial inference algorithms for a complex generalized linear model. ACM Trans. Model. Comput. Simul. 23(1), 1–17 (2013). https://doi.org/10.1145/2414416.2414791
Article MathSciNet MATH Google Scholar
Wang, H., Zhu, R., Ma, P.: Optimal subsampling for large sample logistic regression. J. Am. Stat. Assoc. 113(522), 829–844 (2018). https://doi.org/10.1080/01621459.2017.1292914
Article MathSciNet MATH Google Scholar
Wang, H., Yang, M., Stufken, J.: Information-based optimal subdata selection for big data linear regression. J. Am. Stat. Assoc. 114(525), 393–405 (2019). https://doi.org/10.1080/01621459.2017.1408468
Article MathSciNet MATH Google Scholar
Wedderburn, R.W.M.: On the existence and uniqueness of the maximum likelihood estimates for certain generalized linear models. Biometrika 63(1), 27–32 (1976). https://doi.org/10.1093/biomet/63.1.27
Article MathSciNet MATH Google Scholar
Xu, P., Yang, J., Roosta, F., et al.: Sub-sampled Newton methods with non-uniform sampling. In: Lee, D., Sugiyama, M., Luxburg, U., et al. (eds.) Advances in Neural Information Processing Systems, vol. 29. Curran Associates Inc, Red Hook (2016)
Google Scholar
Yu, J., Wang, H., Ai, M., et al.: Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data. J. Am. Stat. Assoc. 117(537), 265–276 (2022). https://doi.org/10.1080/01621459.2020.1773832
Article MathSciNet MATH Google Scholar

Download references

Funding

None.

Author information

Authors and Affiliations

Department of Statistics and Actuarial Science, University of Waterloo, 200 University Avenue West, Waterloo, ON, N2L 3G1, Canada
Jason Hou-Liu & Ryan P. Browne

Authors

Jason Hou-Liu
View author publications
You can also search for this author in PubMed Google Scholar
Ryan P. Browne
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jason Hou-Liu.

Ethics declarations

Conflict of interest

Not applicable.

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 1190 KB)

Appendix A

An appendix contains supplementary information that is not an essential part of the text itself but which may be helpful in providing a more comprehensive understanding of the research problem or it is information that is too cumbersome to be included in the body of the paper.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Hou-Liu, J., Browne, R.P. Generalized linear models for massive data via doubly-sketching. Stat Comput 33, 105 (2023). https://doi.org/10.1007/s11222-023-10274-8

Download citation

Received: 20 October 2022
Accepted: 01 July 2023
Published: 19 July 2023
DOI: https://doi.org/10.1007/s11222-023-10274-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Generalized linear models for massive data via doubly-sketching

Abstract

Access this article

Similar content being viewed by others

Statistical Leveraging Methods in Big Data

Data science, big data and statistics

Large-Scale and Distributed Optimization: An Introduction

Availability of data and materials

Code Availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 1190 KB)

Appendix A

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Generalized linear models for massive data via doubly-sketching

Abstract

Access this article

Similar content being viewed by others

Statistical Leveraging Methods in Big Data

Data science, big data and statistics

Large-Scale and Distributed Optimization: An Introduction

Availability of data and materials

Code Availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 1190 KB)

Appendix A

Appendix A

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation