Abstract
Generalized linear models are a popular analytics tool with interpretable results and broad applicability, but require iterative estimation procedures that impose data transfer and computational costs that can be problematic under some infrastructure constraints. We propose a doubly-sketched approximation of the iteratively re-weighted least squares algorithm to estimate generalized linear model parameters using a sequence of surrogate datasets. The procedure sketches once to reduce data transfer costs, and sketches again to reduce data computation costs, yielding wall-clock time savings. Regression coefficients and standard errors are produced, with comparison against literature methods. Asymptotic properties of the proposed procedure are shown, with empirical results from simulated and real-world datasets. The efficacy of the proposed method is investigated across a variety of commodity computational infrastructure configurations accessible to practitioners. A highlight of the present work is the estimation of a Poisson-log generalized linear model across almost 1.7 billion observations on a personal computer in 25 min.
Similar content being viewed by others
Availability of data and materials
Real-world datasets are publicly available, and the construction of simulated datasets are provided.
Code Availability
Upon request.
References
Ahfock, D., Astle, W.J., Richardson, S.: On randomized sketching algorithms and the Tracy–Widom law (2022). https://doi.org/10.48550/ARXIV.2201.00450. arXiv:2201.00450
Ahfock, D.C., Astle, W.J., Richardson, S.: Statistical properties of sketching algorithms. Biometrika 108(2), 283–297 (2020). https://doi.org/10.1093/biomet/asaa062
Ai, M., Yu, J., Zhang, H., et al.: Optimal subsampling algorithms for big data regressions. Stat. Sin. 31(2), 749–772 (2021)
Ailon, N., Chazelle, B.: The fast Johnson–Lindenstrauss transform and approximate nearest neighbors. SIAM J. Comput. 39(1), 302–322 (2009). https://doi.org/10.1137/060673096
Baldi, P., Sadowski, P., Whiteson, D.: Searching for exotic particles in high-energy physics with deep learning. Nat. Commun. 5(1), 4308 (2014). https://doi.org/10.1038/ncomms5308
Blackford, L.S., Choi, J., Cleary, A., et al.: ScaLAPACK User’s Guide. Society for Industrial and Applied Mathematics, Philadelphia (1997)
Blum, J.R.: Multidimensional stochastic approximation methods. Ann. Math. Stat. 25(4), 737–744 (1954)
Bollapragada, R., Byrd, R.H., Nocedal, J.: Exact and inexact subsampled Newton methods for optimization. IMA J. Numer. Anal. 39(2), 545–578 (2018). https://doi.org/10.1093/imanum/dry009
Browne, R.P., Andrews, J.L.: Statistical inference for sketching algorithms (2023). arXiv:2306.03593
Bureau of Transportation Statistics: Data Expo 2009: Airline on time data (2008). https://doi.org/10.7910/DVN/HG7NV7
Byrd, R.H., Chin, G.M., Neveitt, W., et al.: On the use of stochastic Hessian information in optimization methods for machine learning. SIAM J. Optim. 21(3), 977–995 (2011). https://doi.org/10.1137/10079923X
Chapman, B., Jost, G., Van Der Pas, R.: Using OpenMP: Portable Shared Memory Parallel Programming. MIT Press, Cambridge (2007)
Clarkson, K.L., Woodruff, D.P.: Low-rank approximation and regression in input sparsity time. J. ACM 63(6), 1–45 (2017). https://doi.org/10.1145/3019134
Cormode, G.: Sketch Techniques for Approximate Query Processing. Foundations and Trends in Databases NOW publishers, Hanover (2011)
Dagum, L., Menon, R.: OpenMP: an industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998). https://doi.org/10.1109/99.660313
Dahiya, Y., Konomis, D., Woodruff, D.P.: An empirical evaluation of sketching for numerical linear algebra. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Association for Computing Machinery, New York, KDD ’18, pp. 1292–1300 (2018). https://doi.org/10.1145/3219819.3220098
Derezinski, M., Mahoney, M.W.: Distributed estimation of the inverse Hessian by determinantal averaging. In: Wallach, H., Larochelle, H., Beygelzimer, A., et al. (eds.) Advances in Neural Information Processing Systems, vol. 32. Curran Associates Inc, Red Hook (2019)
Dhillon, P., Lu, Y., Foster, D.P., et al.: New subsampling algorithms for fast least squares regression. In: Burges, C.J.C., Bottou, L., Welling, M., et al. (eds.) Advances in Neural Information Processing Systems, vol. 26. Curran Associates Inc, Red Hook (2013)
Drineas, P., Mahoney, M.W., Muthukrishnan, S.: Sampling algorithms for l2 regression and applications. In: Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm. Society for Industrial and Applied Mathematics, USA, SODA ’06, pp. 1127–1136 (2006)
Drineas, P., Mahoney, M.W., Muthukrishnan, S., et al.: Faster least squares approximation. Numer. Math. 117(2), 219–249 (2011). https://doi.org/10.1007/s00211-010-0331-6
Drineas, P., Magdon-Ismail, M., Mahoney, M.W., et al.: Fast approximation of matrix coherence and statistical leverage. J. Mach. Learn. Res. 13(111), 3475–3506 (2012)
Dua, D., Graff, C.: UCI machine learning repository (2017). http://archive.ics.uci.edu/ml
Foong, A., Hady, F.: Storage as fast as rest of the system. In: 2016 IEEE 8th International Memory Workshop (IMW), pp. 1–4 (2016). https://doi.org/10.1109/IMW.2016.7495289
Haberman, S.J.: Maximum likelihood estimates in exponential response models. Ann. Stat. 5(5), 815–841 (1977). https://doi.org/10.1214/aos/1176343941
Kleiner, A., Talwalkar, A., Sarkar, P., et al.: A scalable bootstrap for massive data. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 76(4), 795–816 (2014). https://doi.org/10.1111/rssb.12050
Kylasa, S., Roosta, F.F., Mahoney, M.W., et al.: GPU Accelerated Sub-Sampled Newton’s Method for Convex Classification Problems, pp. 702–710 (2019). https://doi.org/10.1137/1.9781611975673.79
Lacotte, J., Liu, S., Dobriban, E., et al.: Optimal iterative sketching methods with the subsampled randomized Hadamard transform. In: Larochelle, H., Ranzato, M., Hadsell, R., et al. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 9725–9735. Curran Associates Inc, Red Hook (2020)
Lee, J., Schifano, E.D., Wang, H.: Fast optimal subsampling probability approximation for generalized linear models. Econom. Stat. (2021). https://doi.org/10.1016/j.ecosta.2021.02.007
Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Data Sets. Cambridge University Press, Cambridge (2020)
Ma, P., Sun, X.: Leveraging for big data regression. WIREs Comput. Stat. 7(1), 70–76 (2015). https://doi.org/10.1002/wics.1324
Mahoney, M.W.: Randomized algorithms for matrices and data. Found. Trends Mach. Learn. 3(2), 123–224 (2011). https://doi.org/10.1561/2200000035
McCullagh, P., Nelder, J.A.: Generalized Linear Models, 2nd edn. Chapman & Hall, London (1989)
Munteanu, A., Omlor, S., Woodruff, D.: Oblivious sketching for logistic regression. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 139, pp. 7861–7871. PMLR (2021). https://proceedings.mlr.press/v139/munteanu21a.html
NYC Taxi and Limousine Commission: TLC trip record data (2022). https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
Pilanci, M., Wainwright, M.J.: Iterative Hessian sketch: fast and accurate solution approximation for constrained least-squares. J. Mach. Learn. Res. 17(53), 1–38 (2016)
Pilanci, M., Wainwright, M.J.: Newton sketch: a near linear-time optimization algorithm with linear-quadratic convergence. SIAM J. Optim. 27(1), 205–245 (2017). https://doi.org/10.1137/15m1021106
R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2023). https://www.R-project.org/
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)
Robbins, H., Siegmund, D.: A convergence theorem for non negative almost supermartingales and some applications. In: Rustagi, J.S. (ed.) Optimizing Methods in Statistics, pp. 233–257. Academic Press, Cambridge (1971). https://doi.org/10.1016/B978-0-12-604550-5.50015-8
Roosta-Khorasani, F., Mahoney, M.W.: Sub-sampled Newton methods. Math. Program. 174(1), 293–326 (2019). https://doi.org/10.1007/s10107-018-1346-5
Sarlos, T.: Improved approximation algorithms for large matrices via random projections. In: 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), pp. 143–152 (2006). https://doi.org/10.1109/FOCS.2006.37
Suchard, M.A., Simpson, S.E., Zorych, I., et al.: Massive parallelization of serial inference algorithms for a complex generalized linear model. ACM Trans. Model. Comput. Simul. 23(1), 1–17 (2013). https://doi.org/10.1145/2414416.2414791
Wang, H., Zhu, R., Ma, P.: Optimal subsampling for large sample logistic regression. J. Am. Stat. Assoc. 113(522), 829–844 (2018). https://doi.org/10.1080/01621459.2017.1292914
Wang, H., Yang, M., Stufken, J.: Information-based optimal subdata selection for big data linear regression. J. Am. Stat. Assoc. 114(525), 393–405 (2019). https://doi.org/10.1080/01621459.2017.1408468
Wedderburn, R.W.M.: On the existence and uniqueness of the maximum likelihood estimates for certain generalized linear models. Biometrika 63(1), 27–32 (1976). https://doi.org/10.1093/biomet/63.1.27
Xu, P., Yang, J., Roosta, F., et al.: Sub-sampled Newton methods with non-uniform sampling. In: Lee, D., Sugiyama, M., Luxburg, U., et al. (eds.) Advances in Neural Information Processing Systems, vol. 29. Curran Associates Inc, Red Hook (2016)
Yu, J., Wang, H., Ai, M., et al.: Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data. J. Am. Stat. Assoc. 117(537), 265–276 (2022). https://doi.org/10.1080/01621459.2020.1773832
Funding
None.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
Not applicable.
Ethics approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Appendix A
Appendix A
An appendix contains supplementary information that is not an essential part of the text itself but which may be helpful in providing a more comprehensive understanding of the research problem or it is information that is too cumbersome to be included in the body of the paper.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Hou-Liu, J., Browne, R.P. Generalized linear models for massive data via doubly-sketching. Stat Comput 33, 105 (2023). https://doi.org/10.1007/s11222-023-10274-8
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11222-023-10274-8