Model-free global likelihood subsampling for massive data

Yi, Si-Yu; Zhou, Yong-Dao

doi:10.1007/s11222-022-10185-0

Model-free global likelihood subsampling for massive data

OriginalPaper
Published: 01 December 2022

Volume 33, article number 9, (2023)
Cite this article

Statistics and Computing Aims and scope Submit manuscript

Si-Yu Yi¹ &
Yong-Dao Zhou¹

692 Accesses
3 Citations
1 Altmetric
Explore all metrics

Abstract

Most existing studies for subsampling heavily depend on a specified model. If the assumed model is not correct, the performance of the subsample may be poor. This paper focuses on a model-free subsampling method, called global likelihood subsampling, such that the subsample is robust to different model choices. It leverages the idea of the global likelihood sampler, which is an effective and robust sampling method from a given continuous distribution. Furthermore, we accelerate the algorithm for large-scale datasets and extend it to deal with high-dimensional data with relatively low computational complexity. Simulations and real data studies are conducted to apply the proposed method to regression and classification problems. It illustrates that this method is robust against different modeling methods and has promising performance compared with some existing model-free subsampling methods for data compression.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Article 09 November 2022

Vitor Werner de Vargas, Jorge Arthur Schneider Aranda, … Jorge Luis Victória Barbosa

Learning from imbalanced data: open challenges and future directions

Article Open access 22 April 2016

Bartosz Krawczyk

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Article 30 August 2016

Aki Vehtari, Andrew Gelman & Jonah Gabry

References

Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)
Article MATH Google Scholar
Bottou, L.: Stochastic Gradient Descent Tricks, Neural Networks: Tricks of the Trade, pp. 421–436. Springer, Berlin (2012)
Book Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article MATH Google Scholar
Davis, R.A., Lii, K.S., Politis, D.N.: Remarks on Some Nonparametric Estimates of a Density Function, Selected Works of Murray Rosenblatt, pp. 95–100. Springer, New York (2011)
Google Scholar
Dua, D., Graff, C.: UCI machine learning repository. URL http://archive.ics.uci.edu/ml (2019)
Fan, J., Lv, J.: Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B 70(5), 849–911 (2008)
Article MathSciNet MATH Google Scholar
Fang, K.T., Liu, M.Q., Qin, H., Zhou, Y.D.: Theory and Application of Uniform Experimental Designs. Springer, Singapore (2018)
Book MATH Google Scholar
Friedman, J.H.: Multivariate adaptive regression splines. Ann. Stat. 19(1), 1–67 (1991)
MathSciNet MATH Google Scholar
Hastings, W.K.: Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 97–109 (1970)
Article MathSciNet MATH Google Scholar
Hlawka, E.: Funktionen von beschränkter variatiou in der theorie der gleichverteilung. Annali di Matematica Pura ed Applicata 54(1), 325–333 (1961)
Article MATH Google Scholar
Jiang, H.: Uniform convergence rates for kernel density estimation. In: International Conference on Machine Learning, pp. 1694–1703. PMLR (2017)
Joseph, V.R., Vakayil, A.: Split: an optimal method for data splitting. Technometrics 64(2), 166–176 (2022)
Article MathSciNet Google Scholar
Liang, F., Wong, W.H.: Real-parameter evolutionary Monte Carlo with applications to Bayesian mixture models. J. Am. Stat. Assoc. 96(454), 653–666 (2001)
Article MathSciNet MATH Google Scholar
Ma, P., Mahoney, M., Yu, B.: A statistical perspective on algorithmic leveraging. In: International Conference on Machine Learning, pp. 91–99. PMLR (2014)
Mahoney, M.W.: Randomized algorithms for matrices and data. Found. Trends Mach. Learn. 3(2), 123–224 (2011)
MATH Google Scholar
Maire, F., Friel, N., Alquier, P.: Informed sub-sampling MCMC: approximate Bayesian inference for large datasets. Stat. Comput. 29(3), 449–482 (2019)
Article MathSciNet MATH Google Scholar
Meng, C., Xie, R., Mandal, A., Zhang, X., Zhong, W., Ma, P.: Lowcon: a design-based subsampling approach in a misspecified linear model. J. Comput. Graph. Stat. 30(3), 694–708 (2021)
Article MathSciNet MATH Google Scholar
Naaman, M.: On the tight constant in the multivariate Dvoretzky–Kiefer–Wolfowitz inequality. Stat. Probab. Lett. 173, 109088 (2021)
Article MathSciNet MATH Google Scholar
Nuyens, D., Cools, R.: Fast algorithms for component-by-component construction of rank-1 lattice rules in shift-invariant reproducing kernel Hilbert spaces. Math. Comput. 75(254), 903–920 (2006)
Article MathSciNet MATH Google Scholar
Parzen, E.: On estimation of a probability density function and mode. Ann. Math. Stat. 33(3), 1065–1076 (1962)
Article MathSciNet MATH Google Scholar
Rubin, D.B.: A noniterative sampling/importance resampling alternative to data augmentation for creating a few imputations when fractions of missing information are modest: the sir algorithm. J. Am. Stat. Assoc. 82, 544–546 (1987)
Article Google Scholar
Santner, T.J., Williams, B.J., Notz, W.I., Williams, B.J.: The Design and Analysis of Computer Experiments. Springer, New York (2003)
Book MATH Google Scholar
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 58(1), 267–288 (1996)
MathSciNet MATH Google Scholar
Ting, D., Brochu, E.: Optimal subsampling with influence functions. In: Advances in Neural Information Processing Systems, pp. 3650–3659 (2018)
Vakayil, A., Joseph, V.R., Mak, S., Vakayil, M.A.: Package ‘SPlit’. R package version 1.0 (2022)
Vakayil, A., Joseph, V.R., Vakayil, M.A.: Package ‘twinning’. R package version 1.0 (2022)
Vakayil, A., Joseph, V.R.: Data twinning. Stat. Anal. Data Min. ASA Data Sci. J. 15, 598–610 (2022)
Article MathSciNet Google Scholar
Wang, D., Joseph, V.R.: Mined: minimum energy design. R Package Version 1.0 (2022)
Wang, Y.C., Ning, J.H., Zhou, Y.D., Fang, K.T.: A new sampler: randomized likelihood sampling. In: Souvenir Booklet of the 24th International Workshop on Matrices and Statistics, pp. 255–261 (2015)
Wang, H., Ma, Y.: Optimal subsampling for quantile regression in big data. Biometrika 108(1), 99–112 (2021)
Article MathSciNet MATH Google Scholar
Wang, H., Zhu, R., Ma, P.: Optimal subsampling for large sample logistic regression. J. Am. Stat. Assoc. 113(522), 829–844 (2018)
Article MathSciNet MATH Google Scholar
Wang, H., Yang, M., Stufken, J.: Information-based optimal subdata selection for big data linear regression. J. Am. Stat. Assoc. 114(525), 393–405 (2019)
Article MathSciNet MATH Google Scholar
Wang, L., Elmstedt, J., Wong, W.K., Xu, H.: Orthogonal subsampling for big data linear regression. Ann. Appl. Stat. 15(3), 1273–1290 (2021)
Article MathSciNet MATH Google Scholar
Yi, S.Y., Liu, Z., Liu, M.Q., Zhou, Y.D.: Global likelihood sampler for multimodal distributions. Submitted (2022)
Yu, J., Wang, H., Ai, M., Zhang, H.: Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data. J. Am. Stat. Assoc. 117(537), 265–276 (2022)
Article MathSciNet Google Scholar
Zhang, M., Zhou, Y.D., Zhou, Z., Zhang, A.J.: Model-free subsampling method based on uniform designs. arXiv:2209.03617 (2022)

Download references

Acknowledgements

The authors would like to thank the two referees for their valuable comments that lead to a significant improvement of the presentation. This work was partial supported by the National Natural Science Foundation of China (11871288 and 12131001)), the Fundamental Research Funds for the Central Universities, LPMC, and KLMDASR.

Author information

Authors and Affiliations

School of Statistics and Data Science, Nankai University, Tianjin, 300071, China
Si-Yu Yi & Yong-Dao Zhou

Authors

Si-Yu Yi
View author publications
You can also search for this author in PubMed Google Scholar
Yong-Dao Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Yi, S. wrote the main manuscript text and Zhou, Y. provided the research idea and modified the whole paper. All authors reviewed the manuscript.

Corresponding author

Correspondence to Yong-Dao Zhou.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Proof of Lemma 1

By Theorem 2 in Jiang (2017), we directly have

$$\begin{aligned} \sup _{\varvec{\theta }\in \mathbb {R}^d} \vert \hat{f}_\textrm{K}(\varvec{\theta })-f(\varvec{\theta })\vert = O_\textrm{P}(\Delta ). \end{aligned}$$

Let the support of f be ${\mathcal T}$ and the volume of ${\mathcal T}$ is $v({\mathcal T})=V_T<\infty $. Then,

$$\begin{aligned} \sup _{\varvec{\theta }\in \mathbb {R}^d} \vert \hat{F}_\textrm{K}(\varvec{\theta })-F(\varvec{\theta })\vert =&\sup _{\varvec{\theta }\in \mathbb {R}^d} \left| \int _{(-\infty ,\varvec{\theta }]\cap {\mathcal T}} (\hat{f}_\textrm{K}(\varvec{\vartheta })-f(\varvec{\vartheta })) \textrm{d} \varvec{\vartheta }\right| \\ \le&\sup _{\varvec{\theta }\in \mathbb {R}^d} \int _{(-\infty ,\varvec{\theta }]\cap {\mathcal T}}\vert \hat{f}_\textrm{K}(\varvec{\vartheta })-f(\varvec{\vartheta })\vert \textrm{d} \varvec{\vartheta }\\ \le&V_T \cdot \sup _{\varvec{\theta }\in \mathbb {R}^d} \vert \hat{f}_\textrm{K}(\varvec{\theta })-f(\varvec{\theta })\vert = O_\textrm{P}(\Delta ), \end{aligned}$$

which completes the proof. $\square $

Proof of Lemma 2

Let $\tilde{F}(\varvec{\theta })$ be the multinomial distribution placing mass $\hat{f}_\textrm{K}({\varvec{s}}_k)/\sum _{i=1}^M \hat{f}_\textrm{K}({\varvec{s}}_i)$ at ${\varvec{s}}_k, k=1,\ldots ,M$. Similar to the proof of Lemma 1(1) in Yi et al. (2022), if ${\mathcal S}$ is the low-discrepancy point set and $\hat{f}_\textrm{K}({\varvec{x}})I_{[-\infty ,\varvec{\theta }]}({\varvec{x}})$ is of uniformly bounded variation for any $\varvec{\theta }\in \mathbb {R}^d$, then for any $\varvec{\theta }\in \mathbb {R}^d$, we have

$$\begin{aligned} \vert \tilde{F}(\varvec{\theta })-\hat{F}_\textrm{K}(\varvec{\theta })\vert =O(\kappa (M)). \end{aligned}$$

Let $D_\textrm{m}=\{(2j-1)/2n: j=1,\ldots ,n\}$. Then, by the proof of Theorem 2 in Yi et al. (2022), we can obtain that

$$\begin{aligned} F_{{\mathcal Z}_\textrm{s}}(\varvec{\theta })=\frac{1}{n}\sum _{u\in D_\textrm{m}}\varrho _{\varvec{\theta }}(u), ~~ \tilde{F}(\varvec{\theta })=\int _{[0,1]}\varrho _{\varvec{\theta }}(u) \textrm{d}u. \end{aligned}$$

Thus, $\vert F_{{\mathcal Z}_\textrm{s}}(\varvec{\theta })-\tilde{F}(\varvec{\theta })\vert $ can be viewed as the quasi-Monte Carlo error based on $D_\textrm{m}$. The star discrepancy of $D_\textrm{m}$ is always no more than 1/n. Hence, by the Koksma-Hlawka inequality (Hlawka 1961), if $\varrho _{\varvec{\theta }}(u)$ is of uniformly bounded variation for any $\varvec{\theta }\in \mathbb {R}^d$ and run size M of the space-filling design ${\mathcal S}$, then for any $\varvec{\theta }\in \mathbb {R}^d$, we have

$$\begin{aligned} \vert F_{{\mathcal Z}_\textrm{s}}(\varvec{\theta })-\tilde{F}(\varvec{\theta })\vert =O\left( \frac{1}{n}\right) . \end{aligned}$$

Since the above formula is valid for any $\varvec{\theta }\in \mathbb {R}^d$, it is also true for the supremum value. Then, by the triangle inequality,

$$\begin{aligned}&\sup _{\varvec{\theta }\in \mathbb {R}^d}\vert F_{\tilde{{\mathcal Z}}_\textrm{s}}(\varvec{\theta }) - \hat{F}_\textrm{K}(\varvec{\theta })\vert \\&\quad \le \sup _{\varvec{\theta }\in \mathbb {R}^d} \vert \tilde{F}(\varvec{\theta })-\hat{F}_\textrm{K}(\varvec{\theta })\vert + \sup _{\varvec{\theta }\in \mathbb {R}^d} \vert F_{{\mathcal Z}_\textrm{s}}(\varvec{\theta })-\tilde{F}(\varvec{\theta })\vert \\&\quad = O\left( \max \left\{ \kappa (M), \frac{1}{n}\right\} \right) , \end{aligned}$$

which completes the proof.

Proof of Theorem 3

Based on Lemmas 1 and 2, by the triangle inequality, we directly have

$$\begin{aligned} \sup _{\varvec{\theta }\in \mathbb {R}^d}\vert F_{\tilde{{\mathcal Z}}_\textrm{s}}(\varvec{\theta }) - F(\varvec{\theta })\vert = O_{\textrm{P}}\left( \max \left\{ \Delta ,\kappa (M), \frac{1}{n}\right\} \right) . \end{aligned}$$

Let ${\mathcal D}= \otimes _{i=1}^d [a_i,b_i]$ and $\delta $ be the volume of the union of the balls ${\mathcal B}({\varvec{s}}_i, r),i=1,\ldots , M$, divided by the volume of $\otimes _{i=1}^d [a_i-r,b_i+r]$. Since r is set as half of the separation distance of ${\mathcal S}$, we have $\delta \prod _{i=1}^d (2r+b_i-a_i)=M v_d r^d$. Without loss of generality, let $a_i=0$ and $b_i=1$ for $i=1,\ldots ,d$, which leads to $r=1/[(Mv_d\delta ^{-1})^{1/d}-2]$. Then, we have $r=O(M^{-1/d})$, which is also true when $a_i$ and $b_i (a_i<b_i)$ take any other bounded values for $i=1,\ldots ,d$.

Based on the result, we have $\Delta =O(M^{-\alpha /d}+(M\log N/N)^{1/2})$. In addition, we can easily obtain that $M^{-\alpha /d}>1/M$ if $d=1$, and $M^{-\alpha /d} >(\log M)^d/M$ as $M \rightarrow \infty $ if $d>1$. Thus, $O_{\textrm{P}}(\max \{\Delta , \kappa (M)\})=O_{\textrm{P}}(M^{-\alpha /d}+(M\log N/N)^{1/2})$ as $M \rightarrow \infty $, which completes the proof.

Proof of Lemma 4

For any ${\varvec{z}}_\textrm{s} \in {\mathcal Z}_\textrm{s}$ and $\varvec{\theta }\in \mathbb {R}^d$, we have

$$\begin{aligned} \textrm{P}({\varvec{z}}_\textrm{s} \in (-\infty , \varvec{\theta }]){} & {} = \textrm{E}(\textrm{E}(I_{(-\infty , \varvec{\theta }]}({\varvec{z}}_\textrm{s}\mid {\mathcal Z})))=\textrm{E}(F_{{\mathcal Z}}(\varvec{\theta }))\\ {}{} & {} =F(\varvec{\theta }), \end{aligned}$$

where $F_{{\mathcal Z}}$ is the ECDF of ${\mathcal Z}$. Hence, any element in ${\mathcal Z}_\textrm{s}$ follows the CDF F. By the multivariate Dvoretzky-Kiefer-Wolfowitz inequality (Naaman 2021), we have

$$\begin{aligned} \sup _{\varvec{\theta }\in \mathbb {R}^d}\vert F(\varvec{\theta })-F_{{\mathcal Z}_\textrm{s}}(\varvec{\theta })\vert =O_\textrm{P}\left( \frac{1}{\sqrt{N_\textrm{s}}}\right) , \end{aligned}$$

which completes the proof.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yi, SY., Zhou, YD. Model-free global likelihood subsampling for massive data. Stat Comput 33, 9 (2023). https://doi.org/10.1007/s11222-022-10185-0

Download citation

Received: 27 August 2022
Accepted: 20 November 2022
Published: 01 December 2022
DOI: https://doi.org/10.1007/s11222-022-10185-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Model-free global likelihood subsampling for massive data

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Learning from imbalanced data: open challenges and future directions

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix

Proof of Lemma 1

Proof of Lemma 2

Proof of Theorem 3

Proof of Lemma 4

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Model-free global likelihood subsampling for massive data

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Learning from imbalanced data: open challenges and future directions

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix

Appendix

Proof of Lemma 1

Proof of Lemma 2

Proof of Theorem 3

Proof of Lemma 4

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation