Skip to main content
Log in

Model-free global likelihood subsampling for massive data

  • OriginalPaper
  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

Most existing studies for subsampling heavily depend on a specified model. If the assumed model is not correct, the performance of the subsample may be poor. This paper focuses on a model-free subsampling method, called global likelihood subsampling, such that the subsample is robust to different model choices. It leverages the idea of the global likelihood sampler, which is an effective and robust sampling method from a given continuous distribution. Furthermore, we accelerate the algorithm for large-scale datasets and extend it to deal with high-dimensional data with relatively low computational complexity. Simulations and real data studies are conducted to apply the proposed method to regression and classification problems. It illustrates that this method is robust against different modeling methods and has promising performance compared with some existing model-free subsampling methods for data compression.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  • Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)

    Article  MATH  Google Scholar 

  • Bottou, L.: Stochastic Gradient Descent Tricks, Neural Networks: Tricks of the Trade, pp. 421–436. Springer, Berlin (2012)

    Book  Google Scholar 

  • Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  MATH  Google Scholar 

  • Davis, R.A., Lii, K.S., Politis, D.N.: Remarks on Some Nonparametric Estimates of a Density Function, Selected Works of Murray Rosenblatt, pp. 95–100. Springer, New York (2011)

    Google Scholar 

  • Dua, D., Graff, C.: UCI machine learning repository. URL http://archive.ics.uci.edu/ml (2019)

  • Fan, J., Lv, J.: Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B 70(5), 849–911 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  • Fang, K.T., Liu, M.Q., Qin, H., Zhou, Y.D.: Theory and Application of Uniform Experimental Designs. Springer, Singapore (2018)

    Book  MATH  Google Scholar 

  • Friedman, J.H.: Multivariate adaptive regression splines. Ann. Stat. 19(1), 1–67 (1991)

    MathSciNet  MATH  Google Scholar 

  • Hastings, W.K.: Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 97–109 (1970)

    Article  MathSciNet  MATH  Google Scholar 

  • Hlawka, E.: Funktionen von beschränkter variatiou in der theorie der gleichverteilung. Annali di Matematica Pura ed Applicata 54(1), 325–333 (1961)

    Article  MATH  Google Scholar 

  • Jiang, H.: Uniform convergence rates for kernel density estimation. In: International Conference on Machine Learning, pp. 1694–1703. PMLR (2017)

  • Joseph, V.R., Vakayil, A.: Split: an optimal method for data splitting. Technometrics 64(2), 166–176 (2022)

    Article  MathSciNet  Google Scholar 

  • Liang, F., Wong, W.H.: Real-parameter evolutionary Monte Carlo with applications to Bayesian mixture models. J. Am. Stat. Assoc. 96(454), 653–666 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  • Ma, P., Mahoney, M., Yu, B.: A statistical perspective on algorithmic leveraging. In: International Conference on Machine Learning, pp. 91–99. PMLR (2014)

  • Mahoney, M.W.: Randomized algorithms for matrices and data. Found. Trends Mach. Learn. 3(2), 123–224 (2011)

    MATH  Google Scholar 

  • Maire, F., Friel, N., Alquier, P.: Informed sub-sampling MCMC: approximate Bayesian inference for large datasets. Stat. Comput. 29(3), 449–482 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  • Meng, C., Xie, R., Mandal, A., Zhang, X., Zhong, W., Ma, P.: Lowcon: a design-based subsampling approach in a misspecified linear model. J. Comput. Graph. Stat. 30(3), 694–708 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  • Naaman, M.: On the tight constant in the multivariate Dvoretzky–Kiefer–Wolfowitz inequality. Stat. Probab. Lett. 173, 109088 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  • Nuyens, D., Cools, R.: Fast algorithms for component-by-component construction of rank-1 lattice rules in shift-invariant reproducing kernel Hilbert spaces. Math. Comput. 75(254), 903–920 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  • Parzen, E.: On estimation of a probability density function and mode. Ann. Math. Stat. 33(3), 1065–1076 (1962)

    Article  MathSciNet  MATH  Google Scholar 

  • Rubin, D.B.: A noniterative sampling/importance resampling alternative to data augmentation for creating a few imputations when fractions of missing information are modest: the sir algorithm. J. Am. Stat. Assoc. 82, 544–546 (1987)

    Article  Google Scholar 

  • Santner, T.J., Williams, B.J., Notz, W.I., Williams, B.J.: The Design and Analysis of Computer Experiments. Springer, New York (2003)

    Book  MATH  Google Scholar 

  • Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 58(1), 267–288 (1996)

    MathSciNet  MATH  Google Scholar 

  • Ting, D., Brochu, E.: Optimal subsampling with influence functions. In: Advances in Neural Information Processing Systems, pp. 3650–3659 (2018)

  • Vakayil, A., Joseph, V.R., Mak, S., Vakayil, M.A.: Package ‘SPlit’. R package version 1.0 (2022)

  • Vakayil, A., Joseph, V.R., Vakayil, M.A.: Package ‘twinning’. R package version 1.0 (2022)

  • Vakayil, A., Joseph, V.R.: Data twinning. Stat. Anal. Data Min. ASA Data Sci. J. 15, 598–610 (2022)

    Article  MathSciNet  Google Scholar 

  • Wang, D., Joseph, V.R.: Mined: minimum energy design. R Package Version 1.0 (2022)

  • Wang, Y.C., Ning, J.H., Zhou, Y.D., Fang, K.T.: A new sampler: randomized likelihood sampling. In: Souvenir Booklet of the 24th International Workshop on Matrices and Statistics, pp. 255–261 (2015)

  • Wang, H., Ma, Y.: Optimal subsampling for quantile regression in big data. Biometrika 108(1), 99–112 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  • Wang, H., Zhu, R., Ma, P.: Optimal subsampling for large sample logistic regression. J. Am. Stat. Assoc. 113(522), 829–844 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  • Wang, H., Yang, M., Stufken, J.: Information-based optimal subdata selection for big data linear regression. J. Am. Stat. Assoc. 114(525), 393–405 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  • Wang, L., Elmstedt, J., Wong, W.K., Xu, H.: Orthogonal subsampling for big data linear regression. Ann. Appl. Stat. 15(3), 1273–1290 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  • Yi, S.Y., Liu, Z., Liu, M.Q., Zhou, Y.D.: Global likelihood sampler for multimodal distributions. Submitted (2022)

  • Yu, J., Wang, H., Ai, M., Zhang, H.: Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data. J. Am. Stat. Assoc. 117(537), 265–276 (2022)

    Article  MathSciNet  Google Scholar 

  • Zhang, M., Zhou, Y.D., Zhou, Z., Zhang, A.J.: Model-free subsampling method based on uniform designs. arXiv:2209.03617 (2022)

Download references

Acknowledgements

The authors would like to thank the two referees for their valuable comments that lead to a significant improvement of the presentation. This work was partial supported by the National Natural Science Foundation of China (11871288 and 12131001)), the Fundamental Research Funds for the Central Universities, LPMC, and KLMDASR.

Author information

Authors and Affiliations

Authors

Contributions

Yi, S. wrote the main manuscript text and Zhou, Y. provided the research idea and modified the whole paper. All authors reviewed the manuscript.

Corresponding author

Correspondence to Yong-Dao Zhou.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Proof of Lemma 1

By Theorem 2 in Jiang (2017), we directly have

$$\begin{aligned} \sup _{\varvec{\theta }\in \mathbb {R}^d} \vert \hat{f}_\textrm{K}(\varvec{\theta })-f(\varvec{\theta })\vert = O_\textrm{P}(\Delta ). \end{aligned}$$

Let the support of f be \({\mathcal T}\) and the volume of \({\mathcal T}\) is \(v({\mathcal T})=V_T<\infty \). Then,

$$\begin{aligned} \sup _{\varvec{\theta }\in \mathbb {R}^d} \vert \hat{F}_\textrm{K}(\varvec{\theta })-F(\varvec{\theta })\vert =&\sup _{\varvec{\theta }\in \mathbb {R}^d} \left| \int _{(-\infty ,\varvec{\theta }]\cap {\mathcal T}} (\hat{f}_\textrm{K}(\varvec{\vartheta })-f(\varvec{\vartheta })) \textrm{d} \varvec{\vartheta }\right| \\ \le&\sup _{\varvec{\theta }\in \mathbb {R}^d} \int _{(-\infty ,\varvec{\theta }]\cap {\mathcal T}}\vert \hat{f}_\textrm{K}(\varvec{\vartheta })-f(\varvec{\vartheta })\vert \textrm{d} \varvec{\vartheta }\\ \le&V_T \cdot \sup _{\varvec{\theta }\in \mathbb {R}^d} \vert \hat{f}_\textrm{K}(\varvec{\theta })-f(\varvec{\theta })\vert = O_\textrm{P}(\Delta ), \end{aligned}$$

which completes the proof. \(\square \)

Proof of Lemma 2

Let \(\tilde{F}(\varvec{\theta })\) be the multinomial distribution placing mass \(\hat{f}_\textrm{K}({\varvec{s}}_k)/\sum _{i=1}^M \hat{f}_\textrm{K}({\varvec{s}}_i)\) at \({\varvec{s}}_k, k=1,\ldots ,M\). Similar to the proof of Lemma 1(1) in Yi et al. (2022), if \({\mathcal S}\) is the low-discrepancy point set and \(\hat{f}_\textrm{K}({\varvec{x}})I_{[-\infty ,\varvec{\theta }]}({\varvec{x}})\) is of uniformly bounded variation for any \(\varvec{\theta }\in \mathbb {R}^d\), then for any \(\varvec{\theta }\in \mathbb {R}^d\), we have

$$\begin{aligned} \vert \tilde{F}(\varvec{\theta })-\hat{F}_\textrm{K}(\varvec{\theta })\vert =O(\kappa (M)). \end{aligned}$$

Let \(D_\textrm{m}=\{(2j-1)/2n: j=1,\ldots ,n\}\). Then, by the proof of Theorem 2 in Yi et al. (2022), we can obtain that

$$\begin{aligned} F_{{\mathcal Z}_\textrm{s}}(\varvec{\theta })=\frac{1}{n}\sum _{u\in D_\textrm{m}}\varrho _{\varvec{\theta }}(u), ~~ \tilde{F}(\varvec{\theta })=\int _{[0,1]}\varrho _{\varvec{\theta }}(u) \textrm{d}u. \end{aligned}$$

Thus, \(\vert F_{{\mathcal Z}_\textrm{s}}(\varvec{\theta })-\tilde{F}(\varvec{\theta })\vert \) can be viewed as the quasi-Monte Carlo error based on \(D_\textrm{m}\). The star discrepancy of \(D_\textrm{m}\) is always no more than 1/n. Hence, by the Koksma-Hlawka inequality (Hlawka 1961), if \(\varrho _{\varvec{\theta }}(u)\) is of uniformly bounded variation for any \(\varvec{\theta }\in \mathbb {R}^d\) and run size M of the space-filling design \({\mathcal S}\), then for any \(\varvec{\theta }\in \mathbb {R}^d\), we have

$$\begin{aligned} \vert F_{{\mathcal Z}_\textrm{s}}(\varvec{\theta })-\tilde{F}(\varvec{\theta })\vert =O\left( \frac{1}{n}\right) . \end{aligned}$$

Since the above formula is valid for any \(\varvec{\theta }\in \mathbb {R}^d\), it is also true for the supremum value. Then, by the triangle inequality,

$$\begin{aligned}&\sup _{\varvec{\theta }\in \mathbb {R}^d}\vert F_{\tilde{{\mathcal Z}}_\textrm{s}}(\varvec{\theta }) - \hat{F}_\textrm{K}(\varvec{\theta })\vert \\&\quad \le \sup _{\varvec{\theta }\in \mathbb {R}^d} \vert \tilde{F}(\varvec{\theta })-\hat{F}_\textrm{K}(\varvec{\theta })\vert + \sup _{\varvec{\theta }\in \mathbb {R}^d} \vert F_{{\mathcal Z}_\textrm{s}}(\varvec{\theta })-\tilde{F}(\varvec{\theta })\vert \\&\quad = O\left( \max \left\{ \kappa (M), \frac{1}{n}\right\} \right) , \end{aligned}$$

which completes the proof.

Proof of Theorem 3

Based on Lemmas 1 and 2, by the triangle inequality, we directly have

$$\begin{aligned} \sup _{\varvec{\theta }\in \mathbb {R}^d}\vert F_{\tilde{{\mathcal Z}}_\textrm{s}}(\varvec{\theta }) - F(\varvec{\theta })\vert = O_{\textrm{P}}\left( \max \left\{ \Delta ,\kappa (M), \frac{1}{n}\right\} \right) . \end{aligned}$$

Let \({\mathcal D}= \otimes _{i=1}^d [a_i,b_i]\) and \(\delta \) be the volume of the union of the balls \({\mathcal B}({\varvec{s}}_i, r),i=1,\ldots , M\), divided by the volume of \(\otimes _{i=1}^d [a_i-r,b_i+r]\). Since r is set as half of the separation distance of \({\mathcal S}\), we have \(\delta \prod _{i=1}^d (2r+b_i-a_i)=M v_d r^d\). Without loss of generality, let \(a_i=0\) and \(b_i=1\) for \(i=1,\ldots ,d\), which leads to \(r=1/[(Mv_d\delta ^{-1})^{1/d}-2]\). Then, we have \(r=O(M^{-1/d})\), which is also true when \(a_i\) and \(b_i (a_i<b_i)\) take any other bounded values for \(i=1,\ldots ,d\).

Based on the result, we have \(\Delta =O(M^{-\alpha /d}+(M\log N/N)^{1/2})\). In addition, we can easily obtain that \(M^{-\alpha /d}>1/M\) if \(d=1\), and \(M^{-\alpha /d} >(\log M)^d/M\) as \(M \rightarrow \infty \) if \(d>1\). Thus, \(O_{\textrm{P}}(\max \{\Delta , \kappa (M)\})=O_{\textrm{P}}(M^{-\alpha /d}+(M\log N/N)^{1/2})\) as \(M \rightarrow \infty \), which completes the proof.

Proof of Lemma 4

For any \({\varvec{z}}_\textrm{s} \in {\mathcal Z}_\textrm{s}\) and \(\varvec{\theta }\in \mathbb {R}^d\), we have

$$\begin{aligned} \textrm{P}({\varvec{z}}_\textrm{s} \in (-\infty , \varvec{\theta }]){} & {} = \textrm{E}(\textrm{E}(I_{(-\infty , \varvec{\theta }]}({\varvec{z}}_\textrm{s}\mid {\mathcal Z})))=\textrm{E}(F_{{\mathcal Z}}(\varvec{\theta }))\\ {}{} & {} =F(\varvec{\theta }), \end{aligned}$$

where \(F_{{\mathcal Z}}\) is the ECDF of \({\mathcal Z}\). Hence, any element in \({\mathcal Z}_\textrm{s}\) follows the CDF F. By the multivariate Dvoretzky-Kiefer-Wolfowitz inequality (Naaman 2021), we have

$$\begin{aligned} \sup _{\varvec{\theta }\in \mathbb {R}^d}\vert F(\varvec{\theta })-F_{{\mathcal Z}_\textrm{s}}(\varvec{\theta })\vert =O_\textrm{P}\left( \frac{1}{\sqrt{N_\textrm{s}}}\right) , \end{aligned}$$

which completes the proof.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yi, SY., Zhou, YD. Model-free global likelihood subsampling for massive data. Stat Comput 33, 9 (2023). https://doi.org/10.1007/s11222-022-10185-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11222-022-10185-0

Keywords

Navigation