Abstract
Most existing studies for subsampling heavily depend on a specified model. If the assumed model is not correct, the performance of the subsample may be poor. This paper focuses on a model-free subsampling method, called global likelihood subsampling, such that the subsample is robust to different model choices. It leverages the idea of the global likelihood sampler, which is an effective and robust sampling method from a given continuous distribution. Furthermore, we accelerate the algorithm for large-scale datasets and extend it to deal with high-dimensional data with relatively low computational complexity. Simulations and real data studies are conducted to apply the proposed method to regression and classification problems. It illustrates that this method is robust against different modeling methods and has promising performance compared with some existing model-free subsampling methods for data compression.
Similar content being viewed by others
References
Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)
Bottou, L.: Stochastic Gradient Descent Tricks, Neural Networks: Tricks of the Trade, pp. 421–436. Springer, Berlin (2012)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Davis, R.A., Lii, K.S., Politis, D.N.: Remarks on Some Nonparametric Estimates of a Density Function, Selected Works of Murray Rosenblatt, pp. 95–100. Springer, New York (2011)
Dua, D., Graff, C.: UCI machine learning repository. URL http://archive.ics.uci.edu/ml (2019)
Fan, J., Lv, J.: Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B 70(5), 849–911 (2008)
Fang, K.T., Liu, M.Q., Qin, H., Zhou, Y.D.: Theory and Application of Uniform Experimental Designs. Springer, Singapore (2018)
Friedman, J.H.: Multivariate adaptive regression splines. Ann. Stat. 19(1), 1–67 (1991)
Hastings, W.K.: Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 97–109 (1970)
Hlawka, E.: Funktionen von beschränkter variatiou in der theorie der gleichverteilung. Annali di Matematica Pura ed Applicata 54(1), 325–333 (1961)
Jiang, H.: Uniform convergence rates for kernel density estimation. In: International Conference on Machine Learning, pp. 1694–1703. PMLR (2017)
Joseph, V.R., Vakayil, A.: Split: an optimal method for data splitting. Technometrics 64(2), 166–176 (2022)
Liang, F., Wong, W.H.: Real-parameter evolutionary Monte Carlo with applications to Bayesian mixture models. J. Am. Stat. Assoc. 96(454), 653–666 (2001)
Ma, P., Mahoney, M., Yu, B.: A statistical perspective on algorithmic leveraging. In: International Conference on Machine Learning, pp. 91–99. PMLR (2014)
Mahoney, M.W.: Randomized algorithms for matrices and data. Found. Trends Mach. Learn. 3(2), 123–224 (2011)
Maire, F., Friel, N., Alquier, P.: Informed sub-sampling MCMC: approximate Bayesian inference for large datasets. Stat. Comput. 29(3), 449–482 (2019)
Meng, C., Xie, R., Mandal, A., Zhang, X., Zhong, W., Ma, P.: Lowcon: a design-based subsampling approach in a misspecified linear model. J. Comput. Graph. Stat. 30(3), 694–708 (2021)
Naaman, M.: On the tight constant in the multivariate Dvoretzky–Kiefer–Wolfowitz inequality. Stat. Probab. Lett. 173, 109088 (2021)
Nuyens, D., Cools, R.: Fast algorithms for component-by-component construction of rank-1 lattice rules in shift-invariant reproducing kernel Hilbert spaces. Math. Comput. 75(254), 903–920 (2006)
Parzen, E.: On estimation of a probability density function and mode. Ann. Math. Stat. 33(3), 1065–1076 (1962)
Rubin, D.B.: A noniterative sampling/importance resampling alternative to data augmentation for creating a few imputations when fractions of missing information are modest: the sir algorithm. J. Am. Stat. Assoc. 82, 544–546 (1987)
Santner, T.J., Williams, B.J., Notz, W.I., Williams, B.J.: The Design and Analysis of Computer Experiments. Springer, New York (2003)
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 58(1), 267–288 (1996)
Ting, D., Brochu, E.: Optimal subsampling with influence functions. In: Advances in Neural Information Processing Systems, pp. 3650–3659 (2018)
Vakayil, A., Joseph, V.R., Mak, S., Vakayil, M.A.: Package ‘SPlit’. R package version 1.0 (2022)
Vakayil, A., Joseph, V.R., Vakayil, M.A.: Package ‘twinning’. R package version 1.0 (2022)
Vakayil, A., Joseph, V.R.: Data twinning. Stat. Anal. Data Min. ASA Data Sci. J. 15, 598–610 (2022)
Wang, D., Joseph, V.R.: Mined: minimum energy design. R Package Version 1.0 (2022)
Wang, Y.C., Ning, J.H., Zhou, Y.D., Fang, K.T.: A new sampler: randomized likelihood sampling. In: Souvenir Booklet of the 24th International Workshop on Matrices and Statistics, pp. 255–261 (2015)
Wang, H., Ma, Y.: Optimal subsampling for quantile regression in big data. Biometrika 108(1), 99–112 (2021)
Wang, H., Zhu, R., Ma, P.: Optimal subsampling for large sample logistic regression. J. Am. Stat. Assoc. 113(522), 829–844 (2018)
Wang, H., Yang, M., Stufken, J.: Information-based optimal subdata selection for big data linear regression. J. Am. Stat. Assoc. 114(525), 393–405 (2019)
Wang, L., Elmstedt, J., Wong, W.K., Xu, H.: Orthogonal subsampling for big data linear regression. Ann. Appl. Stat. 15(3), 1273–1290 (2021)
Yi, S.Y., Liu, Z., Liu, M.Q., Zhou, Y.D.: Global likelihood sampler for multimodal distributions. Submitted (2022)
Yu, J., Wang, H., Ai, M., Zhang, H.: Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data. J. Am. Stat. Assoc. 117(537), 265–276 (2022)
Zhang, M., Zhou, Y.D., Zhou, Z., Zhang, A.J.: Model-free subsampling method based on uniform designs. arXiv:2209.03617 (2022)
Acknowledgements
The authors would like to thank the two referees for their valuable comments that lead to a significant improvement of the presentation. This work was partial supported by the National Natural Science Foundation of China (11871288 and 12131001)), the Fundamental Research Funds for the Central Universities, LPMC, and KLMDASR.
Author information
Authors and Affiliations
Contributions
Yi, S. wrote the main manuscript text and Zhou, Y. provided the research idea and modified the whole paper. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Proof of Lemma 1
By Theorem 2 in Jiang (2017), we directly have
Let the support of f be \({\mathcal T}\) and the volume of \({\mathcal T}\) is \(v({\mathcal T})=V_T<\infty \). Then,
which completes the proof. \(\square \)
Proof of Lemma 2
Let \(\tilde{F}(\varvec{\theta })\) be the multinomial distribution placing mass \(\hat{f}_\textrm{K}({\varvec{s}}_k)/\sum _{i=1}^M \hat{f}_\textrm{K}({\varvec{s}}_i)\) at \({\varvec{s}}_k, k=1,\ldots ,M\). Similar to the proof of Lemma 1(1) in Yi et al. (2022), if \({\mathcal S}\) is the low-discrepancy point set and \(\hat{f}_\textrm{K}({\varvec{x}})I_{[-\infty ,\varvec{\theta }]}({\varvec{x}})\) is of uniformly bounded variation for any \(\varvec{\theta }\in \mathbb {R}^d\), then for any \(\varvec{\theta }\in \mathbb {R}^d\), we have
Let \(D_\textrm{m}=\{(2j-1)/2n: j=1,\ldots ,n\}\). Then, by the proof of Theorem 2 in Yi et al. (2022), we can obtain that
Thus, \(\vert F_{{\mathcal Z}_\textrm{s}}(\varvec{\theta })-\tilde{F}(\varvec{\theta })\vert \) can be viewed as the quasi-Monte Carlo error based on \(D_\textrm{m}\). The star discrepancy of \(D_\textrm{m}\) is always no more than 1/n. Hence, by the Koksma-Hlawka inequality (Hlawka 1961), if \(\varrho _{\varvec{\theta }}(u)\) is of uniformly bounded variation for any \(\varvec{\theta }\in \mathbb {R}^d\) and run size M of the space-filling design \({\mathcal S}\), then for any \(\varvec{\theta }\in \mathbb {R}^d\), we have
Since the above formula is valid for any \(\varvec{\theta }\in \mathbb {R}^d\), it is also true for the supremum value. Then, by the triangle inequality,
which completes the proof.
Proof of Theorem 3
Based on Lemmas 1 and 2, by the triangle inequality, we directly have
Let \({\mathcal D}= \otimes _{i=1}^d [a_i,b_i]\) and \(\delta \) be the volume of the union of the balls \({\mathcal B}({\varvec{s}}_i, r),i=1,\ldots , M\), divided by the volume of \(\otimes _{i=1}^d [a_i-r,b_i+r]\). Since r is set as half of the separation distance of \({\mathcal S}\), we have \(\delta \prod _{i=1}^d (2r+b_i-a_i)=M v_d r^d\). Without loss of generality, let \(a_i=0\) and \(b_i=1\) for \(i=1,\ldots ,d\), which leads to \(r=1/[(Mv_d\delta ^{-1})^{1/d}-2]\). Then, we have \(r=O(M^{-1/d})\), which is also true when \(a_i\) and \(b_i (a_i<b_i)\) take any other bounded values for \(i=1,\ldots ,d\).
Based on the result, we have \(\Delta =O(M^{-\alpha /d}+(M\log N/N)^{1/2})\). In addition, we can easily obtain that \(M^{-\alpha /d}>1/M\) if \(d=1\), and \(M^{-\alpha /d} >(\log M)^d/M\) as \(M \rightarrow \infty \) if \(d>1\). Thus, \(O_{\textrm{P}}(\max \{\Delta , \kappa (M)\})=O_{\textrm{P}}(M^{-\alpha /d}+(M\log N/N)^{1/2})\) as \(M \rightarrow \infty \), which completes the proof.
Proof of Lemma 4
For any \({\varvec{z}}_\textrm{s} \in {\mathcal Z}_\textrm{s}\) and \(\varvec{\theta }\in \mathbb {R}^d\), we have
where \(F_{{\mathcal Z}}\) is the ECDF of \({\mathcal Z}\). Hence, any element in \({\mathcal Z}_\textrm{s}\) follows the CDF F. By the multivariate Dvoretzky-Kiefer-Wolfowitz inequality (Naaman 2021), we have
which completes the proof.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yi, SY., Zhou, YD. Model-free global likelihood subsampling for massive data. Stat Comput 33, 9 (2023). https://doi.org/10.1007/s11222-022-10185-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11222-022-10185-0