Skip to main content
Log in

A Heuristic Sampling Method for Maintaining the Probability Distribution

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Sampling is a fundamental method for generating data subsets. As many data analysis methods are developed based on probability distributions, maintaining distributions when sampling can help to ensure good data analysis performance. However, sampling a minimum subset while maintaining probability distributions is still a problem. In this paper, we decompose a joint probability distribution into a product of conditional probabilities based on Bayesian networks and use the chi-square test to formulate a sampling problem that requires that the sampled subset pass the distribution test to ensure the distribution. Furthermore, a heuristic sampling algorithm is proposed to generate the required subset by designing two scoring functions: one based on the chi-square test and the other based on likelihood functions. Experiments on four types of datasets with a size of 60 000 show that when the significant difference level, α, is set to 0:05, the algorithm can exclude 99:9%, 99:0%, 93:1% and 96:7% of the samples based on their Bayesian networks—ASIA, ALARM, HEPAR2, and ANDES, respectively. When subsets of the same size are sampled, the subset generated by our algorithm passes all the distribution tests and the average distribution difference is approximately 0:03; by contrast, the subsets generated by random sampling pass only 83:8% of the tests, and the average distribution difference is approximately 0:24.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Goodhart C A E, O’Hara M. High frequency data in financial markets: Issues and applications. Journal of Empirical Finance, 1997, 4(2/3): 73-114. DOI: https://doi.org/10.1016/S0927-5398(97)00003-0.

    Article  Google Scholar 

  2. Lohr S L. Sampling: Design and Analysis (2nd edition). CRC Press, 2019. DOI: https://doi.org/10.1201/9780429296284.

  3. Yates F. Systematic sampling. Philosophical Transactions of the Royal Society of London. Series A, Mathematical and Physical Sciences, 1948, 241(834): 345-377. DOI: https://doi.org/10.1098/rsta.1948.0023.

    Article  MathSciNet  Google Scholar 

  4. Neyman J. On the two different aspects of the representative method: The method of stratified sampling and the method of purposive selection. Journal of the Royal Statistical Society, 1934, 97(4): 558-625. DOI: https://doi.org/10.2307/2342192.

    Article  MATH  Google Scholar 

  5. Rand W M. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 1971, 66(336): 846-850. DOI: https://doi.org/10.2307/2284239.

    Article  Google Scholar 

  6. Aljalbout E, Golkov V, Siddiqui Y et al. Clustering with deep learning: Taxonomy and new methods. arXiv:18-01.07648, http://export.arxiv.org/abs/1801.07648, March 2020.

  7. Goodman L A. Snowball sampling. The Annals of Mathematical Statistics, 1961, 32(1): 148-170. DOI: https://doi.org/10.1214/aoms/1177705148.

    Article  MathSciNet  MATH  Google Scholar 

  8. Emerson R W. Convenience sampling, random sampling, and snowball sampling: How does sampling affect the validity of research? Journal of Visual Impairment & Blindness, 2015, 109(2): 164-168. DOI: https://doi.org/10.1177/01454-82X1510900215.

    Article  Google Scholar 

  9. Saar-Tsechansky M, Provost F. Active sampling for class probability estimation and ranking. Machine Learning, 2004, 54(2): 153-178. DOI: https://doi.org/10.1023/B:MACH.00000118-06.12374.c3.

    Article  MATH  Google Scholar 

  10. Dasgupta S, Hsu D. Hierarchical sampling for active learning. In Proc. the 25th International Conference on Machine Learning, June 2008, pp.208-215. DOI: 10.1145/13-90156.1390183.

  11. Zhang H, Lin J, Cormack G V, Smucker M D. Sampling strategies and active learning for volume estimation. In Proc. the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, July 2016, pp.981-984. DOI: 10.1145/2911451.2914685.

  12. Silva J, Ribeiro B, Sung A H. Finding the critical sampling of big datasets. In Proc. the Computing Frontiers Conference, May 2017, pp.355-360. DOI: https://doi.org/10.1145/3075-564.3078886.

  13. Alwosheel A, Van Cranenburgh S, Chorus C G. Is your dataset big enough? Sample size requirements when using artificial neural networks for discrete choice analysis. Journal of Choice Modelling, 2018, 28: 167-182. DOI: https://doi.org/10.1016/j.jocm.2018.07.002.

    Article  Google Scholar 

  14. Wang A, An N, Chen G, Liu J, Alterovitz G. Subtype dependent biomarker identification and tumor classification from gene expression profiles. Knowledge-Based Systems, 2018, 146: 104-117. DOI: https://doi.org/10.1016/j.knosys.2018.01.025.

    Article  Google Scholar 

  15. Yang J, Wang J, Cheng W, Li L. Sampling to maintain approximate probability distribution under chi-square test. In Proc. the 37th National Conference of Theoretical Computer Science, August 2019, pp.29-45. DOI: 10.1007/978-981-15-0105-0_3.

  16. Paxton P, Curran P J, Bollen K A et al. Monte Carlo experiments: Design and implementation. Structural Equation Modeling, 2001, 8(2): 287-312. DOI: https://doi.org/10.1207/S15328-007SEM0802_7.

    Article  Google Scholar 

  17. Gilks W R, Richardson S, Spiegelhalter D. Markov Chain Monte Carlo in Practice (1st edition). Chapman and Hall/CRC, 1996.

  18. Wu S, Angelikopoulos P, Papadimitriou C et al. Bayesian annealed sequential importance sampling: An unbiased version of transitional Markov chain Monte Carlo. ASCE-ASME Journal of Risk and Uncertainty in Engineering Systems, Part B: Mechanical Engineering, 2018, 4(1): Article No. 011008. DOI: 10.1115/1.4037450.

  19. George E I, McCulloch R E. Variable selection via Gibbs sampling. Journal of the American Statistical Association, 1993, 88(423): 881-889. DOI: 10.1080/0162145-9.1993.10476353.

  20. Martino L, Read J, Luengo D. Independent doubly adaptive rejection Metropolis sampling within Gibbs sampling. IEEE Transactions on Signal Processing, 2015, 63(12): 3123-3138. DOI: https://doi.org/10.1109/TSP.2015.2420537.

    Article  MathSciNet  MATH  Google Scholar 

  21. Murphy K. An introduction to graphical models. Technical Report, University of California, 2001. https://www.cs.ubc.ca/~murphyk/Papers/intro_gm.pdf, March 2020.

  22. Friedman N, Geiger D, Goldszmidt M. Bayesian network classifiers. Machine Learning, 1997, 29(2/3): 131-163. DOI: https://doi.org/10.1023/A:1007465528199.

    Article  MATH  Google Scholar 

  23. Bilmes J A. A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. Technique Report, International Computer Science Institute, 1998. http://lasa.ep.ch/teaching/lectures/ML_Phd/Notes/GPGMM.pdf, March 2020.

  24. Zivkovic Z. Improved adaptive Gaussian mixture model for background subtraction. In Proc. the 17th International Conference on Pattern Recognition, August 2004, pp.28-31. DOI: 10.1109/ICPR.2004.1333992.

  25. Murphy K P. Machine Learning: A Probabilistic Perspective. MIT Press, 2012.

  26. Pearson K. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. In Breakthroughs in Statistics: Methodology and Distribution, Kotz S, Johnson N L (eds.), Springer, 1992, pp.11-28. DOI: 10.1007/978-1-4612-4380-9_2.

  27. Balakrishnan N, Voinov V, NikulinMS. Chi-Squared Goodness of Fit Tests with Applications. Academic Press, 2013.

  28. Das A, Kempe D. Approximate submodularity and its applications: Subset selection, sparse approximation and dictionary selection. The Journal of Machine Learning Research, 2018, 19(1): Article No. 3.

  29. Qian C, Yu Y, Zhou Z H. Subset selection by Pareto optimization. In Proc. the Annual Conference on Neural Information Processing Systems, December 2015, pp.1774-1782.

  30. Qian C, Shi J C, Yu Y et al. Parallel Pareto optimization for subset selection. In Proc. the 25th International Joint Conference on Artificial Intelligence, July 2016, pp.1939-1945.

  31. Darrell W. A Genetic Algorithm Tutorial. Statistics & Computing, 1994, 4(2): 65-85.

    Google Scholar 

  32. Lauritzen S, Spiegelhalter D. Local computations with probabilities on graphical structures and their application on expert systems. J. Royal Statistical Soc.: Series B (Methodological), 1988, 50(2): 157-194. DOI: 10.1111/J.25-17-6161.1988.TB01721.X.

  33. Beinlich I, Suermondt H, Chavez R, Cooper G. The ALARM monitoring system: A case study with two probabilistic inference techniques for belief networks. In Proc. the 2nd European Conf. Artificial Intelligence in Medicine, August 1989, pp.247-256. DOI: https://doi.org/10.1007/978-3-642-93437-7_28.

  34. Oniśko A, Druzdzel M J, Wasyluk H. A probabilistic causal model for diagnosis of liver disorders. In Proc. the 7th International Symposium on Intelligent Information Systems, June 1998, pp.379-387.

  35. Conati C, Gertner A S, VanLehn K et al. On-line student modeling for coached problem solving using Bayesian networks. In Proc. the 6th International Conference on User Modeling, June 1997, pp.231-242. DOI: 10.1007/978-3-7091-2670-7_24.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jun-Da Wang.

Supplementary Information

ESM 1

(PDF 235 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, JY., Wang, JD., Zhang, YF. et al. A Heuristic Sampling Method for Maintaining the Probability Distribution. J. Comput. Sci. Technol. 36, 896–909 (2021). https://doi.org/10.1007/s11390-020-0065-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-020-0065-6

Keywords

Navigation