A Heuristic Sampling Method for Maintaining the Probability Distribution

Yang, Jiao-Yun; Wang, Jun-Da; Zhang, Yi-Fang; Cheng, Wen-Juan; Li, Lian

doi:10.1007/s11390-020-0065-6

A Heuristic Sampling Method for Maintaining the Probability Distribution

Regular Paper
Published: 30 July 2021

Volume 36, pages 896–909, (2021)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Jiao-Yun Yang^1,2,3,
Jun-Da Wang^1,2,4,
Yi-Fang Zhang^1,2,3,
Wen-Juan Cheng^1,2,3 &
…
Lian Li^1,2,3

178 Accesses
8 Citations
Explore all metrics

Abstract

Sampling is a fundamental method for generating data subsets. As many data analysis methods are developed based on probability distributions, maintaining distributions when sampling can help to ensure good data analysis performance. However, sampling a minimum subset while maintaining probability distributions is still a problem. In this paper, we decompose a joint probability distribution into a product of conditional probabilities based on Bayesian networks and use the chi-square test to formulate a sampling problem that requires that the sampled subset pass the distribution test to ensure the distribution. Furthermore, a heuristic sampling algorithm is proposed to generate the required subset by designing two scoring functions: one based on the chi-square test and the other based on likelihood functions. Experiments on four types of datasets with a size of 60 000 show that when the significant difference level, α, is set to 0:05, the algorithm can exclude 99:9%, 99:0%, 93:1% and 96:7% of the samples based on their Bayesian networks—ASIA, ALARM, HEPAR2, and ANDES, respectively. When subsets of the same size are sampled, the subset generated by our algorithm passes all the distribution tests and the average distribution difference is approximately 0:03; by contrast, the subsets generated by random sampling pass only 83:8% of the tests, and the average distribution difference is approximately 0:24.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Stratified random sampling from streaming and stored data

Article 23 October 2020

Trong Duc Nguyen, Ming-Hung Shih, … Bojian Xu

A survey of Bayesian Network structure learning

Article Open access 17 January 2023

Neville Kenneth Kitson, Anthony C. Constantinou, … Kiattikun Chobtham

Confidence distributions and hypothesis testing

Article Open access 29 March 2024

Eugenio Melilli & Piero Veronese

References

Goodhart C A E, O’Hara M. High frequency data in financial markets: Issues and applications. Journal of Empirical Finance, 1997, 4(2/3): 73-114. DOI: https://doi.org/10.1016/S0927-5398(97)00003-0.
Article Google Scholar
Lohr S L. Sampling: Design and Analysis (2nd edition). CRC Press, 2019. DOI: https://doi.org/10.1201/9780429296284.
Yates F. Systematic sampling. Philosophical Transactions of the Royal Society of London. Series A, Mathematical and Physical Sciences, 1948, 241(834): 345-377. DOI: https://doi.org/10.1098/rsta.1948.0023.
Article MathSciNet Google Scholar
Neyman J. On the two different aspects of the representative method: The method of stratified sampling and the method of purposive selection. Journal of the Royal Statistical Society, 1934, 97(4): 558-625. DOI: https://doi.org/10.2307/2342192.
Article MATH Google Scholar
Rand W M. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 1971, 66(336): 846-850. DOI: https://doi.org/10.2307/2284239.
Article Google Scholar
Aljalbout E, Golkov V, Siddiqui Y et al. Clustering with deep learning: Taxonomy and new methods. arXiv:18-01.07648, http://export.arxiv.org/abs/1801.07648, March 2020.
Goodman L A. Snowball sampling. The Annals of Mathematical Statistics, 1961, 32(1): 148-170. DOI: https://doi.org/10.1214/aoms/1177705148.
Article MathSciNet MATH Google Scholar
Emerson R W. Convenience sampling, random sampling, and snowball sampling: How does sampling affect the validity of research? Journal of Visual Impairment & Blindness, 2015, 109(2): 164-168. DOI: https://doi.org/10.1177/01454-82X1510900215.
Article Google Scholar
Saar-Tsechansky M, Provost F. Active sampling for class probability estimation and ranking. Machine Learning, 2004, 54(2): 153-178. DOI: https://doi.org/10.1023/B:MACH.00000118-06.12374.c3.
Article MATH Google Scholar
Dasgupta S, Hsu D. Hierarchical sampling for active learning. In Proc. the 25th International Conference on Machine Learning, June 2008, pp.208-215. DOI: 10.1145/13-90156.1390183.
Zhang H, Lin J, Cormack G V, Smucker M D. Sampling strategies and active learning for volume estimation. In Proc. the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, July 2016, pp.981-984. DOI: 10.1145/2911451.2914685.
Silva J, Ribeiro B, Sung A H. Finding the critical sampling of big datasets. In Proc. the Computing Frontiers Conference, May 2017, pp.355-360. DOI: https://doi.org/10.1145/3075-564.3078886.
Alwosheel A, Van Cranenburgh S, Chorus C G. Is your dataset big enough? Sample size requirements when using artificial neural networks for discrete choice analysis. Journal of Choice Modelling, 2018, 28: 167-182. DOI: https://doi.org/10.1016/j.jocm.2018.07.002.
Article Google Scholar
Wang A, An N, Chen G, Liu J, Alterovitz G. Subtype dependent biomarker identification and tumor classification from gene expression profiles. Knowledge-Based Systems, 2018, 146: 104-117. DOI: https://doi.org/10.1016/j.knosys.2018.01.025.
Article Google Scholar
Yang J, Wang J, Cheng W, Li L. Sampling to maintain approximate probability distribution under chi-square test. In Proc. the 37th National Conference of Theoretical Computer Science, August 2019, pp.29-45. DOI: 10.1007/978-981-15-0105-0_3.
Paxton P, Curran P J, Bollen K A et al. Monte Carlo experiments: Design and implementation. Structural Equation Modeling, 2001, 8(2): 287-312. DOI: https://doi.org/10.1207/S15328-007SEM0802_7.
Article Google Scholar
Gilks W R, Richardson S, Spiegelhalter D. Markov Chain Monte Carlo in Practice (1st edition). Chapman and Hall/CRC, 1996.
Wu S, Angelikopoulos P, Papadimitriou C et al. Bayesian annealed sequential importance sampling: An unbiased version of transitional Markov chain Monte Carlo. ASCE-ASME Journal of Risk and Uncertainty in Engineering Systems, Part B: Mechanical Engineering, 2018, 4(1): Article No. 011008. DOI: 10.1115/1.4037450.
George E I, McCulloch R E. Variable selection via Gibbs sampling. Journal of the American Statistical Association, 1993, 88(423): 881-889. DOI: 10.1080/0162145-9.1993.10476353.
Martino L, Read J, Luengo D. Independent doubly adaptive rejection Metropolis sampling within Gibbs sampling. IEEE Transactions on Signal Processing, 2015, 63(12): 3123-3138. DOI: https://doi.org/10.1109/TSP.2015.2420537.
Article MathSciNet MATH Google Scholar
Murphy K. An introduction to graphical models. Technical Report, University of California, 2001. https://www.cs.ubc.ca/~murphyk/Papers/intro_gm.pdf, March 2020.
Friedman N, Geiger D, Goldszmidt M. Bayesian network classifiers. Machine Learning, 1997, 29(2/3): 131-163. DOI: https://doi.org/10.1023/A:1007465528199.
Article MATH Google Scholar
Bilmes J A. A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. Technique Report, International Computer Science Institute, 1998. http://lasa.ep.ch/teaching/lectures/ML_Phd/Notes/GPGMM.pdf, March 2020.
Zivkovic Z. Improved adaptive Gaussian mixture model for background subtraction. In Proc. the 17th International Conference on Pattern Recognition, August 2004, pp.28-31. DOI: 10.1109/ICPR.2004.1333992.
Murphy K P. Machine Learning: A Probabilistic Perspective. MIT Press, 2012.
Pearson K. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. In Breakthroughs in Statistics: Methodology and Distribution, Kotz S, Johnson N L (eds.), Springer, 1992, pp.11-28. DOI: 10.1007/978-1-4612-4380-9_2.
Balakrishnan N, Voinov V, NikulinMS. Chi-Squared Goodness of Fit Tests with Applications. Academic Press, 2013.
Das A, Kempe D. Approximate submodularity and its applications: Subset selection, sparse approximation and dictionary selection. The Journal of Machine Learning Research, 2018, 19(1): Article No. 3.
Qian C, Yu Y, Zhou Z H. Subset selection by Pareto optimization. In Proc. the Annual Conference on Neural Information Processing Systems, December 2015, pp.1774-1782.
Qian C, Shi J C, Yu Y et al. Parallel Pareto optimization for subset selection. In Proc. the 25th International Joint Conference on Artificial Intelligence, July 2016, pp.1939-1945.
Darrell W. A Genetic Algorithm Tutorial. Statistics & Computing, 1994, 4(2): 65-85.
Google Scholar
Lauritzen S, Spiegelhalter D. Local computations with probabilities on graphical structures and their application on expert systems. J. Royal Statistical Soc.: Series B (Methodological), 1988, 50(2): 157-194. DOI: 10.1111/J.25-17-6161.1988.TB01721.X.
Beinlich I, Suermondt H, Chavez R, Cooper G. The ALARM monitoring system: A case study with two probabilistic inference techniques for belief networks. In Proc. the 2nd European Conf. Artificial Intelligence in Medicine, August 1989, pp.247-256. DOI: https://doi.org/10.1007/978-3-642-93437-7_28.
Oniśko A, Druzdzel M J, Wasyluk H. A probabilistic causal model for diagnosis of liver disorders. In Proc. the 7th International Symposium on Intelligent Information Systems, June 1998, pp.379-387.
Conati C, Gertner A S, VanLehn K et al. On-line student modeling for coached problem solving using Bayesian networks. In Proc. the 6th International Conference on User Modeling, June 1997, pp.231-242. DOI: 10.1007/978-3-7091-2670-7_24.

Download references

Author information

Authors and Affiliations

Key Laboratory of Knowledge Engineering with Big Data of Ministry of Education, Hefei University of Technology, Hefei, 230601, China
Jiao-Yun Yang, Jun-Da Wang, Yi-Fang Zhang, Wen-Juan Cheng & Lian Li
National Smart Eldercare International Science and Technology Cooperation Base, Hefei University of Technology, Hefei, 230601, China
Jiao-Yun Yang, Jun-Da Wang, Yi-Fang Zhang, Wen-Juan Cheng & Lian Li
School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, 230601, China
Jiao-Yun Yang, Yi-Fang Zhang, Wen-Juan Cheng & Lian Li
School of Mathematics, Hefei University of Technology, Hefei, 230601, China
Jun-Da Wang

Authors

Jiao-Yun Yang
View author publications
You can also search for this author in PubMed Google Scholar
Jun-Da Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yi-Fang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Wen-Juan Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Lian Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jun-Da Wang.

Supplementary Information

ESM 1

(PDF 235 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, JY., Wang, JD., Zhang, YF. et al. A Heuristic Sampling Method for Maintaining the Probability Distribution. J. Comput. Sci. Technol. 36, 896–909 (2021). https://doi.org/10.1007/s11390-020-0065-6

Download citation

Received: 05 October 2019
Accepted: 15 August 2020
Published: 30 July 2021
Issue Date: July 2021
DOI: https://doi.org/10.1007/s11390-020-0065-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Heuristic Sampling Method for Maintaining the Probability Distribution

Abstract

Access this article

Similar content being viewed by others

Stratified random sampling from streaming and stored data

A survey of Bayesian Network structure learning

Confidence distributions and hypothesis testing

References

Author information

Authors and Affiliations

Corresponding author

Supplementary Information

ESM 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Heuristic Sampling Method for Maintaining the Probability Distribution

Abstract

Access this article

Similar content being viewed by others

Stratified random sampling from streaming and stored data

A survey of Bayesian Network structure learning

Confidence distributions and hypothesis testing

References

Author information

Authors and Affiliations

Corresponding author

Supplementary Information

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation