Density estimation-based method to determine sample size for random sample partition of big data

He, Yulin; Chen, Jiaqi; Shen, Jiaxing; Fournier-Viger, Philippe; Huang, Joshua Zhexue

doi:10.1007/s11704-023-2356-x

Density estimation-based method to determine sample size for random sample partition of big data

Research Article
Published: 16 December 2023

Volume 18, article number 185322, (2024)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Yulin He^1,2,
Jiaqi Chen^1,2,
Jiaxing Shen³,
Philippe Fournier-Viger² &
…
Joshua Zhexue Huang^1,2

100 Accesses
Explore all metrics

Abstract

Random sample partition (RSP) is a newly developed big data representation and management model to deal with big data approximate computation problems. Academic research and practical applications have confirmed that RSP is an efficient solution for big data processing and analysis. However, a challenge for implementing RSP is determining an appropriate sample size for RSP data blocks. While a large sample size increases the burden of big data computation, a small size will lead to insufficient distribution information for RSP data blocks. To address this problem, this paper presents a novel density estimation-based method (DEM) to determine the optimal sample size for RSP data blocks. First, a theoretical sample size is calculated based on the multivariate Dvoretzky-Kiefer-Wolfowitz (DKW) inequality by using the fixed-point iteration (FPI) method. Second, a practical sample size is determined by minimizing the validation error of a kernel density estimator (KDE) constructed on RSP data blocks for an increasing sample size. Finally, a series of persuasive experiments are conducted to validate the feasibility, rationality, and effectiveness of DEM. Experimental results show that (1) the iteration function of the FPI method is convergent for calculating the theoretical sample size from the multivariate DKW inequality; (2) the KDE constructed on RSP data blocks with sample size determined by DEM can yield a good approximation of the probability density function (p.d.f.); and (3) DEM provides more accurate sample sizes than the existing sample size determination methods from the perspective of p.d.f. estimation. This demonstrates that DEM is a viable approach to deal with the sample size determination problem for big data RSP implementation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Effective RSP Data Sampling Algorithm

Exploring and cleaning big data with random sample data blocks

Article Open access 04 June 2019

A Two-Stage Data Processing Algorithm to Generate Random Sample Partitions for Big Data Analysis

References

Sookhak M, Yu F R, Zomaya A Y. Auditing big data storage in cloud computing using divide and conquer tables. IEEE Transactions on Parallel and Distributed Systems, 2018, 29(5): 999–1012
Article Google Scholar
Zhao S Y, Li R X, Tian W L, Xiao W J, Dong X H, Liao D J, Khan S U, Li K Q. Divide-and-conquer approach for solving singular value decomposition based on MapReduce. Concurrency and Computation: Practice and Experience, 2016, 28(2): 331–350
Article Google Scholar
Ghazi M R, Gangodkar D. Hadoop, MapReduce and HDFS: a developers perspective. Procedia Computer Science, 2015, 48: 45–50
Article Google Scholar
Neha M P, Narendra M P, Hasan M I, Parth D S, Mayur M P. Improving HDFS write performance using efficient replica placement. In: Proceedings of the 5th International Conference-Confluence the Next Generation Information Technology Summit. 2014, 36–39
Salloum S, Huang J Z, He Y L. Random sample partition: a distributed data model for big data analysis. IEEE Transactions on Industrial Informatics, 2019, 15(11): 5846–5854
Article Google Scholar
Wei C H, Salloum S, Emara T Z, Zhang X L, Huang J Z, He Y L. A two-stage data processing algorithm to generate random sample partitions for big data analysis. In: Proceedings of the 11th International Conference on Cloud Computing. 2018, 347–364
Yamane T. Statistics: An Introductory Analysis. 2nd ed. New York: Harper and Row, 1967
Google Scholar
Cochran W G. Sampling Techniques. New York: John Wiley & Sons, 2007
Google Scholar
Smith M F. Sampling considerations in evaluating cooperative extension programs. Gainesville: Florida Cooperative Extension Service, Institute of Food and Agricultural Sciences, University of Florida, 1983
Google Scholar
Naaman M. On the tight constant in the multivariate Dvoretzky-Kiefer-Wolfowitz inequality. Statistics & Probability Letters, 2021, 173: 109088
Article MathSciNet Google Scholar
Kleiner A, Talwalkar A, Sarkar P, Jordan M I. A scalable bootstrap for massive data. Journal of the Royal Statistical Society Series B: Statistical Methodology, 2014, 76(4): 795–816
Article MathSciNet Google Scholar
Reshef D N, Reshef Y A, Finucane H K, Grossman S R, McVean G, Turnbaugh P J, Lander E S, Mitzenmacher M, Sabeti P C. Detecting novel associations in large data sets. Science, 2011, 334(6062): 1518–1524
Article Google Scholar
Sengupta S, Volgushev S, Shao X F. A subsampled double bootstrap for massive data. Journal of the American Statistical Association, 2016, 111(515): 1222–1232
Article MathSciNet Google Scholar
Browne R H. On the use of a pilot sample for sample size determination. Statistics in Medicine, 1995, 14(17): 1933–1940
Article Google Scholar
Lenth R V. Some practical guidelines for effective sample size determination. The American Statistician, 2002, 55(3): 187–193
Article MathSciNet Google Scholar
Ahmad W M A W, Amin W A A W M, Aleng N A, Mohamed N. Some practical guidelines for effective sample-size determination in observational studies. Aceh International Journal of Science and Technology, 2012, 1(2): 51–53
Article Google Scholar
Burmeister E, Aitken L M. Sample size: how many is enough? Australian Critical Care, 2012, 25(4): 271–274
Article Google Scholar
Okada S, Ohzeki M, Taguchi S. Efficient partition of integer optimization problems with one-hot encoding. Scientific Reports, 2019, 9(1): 13036
Article Google Scholar
He Y L, Ye X, Huang D F, Fournier-Viger P, Huang J Z. A hybrid method to measure distribution consistency of mixed-attribute datasets. IEEE Transactions on Artificial Intelligence, 2023, 4(1): 182–196
Article Google Scholar
Parzen E. On estimation of a probability density function and mode. The Annals of Mathematical Statistics, 1962, 33(3): 1065–1076
Article MathSciNet Google Scholar
Jiang J, He Y L, Dai D X, Huang J Z. A new kernel density estimator based on the minimum entropy of data set. Information Sciences, 2019, 491: 223–231
Article MathSciNet Google Scholar
Jones M C, Marron J S, Sheather S J. A brief survey of bandwidth selection for density estimation. Journal of the American Statistical Association, 1996, 91(433): 401–407
Article MathSciNet Google Scholar
Perez-Cruz F. Kullback-Leibler divergence estimation of continuous distributions. In: Proceedings of 2008 IEEE International Symposium on Information Theory. 2008, 1666–1670
Perez-Cruz F. Estimation of information theoretic measures for continuous random variables. In: Proceedings of the 21st International Conference on Neural Information Processing Systems. 2008, 1257–1264
Yan Y Y, Cheng D Z, Feng J E, Li H T, Yue J M. Survey on applications of algebraic state space theory of logical systems to finite state machines. Science China Information Sciences, 2023, 66(1): 111201
Article MathSciNet Google Scholar

Download references

Acknowledgements

The authors would like to sincerely thank the editors and three anonymous reviewers whose valuable suggestions considerably helped improve the paper after two rounds of review. This paper was supported by the National Natural Science Foundation of China (Grant No. 61972261), the Natural Science Foundation of Guangdong Province (No. 2023A1515011667), the Key Basic Research Foundation of Shenzhen (No. JCYJ20220818100205012), and the Basic Research Foundation of Shenzhen (No. JCYJ20210324093609026).

Author information

Authors and Affiliations

Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, 518107, China
Yulin He, Jiaqi Chen & Joshua Zhexue Huang
College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518060, China
Yulin He, Jiaqi Chen, Philippe Fournier-Viger & Joshua Zhexue Huang
Department of Computing and Decision Sciences, Lingnan University, Hong Kong, 999077, China
Jiaxing Shen

Authors

Yulin He
View author publications
You can also search for this author inPubMed Google Scholar
Jiaqi Chen
View author publications
You can also search for this author inPubMed Google Scholar
Jiaxing Shen
View author publications
You can also search for this author inPubMed Google Scholar
Philippe Fournier-Viger
View author publications
You can also search for this author inPubMed Google Scholar
Joshua Zhexue Huang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Joshua Zhexue Huang.

Additional information

Yulin He received the PhD degree from Hebei University, China in 2014. From 2011 to 2014, he has served as a Research Assistant with the Department of Computing, The Hong Kong Polytechnic University, China. From 2014 to 2017, he worked as a Post-doctoral Fellow in the College of Computer Science and Software Engineering, Shenzhen University, China. He is currently a Research Associate with Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), China. His main research interests include big data approximate computing technologies, multi-sample statistical analysis theories and methods, and data mining/machine learning algorithms and their applications. He has published over 100+ research papers in ACM Transactions, CAAI Transactions, IEEE Transactions, Elsevier, Springer Journals and PAKDD, IJCNN, CEC, DASFAA conferences. He is an ACM member, CAAI member, CCF member, IEEE member, and the Editorial Review Board members of several international journals.

Jiaqi Chen received her bachelor degree from Shenzhen University, China in 2021. She is currently pursuiting her PhD degree in the College of Computer Science and Software Engineering, Shenzhen University, China. Her main research interests include big data approximate computing technologies, multi-sample statistical analysis theories and methods, and data mining/machine learning algorithms and their applications.

Jiaxing Shen is an Assistant Professor at the Department of Computing and Decision Sciences, Lingnan University, China. He obtained BE in Software Engineering and PhD in Computer Science from Jilin University, China in 2014 and The Hong Kong Polytechnic University, China in 2019, respectively. His central research theme is Human Dynamics which refers to interdisciplinary research of human behavior with an aim to understand human behavior and provide actionable insights. Under the theme, he has several research interests including Mobile Computing, Data Mining, and IoT systems. He has published over 25 papers in top-tier journals and conferences including TMC, TKDE, TOIS, IMWUT, IoT-J, INFOCOM, WWW, ICDM, and ICDCS. He has won Best Paper Award twice including one from INFOCOM 2020. He also served as a Session Chair of MASS 2021.

Philippe Fournier-Viger is a distinguished professor at the College of Computer Science and Software Engineering at Shenzhen University, China. He obtained a title of national talent from the National Natural Science Foundation of China. He has published more than 300 research papers related to data mining, big data, intelligent systems and applications, which have received more than 10,000 citations (H-Index 51). He is the editor-in-chief of the Data Science and Pattern Recognition journal and the former associate editor-in-chief of the Applied Intelligence journal (SCI, Q1). He is the founder of the SPMF data mining library, offering more than 230 algorithms, used in more than 1,000 research papers. He is co-founder of the UDML, PMDB, and MLiSE series workshop held at the ICDM, PKDD, DASFAA, and KDD conferences. His interests are data mining, algorithm design, pattern mining, sequence mining, big data, and applications.

Joshua Zhexue Huang received the PhD degree from The Royal Institute of Technology, Stockholm, Sweden in 1993. He is currently a Distinguished Professor with the College of Computer Science and Software Engineering, Shenzhen University, China. He is also the Director of Big Data Institute, China, and the Deputy Director of National Engineering Laboratory for Big Data System Computing Technology. He has published over 200 research papers in conferences and journals. His main research interests include big data technology and applications. Prof. Huang received the first PAKDD Most Influential Paper Award in 2006. He is known for his contributions to the development of a series of k-means type clustering algorithms in data mining, such as k-modes, fuzzy k-modes, k-prototypes, and w-k-means that are widely cited and used, and some of which have been included in commercial software. He has extensive industry expertise in business intelligence and data mining and has been involved in numerous consulting projects in many countries.

Electronic Supplementary Material