Abstract
Due to the ever increasing data stored in databases, it is important to develop software which can generate large numbers of test data that reflect the properties of a given sample. By generating such data, database algorithms can be stress-tested and evaluated by their performance. If the generated data is much greater in number than the given sample, then the process is called data augmentation or synthetic data generation. Data augmentation can also be very useful in Big Data benchmarking tests. The scope of this paper is to describe a method for statistical data generation based on a given sample, where the generated result attempts to reflect the statistical properties of the sample as much as possible. Throughout the paper we explain how any given data can be represented numerically, and hence clustered using the DBSCAN and K-means algorithms. We introduce a hybrid clustering method, which combines both of the previously mentioned algorithms. The hybrid algorithm focuses on unifying the strengths of both clustering algorithms. After the data is clustered, the individual sub-clusters are statistically analyzed, and based on the analytical results pseudo-random data are generated. The results of the hybrid clustering algorithm show that such artificial data can be created, which reflect the statistical properties of any given sample.
Keywords
Dr. Kiss was also with J. Selye University, Komárno, Slovakia.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Rabl, T., Jacobsen, H.-A.: Big data generation. In: Rabl, T., Poess, M., Baru, C., Jacobsen, H.-A. (eds.) WBDB -2012. LNCS, vol. 8163, pp. 20–27. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-53974-9_3
Soltana, G., Sabetzadeh, M., Briand, L.C.: Synthetic data generation for statistical testing. In: Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering. IEEE Press (2017)
Nowok, B., Raab, G.M., Dibben, C.: synthpop: Bespoke creation of synthetic data in R. J. Stat. Softw. 74(11), 1–26 (2016)
Houkjær, K., Torp, K., Wind, R.: Simple and realistic data generation. In: Proceedings of the 32nd International Conference on Very Large Data Bases. VLDB Endowment (2006)
Gray, J., et al.: Quickly generating billion-record synthetic databases. ACM Sigmod Rec. 23(2) (1994)
Loong, B.W.L.: Topics and applications in synthetic data. Harvard University, Dissertation (2012)
Pei, Y., Zaïane, O.: A synthetic data generator for clustering and outlier analysis. Computing Science Department, University of Alberta, Edmonton, Canada T6G 2E8
Rabl, T., Frank, M., Sergieh, H.M., Kosch, H.: A data generator for cloud-scale benchmarking. In: Nambiar, R., Poess, M. (eds.) TPCTC 2010. LNCS, vol. 6417, pp. 41–56. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-18206-8_4
Ghazal, A., et al.: BigBench: towards an industry standard benchmark for big data analytics. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM (2013)
Ming, Z., et al.: BDGS: a scalable big data generator suite in big data benchmarking. In: Rabl, T., Jacobsen, H.-A., Raghunath, N., Poess, M., Bhandarkar, M., Baru, C. (eds.) WBDB 2013. LNCS, vol. 8585, pp. 138–154. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10596-3_11
Birant, D., Kut, A.: ST-DBSCAN: an algorithm for clustering spatial-temporal data. Data Know. Eng. 60(1), 208–221 (2007)
Acknowledgements
The project was supported by the European Union, co-financed by the European Social Fund (EFOP-3.6.3-VEKOP-16-2017-00002).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Fazekas, B., Kiss, A. (2018). Statistical Data Generation Using Sample Data. In: Benczúr, A., et al. New Trends in Databases and Information Systems. ADBIS 2018. Communications in Computer and Information Science, vol 909. Springer, Cham. https://doi.org/10.1007/978-3-030-00063-9_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-00063-9_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00062-2
Online ISBN: 978-3-030-00063-9
eBook Packages: Computer ScienceComputer Science (R0)