ABSTRACT
Generating synthetic data similar to realistic data is a crucial task in data augmentation and data production. Due to the preservation of authentic data distribution, synthetic data provide concealment of sensitive information and therefore enable Big Data acquisition for model training without facing privacy challenges. Nevertheless, the obstacles arise starting with acquiring real-world open-source data to effectively synthesizing new samples as genuine as possible. In this paper, a comparative study is conducted by considering the efficacy of different generative models like Generative Adversarial Networks (GAN), Variational Autoencoder (VAE), Synthetic Minority Oversampling Technique (SMOTE), Data Synthesizer (DS), Synthetic Data Vault with Gaussian Copula (SDV-G), Conditional Generative Adversarial Networks (SDV-GAN), and SynthPop Non-Parametric (SP-NP) approach to synthesize data with regard to various datasets. We used the pairwise correlation and Synthetic Data (SD) metrics as utility measures respectively between real data and generated data for evaluation. Accordingly, this paper investigates the effects of various data generation models, and the processing time of every model is included as one of the evaluation metrics.
Supplemental Material
Available for Download
- Atanu Bhattacharjee. 2014. Distance Correlation Coefficient: An Application with Bayesian Approach in Clinical Data Analysis. Journal of Modern Applied Statistical Methods 13, 1 (2014), 23.Google ScholarCross Ref
- Kevin W. Bowyer, Nitesh V. Chawla, Lawrence O. Hall, and W. Philip Kegelmeyer. 2011. SMOTE: Synthetic Minority Over-sampling Technique. CoRR abs/1106.1813(2011).Google Scholar
- Jessamyn Dahmen and Diane Cook. 2019. SynSys: A Synthetic Data Generation System for Healthcare Applications. Sensors 19, 5 (2019). https://doi.org/10.3390/s19051181Google Scholar
- Ashish Dandekar, Remmy A. M. Zen, and Stéphane Bressan. 2018. A Comparative Study of Synthetic Dataset Generation Techniques. In Database and Expert Systems Applications. Springer International Publishing, 387–395.Google Scholar
- B.S Everitt and David C. Howell. 2005. Encyclopedia of Statistics in Behavioral Science. Vol. 3. Wiley, 1621–1628.Google Scholar
- Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. Advances in neural information processing systems 27 (2014).Google Scholar
- McGraw Hill and S.P. Parker. 2003. McGraw-Hill Dictionary of Scientific and Technical Terms. McGraw-Hill Education. https://books.google.de/books?id=xOPzO5HVFfECGoogle Scholar
- Diederik P Kingma and Max Welling. [n.d.]. Auto-Encoding Variational Bayes. https://doi.org/10.48550/ARXIV.1312.6114Google Scholar
- Johan Leduc and Nicolas Grislain. 2021. Composable Generative Models. CoRR abs/2102.09249(2021). arXiv:2102.09249https://arxiv.org/abs/2102.09249Google Scholar
- Christian Lezcano and Marta Arias. 2020. Synthetic Dataset Generation with Itemset-Based Generative Models. CoRR abs/2007.06300(2020). arXiv:2007.06300https://arxiv.org/abs/2007.06300Google Scholar
- Jasdeep Singh Malik, Prachi Goyal, and Mr. Akhilesh K Sharma. 2010. A Comprehensive Approach Towards Data Preprocessing Techniques & Association Rules.Google Scholar
- Daniel Manrique-Vallier and Jingchen (Monika) Hu. 2018. Bayesian Non-parametric Generation of Fully Synthetic Multivariate Categorical Data in the Presence of Structural Zeros. Journal of the Royal Statistical Society: Series A (Statistics in Society) 181 (02 2018). https://doi.org/10.1111/rssa.12352Google Scholar
- Beata Nowok. 2015. synthpop : An R package for generating synthetic versions of sensitive microdata for statistical disclosure control.Google Scholar
- Neha Patki, Roy Wedge, and Kalyan Veeramachaneni. 2016. The Synthetic Data Vault. In 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA). 399–410. https://doi.org/10.1109/DSAA.2016.49Google Scholar
- Taoxin Peng and Alexander Telle. 2018. A Tool for Generating Synthetic Data. In Proceedings of the First International Conference on Data Science, E-Learning and Information Systems. Association for Computing Machinery, New York, NY, USA, Article 22, 6 pages. https://doi.org/10.1145/3279996.3280018Google ScholarDigital Library
- Haoyue Ping, Julia Stoyanovich, and Bill Howe. 2017. DataSynthesizer: Privacy-Preserving Synthetic Datasets. In Proceedings of the 29th International Conference on Scientific and Statistical Database Management. Article 42, 5 pages. https://doi.org/10.1145/3085504.3091117Google ScholarDigital Library
- Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. 2019. Modeling Tabular data using Conditional GAN. CoRR abs/1907.00503(2019). arXiv:1907.00503http://arxiv.org/abs/1907.00503Google Scholar
Index Terms
- Synthetic Data Generation: A Comparative Study
Recommendations
GANs in the Panorama of Synthetic Data Generation Methods: Application and Evaluation: Enhancing Fake News Detection with GAN-Generated Synthetic Data
This paper focuses on the creation and evaluation of synthetic data to address the challenges of imbalanced datasets in machine learning applications (ML), using fake news detection as a case study. We conducted a thorough literature review on generative ...
Synthetic data generation: State of the art in health care domain
AbstractRecent progress in artificial intelligence and machine learning has led to the growth of research in every aspect of life including the health care domain. However, privacy risks and legislations hinder the availability of patient data to ...
Highlights- A narrative review of the state of the art in synthetic data generation in healthcare.
- We investigate the strengths and weaknesses of existing approaches for synthetic medical data generation.
- We provide consolidation and ...
Federated synthetic data generation with differential privacy
AbstractDistributed machine learning has attracted much attention in the last decade with the widespread use of the Internet of Things. As a generative model, Generative Adversarial Network (GAN) has excellent empirical performance. However, the ...
Comments