research-article

Synthetic Data Generation: A Comparative Study

Authors:
Markus Endres

University of Passau, Germany

University of Passau, Germany
View Profile

,
Asha Mannarapotta Venugopal

University of Passau, Germany

University of Passau, Germany
View Profile

,
Tung Son Tran

University of Passau, Germany

University of Passau, Germany
View Profile

IDEAS '22: Proceedings of the 26th International Database Engineered Applications SymposiumAugust 2022Pages 94–102https://doi.org/10.1145/3548785.3548793

Published:13 September 2022Publication History

IDEAS '22: Proceedings of the 26th International Database Engineered Applications Symposium

Pages 94–102

ABSTRACT

Generating synthetic data similar to realistic data is a crucial task in data augmentation and data production. Due to the preservation of authentic data distribution, synthetic data provide concealment of sensitive information and therefore enable Big Data acquisition for model training without facing privacy challenges. Nevertheless, the obstacles arise starting with acquiring real-world open-source data to effectively synthesizing new samples as genuine as possible. In this paper, a comparative study is conducted by considering the efficacy of different generative models like Generative Adversarial Networks (GAN), Variational Autoencoder (VAE), Synthetic Minority Oversampling Technique (SMOTE), Data Synthesizer (DS), Synthetic Data Vault with Gaussian Copula (SDV-G), Conditional Generative Adversarial Networks (SDV-GAN), and SynthPop Non-Parametric (SP-NP) approach to synthesize data with regard to various datasets. We used the pairwise correlation and Synthetic Data (SD) metrics as utility measures respectively between real data and generated data for evaluation. Accordingly, this paper investigates the effects of various data generation models, and the processing time of every model is included as one of the evaluation metrics.

Supplemental Material

Available for Download

pptx

Presentation slides (555.7 KB)

References

Atanu Bhattacharjee. 2014. Distance Correlation Coefficient: An Application with Bayesian Approach in Clinical Data Analysis. Journal of Modern Applied Statistical Methods 13, 1 (2014), 23.Google ScholarCross Ref
Kevin W. Bowyer, Nitesh V. Chawla, Lawrence O. Hall, and W. Philip Kegelmeyer. 2011. SMOTE: Synthetic Minority Over-sampling Technique. CoRR abs/1106.1813(2011).Google Scholar
Jessamyn Dahmen and Diane Cook. 2019. SynSys: A Synthetic Data Generation System for Healthcare Applications. Sensors 19, 5 (2019). https://doi.org/10.3390/s19051181Google Scholar
Ashish Dandekar, Remmy A. M. Zen, and Stéphane Bressan. 2018. A Comparative Study of Synthetic Dataset Generation Techniques. In Database and Expert Systems Applications. Springer International Publishing, 387–395.Google Scholar
B.S Everitt and David C. Howell. 2005. Encyclopedia of Statistics in Behavioral Science. Vol. 3. Wiley, 1621–1628.Google Scholar
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. Advances in neural information processing systems 27 (2014).Google Scholar
McGraw Hill and S.P. Parker. 2003. McGraw-Hill Dictionary of Scientific and Technical Terms. McGraw-Hill Education. https://books.google.de/books?id=xOPzO5HVFfECGoogle Scholar
Diederik P Kingma and Max Welling. [n.d.]. Auto-Encoding Variational Bayes. https://doi.org/10.48550/ARXIV.1312.6114Google Scholar
Johan Leduc and Nicolas Grislain. 2021. Composable Generative Models. CoRR abs/2102.09249(2021). arXiv:2102.09249https://arxiv.org/abs/2102.09249Google Scholar
Christian Lezcano and Marta Arias. 2020. Synthetic Dataset Generation with Itemset-Based Generative Models. CoRR abs/2007.06300(2020). arXiv:2007.06300https://arxiv.org/abs/2007.06300Google Scholar
Jasdeep Singh Malik, Prachi Goyal, and Mr. Akhilesh K Sharma. 2010. A Comprehensive Approach Towards Data Preprocessing Techniques & Association Rules.Google Scholar
Daniel Manrique-Vallier and Jingchen (Monika) Hu. 2018. Bayesian Non-parametric Generation of Fully Synthetic Multivariate Categorical Data in the Presence of Structural Zeros. Journal of the Royal Statistical Society: Series A (Statistics in Society) 181 (02 2018). https://doi.org/10.1111/rssa.12352Google Scholar
Beata Nowok. 2015. synthpop : An R package for generating synthetic versions of sensitive microdata for statistical disclosure control.Google Scholar
Neha Patki, Roy Wedge, and Kalyan Veeramachaneni. 2016. The Synthetic Data Vault. In 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA). 399–410. https://doi.org/10.1109/DSAA.2016.49Google Scholar
Taoxin Peng and Alexander Telle. 2018. A Tool for Generating Synthetic Data. In Proceedings of the First International Conference on Data Science, E-Learning and Information Systems. Association for Computing Machinery, New York, NY, USA, Article 22, 6 pages. https://doi.org/10.1145/3279996.3280018Google ScholarDigital Library
Haoyue Ping, Julia Stoyanovich, and Bill Howe. 2017. DataSynthesizer: Privacy-Preserving Synthetic Datasets. In Proceedings of the 29th International Conference on Scientific and Statistical Database Management. Article 42, 5 pages. https://doi.org/10.1145/3085504.3091117Google ScholarDigital Library
Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. 2019. Modeling Tabular data using Conditional GAN. CoRR abs/1907.00503(2019). arXiv:1907.00503http://arxiv.org/abs/1907.00503Google Scholar

Index Terms

Synthetic Data Generation: A Comparative Study
1. General and reference
  1. Document types
    1. Surveys and overviews

Recommendations

GANs in the Panorama of Synthetic Data Generation Methods: Application and Evaluation: Enhancing Fake News Detection with GAN-Generated Synthetic Data
This paper focuses on the creation and evaluation of synthetic data to address the challenges of imbalanced datasets in machine learning applications (ML), using fake news detection as a case study. We conducted a thorough literature review on generative ...
Read More
Synthetic data generation: State of the art in health care domain
Abstract
Recent progress in artificial intelligence and machine learning has led to the growth of research in every aspect of life including the health care domain. However, privacy risks and legislations hinder the availability of patient data to ...
Highlights
- A narrative review of the state of the art in synthetic data generation in healthcare.
- We investigate the strengths and weaknesses of existing approaches for synthetic medical data generation.
- We provide consolidation and ...
Read More
Federated synthetic data generation with differential privacy
Abstract
Distributed machine learning has attracted much attention in the last decade with the widespread use of the Internet of Things. As a generative model, Generative Adversarial Network (GAN) has excellent empirical performance. However, the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
IDEAS '22: Proceedings of the 26th International Database Engineered Applications Symposium
August 2022
174 pages
ISBN:9781450397094
DOI:10.1145/3548785
Editors:
Bipin C. Desai
Concordia University
,
Peter Z. Revesz
University of Nebraska-Lincoln
Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 September 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Generative Models
Neural Networks
Synthetic Data
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate74of210submissions,35%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 620
  Total Downloads
- Downloads (Last 12 months)442
- Downloads (Last 6 weeks)74
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Synthetic Data Generation: A Comparative Study

IDEAS '22: Proceedings of the 26th International Database Engineered Applications Symposium

ABSTRACT

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

GANs in the Panorama of Synthetic Data Generation Methods: Application and Evaluation: Enhancing Fake News Detection with GAN-Generated Synthetic Data

Synthetic data generation: State of the art in health care domain

Federated synthetic data generation with differential privacy

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Synthetic Data Generation: A Comparative Study

IDEAS '22: Proceedings of the 26th International Database Engineered Applications Symposium

ABSTRACT

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

GANs in the Panorama of Synthetic Data Generation Methods: Application and Evaluation: Enhancing Fake News Detection with GAN-Generated Synthetic Data

Synthetic data generation: State of the art in health care domain

Federated synthetic data generation with differential privacy

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media