skip to main content
10.1145/3548785.3548793acmotherconferencesArticle/Chapter ViewAbstractPublication PagesideasConference Proceedingsconference-collections
research-article

Synthetic Data Generation: A Comparative Study

Published:13 September 2022Publication History

ABSTRACT

Generating synthetic data similar to realistic data is a crucial task in data augmentation and data production. Due to the preservation of authentic data distribution, synthetic data provide concealment of sensitive information and therefore enable Big Data acquisition for model training without facing privacy challenges. Nevertheless, the obstacles arise starting with acquiring real-world open-source data to effectively synthesizing new samples as genuine as possible. In this paper, a comparative study is conducted by considering the efficacy of different generative models like Generative Adversarial Networks (GAN), Variational Autoencoder (VAE), Synthetic Minority Oversampling Technique (SMOTE), Data Synthesizer (DS), Synthetic Data Vault with Gaussian Copula (SDV-G), Conditional Generative Adversarial Networks (SDV-GAN), and SynthPop Non-Parametric (SP-NP) approach to synthesize data with regard to various datasets. We used the pairwise correlation and Synthetic Data (SD) metrics as utility measures respectively between real data and generated data for evaluation. Accordingly, this paper investigates the effects of various data generation models, and the processing time of every model is included as one of the evaluation metrics.

Skip Supplemental Material Section

Supplemental Material

References

  1. Atanu Bhattacharjee. 2014. Distance Correlation Coefficient: An Application with Bayesian Approach in Clinical Data Analysis. Journal of Modern Applied Statistical Methods 13, 1 (2014), 23.Google ScholarGoogle ScholarCross RefCross Ref
  2. Kevin W. Bowyer, Nitesh V. Chawla, Lawrence O. Hall, and W. Philip Kegelmeyer. 2011. SMOTE: Synthetic Minority Over-sampling Technique. CoRR abs/1106.1813(2011).Google ScholarGoogle Scholar
  3. Jessamyn Dahmen and Diane Cook. 2019. SynSys: A Synthetic Data Generation System for Healthcare Applications. Sensors 19, 5 (2019). https://doi.org/10.3390/s19051181Google ScholarGoogle Scholar
  4. Ashish Dandekar, Remmy A. M. Zen, and Stéphane Bressan. 2018. A Comparative Study of Synthetic Dataset Generation Techniques. In Database and Expert Systems Applications. Springer International Publishing, 387–395.Google ScholarGoogle Scholar
  5. B.S Everitt and David C. Howell. 2005. Encyclopedia of Statistics in Behavioral Science. Vol. 3. Wiley, 1621–1628.Google ScholarGoogle Scholar
  6. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. Advances in neural information processing systems 27 (2014).Google ScholarGoogle Scholar
  7. McGraw Hill and S.P. Parker. 2003. McGraw-Hill Dictionary of Scientific and Technical Terms. McGraw-Hill Education. https://books.google.de/books?id=xOPzO5HVFfECGoogle ScholarGoogle Scholar
  8. Diederik P Kingma and Max Welling. [n.d.]. Auto-Encoding Variational Bayes. https://doi.org/10.48550/ARXIV.1312.6114Google ScholarGoogle Scholar
  9. Johan Leduc and Nicolas Grislain. 2021. Composable Generative Models. CoRR abs/2102.09249(2021). arXiv:2102.09249https://arxiv.org/abs/2102.09249Google ScholarGoogle Scholar
  10. Christian Lezcano and Marta Arias. 2020. Synthetic Dataset Generation with Itemset-Based Generative Models. CoRR abs/2007.06300(2020). arXiv:2007.06300https://arxiv.org/abs/2007.06300Google ScholarGoogle Scholar
  11. Jasdeep Singh Malik, Prachi Goyal, and Mr. Akhilesh K Sharma. 2010. A Comprehensive Approach Towards Data Preprocessing Techniques & Association Rules.Google ScholarGoogle Scholar
  12. Daniel Manrique-Vallier and Jingchen (Monika) Hu. 2018. Bayesian Non-parametric Generation of Fully Synthetic Multivariate Categorical Data in the Presence of Structural Zeros. Journal of the Royal Statistical Society: Series A (Statistics in Society) 181 (02 2018). https://doi.org/10.1111/rssa.12352Google ScholarGoogle Scholar
  13. Beata Nowok. 2015. synthpop : An R package for generating synthetic versions of sensitive microdata for statistical disclosure control.Google ScholarGoogle Scholar
  14. Neha Patki, Roy Wedge, and Kalyan Veeramachaneni. 2016. The Synthetic Data Vault. In 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA). 399–410. https://doi.org/10.1109/DSAA.2016.49Google ScholarGoogle Scholar
  15. Taoxin Peng and Alexander Telle. 2018. A Tool for Generating Synthetic Data. In Proceedings of the First International Conference on Data Science, E-Learning and Information Systems. Association for Computing Machinery, New York, NY, USA, Article 22, 6 pages. https://doi.org/10.1145/3279996.3280018Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Haoyue Ping, Julia Stoyanovich, and Bill Howe. 2017. DataSynthesizer: Privacy-Preserving Synthetic Datasets. In Proceedings of the 29th International Conference on Scientific and Statistical Database Management. Article 42, 5 pages. https://doi.org/10.1145/3085504.3091117Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. 2019. Modeling Tabular data using Conditional GAN. CoRR abs/1907.00503(2019). arXiv:1907.00503http://arxiv.org/abs/1907.00503Google ScholarGoogle Scholar

Index Terms

  1. Synthetic Data Generation: A Comparative Study

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      IDEAS '22: Proceedings of the 26th International Database Engineered Applications Symposium
      August 2022
      174 pages
      ISBN:9781450397094
      DOI:10.1145/3548785

      Copyright © 2022 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 13 September 2022

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

      Acceptance Rates

      Overall Acceptance Rate74of210submissions,35%
    • Article Metrics

      • Downloads (Last 12 months)442
      • Downloads (Last 6 weeks)74

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format