Skip to main content
Log in

Tabular data synthesis with generative adversarial networks: design space and optimizations

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

The proliferation of big data has brought an urgent demand for privacy-preserving data publishing. Traditional solutions to this demand have limitations on effectively balancing the trade-off between privacy and utility of the released data. To address this problem, the database community and machine learning community have recently studied a new problem of tabular data synthesis using generative adversarial networks (GANs) and proposed various algorithms. However, a comprehensive comparison between GAN-based methods and conventional approaches is still lacking, making it unclear why and how GANs can outperform conventional approaches in synthesizing tabular data. Moreover, it is difficult for practitioners to understand which components are necessary when building a GAN model for tabular data synthesis. To bridge this gap, we conduct a comprehensive experimental study that investigates applying GAN to tabular data synthesis. We introduce a unified GAN-based framework and define a space of design solutions for each component in the framework, including neural network architectures and training strategies. We provide optimization techniques to handle difficulties in training GAN in practice. We conduct extensive experiments to explore the design space, comparing with traditional data synthesis approaches. Through extensive experiments, we find that GAN is very promising for tabular data synthesis and provide guidance for selecting appropriate design choices. We also point out limitations of GAN and identify future research directions. We make all code and datasets public for future research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

Notes

  1. https://github.com/ruc-datalab/Daisy.

  2. We use \(t\) and \(\varvec{t} \) to, respectively, denote a record in \({\mathcal {T}} \) and the d-dimension sample transformed from record \(t\) in the paper.

  3. https://github.com/ruc-datalab/Daisy.

  4. https://github.com/mahmoodm2/tableGAN.

  5. https://sourceforge.net/projects/privbayes/.

References

  1. Adult data set. https://archive.ics.uci.edu/ml/datasets/Adult

  2. Anuran calls (mfccs) data set. http://archive.ics.uci.edu/ml/datasets/Anuran+Calls+%28MFCCs%29

  3. Agrawal, D., Aggarwal, C.C.: On the design and quantification of privacy preserving data mining algorithms. In: PODS (2001)

  4. Arjovsky, M., Bottou, L.: Towards principled methods for training generative adversarial networks. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, Conference Track Proceedings. OpenReview.net (2017)

  5. Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein GAN. CoRR arXiv:1701.07875 (2017)

  6. Baowaly, M.K., Lin, C., Liu, C., Chen, K.: Synthesizing electronic health records using improved generative adversarial networks. JAMIA 26(3), 228–241 (2019)

    PubMed  Google Scholar 

  7. Barak, B., Chaudhuri, K., Dwork, C., Kale, S., McSherry, F., Talwar, K.: Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In: PODS, pp. 273–282 (2007)

  8. Bohannon, P., Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for data cleaning. In: ICDE, pp. 746–755 (2007)

  9. Borisov, V., Leemann, T., Seßler, K., Haug, J., Pawelczyk, M., Kasneci, G.: Deep neural networks and tabular data: a survey. CoRR arXiv:2110.01889 (2021)

  10. Brickell, J., Shmatikov, V.: The cost of privacy: destruction of data-mining utility in anonymized data publishing. In: SIGKDD, pp. 70–78 (2008)

  11. Census-income (kdd) data set. http://archive.ics.uci.edu/ml/datasets/Census-Income+(KDD)

  12. Chaudhuri, S., Ding, B., Kandula, S.: Approximate query processing: No silver bullet. In: SIGMOD, pp. 511–519 (2017)

  13. Chen, H., Jajodia, S., Liu, J., Park, N., Sokolov, V., Subrahmanian, V.S.: Faketables: using GANs to generate functional dependency preserving tables with bounded real data. In: IJCAI, pp. 2074–2080 (2019)

  14. Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., Abbeel, P.: Infogan: interpretable representation learning by information maximizing generative adversarial nets. In: NIPS, pp. 2172–2180 (2016)

  15. Choi, E., Biswal, S., Malin, B.A., Duke, J., Stewart, W.F., Sun, J.: Generating multi-label discrete electronic health records using generative adversarial networks. CoRR arXiv:1703.06490 (2017)

  16. Covertype data set. http://archive.ics.uci.edu/ml/datasets/covertype

  17. Cormode, G., Garofalakis, M.N., Haas, P.J., Jermaine, C.: Synopses for massive data: samples, histograms, wavelets, sketches. Found. Trends Databases 4(1–3), 1–294 (2012)

    Google Scholar 

  18. Doersch, C.: Tutorial on variational autoencoders. CoRR arXiv:1606.05908 (2016)

  19. Diabete data set. https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008

  20. Domingo-Ferrer, J.: A survey of inference control methods for privacy-preserving data mining. In: Privacy-Preserving Data Mining—Models and Algorithms, pp. 53–80 (2008)

  21. Dumoulin, V., Belghazi, I., Poole, B., Lamb, A., Arjovsky, M., Mastropietro, O., Courville, A.C.: Adversarially learned inference. In: ICLR (2017)

  22. Dwork, C., Roth, A.: The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci. 9(3–4), 211–407 (2014)

    MathSciNet  Google Scholar 

  23. Esteban, C., Hyland, S.L., Rätsch, G.: Real-valued (medical) time series generation with recurrent conditional GANs. CoRR arXiv:1706.02633 (2017)

  24. Fan, J., Liu, T., Li, G., Chen, J., Shen, Y., Du, X.: Relation data synthesis using generative adversarial network: a design space exploration. In: Technical Report. https://github.com/ruclty/Daisy/blob/master/daisy.pdf (2020)

  25. Fan, J., Liu, T., Li, G., Chen, J., Shen, Y., Du, X.: Relational data synthesis using generative adversarial networks: a design space exploration. Proc. VLDB Endow. 13(11), 1962–1975 (2020)

    Article  Google Scholar 

  26. Gondara, L., Wang, K.: MIDA: multiple imputation using denoising autoencoders. In: PAKDD, pp. 260–272 (2018)

  27. Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.C., Bengio, Y.: Generative adversarial nets. In: NIPS, pp. 2672–2680 (2014)

  28. Hastie, T., Tibshirani, R., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition. Springer Series in Statistics. Springer (2009)

  29. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  CAS  PubMed  Google Scholar 

  30. Htru2 data set. http://archive.ics.uci.edu/ml/datasets/HTRU2

  31. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML, pp. 448–456 (2015)

  32. Internet data set. https://openml.org/search?type=data &status=active &id=372

  33. Jordon, J., Yoon, J., van der Schaar, M.: PATE-GAN: generating synthetic data with differential privacy guarantees. In: ICLR (2019)

  34. Hodge, J.G., Jr., Gostin, L.O., Jacobson, P.: Legal issues concerning electronic health information: privacy, quality, and liability. JAMA 282, 1466–1471 (1999)

    Article  PubMed  Google Scholar 

  35. Kaggle. The state of data science and machine learning. https://www.kaggle.com/surveys/2017 (2017)

  36. Karras, T., Aittala, M., Hellsten, J., Laine, S., Lehtinen, J., Aila, T.: Training generative adversarial networks with limited data. In: Larochelle, H., Ranzato M., Hadsell R., Balcan M., Lin H. (eds) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, virtual (2020)

  37. Khope, S., Elias, S.: Critical correlation of predictors for an efficient risk prediction framework of ICU patient using correlation and transformation of MIMIC-III dataset. Data Sci. Eng. 7(1), 71–86 (2022)

    Article  Google Scholar 

  38. Kim, J., Jeon, J., Lee, J., Hyeong, J., Park, N.: OCT-GAN: neural ode-based conditional tabular GANs. In Leskovec J., Grobelnik M., Najork M., Tang J., Zia L. (eds) WWW ’21: The Web Conference 2021, Virtual Event / Ljubljana, pp. 1506–1515. ACM/IW3C2 (2021)

  39. Kim, J., Lee, C., Park, N.: Stasy: score-based tabular data synthesis. CoRR arXiv:2210.04018 (2022)

  40. Kim, J., Lee, C., Shin, Y., Park, S., Kim, M., Park, N., Cho, J.: SOS: score-based oversampling for tabular data. In: Zhang A., Rangwala H. (eds) KDD ’22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, pp. 762–772, ACM (2022)

  41. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: ICLR (2014)

  42. Kotelnikov, A., Baranchuk, D., Rubachev, I., Babenko, A.: Tabddpm: Modelling tabular data with diffusion models. CoRR arXiv:2209.15421 (2022)

  43. Lee, J., Hyeong, J., Jeon, J., Park, N., Cho, J.: Invertible tabular GANs: killing two birds with one stone for tabular data synthesis. In: Ranzato M., Beygelzimer A., Dauphin Y.N., Liang P., Vaughan J.W. (eds) Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, pp. 4263–4273 (2021)

  44. Li, H., Xiong, L., Zhang, L., Jiang, X.: Dpsynthesizer: differentially private data synthesizer for privacy preserving data sharing. PVLDB 7(13), 1677–1680 (2014)

    PubMed  Google Scholar 

  45. Li, K., Zhang, Y., Li, G., Tao, W., Yan, Y.: Bounded approximate query processing. IEEE Trans. Knowl. Data Eng. 31(12), 2262–2276 (2019)

    Article  Google Scholar 

  46. Li, N., Li, T., Venkatasubramanian, S.: t-closeness: Privacy beyond k-anonymity and l-diversity. In: ICDE, pp. 106–115 (2007)

  47. Li, S.C., Jiang, B., Marlin, B.M.: Misgan: learning from incomplete data with generative adversarial networks. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, OpenReview.net (2019)

  48. Ling, Z.J., Tran, Q.T., Fan, J., Koh, G.C.H., Nguyen, T., Tan, C.S., Yip, J.W.L., Zhang, M.: GEMINI: an integrative healthcare analytics system. PVLDB 7(13), 1766–1771 (2014)

    Google Scholar 

  49. Liu, T., Fan, J., Luo, Y., Tang, N., Li, G., Du, X.: Adaptive data augmentation for supervised learning over missing data. Proc. VLDB Endow. 14(7), 1202–1214 (2021)

    Article  Google Scholar 

  50. Liu, T., Yang, J., Fan, J., Wei, Z., Li, G., Du, X.: Crowdgame: a game-based crowdsourcing system for cost-effective data labeling. In: SIGMOD, pp. 1957–1960 (2019)

  51. Lu, P., Wang, P., Yu, C.: Empirical evaluation on synthetic data generation with generative adversarial network. In: WIMS, vol. 16, pp. 1–16 (2019)

  52. Lucic, M., Kurach, K., Michalski, M., Gelly, S., Bousquet, O.: Are GANs created equal? A large-scale study. In: NeurIPS, pp. 698–707 (2018)

  53. Mateo-Sanz, J.M., Sebé, F., Domingo-Ferrer, J.: Outlier protection in continuous microdata masking. In: Privacy in Statistical Databases, pp. 201–215 (2004)

  54. Metz, L., Poole, B., Pfau, D., Sohl-Dickstein, J.: Unrolled generative adversarial networks. CoRR arXiv:1611.02163 (2016)

  55. Mirza, M., Osindero, S.: Conditional generative adversarial nets. CoRR arXiv:1411.1784 (2014)

  56. Olsson, C., Bhupatiraju, S., Brown, T.B., Odena, A., Goodfellow, I.J.: Skill rating for generative models. CoRR arXiv:1808.04888

  57. Park, N., Mohammadi, M., Gorde, K., Jajodia, S., Park, H., Kim, Y.: Data synthesis based on generative adversarial networks. PVLDB 11(10), 1071–1083 (2018)

    Google Scholar 

  58. Park, Y., Ghosh, J.: Pegs: perturbed gibbs samplers that generate privacy-compliant synthetic data. Trans. Data Privacy 7(3), 253–282 (2014)

    MathSciNet  Google Scholar 

  59. Patki, N., Wedge, R., Veeramachaneni, K.: The synthetic data vault. In: DSAA, pp. 399–410 (2016)

  60. Pen-based recognition of handwritten digits data set. https://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits

  61. PyTorch Developers. Tensors and dynamic neural networks in python with strong GPU acceleration. https://pytorch.org

  62. Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. In: ICLR (2016)

  63. Ramakrishnan, R., Gehrke, J.: Database Management Systems, 3rd edn. McGraw-Hill (2003)

  64. Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. In: ICML, pp. 1278–1286 (2014)

  65. Salimans, T., Goodfellow, I.J., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. In: NIPS, pp. 2226–2234 (2016)

  66. Sarki, R., Ahmed, K., Wang, H., et al.: Image preprocessing in classification and identification of diabetic eye diseases. Data Sci. Eng. 6(4), 455–471 (2021)

    Article  PubMed  PubMed Central  Google Scholar 

  67. Shokri, R., Stronati, M., Song, C., Shmatikov, V.: Membership inference attacks against machine learning models. In: 2017 IEEE Symposium on Security and Privacy, SP 2017, San Jose, pp. 3–18. IEEE Computer Society (2017)

  68. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS, pp. 3104–3112 (2014)

  69. Statlog (landsat satellite) data set. https://archive.ics.uci.edu/ml/datasets/Statlog+%28Landsat+Satellite%29

  70. Thirumuruganathan, S., Hasan, S., Koudas, N., Das, G.: Approximate query processing using deep generative models. CoRR arXiv:1903.10000 (2019)

  71. Xiao, X., Wang, G., Gehrke, J.: Differential privacy via wavelet transforms. IEEE Trans. Knowl. Data Eng. 23(8), 1200–1214 (2011)

    Article  Google Scholar 

  72. Xie, L., Lin, K., Wang, S., Wang, F., Zhou, J.: Differentially private generative adversarial network. CoRR arXiv:1802.06739 (2018)

  73. Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional GAN. CoRR arXiv:1907.00503 (2019)

  74. Xu, L., Veeramachaneni, K.: Synthesizing tabular data using generative adversarial networks. CoRR arXiv:1811.11264 (2018)

  75. Yang, J., Fan, J., Wei, Z., Li, G., Liu, T., Du, X.: Cost-effective data annotation using game-based crowdsourcing. PVLDB 12(1), 57–70 (2018)

    Google Scholar 

  76. Yang, L., Chou, S., Yang, Y.: Midinet: a convolutional generative adversarial network for symbolic-domain music generation. In: ISMIR, pp. 324–331 (2017)

  77. Yao, Q., Wang, M., Chen, Y., Dai, W., Li, Y.-F., Tu, W.-W., Yang, Q., Yu, Y.: Taking human out of learning applications: a survey on automated machine learning. Preprint arXiv:1810.13306 (2018)

  78. Yu, L., Zhang, W., Wang, J., Yu, Y.: Seqgan: sequence generative adversarial nets with policy gradient. In: AAAI, pp. 2852–2858 (2017)

  79. Zhang, D., Khoreva, A.: PA-GAN: improving GAN training by progressive augmentation. CoRR arXiv:1901.10422 (2019)

  80. Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: Privbayes: private data release via Bayesian networks. In: SIGMOD, pp. 1423–1434 (2014)

  81. Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: Privbayes: private data release via Bayesian networks. ACM Trans. Database Syst. 42(4), 1–41 (2017)

    Article  MathSciNet  Google Scholar 

  82. Zhang, Z., Yan, C., Mesa, D.A., Sun, J., Malin, B.A.: Ensuring electronic medical record simulation through better training, modeling, and evaluation. J. Am. Med. Inform. Assoc. 27(1), 99–108 (2020)

    Article  PubMed  Google Scholar 

  83. Zhao, S., Liu, Z., Lin, J., Zhu, J., Han, S.: Differentiable augmentation for data-efficient GAN training. In: Larochelle H., Ranzato M., Hadsell R., Balcan M., Lin H. (eds) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020 (2020)

Download references

Acknowledgements

This work was partly supported by the NSF of China (62122090, 62072461, 62072458, 61925205, 62232009, and 62102215), CCF-Huawei Populus Grove Fund, the Fund for Building World-Class Universities (Disciplines) of Renmin University of China, the Research Funds of Renmin University of China, Huawei, TAL education, and Zhongguancun Laboratory.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ju Fan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, T., Fan, J., Li, G. et al. Tabular data synthesis with generative adversarial networks: design space and optimizations. The VLDB Journal 33, 255–280 (2024). https://doi.org/10.1007/s00778-023-00807-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-023-00807-y

Keywords

Navigation