Tabular data synthesis with generative adversarial networks: design space and optimizations

Liu, Tongyu; Fan, Ju; Li, Guoliang; Tang, Nan; Du, Xiaoyong

doi:10.1007/s00778-023-00807-y

Tabular data synthesis with generative adversarial networks: design space and optimizations

Regular Paper
Published: 15 August 2023

Volume 33, pages 255–280, (2024)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Tongyu Liu¹,
Ju Fan ORCID: orcid.org/0000-0003-4729-9903¹,
Guoliang Li²,
Nan Tang³ &
…
Xiaoyong Du¹

701 Accesses
Explore all metrics

Abstract

The proliferation of big data has brought an urgent demand for privacy-preserving data publishing. Traditional solutions to this demand have limitations on effectively balancing the trade-off between privacy and utility of the released data. To address this problem, the database community and machine learning community have recently studied a new problem of tabular data synthesis using generative adversarial networks (GANs) and proposed various algorithms. However, a comprehensive comparison between GAN-based methods and conventional approaches is still lacking, making it unclear why and how GANs can outperform conventional approaches in synthesizing tabular data. Moreover, it is difficult for practitioners to understand which components are necessary when building a GAN model for tabular data synthesis. To bridge this gap, we conduct a comprehensive experimental study that investigates applying GAN to tabular data synthesis. We introduce a unified GAN-based framework and define a space of design solutions for each component in the framework, including neural network architectures and training strategies. We provide optimization techniques to handle difficulties in training GAN in practice. We conduct extensive experiments to explore the design space, comparing with traditional data synthesis approaches. Through extensive experiments, we find that GAN is very promising for tabular data synthesis and provide guidance for selecting appropriate design choices. We also point out limitations of GAN and identify future research directions. We make all code and datasets public for future research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 10

Fig. 13

Fig. 14

Fig. 17

Interpretable scientific discovery with symbolic regression: a review

Article Open access 02 January 2024

A survey on federated learning: challenges and applications

Article 11 November 2022

Big data analytics on Apache Spark

Article 13 October 2016

Notes

https://github.com/ruc-datalab/Daisy.
We use \(t\) and \(\varvec{t} \) to, respectively, denote a record in \({\mathcal {T}} \) and the d-dimension sample transformed from record \(t\) in the paper.
https://github.com/ruc-datalab/Daisy.
https://github.com/mahmoodm2/tableGAN.
https://sourceforge.net/projects/privbayes/.

References

Adult data set. https://archive.ics.uci.edu/ml/datasets/Adult
Anuran calls (mfccs) data set. http://archive.ics.uci.edu/ml/datasets/Anuran+Calls+%28MFCCs%29
Agrawal, D., Aggarwal, C.C.: On the design and quantification of privacy preserving data mining algorithms. In: PODS (2001)
Arjovsky, M., Bottou, L.: Towards principled methods for training generative adversarial networks. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, Conference Track Proceedings. OpenReview.net (2017)
Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein GAN. CoRR arXiv:1701.07875 (2017)
Baowaly, M.K., Lin, C., Liu, C., Chen, K.: Synthesizing electronic health records using improved generative adversarial networks. JAMIA 26(3), 228–241 (2019)
PubMed Google Scholar
Barak, B., Chaudhuri, K., Dwork, C., Kale, S., McSherry, F., Talwar, K.: Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In: PODS, pp. 273–282 (2007)
Bohannon, P., Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for data cleaning. In: ICDE, pp. 746–755 (2007)
Borisov, V., Leemann, T., Seßler, K., Haug, J., Pawelczyk, M., Kasneci, G.: Deep neural networks and tabular data: a survey. CoRR arXiv:2110.01889 (2021)
Brickell, J., Shmatikov, V.: The cost of privacy: destruction of data-mining utility in anonymized data publishing. In: SIGKDD, pp. 70–78 (2008)
Census-income (kdd) data set. http://archive.ics.uci.edu/ml/datasets/Census-Income+(KDD)
Chaudhuri, S., Ding, B., Kandula, S.: Approximate query processing: No silver bullet. In: SIGMOD, pp. 511–519 (2017)
Chen, H., Jajodia, S., Liu, J., Park, N., Sokolov, V., Subrahmanian, V.S.: Faketables: using GANs to generate functional dependency preserving tables with bounded real data. In: IJCAI, pp. 2074–2080 (2019)
Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., Abbeel, P.: Infogan: interpretable representation learning by information maximizing generative adversarial nets. In: NIPS, pp. 2172–2180 (2016)
Choi, E., Biswal, S., Malin, B.A., Duke, J., Stewart, W.F., Sun, J.: Generating multi-label discrete electronic health records using generative adversarial networks. CoRR arXiv:1703.06490 (2017)
Covertype data set. http://archive.ics.uci.edu/ml/datasets/covertype
Cormode, G., Garofalakis, M.N., Haas, P.J., Jermaine, C.: Synopses for massive data: samples, histograms, wavelets, sketches. Found. Trends Databases 4(1–3), 1–294 (2012)
Google Scholar
Doersch, C.: Tutorial on variational autoencoders. CoRR arXiv:1606.05908 (2016)
Diabete data set. https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008
Domingo-Ferrer, J.: A survey of inference control methods for privacy-preserving data mining. In: Privacy-Preserving Data Mining—Models and Algorithms, pp. 53–80 (2008)
Dumoulin, V., Belghazi, I., Poole, B., Lamb, A., Arjovsky, M., Mastropietro, O., Courville, A.C.: Adversarially learned inference. In: ICLR (2017)
Dwork, C., Roth, A.: The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci. 9(3–4), 211–407 (2014)
MathSciNet Google Scholar
Esteban, C., Hyland, S.L., Rätsch, G.: Real-valued (medical) time series generation with recurrent conditional GANs. CoRR arXiv:1706.02633 (2017)
Fan, J., Liu, T., Li, G., Chen, J., Shen, Y., Du, X.: Relation data synthesis using generative adversarial network: a design space exploration. In: Technical Report. https://github.com/ruclty/Daisy/blob/master/daisy.pdf (2020)
Fan, J., Liu, T., Li, G., Chen, J., Shen, Y., Du, X.: Relational data synthesis using generative adversarial networks: a design space exploration. Proc. VLDB Endow. 13(11), 1962–1975 (2020)
Article Google Scholar
Gondara, L., Wang, K.: MIDA: multiple imputation using denoising autoencoders. In: PAKDD, pp. 260–272 (2018)
Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.C., Bengio, Y.: Generative adversarial nets. In: NIPS, pp. 2672–2680 (2014)
Hastie, T., Tibshirani, R., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition. Springer Series in Statistics. Springer (2009)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article CAS PubMed Google Scholar
Htru2 data set. http://archive.ics.uci.edu/ml/datasets/HTRU2
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML, pp. 448–456 (2015)
Internet data set. https://openml.org/search?type=data &status=active &id=372
Jordon, J., Yoon, J., van der Schaar, M.: PATE-GAN: generating synthetic data with differential privacy guarantees. In: ICLR (2019)
Hodge, J.G., Jr., Gostin, L.O., Jacobson, P.: Legal issues concerning electronic health information: privacy, quality, and liability. JAMA 282, 1466–1471 (1999)
Article PubMed Google Scholar
Kaggle. The state of data science and machine learning. https://www.kaggle.com/surveys/2017 (2017)
Karras, T., Aittala, M., Hellsten, J., Laine, S., Lehtinen, J., Aila, T.: Training generative adversarial networks with limited data. In: Larochelle, H., Ranzato M., Hadsell R., Balcan M., Lin H. (eds) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, virtual (2020)
Khope, S., Elias, S.: Critical correlation of predictors for an efficient risk prediction framework of ICU patient using correlation and transformation of MIMIC-III dataset. Data Sci. Eng. 7(1), 71–86 (2022)
Article Google Scholar
Kim, J., Jeon, J., Lee, J., Hyeong, J., Park, N.: OCT-GAN: neural ode-based conditional tabular GANs. In Leskovec J., Grobelnik M., Najork M., Tang J., Zia L. (eds) WWW ’21: The Web Conference 2021, Virtual Event / Ljubljana, pp. 1506–1515. ACM/IW3C2 (2021)
Kim, J., Lee, C., Park, N.: Stasy: score-based tabular data synthesis. CoRR arXiv:2210.04018 (2022)
Kim, J., Lee, C., Shin, Y., Park, S., Kim, M., Park, N., Cho, J.: SOS: score-based oversampling for tabular data. In: Zhang A., Rangwala H. (eds) KDD ’22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, pp. 762–772, ACM (2022)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: ICLR (2014)
Kotelnikov, A., Baranchuk, D., Rubachev, I., Babenko, A.: Tabddpm: Modelling tabular data with diffusion models. CoRR arXiv:2209.15421 (2022)
Lee, J., Hyeong, J., Jeon, J., Park, N., Cho, J.: Invertible tabular GANs: killing two birds with one stone for tabular data synthesis. In: Ranzato M., Beygelzimer A., Dauphin Y.N., Liang P., Vaughan J.W. (eds) Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, pp. 4263–4273 (2021)
Li, H., Xiong, L., Zhang, L., Jiang, X.: Dpsynthesizer: differentially private data synthesizer for privacy preserving data sharing. PVLDB 7(13), 1677–1680 (2014)
PubMed Google Scholar
Li, K., Zhang, Y., Li, G., Tao, W., Yan, Y.: Bounded approximate query processing. IEEE Trans. Knowl. Data Eng. 31(12), 2262–2276 (2019)
Article Google Scholar
Li, N., Li, T., Venkatasubramanian, S.: t-closeness: Privacy beyond k-anonymity and l-diversity. In: ICDE, pp. 106–115 (2007)
Li, S.C., Jiang, B., Marlin, B.M.: Misgan: learning from incomplete data with generative adversarial networks. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, OpenReview.net (2019)
Ling, Z.J., Tran, Q.T., Fan, J., Koh, G.C.H., Nguyen, T., Tan, C.S., Yip, J.W.L., Zhang, M.: GEMINI: an integrative healthcare analytics system. PVLDB 7(13), 1766–1771 (2014)
Google Scholar
Liu, T., Fan, J., Luo, Y., Tang, N., Li, G., Du, X.: Adaptive data augmentation for supervised learning over missing data. Proc. VLDB Endow. 14(7), 1202–1214 (2021)
Article Google Scholar
Liu, T., Yang, J., Fan, J., Wei, Z., Li, G., Du, X.: Crowdgame: a game-based crowdsourcing system for cost-effective data labeling. In: SIGMOD, pp. 1957–1960 (2019)
Lu, P., Wang, P., Yu, C.: Empirical evaluation on synthetic data generation with generative adversarial network. In: WIMS, vol. 16, pp. 1–16 (2019)
Lucic, M., Kurach, K., Michalski, M., Gelly, S., Bousquet, O.: Are GANs created equal? A large-scale study. In: NeurIPS, pp. 698–707 (2018)
Mateo-Sanz, J.M., Sebé, F., Domingo-Ferrer, J.: Outlier protection in continuous microdata masking. In: Privacy in Statistical Databases, pp. 201–215 (2004)
Metz, L., Poole, B., Pfau, D., Sohl-Dickstein, J.: Unrolled generative adversarial networks. CoRR arXiv:1611.02163 (2016)
Mirza, M., Osindero, S.: Conditional generative adversarial nets. CoRR arXiv:1411.1784 (2014)
Olsson, C., Bhupatiraju, S., Brown, T.B., Odena, A., Goodfellow, I.J.: Skill rating for generative models. CoRR arXiv:1808.04888
Park, N., Mohammadi, M., Gorde, K., Jajodia, S., Park, H., Kim, Y.: Data synthesis based on generative adversarial networks. PVLDB 11(10), 1071–1083 (2018)
Google Scholar
Park, Y., Ghosh, J.: Pegs: perturbed gibbs samplers that generate privacy-compliant synthetic data. Trans. Data Privacy 7(3), 253–282 (2014)
MathSciNet Google Scholar
Patki, N., Wedge, R., Veeramachaneni, K.: The synthetic data vault. In: DSAA, pp. 399–410 (2016)
Pen-based recognition of handwritten digits data set. https://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits
PyTorch Developers. Tensors and dynamic neural networks in python with strong GPU acceleration. https://pytorch.org
Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. In: ICLR (2016)
Ramakrishnan, R., Gehrke, J.: Database Management Systems, 3rd edn. McGraw-Hill (2003)
Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. In: ICML, pp. 1278–1286 (2014)
Salimans, T., Goodfellow, I.J., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. In: NIPS, pp. 2226–2234 (2016)
Sarki, R., Ahmed, K., Wang, H., et al.: Image preprocessing in classification and identification of diabetic eye diseases. Data Sci. Eng. 6(4), 455–471 (2021)
Article PubMed PubMed Central Google Scholar
Shokri, R., Stronati, M., Song, C., Shmatikov, V.: Membership inference attacks against machine learning models. In: 2017 IEEE Symposium on Security and Privacy, SP 2017, San Jose, pp. 3–18. IEEE Computer Society (2017)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS, pp. 3104–3112 (2014)
Statlog (landsat satellite) data set. https://archive.ics.uci.edu/ml/datasets/Statlog+%28Landsat+Satellite%29
Thirumuruganathan, S., Hasan, S., Koudas, N., Das, G.: Approximate query processing using deep generative models. CoRR arXiv:1903.10000 (2019)
Xiao, X., Wang, G., Gehrke, J.: Differential privacy via wavelet transforms. IEEE Trans. Knowl. Data Eng. 23(8), 1200–1214 (2011)
Article Google Scholar
Xie, L., Lin, K., Wang, S., Wang, F., Zhou, J.: Differentially private generative adversarial network. CoRR arXiv:1802.06739 (2018)
Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional GAN. CoRR arXiv:1907.00503 (2019)
Xu, L., Veeramachaneni, K.: Synthesizing tabular data using generative adversarial networks. CoRR arXiv:1811.11264 (2018)
Yang, J., Fan, J., Wei, Z., Li, G., Liu, T., Du, X.: Cost-effective data annotation using game-based crowdsourcing. PVLDB 12(1), 57–70 (2018)
Google Scholar
Yang, L., Chou, S., Yang, Y.: Midinet: a convolutional generative adversarial network for symbolic-domain music generation. In: ISMIR, pp. 324–331 (2017)
Yao, Q., Wang, M., Chen, Y., Dai, W., Li, Y.-F., Tu, W.-W., Yang, Q., Yu, Y.: Taking human out of learning applications: a survey on automated machine learning. Preprint arXiv:1810.13306 (2018)
Yu, L., Zhang, W., Wang, J., Yu, Y.: Seqgan: sequence generative adversarial nets with policy gradient. In: AAAI, pp. 2852–2858 (2017)
Zhang, D., Khoreva, A.: PA-GAN: improving GAN training by progressive augmentation. CoRR arXiv:1901.10422 (2019)
Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: Privbayes: private data release via Bayesian networks. In: SIGMOD, pp. 1423–1434 (2014)
Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: Privbayes: private data release via Bayesian networks. ACM Trans. Database Syst. 42(4), 1–41 (2017)
Article MathSciNet Google Scholar
Zhang, Z., Yan, C., Mesa, D.A., Sun, J., Malin, B.A.: Ensuring electronic medical record simulation through better training, modeling, and evaluation. J. Am. Med. Inform. Assoc. 27(1), 99–108 (2020)
Article PubMed Google Scholar
Zhao, S., Liu, Z., Lin, J., Zhu, J., Han, S.: Differentiable augmentation for data-efficient GAN training. In: Larochelle H., Ranzato M., Hadsell R., Balcan M., Lin H. (eds) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020 (2020)

Download references

Acknowledgements

This work was partly supported by the NSF of China (62122090, 62072461, 62072458, 61925205, 62232009, and 62102215), CCF-Huawei Populus Grove Fund, the Fund for Building World-Class Universities (Disciplines) of Renmin University of China, the Research Funds of Renmin University of China, Huawei, TAL education, and Zhongguancun Laboratory.

Author information

Authors and Affiliations

Renmin University of China, Beijing, 100872, China
Tongyu Liu, Ju Fan & Xiaoyong Du
Tsinghua University, Beijing, 100084, China
Guoliang Li
HKUST (GZ), Guangzhou, 511455, China
Nan Tang

Authors

Tongyu Liu
View author publications
You can also search for this author in PubMed Google Scholar
Ju Fan
View author publications
You can also search for this author in PubMed Google Scholar
Guoliang Li
View author publications
You can also search for this author in PubMed Google Scholar
Nan Tang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyong Du
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ju Fan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Liu, T., Fan, J., Li, G. et al. Tabular data synthesis with generative adversarial networks: design space and optimizations. The VLDB Journal 33, 255–280 (2024). https://doi.org/10.1007/s00778-023-00807-y

Download citation

Received: 30 August 2022
Revised: 08 May 2023
Accepted: 16 July 2023
Published: 15 August 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s00778-023-00807-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Tabular data synthesis with generative adversarial networks: design space and optimizations

Abstract

Access this article

Similar content being viewed by others

Interpretable scientific discovery with symbolic regression: a review

A survey on federated learning: challenges and applications

Big data analytics on Apache Spark

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Tabular data synthesis with generative adversarial networks: design space and optimizations

Abstract

Access this article

Similar content being viewed by others

Interpretable scientific discovery with symbolic regression: a review

A survey on federated learning: challenges and applications

Big data analytics on Apache Spark

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation