skip to main content
10.1145/3632410.3632438acmotherconferencesArticle/Chapter ViewAbstractPublication PagescomadConference Proceedingsconference-collections
short-paper

Tabular Data Synthesis with GANs for Adaptive AI Models

Published: 04 January 2024 Publication History

Abstract

In situations such as demographics change ML models often perform poorly because the training data does not appropriately represent the environment. Privacy concerns worsen the issue by severely limiting training data. In this paper, we present a framework that utilizes a GAN-based synthesizer to generate synthetic data that not only satisfies user-defined constraints expressed as marginal distributions of selected columns but also strives to preserve the distributions observed in the original data. This framework takes as input an original dataset and a set of user-defined constraints, and synthesizes data that adheres to these constraints while capturing the underlying distributions present in the given data. The result is a customizable and realistic data generation solution that balances constraint satisfaction and preservation of data distributions.We validate and demonstrate the effectiveness of our technique through experimentation.

References

[1]
Mrinal Kanti Baowaly, Chia-Ching Lin, Chao-Lin Liu, and Kuan-Ta Chen. 2019. Synthesizing electronic health records using improved generative adversarial networks. Journal of the American Medical Informatics Association 26, 3 (2019), 228–241.
[2]
Boaz Barak, Kamalika Chaudhuri, Cynthia Dwork, Satyen Kale, Frank McSherry, and Kunal Talwar. 2007. Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. 273–282.
[3]
Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. JAIR 16 (2002), 321–357.
[4]
Haipeng Chen, Sushil Jajodia, Jing Liu, Noseong Park, Vadim Sokolov, and VS Subrahmanian. 2019. FakeTables: Using GANs to Generate Functional Dependency Preserving Tables with Bounded Real Data. In IJCAI. 2074–2080.
[5]
Edward Choi, Siddharth Biswal, Bradley Malin, Jon Duke, Walter F Stewart, and Jimeng Sun. 2017. Generating multi-label discrete patient records using generative adversarial networks. In Machine learning for healthcare conference. PMLR, 286–305.
[6]
Graham Cormode, Minos Garofalakis, Peter J Haas, Chris Jermaine, 2011. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends® in Databases 4, 1–3 (2011), 1–294.
[7]
Dhimant Ganatara. [n.  d.]. Campus Recruitment Analysis. ([n.  d.]). https://www.kaggle.com/datasets/benroshan/factors-affecting-campus-placement
[8]
Lovedeep Gondara and Ke Wang. 2018. Mida: Multiple imputation using denoising autoencoders. In Advances in Knowledge Discovery and Data Mining: 22nd Pacific-Asia Conference, PAKDD 2018, Melbourne, VIC, Australia, June 3-6, 2018, Proceedings, Part III 22. Springer, 260–272.
[9]
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In NIPS. 2672–2680.
[10]
Diederik P. Kingma and Max Welling. [n. d.]. Auto-Encoding Variational Bayes. In ICLR 2014, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1312.6114
[11]
Ronny Kohavi and Barry Becker. [n. d.]. Census Income Data. ([n. d.]). https://archive.ics.uci.edu/ml/datasets/adult
[12]
Haoran Li, Li Xiong, Lifan Zhang, and Xiaoqian Jiang. 2014. DPSynthesizer: differentially private data synthesizer for privacy preserving data sharing. In Proceedings of the VLDB Endowment International Conference on Very Large Data Bases, Vol. 7. NIH Public Access, 1677.
[13]
Pei-Hsuan Lu, Pang-Chieh Wang, and Chia-Mu Yu. 2019. Empirical evaluation on synthetic data generation with generative adversarial network. In Proceedings of the 9th International Conference on Web Intelligence, Mining and Semantics. 1–6.
[14]
Guido Moerkotte, Thomas Neumann, and Gabriele Steidl. 2009. Preventing bad plans by bounding the impact of cardinality estimation errors. Proceedings of the VLDB Endowment 2, 1 (2009), 982–993.
[15]
Noseong Park, Mahmoud Mohammadi, Kshitij Gorde, Sushil Jajodia, Hongkyu Park, and Youngmin Kim. 2018. Data synthesis based on generative adversarial networks. arXiv preprint arXiv:1806.03384 (2018).
[16]
Yubin Park and Joydeep Ghosh. 2014. PeGS: Perturbed Gibbs Samplers that Generate Privacy-Compliant Synthetic Data.Trans. Data Priv. 7, 3 (2014), 253–282.
[17]
Neha Patki, Roy Wedge, and Kalyan Veeramachaneni. 2016. The synthetic data vault. In 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA). IEEE, 399–410.
[18]
Diptikalyan Saha, Aniya Aggarwal, and Sandeep Hans. 2022. Data synthesis for testing black-box machine learning models. In 5th Joint International Conference on Data Science & Management of Data (9th ACM IKDD CODS and 27th COMAD). 110–114.
[19]
Aivin V Solatorio and Olivier Dupriez. 2023. REaLTabFormer: Generating Realistic Relational and Tabular Data using Transformers. arXiv preprint arXiv:2302.02041 (2023).
[20]
Saravanan Thirumuruganathan, Shohedul Hasan, Nick Koudas, and Gautam Das. 2019. Approximate query processing using deep generative models. arXiv preprint arXiv:1903.10000 (2019).
[21]
Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. 2019. Modeling tabular data using conditional gan. Advances in neural information processing systems 32 (2019).
[22]
Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. 2019. Modeling tabular data using conditional gan. Advances in Neural Information Processing Systems 32 (2019).
[23]
Jun Zhang, Graham Cormode, Cecilia M Procopiuc, Divesh Srivastava, and Xiaokui Xiao. 2017. Privbayes: Private data release via bayesian networks. ACM Transactions on Database Systems (TODS) 42, 4 (2017), 1–41.

Index Terms

  1. Tabular Data Synthesis with GANs for Adaptive AI Models
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Other conferences
        CODS-COMAD '24: Proceedings of the 7th Joint International Conference on Data Science & Management of Data (11th ACM IKDD CODS and 29th COMAD)
        January 2024
        627 pages
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 04 January 2024

        Permissions

        Request permissions for this article.

        Check for updates

        Qualifiers

        • Short-paper
        • Research
        • Refereed limited

        Conference

        CODS-COMAD 2024

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 113
          Total Downloads
        • Downloads (Last 12 months)73
        • Downloads (Last 6 weeks)6
        Reflects downloads up to 03 Mar 2025

        Other Metrics

        Citations

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format.

        HTML Format

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media