Skip to main content

Frugal Generative Modeling for Tabular Data

  • Conference paper
  • First Online:
Machine Learning and Knowledge Discovery in Databases. Research Track and Demo Track (ECML PKDD 2024)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14948))

  • 808 Accesses

Abstract

This paper presents a generative modeling approach called Gmda designed for tabular data, adapted to its arbitrary feature correlation structure. The generative model is trained so that sampled regions in the feature space contain the same fraction of true and synthetic samples, allowing true and synthetic data distributions to be aligned using a frugal and sound learning criterion. The merits of Gmda in terms of the usual performance indicators (pairwise correlation errors, precision, recall, predictive performance) are on par with or better than the state-of-the-art approaches for tabular data based on VAEs, GANs, or diffusion models. The key point is that it provides generative models with one or more orders of magnitude that are more frugal than baseline approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    The code and the supplementary material (SM) are publicly available at https://github.com/ablacan/gmda.

  2. 2.

    A small quantity is added to the denominator of \(\mathcal{L}(DH)\) to prevent numerical instabilities.

  3. 3.

    Although such metrics are sensitive to outliers, we argue that they remain empirically more stable and interpretable than further formulations suggested by [1].

  4. 4.

    Along the same line, the use of refined heuristics for the selection of the seeds, accounting for how often these have been selected in the former epochs, did not improve the performance.

References

  1. Alaa, A., Van Breugel, B., Saveliev, E.S., van der Schaar, M.: How faithful is your synthetic data? Sample-level metrics for evaluating and auditing generative models. In: ICML (2022)

    Google Scholar 

  2. Bhanot, K., Qi, M., Erickson, J.S., Guyon, I., Bennett, K.P.: The problem of fairness in synthetic healthcare data. Entropy 23(9), 1165 (2021)

    Article  Google Scholar 

  3. Brown, T., Mann, B., et al.: Language models are few-shot learners. In: NeurIPS (2020)

    Google Scholar 

  4. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)

    Article  Google Scholar 

  5. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics (2019)

    Google Scholar 

  6. Engelmann, J., Lessmann, S.: Conditional Wasserstein GAN-based oversampling of tabular data for imbalanced learning. Expert Syst. Appl. 174, 114582 (2021)

    Article  Google Scholar 

  7. Gogoshin, G., Branciamore, S., Rodin, A.S.: Synthetic data generation with probabilistic Bayesian networks. Math. Biosci. Eng. 18(6), 8603–8621 (2021)

    Article  MathSciNet  Google Scholar 

  8. Goodfellow, I., et al.: Generative adversarial nets. In: NeurIPS (2014)

    Google Scholar 

  9. Gorishniy, Y., Rubachev, I., Khrulkov, V., Babenko, A.: Revisiting deep learning models for tabular data. In: NeurIPS (2021)

    Google Scholar 

  10. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.: Improved training of Wasserstein GANs. In: NeurIPS (2017)

    Google Scholar 

  11. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)

    Google Scholar 

  12. Jiralerspong, M., Bose, J., Gemp, I., Qin, C., Bachrach, Y., Gidel, G.: Feature likelihood score: evaluating the generalization of generative models using samples. In: NeurIPS (2023)

    Google Scholar 

  13. Jordan, M.I., Ghahramani, Z., Jaakkola, T.S., Saul, L.K.: An introduction to variational methods for graphical models. Mach. Learn. 37(2), 183–233 (1999)

    Article  Google Scholar 

  14. Kerbl, B., Kopanas, G., Leimkuehler, T., Drettakis, G.: 3D Gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 42(4), 1–14 (2023)

    Article  Google Scholar 

  15. Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. In: ICLR, (2014)

    Google Scholar 

  16. Kotelnikov, A., Baranchuk, D., Rubachev, I., Babenko, A.: TabDDPM: modelling tabular data with diffusion models. In: ICML (2023)

    Google Scholar 

  17. Kynkäänniemi, T., Karras, T., Laine, S., Lehtinen, J., Aila, T.: Improved precision and recall metric for assessing generative models. In: NeurIPS (2019)

    Google Scholar 

  18. Lonsdale, J., et al.: The genotype-tissue expression (GTEX) project. Nat. Genet. 45(6), 580–585 (2013)

    Article  Google Scholar 

  19. Nguyen, H.M., Cooper, E.W., Kamei, K.: Borderline over-sampling for imbalanced data classification. Int. J. Knowl. Eng. Soft Data Paradigms 3, 4–21 (2009)

    Google Scholar 

  20. Onishi, S., Meguro, S.: Rethinking data augmentation for tabular data in deep learning. arXiv (2023)

    Google Scholar 

  21. Park, N., Mohammadi, M., Gorde, K., Jajodia, S., Park, H., Kim, Y.: Data synthesis based on generative adversarial networks. Proc. VLDB Endow. 11(10), 1071–1083 (2018)

    Article  Google Scholar 

  22. Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A.V., Gulin, A.: CatBoost: unbiased boosting with categorical features. In: NeurIPS (2018)

    Google Scholar 

  23. Radford, A., Kim, J.W., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

    Google Scholar 

  24. Sajjadi, M.S.M., Bachem, O., Lucic, M., Bousquet, O., Gelly, S.: Assessing generative models via precision and recall. In: NeurIPS (2018)

    Google Scholar 

  25. Schultz, K., Bej, S., Hahn, W., Wolfien, M., Srivastava, P., Wolkenhauer, O.: ConvGEN: a convex space learning approach for deep-generative oversampling and imbalanced classification of small tabular datasets. Pattern Recogn. 147, 110138 (2024)

    Article  Google Scholar 

  26. Schwartz, R., Dodge, J., Smith, N., Etzioni, O.: Green AI. Commun. ACM 63, 54–63 (2020)

    Google Scholar 

  27. Sun, Y., Cuesta-Infante, A., Veeramachaneni, K.: Learning vine copula models for synthetic data generation. Proc. AAAI 33(01), 5049–5057 (2019)

    Article  Google Scholar 

  28. Verine, A., Negrevergne, B., Pydi, M.S., Chevaleyre, Y.: Precision-recall divergence optimization for generative modeling with GANs and normalizing flows. In: NeurIPS (2023)

    Google Scholar 

  29. Weinstein, J.N., et al.: The cancer genome atlas pan-cancer analysis project. Nat. Genet. 45 (2013)

    Google Scholar 

  30. Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional GAN. In: NeurIPS (2019)

    Google Scholar 

  31. Yoon, J., Jordon, J., van der Schaar, M.: PATE-GAN: generating synthetic data with differential privacy guarantees. In: ICLR (2019)

    Google Scholar 

  32. Zhang, H., et al.: Mixed-type tabular data synthesis with score-based diffusion in latent space. In: ICLR (2024)

    Google Scholar 

  33. Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: PrivBayes: private data release via Bayesian networks. ACM Trans. Database Syst. 42(4), 1–41 (2017)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This research was supported by the Labex DigiCosme (University Paris-Saclay) and by a public grant overseen by the French National Research Agency (ANR) through the program UDOPIA, project funded by the ANR-20-THIA-0013-01.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alice Lacan .

Editor information

Editors and Affiliations

Ethics declarations

Disclosure of Interests

The authors have no competing interests to declare that are relevant to the content of this article.

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 531 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lacan, A., Hanczar, B., Sebag, M. (2024). Frugal Generative Modeling for Tabular Data. In: Bifet, A., et al. Machine Learning and Knowledge Discovery in Databases. Research Track and Demo Track. ECML PKDD 2024. Lecture Notes in Computer Science(), vol 14948. Springer, Cham. https://doi.org/10.1007/978-3-031-70371-3_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-70371-3_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-70370-6

  • Online ISBN: 978-3-031-70371-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics