Frugal Generative Modeling for Tabular Data

Lacan, Alice; Hanczar, Blaise; Sebag, Michele

doi:10.1007/978-3-031-70371-3_4

Alice Lacan^15,16,
Blaise Hanczar¹⁵ &
Michele Sebag¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14948))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

808 Accesses

Abstract

This paper presents a generative modeling approach called Gmda designed for tabular data, adapted to its arbitrary feature correlation structure. The generative model is trained so that sampled regions in the feature space contain the same fraction of true and synthetic samples, allowing true and synthetic data distributions to be aligned using a frugal and sound learning criterion. The merits of Gmda in terms of the usual performance indicators (pairwise correlation errors, precision, recall, predictive performance) are on par with or better than the state-of-the-art approaches for tabular data based on VAEs, GANs, or diffusion models. The key point is that it provides generative models with one or more orders of magnitude that are more frugal than baseline approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A comparative exploration of two diffusion generative models on tabular data synthesis

Article 05 October 2024

Tabular and latent space synthetic data generation: a literature review

Article Open access 10 July 2023

CaTabRa: Efficient Analysis and Predictive Modeling of Tabular Data

Notes

1.
The code and the supplementary material (SM) are publicly available at https://github.com/ablacan/gmda.
2.
A small quantity is added to the denominator of $\mathcal{L}(DH)$ to prevent numerical instabilities.
3.
Although such metrics are sensitive to outliers, we argue that they remain empirically more stable and interpretable than further formulations suggested by [1].
4.
Along the same line, the use of refined heuristics for the selection of the seeds, accounting for how often these have been selected in the former epochs, did not improve the performance.

References

Alaa, A., Van Breugel, B., Saveliev, E.S., van der Schaar, M.: How faithful is your synthetic data? Sample-level metrics for evaluating and auditing generative models. In: ICML (2022)
Google Scholar
Bhanot, K., Qi, M., Erickson, J.S., Guyon, I., Bennett, K.P.: The problem of fairness in synthetic healthcare data. Entropy 23(9), 1165 (2021)
Article Google Scholar
Brown, T., Mann, B., et al.: Language models are few-shot learners. In: NeurIPS (2020)
Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Article Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics (2019)
Google Scholar
Engelmann, J., Lessmann, S.: Conditional Wasserstein GAN-based oversampling of tabular data for imbalanced learning. Expert Syst. Appl. 174, 114582 (2021)
Article Google Scholar
Gogoshin, G., Branciamore, S., Rodin, A.S.: Synthetic data generation with probabilistic Bayesian networks. Math. Biosci. Eng. 18(6), 8603–8621 (2021)
Article MathSciNet Google Scholar
Goodfellow, I., et al.: Generative adversarial nets. In: NeurIPS (2014)
Google Scholar
Gorishniy, Y., Rubachev, I., Khrulkov, V., Babenko, A.: Revisiting deep learning models for tabular data. In: NeurIPS (2021)
Google Scholar
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.: Improved training of Wasserstein GANs. In: NeurIPS (2017)
Google Scholar
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)
Google Scholar
Jiralerspong, M., Bose, J., Gemp, I., Qin, C., Bachrach, Y., Gidel, G.: Feature likelihood score: evaluating the generalization of generative models using samples. In: NeurIPS (2023)
Google Scholar
Jordan, M.I., Ghahramani, Z., Jaakkola, T.S., Saul, L.K.: An introduction to variational methods for graphical models. Mach. Learn. 37(2), 183–233 (1999)
Article Google Scholar
Kerbl, B., Kopanas, G., Leimkuehler, T., Drettakis, G.: 3D Gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 42(4), 1–14 (2023)
Article Google Scholar
Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. In: ICLR, (2014)
Google Scholar
Kotelnikov, A., Baranchuk, D., Rubachev, I., Babenko, A.: TabDDPM: modelling tabular data with diffusion models. In: ICML (2023)
Google Scholar
Kynkäänniemi, T., Karras, T., Laine, S., Lehtinen, J., Aila, T.: Improved precision and recall metric for assessing generative models. In: NeurIPS (2019)
Google Scholar
Lonsdale, J., et al.: The genotype-tissue expression (GTEX) project. Nat. Genet. 45(6), 580–585 (2013)
Article Google Scholar
Nguyen, H.M., Cooper, E.W., Kamei, K.: Borderline over-sampling for imbalanced data classification. Int. J. Knowl. Eng. Soft Data Paradigms 3, 4–21 (2009)
Google Scholar
Onishi, S., Meguro, S.: Rethinking data augmentation for tabular data in deep learning. arXiv (2023)
Google Scholar
Park, N., Mohammadi, M., Gorde, K., Jajodia, S., Park, H., Kim, Y.: Data synthesis based on generative adversarial networks. Proc. VLDB Endow. 11(10), 1071–1083 (2018)
Article Google Scholar
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A.V., Gulin, A.: CatBoost: unbiased boosting with categorical features. In: NeurIPS (2018)
Google Scholar
Radford, A., Kim, J.W., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Google Scholar
Sajjadi, M.S.M., Bachem, O., Lucic, M., Bousquet, O., Gelly, S.: Assessing generative models via precision and recall. In: NeurIPS (2018)
Google Scholar
Schultz, K., Bej, S., Hahn, W., Wolfien, M., Srivastava, P., Wolkenhauer, O.: ConvGEN: a convex space learning approach for deep-generative oversampling and imbalanced classification of small tabular datasets. Pattern Recogn. 147, 110138 (2024)
Article Google Scholar
Schwartz, R., Dodge, J., Smith, N., Etzioni, O.: Green AI. Commun. ACM 63, 54–63 (2020)
Google Scholar
Sun, Y., Cuesta-Infante, A., Veeramachaneni, K.: Learning vine copula models for synthetic data generation. Proc. AAAI 33(01), 5049–5057 (2019)
Article Google Scholar
Verine, A., Negrevergne, B., Pydi, M.S., Chevaleyre, Y.: Precision-recall divergence optimization for generative modeling with GANs and normalizing flows. In: NeurIPS (2023)
Google Scholar
Weinstein, J.N., et al.: The cancer genome atlas pan-cancer analysis project. Nat. Genet. 45 (2013)
Google Scholar
Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional GAN. In: NeurIPS (2019)
Google Scholar
Yoon, J., Jordon, J., van der Schaar, M.: PATE-GAN: generating synthetic data with differential privacy guarantees. In: ICLR (2019)
Google Scholar
Zhang, H., et al.: Mixed-type tabular data synthesis with score-based diffusion in latent space. In: ICLR (2024)
Google Scholar
Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: PrivBayes: private data release via Bayesian networks. ACM Trans. Database Syst. 42(4), 1–41 (2017)
Article MathSciNet Google Scholar

Download references

Acknowledgements

This research was supported by the Labex DigiCosme (University Paris-Saclay) and by a public grant overseen by the French National Research Agency (ANR) through the program UDOPIA, project funded by the ANR-20-THIA-0013-01.

Author information

Authors and Affiliations

IBISC, U. Evry, Université Paris-Saclay, Paris, France
Alice Lacan & Blaise Hanczar
TAU, CNRS-INRIA-LISN, Université Paris-Saclay, Paris, France
Alice Lacan & Michele Sebag

Authors

Alice Lacan
View author publications
You can also search for this author in PubMed Google Scholar
Blaise Hanczar
View author publications
You can also search for this author in PubMed Google Scholar
Michele Sebag
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alice Lacan .

Editor information

Editors and Affiliations

LTCI, Télécom Paris, Palaiseau Cedex, France
Albert Bifet
Vytautas Magnus University, Lithuania, Kaunas, Lithuania
Povilas Daniušis
Department of Computer Science, Katholieke Universiteit Leuven, Heverlee, Belgium
Jesse Davis
Faculty of Informatics, Vytautas Magnus University, Akademija, Lithuania
Tomas Krilavičius
Institute of Computer Science, University of Tartu, Tartu, Estonia
Meelis Kull
Department of Computer Science, Bundeswehr University Munich, Munich, Germany
Eirini Ntoutsi
Department of Computer Science, University of Helsinki, Helsinki, Finland
Kai Puolamäki
Department of Computer Science, University of Helsinki, Helsinki, Finland
Indrė Žliobaitė

Ethics declarations

Disclosure of Interests

The authors have no competing interests to declare that are relevant to the content of this article.

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 531 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lacan, A., Hanczar, B., Sebag, M. (2024). Frugal Generative Modeling for Tabular Data. In: Bifet, A., et al. Machine Learning and Knowledge Discovery in Databases. Research Track and Demo Track. ECML PKDD 2024. Lecture Notes in Computer Science(), vol 14948. Springer, Cham. https://doi.org/10.1007/978-3-031-70371-3_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-70371-3_4
Published: 22 August 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70370-6
Online ISBN: 978-3-031-70371-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)

Frugal Generative Modeling for Tabular Data