MeTGAN: Memory Efficient Tabular GAN for High Cardinality Categorical Datasets

Singh, Shreyansh; Kayathwal, Kanishka; Wadhwa, Hardik; Dhama, Gaurav

doi:10.1007/978-3-030-92310-5_60

Shreyansh Singh¹⁰,
Kanishka Kayathwal¹⁰,
Hardik Wadhwa¹⁰ &
…
Gaurav Dhama¹⁰

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1517))

Included in the following conference series:

International Conference on Neural Information Processing

1931 Accesses
1 Citations

Abstract

Generative Adversarial Networks (GANs) have seen their use for generating synthetic data expand, from unstructured data like images to structured tabular data. One of the recently proposed models in the field of tabular data generation, CTGAN, demonstrated state-of-the-art performance on this task even in the presence of a high class imbalance in categorical columns or multiple modes in continuous columns. Many of the recently proposed methods have also derived ideas from CTGAN. However, training CTGAN requires a high memory footprint while dealing with high cardinality categorical columns in the dataset. In this paper, we propose MeTGAN, a memory-efficient version of CTGAN, which reduces memory usage by roughly 80%, with a minimal effect on performance. MeTGAN uses sparse linear layers to overcome the memory bottlenecks of CTGAN. We compare the performance of MeTGAN with the other models on publicly available datasets. Quality of data generation, memory requirements, and the privacy guarantees of the models are the metrics considered in this study. The goal of this paper is also to draw the attention of the research community on the issue of the computational footprint of tabular data generation methods to enable them on larger datasets especially ones with high cardinality categorical variables.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Data to AI Lab, at MIT: Sdmetrics (2020). https://github.com/sdv-dev/SDMetrics
Choi, E., Biswal, S., Malin, B., Duke, J., Stewart, W.F., Sun, J.: Generating multi-label discrete patient records using generative adversarial networks. In: Proceedings of the 2nd Machine Learning for Healthcare Conference, vol. 68. PMLR (2017)
Google Scholar
Cormode, G., Procopiuc, C., Srivastava, D., Shen, E., Yu, T.: Differentially private spatial decompositions. In: 2012 IEEE 28th International Conference on Data Engineering, pp. 20–31 (2012). https://doi.org/10.1109/ICDE.2012.16
Engelmann, J., Lessmann, S.: Conditional wasserstein GAN-based oversampling of tabular data for imbalanced learning. Expert Syst. Appl. 174, 114582 (2021). https://doi.org/10.1016/j.eswa.2021.114582
Article Google Scholar
Goodfellow, I.J., et al.: Generative adversarial networks (2014)
Google Scholar
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.: Improved training of wasserstein GANs. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 2017, pp. 5769–5779. Curran Associates Inc., Red Hook (2017)
Google Scholar
Kohavi, R., Becker, B.: Adult data set, May 1996. https://bit.ly/3v3VDIj
Lin, Z., Khetan, A., Fanti, G., Oh, S.: PacGAN: the power of two samples in generative adversarial networks. IEEE J. Sel. Areas Inf. Theory 1, 324–335 (2020)
Article Google Scholar
Mottini, A., Lheritier, A., Acuna-Agost, R.: Airline passenger name record generation using generative adversarial networks. CoRR abs/1807.06657 (2018)
Google Scholar
Park, N., Mohammadi, M., Gorde, K., Jajodia, S., Park, H., Kim, Y.: Data synthesis based on generative adversarial networks. Proc. VLDB Endow. 11(10), 1071–1083 (2018). https://doi.org/10.14778/3231751.3231757
Article Google Scholar
Patki, N., Wedge, R., Veeramachaneni, K.: The synthetic data vault, pp. 399–410, October 2016. https://doi.org/10.1109/DSAA.2016.49
Peng, Z., et al.: Shrinking bigfoot: reducing wav2vec 2.0 footprint (2021)
Google Scholar
Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks (2016)
Google Scholar
Reiter, J.: Using cart to generate partially synthetic, public use microdata. J. Off. Stat. 21, 441–462 (2005)
Google Scholar
Fernandes, K., Vinagre, P., Cortez, P.: A proactive intelligent decision support system for predicting the popularity of online news. In: Pereira, F., Machado, P., Costa, E., Cardoso, A. (eds.) EPIA 2015. LNCS (LNAI), vol. 9273, pp. 535–546. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23485-4_53
Chapter Google Scholar
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. ArXiv abs/1910.01108 (2019)
Google Scholar
Tan, M., Le, Q.V.: EfficientNetV2: smaller models and faster training (2021)
Google Scholar
Toktogaraev, M.: Should this loan be approved or denied? https://bit.ly/3AptJaW
Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional GAN. In: NIPS (2019)
Google Scholar
Xu, L., Veeramachaneni, K.: Synthesizing tabular data using generative adversarial networks. arXiv preprint arXiv:1811.11264 (2018)
Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: PrivBayes: private data release via Bayesian networks. ACM Trans. Database Syst. 42(4), 1–41 (2017)
Article MathSciNet Google Scholar
Zhao, Z., Kunar, A., der Scheer, H.V., Birke, R., Chen, L.Y.: CTAB-GAN: effective table data synthesizing (2021)
Google Scholar

Download references

Author information

Authors and Affiliations

AI Garage, Mastercard, Gurugram, India
Shreyansh Singh, Kanishka Kayathwal, Hardik Wadhwa & Gaurav Dhama

Authors

Shreyansh Singh
View author publications
You can also search for this author in PubMed Google Scholar
Kanishka Kayathwal
View author publications
You can also search for this author in PubMed Google Scholar
Hardik Wadhwa
View author publications
You can also search for this author in PubMed Google Scholar
Gaurav Dhama
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shreyansh Singh .

Editor information

Editors and Affiliations

Sampoerna University, Jakarta, Indonesia
Teddy Mantoro
Kyungpook National University, Daegu, Korea (Republic of)
Minho Lee
Sampoerna University, Jakarta, Indonesia
Media Anugerah Ayu
Murdoch University, Murdoch, WA, Australia
Kok Wai Wong
Universitas Indonesia, Depok, Indonesia
Achmad Nizar Hidayanto

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Singh, S., Kayathwal, K., Wadhwa, H., Dhama, G. (2021). MeTGAN: Memory Efficient Tabular GAN for High Cardinality Categorical Datasets. In: Mantoro, T., Lee, M., Ayu, M.A., Wong, K.W., Hidayanto, A.N. (eds) Neural Information Processing. ICONIP 2021. Communications in Computer and Information Science, vol 1517. Springer, Cham. https://doi.org/10.1007/978-3-030-92310-5_60

Download citation

DOI: https://doi.org/10.1007/978-3-030-92310-5_60
Published: 02 December 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-92309-9
Online ISBN: 978-3-030-92310-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

MeTGAN: Memory Efficient Tabular GAN for High Cardinality Categorical Datasets