Skip to main content

An Evaluation Framework for Synthetic Data Generation Models

  • Conference paper
  • First Online:
Artificial Intelligence Applications and Innovations (AIAI 2024)

Abstract

Nowadays, the use of synthetic data has gained popularity as a cost-efficient strategy for enhancing data augmentation for improving machine learning models performance as well as addressing concerns related to sensitive data privacy. Therefore, the necessity of ensuring quality of generated synthetic data, in terms of accurate representation of real data, consists of primary importance. In this work, we present a new framework for evaluating synthetic data generation models’ ability for developing high-quality synthetic data. The proposed approach is able to provide strong statistical and theoretical information about the evaluation framework and the compared models’ ranking. Two use case scenarios demonstrate the applicability of the proposed framework for evaluating the ability of synthetic data generation models to generated high quality data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://doi.org/10.24432/C5C31Q.

  2. 2.

    https://doi.org/10.34740/kaggle/dsv/7009925.

  3. 3.

    The models parameters for each use-case as well as the implementation code can be found in https://github.com/novelcore/synthetic_data_evaluation_framework.

References

  1. Boehmer, N., Bredereck, R., Faliszewski, P., Niedermeier, R.: A quantitative and qualitative analysis of the robustness of (real-world) election winners. In: Equity and Access in Algorithms, Mechanisms, and Optimization, pp. 1–10 (2022)

    Google Scholar 

  2. Bourou, S., El Saer, A., Velivassaki, T.H., Voulkidis, A., Zahariadis, T.: A review of tabular data synthesis using GANs on an ids dataset. Information 12(09), 375 (2021)

    Article  Google Scholar 

  3. Canbek, G., Sagiroglu, S., Temizel, T.T., Baykal, N.: Binary classification performance measures/metrics: a comprehensive visualized roadmap to gain new insights. In: 2017 International Conference on Computer Science and Engineering (UBMK), pp. 821–826. IEEE (2017)

    Google Scholar 

  4. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)

    Article  Google Scholar 

  5. Chundawat, V.S., Tarun, A.K., Mandal, M., Lahoti, M., Narang, P.: Tabsyndex: a universal metric for robust evaluation of synthetic tabular data. arXiv preprint arXiv:2207.05295 (2022)

  6. Dankar, F.K., Ibrahim, M.K., Ismail, L.: A multi-dimensional evaluation of synthetic data generators. IEEE Access 10, 11147–11158 (2022)

    Article  Google Scholar 

  7. Espinosa, E., Figueira, A.: On the quality of synthetic generated tabular data. Mathematics 11(15), 3278 (2023)

    Article  Google Scholar 

  8. Figueira, A., Vaz, B.: Survey on synthetic data generation, evaluation methods and GANs. Mathematics 10(15), 2733 (2022)

    Article  Google Scholar 

  9. Finner, H.: On a monotonicity problem in step-down multiple test procedures. J. Am. Stat. Assoc. 88(423), 920–923 (1993)

    Article  MathSciNet  Google Scholar 

  10. Fruhwirth-Schnatter, S., Celeux, G., Robert, C.P.: Handbook of Mixture Analysis. CRC Press, Boca Raton (2019)

    Google Scholar 

  11. García, S., Fernández, A., Luengo, J., Herrera, F.: Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf. Sci. 180(10), 2044–2064 (2010)

    Article  Google Scholar 

  12. Gui, J., Sun, Z., Wen, Y., Tao, D., Ye, J.: A review on generative adversarial networks: algorithms, theory, and applications. IEEE Trans. Knowl. Data Eng. 35(4), 3313–3332 (2021)

    Article  Google Scholar 

  13. He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks, pp. 1322–1328. IEEE (2008)

    Google Scholar 

  14. Hernandez, M., Epelde, G., Alberdi, A., Cilla, R., Rankin, D.: Synthetic data generation for tabular health records: a systematic review. Neurocomputing 493, 28–45 (2022)

    Article  Google Scholar 

  15. Herurkar, D., Sattarov, T., Hees, J., Palacio, S., Raue, F., Dengel, A.: Cross-domain transformation for outlier detection on tabular datasets. In: 2023 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2023)

    Google Scholar 

  16. Hodges, J., Lehmann, E.L.: Rank methods for combination of independent experiments in analysis of variance. In: Rojo, J. (eds.) Selected Works of EL Lehmann, pp. 403–418. Springer, Boston (2012). https://doi.org/10.1007/978-1-4614-1412-4_35

  17. Justel, A., Peña, D., Zamar, R.: A multivariate Kolmogorov-Smirnov test of goodness of fit. Stat. Probab. Lett. 35(3), 251–259 (1997)

    Article  MathSciNet  Google Scholar 

  18. Kamthe, S., Assefa, S., Deisenroth, M.: Copula flows for synthetic data generation. arXiv preprint arXiv:2101.00598 (2021)

  19. Kiriakidou, N., Livieris, I.E., Pintelas, P.: Mutual information-based neighbor selection method for causal effect estimation. Neural Comput. Appl. 1–15 (2024)

    Google Scholar 

  20. Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)

    Article  MathSciNet  Google Scholar 

  21. Lesouple, J., Baudoin, C., Spigai, M., Tourneret, J.Y.: Generalized isolation forest for anomaly detection. Pattern Recogn. Lett. 149, 109–119 (2021)

    Article  Google Scholar 

  22. Li, Z., Zhao, Y., Fu, J.: Sync: a copula based framework for generating synthetic data from aggregated sources. In: 2020 International Conference on Data Mining Workshops, pp. 571–578. IEEE (2020)

    Google Scholar 

  23. Livieris, I., Kanavos, A., Pintelas, P.: Detecting lung abnormalities from X-rays using an improved SSL algorithm. Electron. Notes Theor. Comput. Sci. 343, 19–33 (2019)

    Article  Google Scholar 

  24. Livieris, I.E., Stavroyiannis, S., Pintelas, E., Kotsilieris, T., Pintelas, P.: A dropout weight-constrained recurrent neural network model for forecasting the price of major cryptocurrencies and CCI30 index. Evolving Syst. 1–16 (2022)

    Google Scholar 

  25. Llugiqi, M., Mayer, R.: An empirical analysis of synthetic-data-based anomaly detection. In: Holzinger, A., Kieseberg, P., Tjoa, A.M., Weippl, E. (eds.) Machine Learning and Knowledge Extraction, pp. 306–327. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-14463-9_20

  26. Mao, X., Li, Q.: Generative Adversarial Networks for Image Generation. Springer, Singapore (2021). https://doi.org/10.1007/978-981-33-6048-8

    Book  Google Scholar 

  27. Nikolenko, S.I.: Synthetic Data for Deep Learning, vol. 174. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-75178-4

    Book  Google Scholar 

  28. Spadotto, T., Toldo, M., Michieli, U., Zanuttigh, P.: Unsupervised domain adaptation with multiple domain discriminators and adaptive self-training. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 2845–2852. IEEE (2021)

    Google Scholar 

  29. Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional GAN. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

    Google Scholar 

Download references

Acknowledgements

This work has been conducted under the auspices of the General Secretariat for Research Innovation, for the certification of research and development expenditure programme, under No. 82216, project Synthetic data synthesis - Datafication of social data.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to I. E. Livieris .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 IFIP International Federation for Information Processing

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Livieris, I.E., Alimpertis, N., Domalis, G., Tsakalidis, D. (2024). An Evaluation Framework for Synthetic Data Generation Models. In: Maglogiannis, I., Iliadis, L., Macintyre, J., Avlonitis, M., Papaleonidas, A. (eds) Artificial Intelligence Applications and Innovations. AIAI 2024. IFIP Advances in Information and Communication Technology, vol 713. Springer, Cham. https://doi.org/10.1007/978-3-031-63219-8_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-63219-8_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-63218-1

  • Online ISBN: 978-3-031-63219-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics