An Evaluation Framework for Synthetic Data Generation Models

Livieris, I. E.; Alimpertis, N.; Domalis, G.; Tsakalidis, D.

doi:10.1007/978-3-031-63219-8_24

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 713))

Included in the following conference series:

IFIP International Conference on Artificial Intelligence Applications and Innovations

612 Accesses
6 Citations

Abstract

Nowadays, the use of synthetic data has gained popularity as a cost-efficient strategy for enhancing data augmentation for improving machine learning models performance as well as addressing concerns related to sensitive data privacy. Therefore, the necessity of ensuring quality of generated synthetic data, in terms of accurate representation of real data, consists of primary importance. In this work, we present a new framework for evaluating synthetic data generation models’ ability for developing high-quality synthetic data. The proposed approach is able to provide strong statistical and theoretical information about the evaluation framework and the compared models’ ranking. Two use case scenarios demonstrate the applicability of the proposed framework for evaluating the ability of synthetic data generation models to generated high quality data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 299.00; Price excludes VAT (USA)

Hardcover Book: USD 379.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Tabular and latent space synthetic data generation: a literature review

Article Open access 10 July 2023

Challenges in Measuring Utility for Fully Synthetic Data

Leveraging Artificial Intelligence Models Using Synthetic Data

Notes

1.
https://doi.org/10.24432/C5C31Q.
2.
https://doi.org/10.34740/kaggle/dsv/7009925.
3.
The models parameters for each use-case as well as the implementation code can be found in https://github.com/novelcore/synthetic_data_evaluation_framework.

References

Boehmer, N., Bredereck, R., Faliszewski, P., Niedermeier, R.: A quantitative and qualitative analysis of the robustness of (real-world) election winners. In: Equity and Access in Algorithms, Mechanisms, and Optimization, pp. 1–10 (2022)
Google Scholar
Bourou, S., El Saer, A., Velivassaki, T.H., Voulkidis, A., Zahariadis, T.: A review of tabular data synthesis using GANs on an ids dataset. Information 12(09), 375 (2021)
Article Google Scholar
Canbek, G., Sagiroglu, S., Temizel, T.T., Baykal, N.: Binary classification performance measures/metrics: a comprehensive visualized roadmap to gain new insights. In: 2017 International Conference on Computer Science and Engineering (UBMK), pp. 821–826. IEEE (2017)
Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Article Google Scholar
Chundawat, V.S., Tarun, A.K., Mandal, M., Lahoti, M., Narang, P.: Tabsyndex: a universal metric for robust evaluation of synthetic tabular data. arXiv preprint arXiv:2207.05295 (2022)
Dankar, F.K., Ibrahim, M.K., Ismail, L.: A multi-dimensional evaluation of synthetic data generators. IEEE Access 10, 11147–11158 (2022)
Article Google Scholar
Espinosa, E., Figueira, A.: On the quality of synthetic generated tabular data. Mathematics 11(15), 3278 (2023)
Article Google Scholar
Figueira, A., Vaz, B.: Survey on synthetic data generation, evaluation methods and GANs. Mathematics 10(15), 2733 (2022)
Article Google Scholar
Finner, H.: On a monotonicity problem in step-down multiple test procedures. J. Am. Stat. Assoc. 88(423), 920–923 (1993)
Article MathSciNet Google Scholar
Fruhwirth-Schnatter, S., Celeux, G., Robert, C.P.: Handbook of Mixture Analysis. CRC Press, Boca Raton (2019)
Google Scholar
García, S., Fernández, A., Luengo, J., Herrera, F.: Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf. Sci. 180(10), 2044–2064 (2010)
Article Google Scholar
Gui, J., Sun, Z., Wen, Y., Tao, D., Ye, J.: A review on generative adversarial networks: algorithms, theory, and applications. IEEE Trans. Knowl. Data Eng. 35(4), 3313–3332 (2021)
Article Google Scholar
He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks, pp. 1322–1328. IEEE (2008)
Google Scholar
Hernandez, M., Epelde, G., Alberdi, A., Cilla, R., Rankin, D.: Synthetic data generation for tabular health records: a systematic review. Neurocomputing 493, 28–45 (2022)
Article Google Scholar
Herurkar, D., Sattarov, T., Hees, J., Palacio, S., Raue, F., Dengel, A.: Cross-domain transformation for outlier detection on tabular datasets. In: 2023 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2023)
Google Scholar
Hodges, J., Lehmann, E.L.: Rank methods for combination of independent experiments in analysis of variance. In: Rojo, J. (eds.) Selected Works of EL Lehmann, pp. 403–418. Springer, Boston (2012). https://doi.org/10.1007/978-1-4614-1412-4_35
Justel, A., Peña, D., Zamar, R.: A multivariate Kolmogorov-Smirnov test of goodness of fit. Stat. Probab. Lett. 35(3), 251–259 (1997)
Article MathSciNet Google Scholar
Kamthe, S., Assefa, S., Deisenroth, M.: Copula flows for synthetic data generation. arXiv preprint arXiv:2101.00598 (2021)
Kiriakidou, N., Livieris, I.E., Pintelas, P.: Mutual information-based neighbor selection method for causal effect estimation. Neural Comput. Appl. 1–15 (2024)
Google Scholar
Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)
Article MathSciNet Google Scholar
Lesouple, J., Baudoin, C., Spigai, M., Tourneret, J.Y.: Generalized isolation forest for anomaly detection. Pattern Recogn. Lett. 149, 109–119 (2021)
Article Google Scholar
Li, Z., Zhao, Y., Fu, J.: Sync: a copula based framework for generating synthetic data from aggregated sources. In: 2020 International Conference on Data Mining Workshops, pp. 571–578. IEEE (2020)
Google Scholar
Livieris, I., Kanavos, A., Pintelas, P.: Detecting lung abnormalities from X-rays using an improved SSL algorithm. Electron. Notes Theor. Comput. Sci. 343, 19–33 (2019)
Article Google Scholar
Livieris, I.E., Stavroyiannis, S., Pintelas, E., Kotsilieris, T., Pintelas, P.: A dropout weight-constrained recurrent neural network model for forecasting the price of major cryptocurrencies and CCI30 index. Evolving Syst. 1–16 (2022)
Google Scholar
Llugiqi, M., Mayer, R.: An empirical analysis of synthetic-data-based anomaly detection. In: Holzinger, A., Kieseberg, P., Tjoa, A.M., Weippl, E. (eds.) Machine Learning and Knowledge Extraction, pp. 306–327. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-14463-9_20
Mao, X., Li, Q.: Generative Adversarial Networks for Image Generation. Springer, Singapore (2021). https://doi.org/10.1007/978-981-33-6048-8
Book Google Scholar
Nikolenko, S.I.: Synthetic Data for Deep Learning, vol. 174. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-75178-4
Book Google Scholar
Spadotto, T., Toldo, M., Michieli, U., Zanuttigh, P.: Unsupervised domain adaptation with multiple domain discriminators and adaptive self-training. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 2845–2852. IEEE (2021)
Google Scholar
Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional GAN. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar

Download references

Acknowledgements

This work has been conducted under the auspices of the General Secretariat for Research Innovation, for the certification of research and development expenditure programme, under No. 82216, project Synthetic data synthesis - Datafication of social data.

Author information

Authors and Affiliations

Novelcore, 10436, Athens, Greece
I. E. Livieris, N. Alimpertis, G. Domalis & D. Tsakalidis
Department of Statistics and Insurance Science, University of Pireaus, Piraeus, Greece
I. E. Livieris

Authors

I. E. Livieris
View author publications
You can also search for this author in PubMed Google Scholar
N. Alimpertis
View author publications
You can also search for this author in PubMed Google Scholar
G. Domalis
View author publications
You can also search for this author in PubMed Google Scholar
D. Tsakalidis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to I. E. Livieris .

Editor information

Editors and Affiliations

University of Piraeus, Piraeus, Greece
Ilias Maglogiannis
Democritus University of Thrace, Xanthi, Greece
Lazaros Iliadis
University of Abertay, Dundee, UK
John Macintyre
Ionian University, Corfu, Greece
Markos Avlonitis
Democritus University of Thrace, Xanthi, Greece
Antonios Papaleonidas

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Livieris, I.E., Alimpertis, N., Domalis, G., Tsakalidis, D. (2024). An Evaluation Framework for Synthetic Data Generation Models. In: Maglogiannis, I., Iliadis, L., Macintyre, J., Avlonitis, M., Papaleonidas, A. (eds) Artificial Intelligence Applications and Innovations. AIAI 2024. IFIP Advances in Information and Communication Technology, vol 713. Springer, Cham. https://doi.org/10.1007/978-3-031-63219-8_24

Download citation

DOI: https://doi.org/10.1007/978-3-031-63219-8_24
Published: 22 June 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-63218-1
Online ISBN: 978-3-031-63219-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Federation for Information Processing (opens in a new tab)

An Evaluation Framework for Synthetic Data Generation Models