Abstract
Nowadays, the use of synthetic data has gained popularity as a cost-efficient strategy for enhancing data augmentation for improving machine learning models performance as well as addressing concerns related to sensitive data privacy. Therefore, the necessity of ensuring quality of generated synthetic data, in terms of accurate representation of real data, consists of primary importance. In this work, we present a new framework for evaluating synthetic data generation models’ ability for developing high-quality synthetic data. The proposed approach is able to provide strong statistical and theoretical information about the evaluation framework and the compared models’ ranking. Two use case scenarios demonstrate the applicability of the proposed framework for evaluating the ability of synthetic data generation models to generated high quality data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
The models parameters for each use-case as well as the implementation code can be found in https://github.com/novelcore/synthetic_data_evaluation_framework.
References
Boehmer, N., Bredereck, R., Faliszewski, P., Niedermeier, R.: A quantitative and qualitative analysis of the robustness of (real-world) election winners. In: Equity and Access in Algorithms, Mechanisms, and Optimization, pp. 1–10 (2022)
Bourou, S., El Saer, A., Velivassaki, T.H., Voulkidis, A., Zahariadis, T.: A review of tabular data synthesis using GANs on an ids dataset. Information 12(09), 375 (2021)
Canbek, G., Sagiroglu, S., Temizel, T.T., Baykal, N.: Binary classification performance measures/metrics: a comprehensive visualized roadmap to gain new insights. In: 2017 International Conference on Computer Science and Engineering (UBMK), pp. 821–826. IEEE (2017)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Chundawat, V.S., Tarun, A.K., Mandal, M., Lahoti, M., Narang, P.: Tabsyndex: a universal metric for robust evaluation of synthetic tabular data. arXiv preprint arXiv:2207.05295 (2022)
Dankar, F.K., Ibrahim, M.K., Ismail, L.: A multi-dimensional evaluation of synthetic data generators. IEEE Access 10, 11147–11158 (2022)
Espinosa, E., Figueira, A.: On the quality of synthetic generated tabular data. Mathematics 11(15), 3278 (2023)
Figueira, A., Vaz, B.: Survey on synthetic data generation, evaluation methods and GANs. Mathematics 10(15), 2733 (2022)
Finner, H.: On a monotonicity problem in step-down multiple test procedures. J. Am. Stat. Assoc. 88(423), 920–923 (1993)
Fruhwirth-Schnatter, S., Celeux, G., Robert, C.P.: Handbook of Mixture Analysis. CRC Press, Boca Raton (2019)
García, S., Fernández, A., Luengo, J., Herrera, F.: Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf. Sci. 180(10), 2044–2064 (2010)
Gui, J., Sun, Z., Wen, Y., Tao, D., Ye, J.: A review on generative adversarial networks: algorithms, theory, and applications. IEEE Trans. Knowl. Data Eng. 35(4), 3313–3332 (2021)
He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks, pp. 1322–1328. IEEE (2008)
Hernandez, M., Epelde, G., Alberdi, A., Cilla, R., Rankin, D.: Synthetic data generation for tabular health records: a systematic review. Neurocomputing 493, 28–45 (2022)
Herurkar, D., Sattarov, T., Hees, J., Palacio, S., Raue, F., Dengel, A.: Cross-domain transformation for outlier detection on tabular datasets. In: 2023 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2023)
Hodges, J., Lehmann, E.L.: Rank methods for combination of independent experiments in analysis of variance. In: Rojo, J. (eds.) Selected Works of EL Lehmann, pp. 403–418. Springer, Boston (2012). https://doi.org/10.1007/978-1-4614-1412-4_35
Justel, A., Peña, D., Zamar, R.: A multivariate Kolmogorov-Smirnov test of goodness of fit. Stat. Probab. Lett. 35(3), 251–259 (1997)
Kamthe, S., Assefa, S., Deisenroth, M.: Copula flows for synthetic data generation. arXiv preprint arXiv:2101.00598 (2021)
Kiriakidou, N., Livieris, I.E., Pintelas, P.: Mutual information-based neighbor selection method for causal effect estimation. Neural Comput. Appl. 1–15 (2024)
Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)
Lesouple, J., Baudoin, C., Spigai, M., Tourneret, J.Y.: Generalized isolation forest for anomaly detection. Pattern Recogn. Lett. 149, 109–119 (2021)
Li, Z., Zhao, Y., Fu, J.: Sync: a copula based framework for generating synthetic data from aggregated sources. In: 2020 International Conference on Data Mining Workshops, pp. 571–578. IEEE (2020)
Livieris, I., Kanavos, A., Pintelas, P.: Detecting lung abnormalities from X-rays using an improved SSL algorithm. Electron. Notes Theor. Comput. Sci. 343, 19–33 (2019)
Livieris, I.E., Stavroyiannis, S., Pintelas, E., Kotsilieris, T., Pintelas, P.: A dropout weight-constrained recurrent neural network model for forecasting the price of major cryptocurrencies and CCI30 index. Evolving Syst. 1–16 (2022)
Llugiqi, M., Mayer, R.: An empirical analysis of synthetic-data-based anomaly detection. In: Holzinger, A., Kieseberg, P., Tjoa, A.M., Weippl, E. (eds.) Machine Learning and Knowledge Extraction, pp. 306–327. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-14463-9_20
Mao, X., Li, Q.: Generative Adversarial Networks for Image Generation. Springer, Singapore (2021). https://doi.org/10.1007/978-981-33-6048-8
Nikolenko, S.I.: Synthetic Data for Deep Learning, vol. 174. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-75178-4
Spadotto, T., Toldo, M., Michieli, U., Zanuttigh, P.: Unsupervised domain adaptation with multiple domain discriminators and adaptive self-training. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 2845–2852. IEEE (2021)
Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional GAN. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Acknowledgements
This work has been conducted under the auspices of the General Secretariat for Research Innovation, for the certification of research and development expenditure programme, under No. 82216, project Synthetic data synthesis - Datafication of social data.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 IFIP International Federation for Information Processing
About this paper
Cite this paper
Livieris, I.E., Alimpertis, N., Domalis, G., Tsakalidis, D. (2024). An Evaluation Framework for Synthetic Data Generation Models. In: Maglogiannis, I., Iliadis, L., Macintyre, J., Avlonitis, M., Papaleonidas, A. (eds) Artificial Intelligence Applications and Innovations. AIAI 2024. IFIP Advances in Information and Communication Technology, vol 713. Springer, Cham. https://doi.org/10.1007/978-3-031-63219-8_24
Download citation
DOI: https://doi.org/10.1007/978-3-031-63219-8_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-63218-1
Online ISBN: 978-3-031-63219-8
eBook Packages: Computer ScienceComputer Science (R0)