Abstract
Differentially private algorithmic synthetic data generation (SDG) solutions take input datasets \(D_p\) consisting of sensitive, private data and generate synthetic data \(D_s\) with similar qualities. The importance of such solutions is increasing both because more and more people realize how much data is collected about them and used in machine learning contexts, as well as a consequence of newly introduced data privacy regulations, e.g. the EU’s General Data Protection Regulation (GDPR). We aim to develop a novel and composite SDG evaluation metric which takes into account macro-statistical dataset similarities and data utility in machine learning tasks against privacy boundaries of the synthetic data. We formalize the mathematical foundations for quantitatively measuring both the statistical similarities and the data utility of synthetic data. We use two well-known datasets containing (potentially) personally identifiable information as inputs (\(D_p\)) and existing SDG algorithms PrivBayes and DPGroupFields to generate synthetic data (\(D_s\)) based on them. We then test our evaluation metric for different values of privacy budget \(\epsilon \). Based on our experiments we conclude that the proposed composite evaluation metric is appropriate for quantitatively measuring the quality of synthetic data generated by different SDG solutions and possesses an expected sensitivity to various privacy budget values.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Since this paper is focusing on evaluation, most popular SDG approaches are just listed here since their detailed description is out of the scope of this paper.
- 2.
The important role of \(\epsilon \) in DP justifies its presence as subscript of G in our evaluation metric definition since we evaluate G at varying of \(\epsilon \).
References
Acs, G., Castelluccia, C., Chen, R.: Differentially private histogram publishing through lossy compression. In: 2012 IEEE 12th International Conference on Data Mining, pp. 1–10. IEEE (2012)
Asghar, H.J., Ding, M., Rakotoarivelo, T., Mrabet, S., Kaafar, D.: Differentially private release of datasets using Gaussian copula. J. Priv. Confidentiality 10(2) June 2020
Baak, M., Koopman, R., Snoek, H., Klous, S.: A new correlation coefficient between categorical, ordinal and interval variables with pearson characteristics. Comput. Stat. Data Anal. 152, 107043 (2020)
Barak, B., Chaudhuri, K., Dwork, C., Kale, S., McSherry, F., Talwar, K.: Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In: Proceedings of the Twenty-Sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 273–282 (2007)
Bowen, C.M., Snoke, J.: Comparative study of differentially private synthetic data algorithms and evaluation standards (2019). arXiv preprint arXiv:1911.12704
Cormode, G., Procopiuc, C., Srivastava, D., Shen, E., Yu, T.: Differentially private spatial decompositions. In: 2012 IEEE 28th International Conference on Data Engineering, pp. 20–31. IEEE (2012)
Cormode, G., Procopiuc, C., Srivastava, D., Tran, T.T.: Differentially private summaries for sparse data. In: Proceedings of the 15th International Conference on Database Theory, pp. 299–311 (2012)
Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). https://doi.org/10.1007/11681878_14
Dwork, C., Roth, A., et al.: The algorithmic foundations of differential privacy. Found. Trends Theor. Comput.Sci. 9(3–4), 211–407 (2014)
Hittmeir, M., Ekelhart, A., Mayer, R.: On the utility of synthetic data: an empirical evaluation on machine learning tasks. In: Proceedings of the 14th International Conference on Availability, Reliability and Security, pp. 1–6 (2019)
Howe, B., Stoyanovich, J., Ping, H., Herman, B., Gee, M.: Synthetic data for social good (2017). arXiv preprint arXiv:1710.08874
Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. MIT press, Cambridge (2009)
Li, H., Xiong, L., Jiang, X.: Differentially private synthesization of multi-dimensional data using copula functions. In: Advances in Database Technology: Proceedings. International Conference on Extending Database Technology, vol. 2014, p. 475. NIH Public Access (2014)
Li, H., Xiong, L., Zhang, L., Jiang, X.: Dpsynthesizer: differentially private data synthesizer for privacy preserving data sharing. In: Proceedings of the VLDB Endowment International Conference on Very Large Data Bases, vol. 7, p. 1677. NIH Public Access (2014)
McSherry, F., Talwar, K.: Mechanism design via differential privacy. In: 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07), pp. 94–103. IEEE (2007)
Patki, N., Wedge, R., Veeramachaneni, K.: The synthetic data vault. In: 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA). pp. 399–410, IEEE (2016)
Ping, H., Stoyanovich, J., Howe, B.: Datasynthesizer: privacy-preserving synthetic datasets. In: Proceedings of the 29th International Conference on Scientific and Statistical Database Management, pp. 1–5 (2017)
Sklar, A.: mfonctions de répartition à n dimensions et leurs marges, n publ. Inst. Statist. Univ. Paris 8, 229–231 (1959)
Tsybakov, A.B.: Introduction to Nonparametric Estimation. Springer Science & Business Media, Berlin (2008)
Xiao, X., Wang, G., Gehrke, J.: Differential privacy via wavelet transforms. IEEE Trans. Knowl. Data Eng. 23(8), 1200–1214 (2010)
Zhang, J., Zheng, K., Mou, W., Wang, L.: Efficient private ERM for smooth objectives. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI 2017, pp. 3922–3928. AAAI Press (2017)
Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: Privbayes: private data release via bayesian networks. ACM Trans. Data. Syst. (TODS) 42(4), 1–41 (2017)
Zhang, J., Xiao, X., Yang, Y., Zhang, Z., Winslett, M.: Privgene: differentially private model fitting using genetic algorithms. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 665–676 (2013)
Acknowledgments
This research was co-funded by EIT Digital Industrial Doctorate and Ericsson Hungary. Project no. ED_18-1-2019-0030 (Application domain specific highly reliable IT solutions subprogramme) has been implemented with the support provided from the National Research, Development and Innovation Fund of Hungary, financed under the Thematic Excellence Programme funding scheme. We are thankful to Gian Marco Canneori for the fruitful discussions leading to the final mathematical foundations presented in this paper.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Galloni, A., Lendák, I., Horváth, T. (2020). A Novel Evaluation Metric for Synthetic Data Generation. In: Analide, C., Novais, P., Camacho, D., Yin, H. (eds) Intelligent Data Engineering and Automated Learning – IDEAL 2020. IDEAL 2020. Lecture Notes in Computer Science(), vol 12490. Springer, Cham. https://doi.org/10.1007/978-3-030-62365-4_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-62365-4_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-62364-7
Online ISBN: 978-3-030-62365-4
eBook Packages: Computer ScienceComputer Science (R0)