A Novel Evaluation Metric for Synthetic Data Generation

Galloni, Andrea; Lendák, Imre; Horváth, Tomáš

doi:10.1007/978-3-030-62365-4_3

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12490))

Included in the following conference series:

International Conference on Intelligent Data Engineering and Automated Learning

1834 Accesses
3 Citations

Abstract

Differentially private algorithmic synthetic data generation (SDG) solutions take input datasets \(D_p\) consisting of sensitive, private data and generate synthetic data \(D_s\) with similar qualities. The importance of such solutions is increasing both because more and more people realize how much data is collected about them and used in machine learning contexts, as well as a consequence of newly introduced data privacy regulations, e.g. the EU’s General Data Protection Regulation (GDPR). We aim to develop a novel and composite SDG evaluation metric which takes into account macro-statistical dataset similarities and data utility in machine learning tasks against privacy boundaries of the synthetic data. We formalize the mathematical foundations for quantitatively measuring both the statistical similarities and the data utility of synthetic data. We use two well-known datasets containing (potentially) personally identifiable information as inputs (\(D_p\)) and existing SDG algorithms PrivBayes and DPGroupFields to generate synthetic data (\(D_s\)) based on them. We then test our evaluation metric for different values of privacy budget \(\epsilon \). Based on our experiments we conclude that the proposed composite evaluation metric is appropriate for quantitatively measuring the quality of synthetic data generated by different SDG solutions and possesses an expected sensitivity to various privacy budget values.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Since this paper is focusing on evaluation, most popular SDG approaches are just listed here since their detailed description is out of the scope of this paper.
2.
The important role of \(\epsilon \) in DP justifies its presence as subscript of G in our evaluation metric definition since we evaluate G at varying of \(\epsilon \).

References

Acs, G., Castelluccia, C., Chen, R.: Differentially private histogram publishing through lossy compression. In: 2012 IEEE 12th International Conference on Data Mining, pp. 1–10. IEEE (2012)
Google Scholar
Asghar, H.J., Ding, M., Rakotoarivelo, T., Mrabet, S., Kaafar, D.: Differentially private release of datasets using Gaussian copula. J. Priv. Confidentiality 10(2) June 2020
Google Scholar
Baak, M., Koopman, R., Snoek, H., Klous, S.: A new correlation coefficient between categorical, ordinal and interval variables with pearson characteristics. Comput. Stat. Data Anal. 152, 107043 (2020)
Article MathSciNet Google Scholar
Barak, B., Chaudhuri, K., Dwork, C., Kale, S., McSherry, F., Talwar, K.: Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In: Proceedings of the Twenty-Sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 273–282 (2007)
Google Scholar
Bowen, C.M., Snoke, J.: Comparative study of differentially private synthetic data algorithms and evaluation standards (2019). arXiv preprint arXiv:1911.12704
Cormode, G., Procopiuc, C., Srivastava, D., Shen, E., Yu, T.: Differentially private spatial decompositions. In: 2012 IEEE 28th International Conference on Data Engineering, pp. 20–31. IEEE (2012)
Google Scholar
Cormode, G., Procopiuc, C., Srivastava, D., Tran, T.T.: Differentially private summaries for sparse data. In: Proceedings of the 15th International Conference on Database Theory, pp. 299–311 (2012)
Google Scholar
Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). https://doi.org/10.1007/11681878_14
Chapter Google Scholar
Dwork, C., Roth, A., et al.: The algorithmic foundations of differential privacy. Found. Trends Theor. Comput.Sci. 9(3–4), 211–407 (2014)
MathSciNet MATH Google Scholar
Hittmeir, M., Ekelhart, A., Mayer, R.: On the utility of synthetic data: an empirical evaluation on machine learning tasks. In: Proceedings of the 14th International Conference on Availability, Reliability and Security, pp. 1–6 (2019)
Google Scholar
Howe, B., Stoyanovich, J., Ping, H., Herman, B., Gee, M.: Synthetic data for social good (2017). arXiv preprint arXiv:1710.08874
Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. MIT press, Cambridge (2009)
MATH Google Scholar
Li, H., Xiong, L., Jiang, X.: Differentially private synthesization of multi-dimensional data using copula functions. In: Advances in Database Technology: Proceedings. International Conference on Extending Database Technology, vol. 2014, p. 475. NIH Public Access (2014)
Google Scholar
Li, H., Xiong, L., Zhang, L., Jiang, X.: Dpsynthesizer: differentially private data synthesizer for privacy preserving data sharing. In: Proceedings of the VLDB Endowment International Conference on Very Large Data Bases, vol. 7, p. 1677. NIH Public Access (2014)
Google Scholar
McSherry, F., Talwar, K.: Mechanism design via differential privacy. In: 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07), pp. 94–103. IEEE (2007)
Google Scholar
Patki, N., Wedge, R., Veeramachaneni, K.: The synthetic data vault. In: 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA). pp. 399–410, IEEE (2016)
Google Scholar
Ping, H., Stoyanovich, J., Howe, B.: Datasynthesizer: privacy-preserving synthetic datasets. In: Proceedings of the 29th International Conference on Scientific and Statistical Database Management, pp. 1–5 (2017)
Google Scholar
Sklar, A.: mfonctions de répartition à n dimensions et leurs marges, n publ. Inst. Statist. Univ. Paris 8, 229–231 (1959)
Google Scholar
Tsybakov, A.B.: Introduction to Nonparametric Estimation. Springer Science & Business Media, Berlin (2008)
MATH Google Scholar
Xiao, X., Wang, G., Gehrke, J.: Differential privacy via wavelet transforms. IEEE Trans. Knowl. Data Eng. 23(8), 1200–1214 (2010)
Article Google Scholar
Zhang, J., Zheng, K., Mou, W., Wang, L.: Efficient private ERM for smooth objectives. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI 2017, pp. 3922–3928. AAAI Press (2017)
Google Scholar
Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: Privbayes: private data release via bayesian networks. ACM Trans. Data. Syst. (TODS) 42(4), 1–41 (2017)
Article MathSciNet Google Scholar
Zhang, J., Xiao, X., Yang, Y., Zhang, Z., Winslett, M.: Privgene: differentially private model fitting using genetic algorithms. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 665–676 (2013)
Google Scholar

Download references

Acknowledgments

This research was co-funded by EIT Digital Industrial Doctorate and Ericsson Hungary. Project no. ED_18-1-2019-0030 (Application domain specific highly reliable IT solutions subprogramme) has been implemented with the support provided from the National Research, Development and Innovation Fund of Hungary, financed under the Thematic Excellence Programme funding scheme. We are thankful to Gian Marco Canneori for the fruitful discussions leading to the final mathematical foundations presented in this paper.

Author information

Authors and Affiliations

Faculty of Informatics, Department of Data Science and Engineering, ELTE – Eötvös Loránd University, Budapest, Hungary
Andrea Galloni, Imre Lendák & Tomáš Horváth
Faculty of Science, Institute of Computer Science, Pavol Jozef Šafárik University, Košice, Slovakia
Tomáš Horváth
Faculty of Technical Sciences, University of Novi Sad, Novi Sad, Serbia
Imre Lendák

Authors

Andrea Galloni
View author publications
You can also search for this author in PubMed Google Scholar
Imre Lendák
View author publications
You can also search for this author in PubMed Google Scholar
Tomáš Horváth
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andrea Galloni .

Editor information

Editors and Affiliations

University of Minho, Braga, Portugal
Cesar Analide
University of Minho, Braga, Portugal
Paulo Novais
Technical University of Madrid, Madrid, Spain
David Camacho
University of Manchester, Manchester, UK
Hujun Yin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Galloni, A., Lendák, I., Horváth, T. (2020). A Novel Evaluation Metric for Synthetic Data Generation. In: Analide, C., Novais, P., Camacho, D., Yin, H. (eds) Intelligent Data Engineering and Automated Learning – IDEAL 2020. IDEAL 2020. Lecture Notes in Computer Science(), vol 12490. Springer, Cham. https://doi.org/10.1007/978-3-030-62365-4_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-62365-4_3
Published: 27 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-62364-7
Online ISBN: 978-3-030-62365-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics