Skip to main content

A Novel Evaluation Metric for Synthetic Data Generation

  • Conference paper
  • First Online:
Intelligent Data Engineering and Automated Learning – IDEAL 2020 (IDEAL 2020)

Abstract

Differentially private algorithmic synthetic data generation (SDG) solutions take input datasets \(D_p\) consisting of sensitive, private data and generate synthetic data \(D_s\) with similar qualities. The importance of such solutions is increasing both because more and more people realize how much data is collected about them and used in machine learning contexts, as well as a consequence of newly introduced data privacy regulations, e.g. the EU’s General Data Protection Regulation (GDPR). We aim to develop a novel and composite SDG evaluation metric which takes into account macro-statistical dataset similarities and data utility in machine learning tasks against privacy boundaries of the synthetic data. We formalize the mathematical foundations for quantitatively measuring both the statistical similarities and the data utility of synthetic data. We use two well-known datasets containing (potentially) personally identifiable information as inputs (\(D_p\)) and existing SDG algorithms PrivBayes and DPGroupFields to generate synthetic data (\(D_s\)) based on them. We then test our evaluation metric for different values of privacy budget \(\epsilon \). Based on our experiments we conclude that the proposed composite evaluation metric is appropriate for quantitatively measuring the quality of synthetic data generated by different SDG solutions and possesses an expected sensitivity to various privacy budget values.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Since this paper is focusing on evaluation, most popular SDG approaches are just listed here since their detailed description is out of the scope of this paper.

  2. 2.

    The important role of \(\epsilon \) in DP justifies its presence as subscript of G in our evaluation metric definition since we evaluate G at varying of \(\epsilon \).

References

  1. Acs, G., Castelluccia, C., Chen, R.: Differentially private histogram publishing through lossy compression. In: 2012 IEEE 12th International Conference on Data Mining, pp. 1–10. IEEE (2012)

    Google Scholar 

  2. Asghar, H.J., Ding, M., Rakotoarivelo, T., Mrabet, S., Kaafar, D.: Differentially private release of datasets using Gaussian copula. J. Priv. Confidentiality 10(2) June 2020

    Google Scholar 

  3. Baak, M., Koopman, R., Snoek, H., Klous, S.: A new correlation coefficient between categorical, ordinal and interval variables with pearson characteristics. Comput. Stat. Data Anal. 152, 107043 (2020)

    Article  MathSciNet  Google Scholar 

  4. Barak, B., Chaudhuri, K., Dwork, C., Kale, S., McSherry, F., Talwar, K.: Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In: Proceedings of the Twenty-Sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 273–282 (2007)

    Google Scholar 

  5. Bowen, C.M., Snoke, J.: Comparative study of differentially private synthetic data algorithms and evaluation standards (2019). arXiv preprint arXiv:1911.12704

  6. Cormode, G., Procopiuc, C., Srivastava, D., Shen, E., Yu, T.: Differentially private spatial decompositions. In: 2012 IEEE 28th International Conference on Data Engineering, pp. 20–31. IEEE (2012)

    Google Scholar 

  7. Cormode, G., Procopiuc, C., Srivastava, D., Tran, T.T.: Differentially private summaries for sparse data. In: Proceedings of the 15th International Conference on Database Theory, pp. 299–311 (2012)

    Google Scholar 

  8. Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). https://doi.org/10.1007/11681878_14

    Chapter  Google Scholar 

  9. Dwork, C., Roth, A., et al.: The algorithmic foundations of differential privacy. Found. Trends Theor. Comput.Sci. 9(3–4), 211–407 (2014)

    MathSciNet  MATH  Google Scholar 

  10. Hittmeir, M., Ekelhart, A., Mayer, R.: On the utility of synthetic data: an empirical evaluation on machine learning tasks. In: Proceedings of the 14th International Conference on Availability, Reliability and Security, pp. 1–6 (2019)

    Google Scholar 

  11. Howe, B., Stoyanovich, J., Ping, H., Herman, B., Gee, M.: Synthetic data for social good (2017). arXiv preprint arXiv:1710.08874

  12. Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. MIT press, Cambridge (2009)

    MATH  Google Scholar 

  13. Li, H., Xiong, L., Jiang, X.: Differentially private synthesization of multi-dimensional data using copula functions. In: Advances in Database Technology: Proceedings. International Conference on Extending Database Technology, vol. 2014, p. 475. NIH Public Access (2014)

    Google Scholar 

  14. Li, H., Xiong, L., Zhang, L., Jiang, X.: Dpsynthesizer: differentially private data synthesizer for privacy preserving data sharing. In: Proceedings of the VLDB Endowment International Conference on Very Large Data Bases, vol. 7, p. 1677. NIH Public Access (2014)

    Google Scholar 

  15. McSherry, F., Talwar, K.: Mechanism design via differential privacy. In: 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07), pp. 94–103. IEEE (2007)

    Google Scholar 

  16. Patki, N., Wedge, R., Veeramachaneni, K.: The synthetic data vault. In: 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA). pp. 399–410, IEEE (2016)

    Google Scholar 

  17. Ping, H., Stoyanovich, J., Howe, B.: Datasynthesizer: privacy-preserving synthetic datasets. In: Proceedings of the 29th International Conference on Scientific and Statistical Database Management, pp. 1–5 (2017)

    Google Scholar 

  18. Sklar, A.: mfonctions de répartition à n dimensions et leurs marges, n publ. Inst. Statist. Univ. Paris 8, 229–231 (1959)

    Google Scholar 

  19. Tsybakov, A.B.: Introduction to Nonparametric Estimation. Springer Science & Business Media, Berlin (2008)

    MATH  Google Scholar 

  20. Xiao, X., Wang, G., Gehrke, J.: Differential privacy via wavelet transforms. IEEE Trans. Knowl. Data Eng. 23(8), 1200–1214 (2010)

    Article  Google Scholar 

  21. Zhang, J., Zheng, K., Mou, W., Wang, L.: Efficient private ERM for smooth objectives. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI 2017, pp. 3922–3928. AAAI Press (2017)

    Google Scholar 

  22. Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: Privbayes: private data release via bayesian networks. ACM Trans. Data. Syst. (TODS) 42(4), 1–41 (2017)

    Article  MathSciNet  Google Scholar 

  23. Zhang, J., Xiao, X., Yang, Y., Zhang, Z., Winslett, M.: Privgene: differentially private model fitting using genetic algorithms. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 665–676 (2013)

    Google Scholar 

Download references

Acknowledgments

This research was co-funded by EIT Digital Industrial Doctorate and Ericsson Hungary. Project no. ED_18-1-2019-0030 (Application domain specific highly reliable IT solutions subprogramme) has been implemented with the support provided from the National Research, Development and Innovation Fund of Hungary, financed under the Thematic Excellence Programme funding scheme. We are thankful to Gian Marco Canneori for the fruitful discussions leading to the final mathematical foundations presented in this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andrea Galloni .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Galloni, A., Lendák, I., Horváth, T. (2020). A Novel Evaluation Metric for Synthetic Data Generation. In: Analide, C., Novais, P., Camacho, D., Yin, H. (eds) Intelligent Data Engineering and Automated Learning – IDEAL 2020. IDEAL 2020. Lecture Notes in Computer Science(), vol 12490. Springer, Cham. https://doi.org/10.1007/978-3-030-62365-4_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-62365-4_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-62364-7

  • Online ISBN: 978-3-030-62365-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics