Skip to main content

Towards Improving Privacy of Synthetic DataSets

  • Conference paper
  • First Online:
Privacy Technologies and Policy (APF 2021)

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 12703))

Included in the following conference series:

Abstract

Recent growth in domain specific applications of machine learning can be attributed to availability of realistic public datasets. Real world datasets may always contain sensitive information about the users, which makes it hard to share freely with other stake holders, and researchers due to regulatory and compliance requirements. Synthesising datasets from real data by leveraging generative techniques is gaining popularity. However, the privacy analysis of these dataset is still a open research. In this work, we fill this gap by investigating the privacy issues of the generated data sets from attacker and auditor point of view. We propose instance level Privacy Score (PS) for each synthetic sample by measuring the memorisation coefficient \(\boldsymbol{\alpha _{m}}\) per sample. Leveraging, PS we empirically show that accuracy of membership inference attacks on synthetic data drop significantly. PS is a model agnostic, post training measure, which helps data sharer with guidance about the privacy properties of a given sample but also helps third party data auditors to run privacy checks without sharing model internals. We tested our method on two real world data sets and show that attack accuracy reduced by PS based filtering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://incidentdatabase.ai/.

  2. 2.

    https://www.kaggle.com/olistbr/brazilian-ecommerce/.

References

  1. Carlini, N., et al.: Extracting training data from large language models. arXiv preprint arXiv:2012.07805 (2020)

  2. SIVEP-Gripe (2020). http://plataforma.saude.gov.br/coronavirus/dados-abertos/. In Ministry of Health. SIVEP-Gripe public dataset, (Accessed 10 May 2020; in Portuguese)

  3. Jouppi, N.P., et al.: In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12 (2017)

    Google Scholar 

  4. Departement of Commerce, National Institute of Standards and Technology. Differential private synthetic data challenge (2019). https://www.challenge.gov/challenge/differential-privacy-synthetic-data-challenge/. Accessed 19 Feb 2021

  5. Olivier, T.T.: Anonymisation and synthetic data: towards trustworthy data (2019). https://theodi.org/article/anonymisation-and-synthetic-data-towards-trustworthy-data/. Accessed 19 Feb 2021

  6. The Open Data Institute. Diagnosing the NHS: SynAE. https://www.odileeds.org/events/synae/. Accessed 19 Feb 2021

  7. Hazy. https://hazy.com/

  8. AIreverie. https://aireverie.com/

  9. Statice. https://statice.ai/

  10. One-view. https://one-view.ai/

  11. Datagen. https://www.datagen.tech/

  12. Synthesize. https://synthezise.io/

  13. Cognata. https://www.cognata.com/

  14. Mostly-AI. https://mostly.ai/

  15. Rubin, D.B.: Statistical disclosure limitation. J. Off. Stat. 9(2), 461–468 (1993)

    Google Scholar 

  16. Little, R.J.: Statistical analysis of masked data. J. Off. Stat. 9(2), 407–426 (1993)

    Google Scholar 

  17. Abowd, J.M., Lane, J.: New approaches to confidentiality protection: synthetic data, remote access and research data centers. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 282–289. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-25955-8_22. ISBN 978-3-540-22118-0

    Chapter  Google Scholar 

  18. Abowd, J.M., Woodcock, S.D.: Multiply-imputing confidential characteristics and file links in longitudinal linked data. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 290–297. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-25955-8_23. ISBN 3-540-22118-2

    Chapter  Google Scholar 

  19. Reiter, J.P., Raghunathan, T.E.: The multiple adaptations of multiple imputation. J. Am. Stat. Assoc. 102(480), 1462–1471 (2007)

    Article  MathSciNet  Google Scholar 

  20. Drechsler, J., Reiter, J.P.: Sampling with synthesis: a new approach for releasing public use census microdata. J. Am. Stat. Assoc. 105(492), 1347–1357 (2010)

    Article  MathSciNet  Google Scholar 

  21. Drechsler, J., Reiter, J.P.: An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. Comput. Stat. Data Anal. 55(12), 3232–3243 (2011)

    Article  MathSciNet  Google Scholar 

  22. Kinney, S.K., Reiter, J.P., Reznek, A.P., Miranda, J., Jarmin, R.S., Abowd, J.M.: Towards unrestricted public use business microdata: the synthetic longitudinal business database. Int. Stat. Rev. 79(3), 362–384 (2011)

    Article  Google Scholar 

  23. Reiter, J.P.: Using cart to generate partially synthetic, public use microdata. J. Off. Stat. 21(3), 441–462 (2005)

    Google Scholar 

  24. Caiola, G., Reiter, J.P.: Random forests for generating partially synthetic, categorical data. Trans. Data Priv. 3(1), 27–42 (2010)

    MathSciNet  Google Scholar 

  25. Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control, vol. 53. Springer, Heidelberg (2011). https://doi.org/10.1007/978-1-4614-0326-5. ISBN 9788578110796

  26. Bowen, C.M., Liu, F.: Comparative study of differentially private data synthesis methods. Stat. Sci. (forthcoming)

    Google Scholar 

  27. Manrique-Vallier, D., Hu, J.: Bayesian non-parametric generation of fully synthetic multivariate categorical data in the presence of structural zeros. J. Roy. Stat. Soc. Ser. A: Stat. Soc. 181(3), 635–647 (2018)

    Article  MathSciNet  Google Scholar 

  28. Snoke, J., Raab, G., Nowok, B., Dibben, C., Slavkovic, A.: General and specific utility measures for synthetic data (2016)

    Google Scholar 

  29. Raghunathan, T.E., Reiter, J.P., Rubin, D.B.: Multiple imputation for statistical disclosure limitation. J. Off. Stat. 19(1), 1 (2003)

    Google Scholar 

  30. Kinney, S.K., Reiter, J.P., Berger, J.O.: Model selection when multiple imputation is used to protect confidentiality in public use data. J. Priv. Confident. 2(2), 3–19 (2010)

    Google Scholar 

  31. Article 29 Data Protection Working Party - European Commission. Opinion 05/2014 on anonymisation techniques (2014). https://ec.europa.eu/justice/article-29/documentation/opinion-recommendation/files/2014/wp216_en.pdf

  32. Elliot, M., Mackey, E., O’Hara, K., Tudor, C.: The anonymisation decision-making framework. UKAN Manchester (2016)

    Google Scholar 

  33. Rubinstein, I.S., Hartzog, W.: Anonymization and risk. Wash. L. Rev. 91, 703 (2016)

    Google Scholar 

  34. Elliot, M., et al.: Functional anonymisation: personal data and the data environment. Comput. Law Secur. Rev. 34(2), 204–221 (2018)

    Article  Google Scholar 

  35. Patki, N., Wedge, R., Veeramachaneni, K.: The synthetic data vault. In: 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 399–410. IEEE (2016)

    Google Scholar 

  36. Goodfellow, I.: NIPS 2016 tutorial: generative adversarial networks. arXiv preprint arXiv:1701.00160 (2016)

  37. European Commission. Regulation (EU) 2016/679: General Data Protection Regulation (GDPR) (2016)

    Google Scholar 

  38. Sweeney, L.: k-anonymity: a model for protecting privacy. Internat. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(05), 557–570 (2002)

    Article  MathSciNet  Google Scholar 

  39. Shokri, R., Stronati, M., Song, C., Shmatikov, V.: Membership inference attacks against machine learning models. In: IEEE Symposium on Security and Privacy (S&P) (2017)

    Google Scholar 

  40. Yaghini, M., Kulynych, B., Troncoso, C.: Disparate vulnerability: on the unfairness of privacy attacks against machine learning. arXiv preprint arXiv:1906.00389 (2019)

  41. Ping, H., Stoyanovich, J., Howe, B.: DataSynthesizer: privacy-preserving synthetic datasets. In: Proceedings of the 29th International Conference on Scientific and Statistical Database Management (2017)

    Google Scholar 

  42. Jayaraman, B., Wang, L., Evans, D., Gu, Q.: Revisiting membership inference under realistic assumptions. arXiv preprint arXiv:2005.10881 (2020)

  43. Yeom, S., Giacomelli, I., Fredrikson, M., Jha, S.: Privacy risk in machine learning: analyzing the connection to overfitting. In: 2018 IEEE 31st Computer Security Foundations Symposium (CSF), pp. 268–282. IEEE (2018)

    Google Scholar 

  44. Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: PrivBayes: private data release via Bayesian networks. ACM Trans. Database Syst. 42, 1–41 (2017)

    Article  MathSciNet  Google Scholar 

  45. Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional GAN. In: Advances in Neural Information Processing Systems (2019)

    Google Scholar 

  46. Yoon, J., Drumright, L.N., Van Der Schaar, M.: Anonymization through data synthesis using generative adversarial networks (ADS-GAN). IEEE J. Biomed. Health Inf. 24, 2378–2388 (2020)

    Article  Google Scholar 

  47. Adlam, B., Weill, C., Kapoor, A.: Investigating under and overfitting in wasserstein generative adversarial networks. arXiv preprint arXiv:1910.14137 (2019)

  48. Meehan, C., Chaudhuri, K., Dasgupta, S.: A non-parametric test to detect data-copying in generative models. arXiv preprint arXiv:2004.05675 (2020)

  49. Hayes, J., Melis, L., Danezis, G., De Cristofaro, E.: LoGAN: membership inference attacks against generative models. Proc. Priv. Enhanc. Technol. 2019(1), 133–152 (2019)

    Google Scholar 

  50. Sablayrolles, A., Douze, M., Ollivier, Y., Schmid, C., Jégou, H.: White-box vs black-box: bayes optimal strategies for membership inference. In: Proceedings of the 36th International Conference on Machine Learning (ICML), pp. 5558–5567 (2019)

    Google Scholar 

  51. Sablayrolles, A., Douze, M., Ollivier, Y., Schmid, C., Jégou, H.: White-box vs black-box: bayes optimal strategies for membership inference (2019)

    Google Scholar 

  52. Truex, S., Liu, L., Gursoy, M.E., Yu, L., Wei, W.: Towards demystifying membership inference attacks. ArXiv, vol. abs/1807.09173 (2018)

    Google Scholar 

  53. Kuppa, A., Grzonkowski, S., Asghar, M.R., Le-Khac, N.-A.: Black box attacks on deep anomaly detectors. In: Proceedings of the 14th International Conference on Availability, Reliability and Security (2019)

    Google Scholar 

  54. Yoon, J., Jordon, J., van der Schaar, M.: PATE-GAN: generating synthetic data with differential privacy guarantees. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=S1zk9iRqF7

  55. Arpit, D., et al.: A closer look at memorization in deep networks. In: Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML 2017, p. 233–242. JMLR.org (2017)

    Google Scholar 

  56. Salimans, T., Goodfellow, I.J., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. CoRR abs/1606.03498 (2016)

    Google Scholar 

  57. Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional GAN. In: NeurIPS (2019)

    Google Scholar 

  58. Eduardo, S., Nazábal, A., Williams, C.K.I., Sutton, C.: Robust variational autoencoders for outlier detection and repair of mixed-type data. In: Proceedings of AISTATS (2020)

    Google Scholar 

  59. Camino, R., Hammerschmidt, C., State, R.: Generating multi-categorical samples with generative adversarial networks. arXiv preprint arXiv:1807.01202 (2018)

  60. Meehan, C.R., Chaudhuri, K., Dasgupta, S.: A non-parametric test to detect data-copying in generative models. ArXiv, vol. abs/2004.05675 (2020)

    Google Scholar 

  61. Izzo, Z., Smart, M.A., Chaudhuri, K., Zou, J.: Approximate data deletion from machine learning models: algorithms and evaluations. ArXiv, vol. abs/2002.10077 (2020)

    Google Scholar 

  62. Song, C., Shmatikov, V.: Overlearning reveals sensitive attributes (2020)

    Google Scholar 

  63. Melis, L., Song, C., De Cristofaro, E., Shmatikov, V.: Exploiting unintended feature leakage in collaborative learning. In: IEEE Symposium on Security and Privacy (S&P), pp. 497–512. IEEE (2019)

    Google Scholar 

  64. Shokri, R., Stronati, M., Song, C., Shmatikov, V.: Membership inference attacks against machine learning models. In: IEEE Symposium on Security and Privacy (S&P), pp. 3–18. IEEE (2017)

    Google Scholar 

  65. Chen, M., Zhang, Z., Wang, T., Backes, M., Humbert, M., Zhang, Y.: When machine unlearning jeopardizes privacy. CoRR abs/2005.02205 (2020)

    Google Scholar 

  66. Li, Z., Zhang, Y.: Label-leaks: membership inference attack with label. CoRR abs/2007.15528 (2020)

    Google Scholar 

  67. Leino, K., Fredrikson, M.: Stolen memories: leveraging model memorization for calibrated white-box membership inference. In: USENIX Security Symposium (USENIX Security), pp. 1605–1622. USENIX (2020)

    Google Scholar 

  68. Chen, D., Yu, N., Zhang, Y., Fritz, M.: GAN-leaks: a taxonomy of membership inference attacks against generative models. In: ACM SIGSAC Conference on Computer and Communications Security (CCS), p. 343–362. ACM (2020)

    Google Scholar 

  69. Salem, A., Zhang, Y., Humbert, M., Berrang, P., Fritz, M., Backes, M.: ML-leaks: model and data independent membership inference attacks and defenses on machine learning models. In: Network and Distributed System Security Symposium (NDSS). Internet Society (2019)

    Google Scholar 

  70. Jia, J., Salem, A., Backes, M., Zhang, Y., Gong, N.Z.: MemGuard: defending against black-box membership inference attacks via adversarial examples. In: ACM SIGSAC Conference on Computer and Communications Security (CCS), pp. 259–274. ACM (2019)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aditya Kuppa .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kuppa, A., Aouad, L., Le-Khac, NA. (2021). Towards Improving Privacy of Synthetic DataSets. In: Gruschka, N., Antunes, L.F.C., Rannenberg, K., Drogkaris, P. (eds) Privacy Technologies and Policy. APF 2021. Lecture Notes in Computer Science(), vol 12703. Springer, Cham. https://doi.org/10.1007/978-3-030-76663-4_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-76663-4_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-76662-7

  • Online ISBN: 978-3-030-76663-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics