Towards Improving Privacy of Synthetic DataSets

Kuppa, Aditya; Aouad, Lamine; Le-Khac, Nhien-An

doi:10.1007/978-3-030-76663-4_6

Aditya Kuppa^12,13,
Lamine Aouad¹³ &
Nhien-An Le-Khac¹²

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 12703))

Included in the following conference series:

Annual Privacy Forum

1038 Accesses
6 Citations

Abstract

Recent growth in domain specific applications of machine learning can be attributed to availability of realistic public datasets. Real world datasets may always contain sensitive information about the users, which makes it hard to share freely with other stake holders, and researchers due to regulatory and compliance requirements. Synthesising datasets from real data by leveraging generative techniques is gaining popularity. However, the privacy analysis of these dataset is still a open research. In this work, we fill this gap by investigating the privacy issues of the generated data sets from attacker and auditor point of view. We propose instance level Privacy Score (PS) for each synthetic sample by measuring the memorisation coefficient \(\boldsymbol{\alpha _{m}}\) per sample. Leveraging, PS we empirically show that accuracy of membership inference attacks on synthetic data drop significantly. PS is a model agnostic, post training measure, which helps data sharer with guidance about the privacy properties of a given sample but also helps third party data auditors to run privacy checks without sharing model internals. We tested our method on two real world data sets and show that attack accuracy reduced by PS based filtering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Carlini, N., et al.: Extracting training data from large language models. arXiv preprint arXiv:2012.07805 (2020)
SIVEP-Gripe (2020). http://plataforma.saude.gov.br/coronavirus/dados-abertos/. In Ministry of Health. SIVEP-Gripe public dataset, (Accessed 10 May 2020; in Portuguese)
Jouppi, N.P., et al.: In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12 (2017)
Google Scholar
Departement of Commerce, National Institute of Standards and Technology. Differential private synthetic data challenge (2019). https://www.challenge.gov/challenge/differential-privacy-synthetic-data-challenge/. Accessed 19 Feb 2021
Olivier, T.T.: Anonymisation and synthetic data: towards trustworthy data (2019). https://theodi.org/article/anonymisation-and-synthetic-data-towards-trustworthy-data/. Accessed 19 Feb 2021
The Open Data Institute. Diagnosing the NHS: SynAE. https://www.odileeds.org/events/synae/. Accessed 19 Feb 2021
Hazy. https://hazy.com/
AIreverie. https://aireverie.com/
Statice. https://statice.ai/
One-view. https://one-view.ai/
Datagen. https://www.datagen.tech/
Synthesize. https://synthezise.io/
Cognata. https://www.cognata.com/
Mostly-AI. https://mostly.ai/
Rubin, D.B.: Statistical disclosure limitation. J. Off. Stat. 9(2), 461–468 (1993)
Google Scholar
Little, R.J.: Statistical analysis of masked data. J. Off. Stat. 9(2), 407–426 (1993)
Google Scholar
Abowd, J.M., Lane, J.: New approaches to confidentiality protection: synthetic data, remote access and research data centers. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 282–289. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-25955-8_22. ISBN 978-3-540-22118-0
Chapter Google Scholar
Abowd, J.M., Woodcock, S.D.: Multiply-imputing confidential characteristics and file links in longitudinal linked data. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 290–297. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-25955-8_23. ISBN 3-540-22118-2
Chapter Google Scholar
Reiter, J.P., Raghunathan, T.E.: The multiple adaptations of multiple imputation. J. Am. Stat. Assoc. 102(480), 1462–1471 (2007)
Article MathSciNet Google Scholar
Drechsler, J., Reiter, J.P.: Sampling with synthesis: a new approach for releasing public use census microdata. J. Am. Stat. Assoc. 105(492), 1347–1357 (2010)
Article MathSciNet Google Scholar
Drechsler, J., Reiter, J.P.: An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. Comput. Stat. Data Anal. 55(12), 3232–3243 (2011)
Article MathSciNet Google Scholar
Kinney, S.K., Reiter, J.P., Reznek, A.P., Miranda, J., Jarmin, R.S., Abowd, J.M.: Towards unrestricted public use business microdata: the synthetic longitudinal business database. Int. Stat. Rev. 79(3), 362–384 (2011)
Article Google Scholar
Reiter, J.P.: Using cart to generate partially synthetic, public use microdata. J. Off. Stat. 21(3), 441–462 (2005)
Google Scholar
Caiola, G., Reiter, J.P.: Random forests for generating partially synthetic, categorical data. Trans. Data Priv. 3(1), 27–42 (2010)
MathSciNet Google Scholar
Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control, vol. 53. Springer, Heidelberg (2011). https://doi.org/10.1007/978-1-4614-0326-5. ISBN 9788578110796
Bowen, C.M., Liu, F.: Comparative study of differentially private data synthesis methods. Stat. Sci. (forthcoming)
Google Scholar
Manrique-Vallier, D., Hu, J.: Bayesian non-parametric generation of fully synthetic multivariate categorical data in the presence of structural zeros. J. Roy. Stat. Soc. Ser. A: Stat. Soc. 181(3), 635–647 (2018)
Article MathSciNet Google Scholar
Snoke, J., Raab, G., Nowok, B., Dibben, C., Slavkovic, A.: General and specific utility measures for synthetic data (2016)
Google Scholar
Raghunathan, T.E., Reiter, J.P., Rubin, D.B.: Multiple imputation for statistical disclosure limitation. J. Off. Stat. 19(1), 1 (2003)
Google Scholar
Kinney, S.K., Reiter, J.P., Berger, J.O.: Model selection when multiple imputation is used to protect confidentiality in public use data. J. Priv. Confident. 2(2), 3–19 (2010)
Google Scholar
Article 29 Data Protection Working Party - European Commission. Opinion 05/2014 on anonymisation techniques (2014). https://ec.europa.eu/justice/article-29/documentation/opinion-recommendation/files/2014/wp216_en.pdf
Elliot, M., Mackey, E., O’Hara, K., Tudor, C.: The anonymisation decision-making framework. UKAN Manchester (2016)
Google Scholar
Rubinstein, I.S., Hartzog, W.: Anonymization and risk. Wash. L. Rev. 91, 703 (2016)
Google Scholar
Elliot, M., et al.: Functional anonymisation: personal data and the data environment. Comput. Law Secur. Rev. 34(2), 204–221 (2018)
Article Google Scholar
Patki, N., Wedge, R., Veeramachaneni, K.: The synthetic data vault. In: 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 399–410. IEEE (2016)
Google Scholar
Goodfellow, I.: NIPS 2016 tutorial: generative adversarial networks. arXiv preprint arXiv:1701.00160 (2016)
European Commission. Regulation (EU) 2016/679: General Data Protection Regulation (GDPR) (2016)
Google Scholar
Sweeney, L.: k-anonymity: a model for protecting privacy. Internat. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(05), 557–570 (2002)
Article MathSciNet Google Scholar
Shokri, R., Stronati, M., Song, C., Shmatikov, V.: Membership inference attacks against machine learning models. In: IEEE Symposium on Security and Privacy (S&P) (2017)
Google Scholar
Yaghini, M., Kulynych, B., Troncoso, C.: Disparate vulnerability: on the unfairness of privacy attacks against machine learning. arXiv preprint arXiv:1906.00389 (2019)
Ping, H., Stoyanovich, J., Howe, B.: DataSynthesizer: privacy-preserving synthetic datasets. In: Proceedings of the 29th International Conference on Scientific and Statistical Database Management (2017)
Google Scholar
Jayaraman, B., Wang, L., Evans, D., Gu, Q.: Revisiting membership inference under realistic assumptions. arXiv preprint arXiv:2005.10881 (2020)
Yeom, S., Giacomelli, I., Fredrikson, M., Jha, S.: Privacy risk in machine learning: analyzing the connection to overfitting. In: 2018 IEEE 31st Computer Security Foundations Symposium (CSF), pp. 268–282. IEEE (2018)
Google Scholar
Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: PrivBayes: private data release via Bayesian networks. ACM Trans. Database Syst. 42, 1–41 (2017)
Article MathSciNet Google Scholar
Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional GAN. In: Advances in Neural Information Processing Systems (2019)
Google Scholar
Yoon, J., Drumright, L.N., Van Der Schaar, M.: Anonymization through data synthesis using generative adversarial networks (ADS-GAN). IEEE J. Biomed. Health Inf. 24, 2378–2388 (2020)
Article Google Scholar
Adlam, B., Weill, C., Kapoor, A.: Investigating under and overfitting in wasserstein generative adversarial networks. arXiv preprint arXiv:1910.14137 (2019)
Meehan, C., Chaudhuri, K., Dasgupta, S.: A non-parametric test to detect data-copying in generative models. arXiv preprint arXiv:2004.05675 (2020)
Hayes, J., Melis, L., Danezis, G., De Cristofaro, E.: LoGAN: membership inference attacks against generative models. Proc. Priv. Enhanc. Technol. 2019(1), 133–152 (2019)
Google Scholar
Sablayrolles, A., Douze, M., Ollivier, Y., Schmid, C., Jégou, H.: White-box vs black-box: bayes optimal strategies for membership inference. In: Proceedings of the 36th International Conference on Machine Learning (ICML), pp. 5558–5567 (2019)
Google Scholar
Sablayrolles, A., Douze, M., Ollivier, Y., Schmid, C., Jégou, H.: White-box vs black-box: bayes optimal strategies for membership inference (2019)
Google Scholar
Truex, S., Liu, L., Gursoy, M.E., Yu, L., Wei, W.: Towards demystifying membership inference attacks. ArXiv, vol. abs/1807.09173 (2018)
Google Scholar
Kuppa, A., Grzonkowski, S., Asghar, M.R., Le-Khac, N.-A.: Black box attacks on deep anomaly detectors. In: Proceedings of the 14th International Conference on Availability, Reliability and Security (2019)
Google Scholar
Yoon, J., Jordon, J., van der Schaar, M.: PATE-GAN: generating synthetic data with differential privacy guarantees. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=S1zk9iRqF7
Arpit, D., et al.: A closer look at memorization in deep networks. In: Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML 2017, p. 233–242. JMLR.org (2017)
Google Scholar
Salimans, T., Goodfellow, I.J., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. CoRR abs/1606.03498 (2016)
Google Scholar
Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional GAN. In: NeurIPS (2019)
Google Scholar
Eduardo, S., Nazábal, A., Williams, C.K.I., Sutton, C.: Robust variational autoencoders for outlier detection and repair of mixed-type data. In: Proceedings of AISTATS (2020)
Google Scholar
Camino, R., Hammerschmidt, C., State, R.: Generating multi-categorical samples with generative adversarial networks. arXiv preprint arXiv:1807.01202 (2018)
Meehan, C.R., Chaudhuri, K., Dasgupta, S.: A non-parametric test to detect data-copying in generative models. ArXiv, vol. abs/2004.05675 (2020)
Google Scholar
Izzo, Z., Smart, M.A., Chaudhuri, K., Zou, J.: Approximate data deletion from machine learning models: algorithms and evaluations. ArXiv, vol. abs/2002.10077 (2020)
Google Scholar
Song, C., Shmatikov, V.: Overlearning reveals sensitive attributes (2020)
Google Scholar
Melis, L., Song, C., De Cristofaro, E., Shmatikov, V.: Exploiting unintended feature leakage in collaborative learning. In: IEEE Symposium on Security and Privacy (S&P), pp. 497–512. IEEE (2019)
Google Scholar
Shokri, R., Stronati, M., Song, C., Shmatikov, V.: Membership inference attacks against machine learning models. In: IEEE Symposium on Security and Privacy (S&P), pp. 3–18. IEEE (2017)
Google Scholar
Chen, M., Zhang, Z., Wang, T., Backes, M., Humbert, M., Zhang, Y.: When machine unlearning jeopardizes privacy. CoRR abs/2005.02205 (2020)
Google Scholar
Li, Z., Zhang, Y.: Label-leaks: membership inference attack with label. CoRR abs/2007.15528 (2020)
Google Scholar
Leino, K., Fredrikson, M.: Stolen memories: leveraging model memorization for calibrated white-box membership inference. In: USENIX Security Symposium (USENIX Security), pp. 1605–1622. USENIX (2020)
Google Scholar
Chen, D., Yu, N., Zhang, Y., Fritz, M.: GAN-leaks: a taxonomy of membership inference attacks against generative models. In: ACM SIGSAC Conference on Computer and Communications Security (CCS), p. 343–362. ACM (2020)
Google Scholar
Salem, A., Zhang, Y., Humbert, M., Berrang, P., Fritz, M., Backes, M.: ML-leaks: model and data independent membership inference attacks and defenses on machine learning models. In: Network and Distributed System Security Symposium (NDSS). Internet Society (2019)
Google Scholar
Jia, J., Salem, A., Backes, M., Zhang, Y., Gong, N.Z.: MemGuard: defending against black-box membership inference attacks via adversarial examples. In: ACM SIGSAC Conference on Computer and Communications Security (CCS), pp. 259–274. ACM (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

UCD School of Computing, Dublin, Ireland
Aditya Kuppa & Nhien-An Le-Khac
Tenable Network Security, Paris, France
Aditya Kuppa & Lamine Aouad

Authors

Aditya Kuppa
View author publications
You can also search for this author in PubMed Google Scholar
Lamine Aouad
View author publications
You can also search for this author in PubMed Google Scholar
Nhien-An Le-Khac
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aditya Kuppa .

Editor information

Editors and Affiliations

University of Oslo, Oslo, Norway
Nils Gruschka
Department of Computer Science, University of Porto, Porto, Portugal
Luís Filipe Coelho Antunes
Goethe University Frankfurt, Frankfurt, Germany
Kai Rannenberg
ENISA, Athens, Greece
Prokopios Drogkaris

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kuppa, A., Aouad, L., Le-Khac, NA. (2021). Towards Improving Privacy of Synthetic DataSets. In: Gruschka, N., Antunes, L.F.C., Rannenberg, K., Drogkaris, P. (eds) Privacy Technologies and Policy. APF 2021. Lecture Notes in Computer Science(), vol 12703. Springer, Cham. https://doi.org/10.1007/978-3-030-76663-4_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-76663-4_6
Published: 19 May 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-76662-7
Online ISBN: 978-3-030-76663-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics