Abstract
Recent growth in domain specific applications of machine learning can be attributed to availability of realistic public datasets. Real world datasets may always contain sensitive information about the users, which makes it hard to share freely with other stake holders, and researchers due to regulatory and compliance requirements. Synthesising datasets from real data by leveraging generative techniques is gaining popularity. However, the privacy analysis of these dataset is still a open research. In this work, we fill this gap by investigating the privacy issues of the generated data sets from attacker and auditor point of view. We propose instance level Privacy Score (PS) for each synthetic sample by measuring the memorisation coefficient \(\boldsymbol{\alpha _{m}}\) per sample. Leveraging, PS we empirically show that accuracy of membership inference attacks on synthetic data drop significantly. PS is a model agnostic, post training measure, which helps data sharer with guidance about the privacy properties of a given sample but also helps third party data auditors to run privacy checks without sharing model internals. We tested our method on two real world data sets and show that attack accuracy reduced by PS based filtering.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Carlini, N., et al.: Extracting training data from large language models. arXiv preprint arXiv:2012.07805 (2020)
SIVEP-Gripe (2020). http://plataforma.saude.gov.br/coronavirus/dados-abertos/. In Ministry of Health. SIVEP-Gripe public dataset, (Accessed 10 May 2020; in Portuguese)
Jouppi, N.P., et al.: In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12 (2017)
Departement of Commerce, National Institute of Standards and Technology. Differential private synthetic data challenge (2019). https://www.challenge.gov/challenge/differential-privacy-synthetic-data-challenge/. Accessed 19 Feb 2021
Olivier, T.T.: Anonymisation and synthetic data: towards trustworthy data (2019). https://theodi.org/article/anonymisation-and-synthetic-data-towards-trustworthy-data/. Accessed 19 Feb 2021
The Open Data Institute. Diagnosing the NHS: SynAE. https://www.odileeds.org/events/synae/. Accessed 19 Feb 2021
Hazy. https://hazy.com/
AIreverie. https://aireverie.com/
Statice. https://statice.ai/
One-view. https://one-view.ai/
Datagen. https://www.datagen.tech/
Synthesize. https://synthezise.io/
Cognata. https://www.cognata.com/
Mostly-AI. https://mostly.ai/
Rubin, D.B.: Statistical disclosure limitation. J. Off. Stat. 9(2), 461–468 (1993)
Little, R.J.: Statistical analysis of masked data. J. Off. Stat. 9(2), 407–426 (1993)
Abowd, J.M., Lane, J.: New approaches to confidentiality protection: synthetic data, remote access and research data centers. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 282–289. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-25955-8_22. ISBN 978-3-540-22118-0
Abowd, J.M., Woodcock, S.D.: Multiply-imputing confidential characteristics and file links in longitudinal linked data. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 290–297. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-25955-8_23. ISBN 3-540-22118-2
Reiter, J.P., Raghunathan, T.E.: The multiple adaptations of multiple imputation. J. Am. Stat. Assoc. 102(480), 1462–1471 (2007)
Drechsler, J., Reiter, J.P.: Sampling with synthesis: a new approach for releasing public use census microdata. J. Am. Stat. Assoc. 105(492), 1347–1357 (2010)
Drechsler, J., Reiter, J.P.: An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. Comput. Stat. Data Anal. 55(12), 3232–3243 (2011)
Kinney, S.K., Reiter, J.P., Reznek, A.P., Miranda, J., Jarmin, R.S., Abowd, J.M.: Towards unrestricted public use business microdata: the synthetic longitudinal business database. Int. Stat. Rev. 79(3), 362–384 (2011)
Reiter, J.P.: Using cart to generate partially synthetic, public use microdata. J. Off. Stat. 21(3), 441–462 (2005)
Caiola, G., Reiter, J.P.: Random forests for generating partially synthetic, categorical data. Trans. Data Priv. 3(1), 27–42 (2010)
Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control, vol. 53. Springer, Heidelberg (2011). https://doi.org/10.1007/978-1-4614-0326-5. ISBN 9788578110796
Bowen, C.M., Liu, F.: Comparative study of differentially private data synthesis methods. Stat. Sci. (forthcoming)
Manrique-Vallier, D., Hu, J.: Bayesian non-parametric generation of fully synthetic multivariate categorical data in the presence of structural zeros. J. Roy. Stat. Soc. Ser. A: Stat. Soc. 181(3), 635–647 (2018)
Snoke, J., Raab, G., Nowok, B., Dibben, C., Slavkovic, A.: General and specific utility measures for synthetic data (2016)
Raghunathan, T.E., Reiter, J.P., Rubin, D.B.: Multiple imputation for statistical disclosure limitation. J. Off. Stat. 19(1), 1 (2003)
Kinney, S.K., Reiter, J.P., Berger, J.O.: Model selection when multiple imputation is used to protect confidentiality in public use data. J. Priv. Confident. 2(2), 3–19 (2010)
Article 29 Data Protection Working Party - European Commission. Opinion 05/2014 on anonymisation techniques (2014). https://ec.europa.eu/justice/article-29/documentation/opinion-recommendation/files/2014/wp216_en.pdf
Elliot, M., Mackey, E., O’Hara, K., Tudor, C.: The anonymisation decision-making framework. UKAN Manchester (2016)
Rubinstein, I.S., Hartzog, W.: Anonymization and risk. Wash. L. Rev. 91, 703 (2016)
Elliot, M., et al.: Functional anonymisation: personal data and the data environment. Comput. Law Secur. Rev. 34(2), 204–221 (2018)
Patki, N., Wedge, R., Veeramachaneni, K.: The synthetic data vault. In: 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 399–410. IEEE (2016)
Goodfellow, I.: NIPS 2016 tutorial: generative adversarial networks. arXiv preprint arXiv:1701.00160 (2016)
European Commission. Regulation (EU) 2016/679: General Data Protection Regulation (GDPR) (2016)
Sweeney, L.: k-anonymity: a model for protecting privacy. Internat. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(05), 557–570 (2002)
Shokri, R., Stronati, M., Song, C., Shmatikov, V.: Membership inference attacks against machine learning models. In: IEEE Symposium on Security and Privacy (S&P) (2017)
Yaghini, M., Kulynych, B., Troncoso, C.: Disparate vulnerability: on the unfairness of privacy attacks against machine learning. arXiv preprint arXiv:1906.00389 (2019)
Ping, H., Stoyanovich, J., Howe, B.: DataSynthesizer: privacy-preserving synthetic datasets. In: Proceedings of the 29th International Conference on Scientific and Statistical Database Management (2017)
Jayaraman, B., Wang, L., Evans, D., Gu, Q.: Revisiting membership inference under realistic assumptions. arXiv preprint arXiv:2005.10881 (2020)
Yeom, S., Giacomelli, I., Fredrikson, M., Jha, S.: Privacy risk in machine learning: analyzing the connection to overfitting. In: 2018 IEEE 31st Computer Security Foundations Symposium (CSF), pp. 268–282. IEEE (2018)
Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: PrivBayes: private data release via Bayesian networks. ACM Trans. Database Syst. 42, 1–41 (2017)
Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional GAN. In: Advances in Neural Information Processing Systems (2019)
Yoon, J., Drumright, L.N., Van Der Schaar, M.: Anonymization through data synthesis using generative adversarial networks (ADS-GAN). IEEE J. Biomed. Health Inf. 24, 2378–2388 (2020)
Adlam, B., Weill, C., Kapoor, A.: Investigating under and overfitting in wasserstein generative adversarial networks. arXiv preprint arXiv:1910.14137 (2019)
Meehan, C., Chaudhuri, K., Dasgupta, S.: A non-parametric test to detect data-copying in generative models. arXiv preprint arXiv:2004.05675 (2020)
Hayes, J., Melis, L., Danezis, G., De Cristofaro, E.: LoGAN: membership inference attacks against generative models. Proc. Priv. Enhanc. Technol. 2019(1), 133–152 (2019)
Sablayrolles, A., Douze, M., Ollivier, Y., Schmid, C., Jégou, H.: White-box vs black-box: bayes optimal strategies for membership inference. In: Proceedings of the 36th International Conference on Machine Learning (ICML), pp. 5558–5567 (2019)
Sablayrolles, A., Douze, M., Ollivier, Y., Schmid, C., Jégou, H.: White-box vs black-box: bayes optimal strategies for membership inference (2019)
Truex, S., Liu, L., Gursoy, M.E., Yu, L., Wei, W.: Towards demystifying membership inference attacks. ArXiv, vol. abs/1807.09173 (2018)
Kuppa, A., Grzonkowski, S., Asghar, M.R., Le-Khac, N.-A.: Black box attacks on deep anomaly detectors. In: Proceedings of the 14th International Conference on Availability, Reliability and Security (2019)
Yoon, J., Jordon, J., van der Schaar, M.: PATE-GAN: generating synthetic data with differential privacy guarantees. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=S1zk9iRqF7
Arpit, D., et al.: A closer look at memorization in deep networks. In: Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML 2017, p. 233–242. JMLR.org (2017)
Salimans, T., Goodfellow, I.J., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. CoRR abs/1606.03498 (2016)
Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional GAN. In: NeurIPS (2019)
Eduardo, S., Nazábal, A., Williams, C.K.I., Sutton, C.: Robust variational autoencoders for outlier detection and repair of mixed-type data. In: Proceedings of AISTATS (2020)
Camino, R., Hammerschmidt, C., State, R.: Generating multi-categorical samples with generative adversarial networks. arXiv preprint arXiv:1807.01202 (2018)
Meehan, C.R., Chaudhuri, K., Dasgupta, S.: A non-parametric test to detect data-copying in generative models. ArXiv, vol. abs/2004.05675 (2020)
Izzo, Z., Smart, M.A., Chaudhuri, K., Zou, J.: Approximate data deletion from machine learning models: algorithms and evaluations. ArXiv, vol. abs/2002.10077 (2020)
Song, C., Shmatikov, V.: Overlearning reveals sensitive attributes (2020)
Melis, L., Song, C., De Cristofaro, E., Shmatikov, V.: Exploiting unintended feature leakage in collaborative learning. In: IEEE Symposium on Security and Privacy (S&P), pp. 497–512. IEEE (2019)
Shokri, R., Stronati, M., Song, C., Shmatikov, V.: Membership inference attacks against machine learning models. In: IEEE Symposium on Security and Privacy (S&P), pp. 3–18. IEEE (2017)
Chen, M., Zhang, Z., Wang, T., Backes, M., Humbert, M., Zhang, Y.: When machine unlearning jeopardizes privacy. CoRR abs/2005.02205 (2020)
Li, Z., Zhang, Y.: Label-leaks: membership inference attack with label. CoRR abs/2007.15528 (2020)
Leino, K., Fredrikson, M.: Stolen memories: leveraging model memorization for calibrated white-box membership inference. In: USENIX Security Symposium (USENIX Security), pp. 1605–1622. USENIX (2020)
Chen, D., Yu, N., Zhang, Y., Fritz, M.: GAN-leaks: a taxonomy of membership inference attacks against generative models. In: ACM SIGSAC Conference on Computer and Communications Security (CCS), p. 343–362. ACM (2020)
Salem, A., Zhang, Y., Humbert, M., Berrang, P., Fritz, M., Backes, M.: ML-leaks: model and data independent membership inference attacks and defenses on machine learning models. In: Network and Distributed System Security Symposium (NDSS). Internet Society (2019)
Jia, J., Salem, A., Backes, M., Zhang, Y., Gong, N.Z.: MemGuard: defending against black-box membership inference attacks via adversarial examples. In: ACM SIGSAC Conference on Computer and Communications Security (CCS), pp. 259–274. ACM (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Kuppa, A., Aouad, L., Le-Khac, NA. (2021). Towards Improving Privacy of Synthetic DataSets. In: Gruschka, N., Antunes, L.F.C., Rannenberg, K., Drogkaris, P. (eds) Privacy Technologies and Policy. APF 2021. Lecture Notes in Computer Science(), vol 12703. Springer, Cham. https://doi.org/10.1007/978-3-030-76663-4_6
Download citation
DOI: https://doi.org/10.1007/978-3-030-76663-4_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-76662-7
Online ISBN: 978-3-030-76663-4
eBook Packages: Computer ScienceComputer Science (R0)