Integrating Differential Privacy in the Statistical Disclosure Control Tool-Kit for Synthetic Data Production

Shlomo, Natalie

doi:10.1007/978-3-030-57521-2_19

Integrating Differential Privacy in the Statistical Disclosure Control Tool-Kit for Synthetic Data Production

Natalie Shlomo ORCID: orcid.org/0000-0003-0701-5080¹⁰

Conference paper
First Online: 16 September 2020

746 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12276))

Abstract

A standard approach in the statistical disclosure control tool-kit for producing synthetic data is the procedure based on multivariate sequential chained-equation regression models where each successive regression includes variables from the preceding regressions. The models depend on conditional Bayesian posterior distributions and can handle continuous, binary and categorical variables. Synthetic data are generated by drawing random values from the corresponding predictive distributions. Multiple copies of the synthetic data are generated and inference carried out on each of the data sets with results combined for point and variance estimates under well-established combination rules. In this paper, we investigate whether algorithms and mechanisms found in the differential privacy literature can be added to the synthetic data production process to raise the privacy standards used at National Statistical Institutes. In particular, we focus on a differentially private functional mechanism of adding random noise to the estimating equations of the regression models. We also incorporate regularization in the OLS linear models (ridge regression) to compensate for noisy estimating equations and bound the global sensitivity. We evaluate the standard and modified multivariate sequential chained-equation regression approach for producing synthetic data in a small-scale simulation study.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Barak, B., Chaudhuri, K., Dwork, C., Kale, S., McSherry, F., Talwar., K.: Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In: Proceedings of the 26th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS), pp. 273–282 (2007)
Google Scholar
Chipperfield, J.O., O’Keefe, C.M.: Disclosure-protected inference using generalised linear models. Int. Stat. Rev. 82(3), 371–391 (2014)
Article MathSciNet Google Scholar
Dandekar, A., Basu, D., Bressan, S.: Differential privacy for regularised linear regression. In: Hartmann, S., Ma, H., Hameurlain, A., Pernul, G., Wagner, Roland R. (eds.) DEXA 2018. LNCS, vol. 11030, pp. 483–491. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98812-2_44
Chapter Google Scholar
Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control, Theory and Implementation. Lecture Notes in Statistics. Springer, New York (2011). https://doi.org/10.1007/978-1-4614-0326-5
Book MATH Google Scholar
Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). https://doi.org/10.1007/11681878_14
Chapter Google Scholar
Dwork, C., Roth, A.: The algorithmic foundations of differential privacy. Found. Trends Theoret. Comput. Sci. 9, 211–407 (2014)
Article MathSciNet Google Scholar
Gaboardi, M., Arias, E.J.G., Hsu, J., Roth, A., Wu, Z.S.: Dual query: practical private query release for high dimensional data. J. Priv. Confid. 7, 53–77 (2016)
Google Scholar
Li, N., Lyu, M., Su, D., Yang, W.: Differential Privacy: From Theory to Practice. Synthesis Lectures on Information Security, Privacy and Trust. Morgan and Claypool (2017)
Google Scholar
Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., Vilhuber, L.: Privacy: theory meets practice on the map. In: Proceedings of the IEEE 24th International Conference on Data Engineering. ICDE, pp. 277–286 (2008)
Google Scholar
Nowok, B., Raab, G.M., Dibben, C.: Synthpop: bespoke creation of synthetic data in r. J. Stat. Softw. 74(11), 1–26 (2016)
Article Google Scholar
Raghunathan, T.E., Lepkowksi, J.M., van Hoewyk, J., Solenbeger, P.: A multivariate technique for multiply imputing missing values using a sequence of regression models. Surv. Methodol. 27, 85–95 (2001)
Google Scholar
Raghunathan, T.E., Reiter, J.P., Rubin, D.B.: Multiple imputation for statistical disclosure limitation. J. Off. Stat. 19, 1–16 (2003)
Google Scholar
Reiter, J.: Releasing multiply-imputed, synthetic public use microdata: An illustration and empirical study. J. Roy. Stat. Soc. A 168(1), 185–205 (2005)
Article MathSciNet Google Scholar
Rinott, Y., O’Keefe, C., Shlomo, N., Skinner, C.: Confidentiality and differential privacy in the dissemination of frequency tables. Stat. Sci. 33(3), 358–385 (2018)
Article MathSciNet Google Scholar
Rubin, D.B.: Satisfying confidentiality constraints through the use of synthetic multiply-imputed microdata. J. Off. Stat. 91, 461–468 (1993)
Google Scholar
Sheffet, O.: Private approximations of the 2nd-moment matrix using existing techniques in linear regression (2015). https://arxiv.org/abs/1507.00056. Accessed 29 June 2020
Torkzadehmahani, R., Kairouz, P., Paten, B.: DP-CGAN: Differentially Private Synthetic Data and Label Generation (2020). https://arxiv.org/abs/2001.09700. Accessed 31 May 2020
Van Buuren, S.: Multiple imputation of discrete and continuous data by fully conditional specification. Stat. Methods Med. Res. 16(3), 219–242 (2007)
Article MathSciNet Google Scholar
Zhang, J., Zhang, Z., Xiao, X., Yang, Y., Winslett, M.: Functional mechanism: regression analysis under differential privacy (2012). https://arxiv.org/abs/1208.0219. Accessed 31 May 2020

Download references

Author information

Authors and Affiliations

Social Statistics Department, School of Social Sciences, University of Manchester, Humanities Bridgeford Street G17A, Manchester, M13 9PL, UK
Natalie Shlomo

Authors

Natalie Shlomo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Natalie Shlomo .

Editor information

Editors and Affiliations

Rovira i Virgili University, Tarragona, Catalonia, Spain
Josep Domingo-Ferrer
University of Oklahoma, Norman, OK, USA
Krishnamurty Muralidhar

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shlomo, N. (2020). Integrating Differential Privacy in the Statistical Disclosure Control Tool-Kit for Synthetic Data Production. In: Domingo-Ferrer, J., Muralidhar, K. (eds) Privacy in Statistical Databases. PSD 2020. Lecture Notes in Computer Science(), vol 12276. Springer, Cham. https://doi.org/10.1007/978-3-030-57521-2_19

Download citation

DOI: https://doi.org/10.1007/978-3-030-57521-2_19
Published: 16 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-57520-5
Online ISBN: 978-3-030-57521-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics