Abstract
A standard approach in the statistical disclosure control tool-kit for producing synthetic data is the procedure based on multivariate sequential chained-equation regression models where each successive regression includes variables from the preceding regressions. The models depend on conditional Bayesian posterior distributions and can handle continuous, binary and categorical variables. Synthetic data are generated by drawing random values from the corresponding predictive distributions. Multiple copies of the synthetic data are generated and inference carried out on each of the data sets with results combined for point and variance estimates under well-established combination rules. In this paper, we investigate whether algorithms and mechanisms found in the differential privacy literature can be added to the synthetic data production process to raise the privacy standards used at National Statistical Institutes. In particular, we focus on a differentially private functional mechanism of adding random noise to the estimating equations of the regression models. We also incorporate regularization in the OLS linear models (ridge regression) to compensate for noisy estimating equations and bound the global sensitivity. We evaluate the standard and modified multivariate sequential chained-equation regression approach for producing synthetic data in a small-scale simulation study.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Barak, B., Chaudhuri, K., Dwork, C., Kale, S., McSherry, F., Talwar., K.: Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In: Proceedings of the 26th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS), pp. 273–282 (2007)
Chipperfield, J.O., O’Keefe, C.M.: Disclosure-protected inference using generalised linear models. Int. Stat. Rev. 82(3), 371–391 (2014)
Dandekar, A., Basu, D., Bressan, S.: Differential privacy for regularised linear regression. In: Hartmann, S., Ma, H., Hameurlain, A., Pernul, G., Wagner, Roland R. (eds.) DEXA 2018. LNCS, vol. 11030, pp. 483–491. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98812-2_44
Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control, Theory and Implementation. Lecture Notes in Statistics. Springer, New York (2011). https://doi.org/10.1007/978-1-4614-0326-5
Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). https://doi.org/10.1007/11681878_14
Dwork, C., Roth, A.: The algorithmic foundations of differential privacy. Found. Trends Theoret. Comput. Sci. 9, 211–407 (2014)
Gaboardi, M., Arias, E.J.G., Hsu, J., Roth, A., Wu, Z.S.: Dual query: practical private query release for high dimensional data. J. Priv. Confid. 7, 53–77 (2016)
Li, N., Lyu, M., Su, D., Yang, W.: Differential Privacy: From Theory to Practice. Synthesis Lectures on Information Security, Privacy and Trust. Morgan and Claypool (2017)
Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., Vilhuber, L.: Privacy: theory meets practice on the map. In: Proceedings of the IEEE 24th International Conference on Data Engineering. ICDE, pp. 277–286 (2008)
Nowok, B., Raab, G.M., Dibben, C.: Synthpop: bespoke creation of synthetic data in r. J. Stat. Softw. 74(11), 1–26 (2016)
Raghunathan, T.E., Lepkowksi, J.M., van Hoewyk, J., Solenbeger, P.: A multivariate technique for multiply imputing missing values using a sequence of regression models. Surv. Methodol. 27, 85–95 (2001)
Raghunathan, T.E., Reiter, J.P., Rubin, D.B.: Multiple imputation for statistical disclosure limitation. J. Off. Stat. 19, 1–16 (2003)
Reiter, J.: Releasing multiply-imputed, synthetic public use microdata: An illustration and empirical study. J. Roy. Stat. Soc. A 168(1), 185–205 (2005)
Rinott, Y., O’Keefe, C., Shlomo, N., Skinner, C.: Confidentiality and differential privacy in the dissemination of frequency tables. Stat. Sci. 33(3), 358–385 (2018)
Rubin, D.B.: Satisfying confidentiality constraints through the use of synthetic multiply-imputed microdata. J. Off. Stat. 91, 461–468 (1993)
Sheffet, O.: Private approximations of the 2nd-moment matrix using existing techniques in linear regression (2015). https://arxiv.org/abs/1507.00056. Accessed 29 June 2020
Torkzadehmahani, R., Kairouz, P., Paten, B.: DP-CGAN: Differentially Private Synthetic Data and Label Generation (2020). https://arxiv.org/abs/2001.09700. Accessed 31 May 2020
Van Buuren, S.: Multiple imputation of discrete and continuous data by fully conditional specification. Stat. Methods Med. Res. 16(3), 219–242 (2007)
Zhang, J., Zhang, Z., Xiao, X., Yang, Y., Winslett, M.: Functional mechanism: regression analysis under differential privacy (2012). https://arxiv.org/abs/1208.0219. Accessed 31 May 2020
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Shlomo, N. (2020). Integrating Differential Privacy in the Statistical Disclosure Control Tool-Kit for Synthetic Data Production. In: Domingo-Ferrer, J., Muralidhar, K. (eds) Privacy in Statistical Databases. PSD 2020. Lecture Notes in Computer Science(), vol 12276. Springer, Cham. https://doi.org/10.1007/978-3-030-57521-2_19
Download citation
DOI: https://doi.org/10.1007/978-3-030-57521-2_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-57520-5
Online ISBN: 978-3-030-57521-2
eBook Packages: Computer ScienceComputer Science (R0)