Skip to main content

Integrating Differential Privacy in the Statistical Disclosure Control Tool-Kit for Synthetic Data Production

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12276))

Abstract

A standard approach in the statistical disclosure control tool-kit for producing synthetic data is the procedure based on multivariate sequential chained-equation regression models where each successive regression includes variables from the preceding regressions. The models depend on conditional Bayesian posterior distributions and can handle continuous, binary and categorical variables. Synthetic data are generated by drawing random values from the corresponding predictive distributions. Multiple copies of the synthetic data are generated and inference carried out on each of the data sets with results combined for point and variance estimates under well-established combination rules. In this paper, we investigate whether algorithms and mechanisms found in the differential privacy literature can be added to the synthetic data production process to raise the privacy standards used at National Statistical Institutes. In particular, we focus on a differentially private functional mechanism of adding random noise to the estimating equations of the regression models. We also incorporate regularization in the OLS linear models (ridge regression) to compensate for noisy estimating equations and bound the global sensitivity. We evaluate the standard and modified multivariate sequential chained-equation regression approach for producing synthetic data in a small-scale simulation study.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  • Barak, B., Chaudhuri, K., Dwork, C., Kale, S., McSherry, F., Talwar., K.: Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In: Proceedings of the 26th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS), pp. 273–282 (2007)

    Google Scholar 

  • Chipperfield, J.O., O’Keefe, C.M.: Disclosure-protected inference using generalised linear models. Int. Stat. Rev. 82(3), 371–391 (2014)

    Article  MathSciNet  Google Scholar 

  • Dandekar, A., Basu, D., Bressan, S.: Differential privacy for regularised linear regression. In: Hartmann, S., Ma, H., Hameurlain, A., Pernul, G., Wagner, Roland R. (eds.) DEXA 2018. LNCS, vol. 11030, pp. 483–491. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98812-2_44

    Chapter  Google Scholar 

  • Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control, Theory and Implementation. Lecture Notes in Statistics. Springer, New York (2011). https://doi.org/10.1007/978-1-4614-0326-5

    Book  MATH  Google Scholar 

  • Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). https://doi.org/10.1007/11681878_14

    Chapter  Google Scholar 

  • Dwork, C., Roth, A.: The algorithmic foundations of differential privacy. Found. Trends Theoret. Comput. Sci. 9, 211–407 (2014)

    Article  MathSciNet  Google Scholar 

  • Gaboardi, M., Arias, E.J.G., Hsu, J., Roth, A., Wu, Z.S.: Dual query: practical private query release for high dimensional data. J. Priv. Confid. 7, 53–77 (2016)

    Google Scholar 

  • Li, N., Lyu, M., Su, D., Yang, W.: Differential Privacy: From Theory to Practice. Synthesis Lectures on Information Security, Privacy and Trust. Morgan and Claypool (2017)

    Google Scholar 

  • Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., Vilhuber, L.: Privacy: theory meets practice on the map. In: Proceedings of the IEEE 24th International Conference on Data Engineering. ICDE, pp. 277–286 (2008)

    Google Scholar 

  • Nowok, B., Raab, G.M., Dibben, C.: Synthpop: bespoke creation of synthetic data in r. J. Stat. Softw. 74(11), 1–26 (2016)

    Article  Google Scholar 

  • Raghunathan, T.E., Lepkowksi, J.M., van Hoewyk, J., Solenbeger, P.: A multivariate technique for multiply imputing missing values using a sequence of regression models. Surv. Methodol. 27, 85–95 (2001)

    Google Scholar 

  • Raghunathan, T.E., Reiter, J.P., Rubin, D.B.: Multiple imputation for statistical disclosure limitation. J. Off. Stat. 19, 1–16 (2003)

    Google Scholar 

  • Reiter, J.: Releasing multiply-imputed, synthetic public use microdata: An illustration and empirical study. J. Roy. Stat. Soc. A 168(1), 185–205 (2005)

    Article  MathSciNet  Google Scholar 

  • Rinott, Y., O’Keefe, C., Shlomo, N., Skinner, C.: Confidentiality and differential privacy in the dissemination of frequency tables. Stat. Sci. 33(3), 358–385 (2018)

    Article  MathSciNet  Google Scholar 

  • Rubin, D.B.: Satisfying confidentiality constraints through the use of synthetic multiply-imputed microdata. J. Off. Stat. 91, 461–468 (1993)

    Google Scholar 

  • Sheffet, O.: Private approximations of the 2nd-moment matrix using existing techniques in linear regression (2015). https://arxiv.org/abs/1507.00056. Accessed 29 June 2020

  • Torkzadehmahani, R., Kairouz, P., Paten, B.: DP-CGAN: Differentially Private Synthetic Data and Label Generation (2020). https://arxiv.org/abs/2001.09700. Accessed 31 May 2020

  • Van Buuren, S.: Multiple imputation of discrete and continuous data by fully conditional specification. Stat. Methods Med. Res. 16(3), 219–242 (2007)

    Article  MathSciNet  Google Scholar 

  • Zhang, J., Zhang, Z., Xiao, X., Yang, Y., Winslett, M.: Functional mechanism: regression analysis under differential privacy (2012). https://arxiv.org/abs/1208.0219. Accessed 31 May 2020

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Natalie Shlomo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Shlomo, N. (2020). Integrating Differential Privacy in the Statistical Disclosure Control Tool-Kit for Synthetic Data Production. In: Domingo-Ferrer, J., Muralidhar, K. (eds) Privacy in Statistical Databases. PSD 2020. Lecture Notes in Computer Science(), vol 12276. Springer, Cham. https://doi.org/10.1007/978-3-030-57521-2_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-57521-2_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-57520-5

  • Online ISBN: 978-3-030-57521-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics