Abstract
There has been some confusion in recent years in which circumstances datasets generated using the synthetic data approach should be considered fully synthetic and which estimator to use for obtaining valid variance estimates based on the synthetic data. This paper aims at providing some guidance to overcome this confusion. It offers a review of the different approaches for generating synthetic datasets and discusses their similarities and differences. It also presents the different variance estimators that have been proposed for analyzing the synthetic data. Based on two simulation studies the advantages and limitations of the different estimators are discussed. The paper concludes with some general recommendations how to judge which synthesis strategy and which variance estimator is most suitable in which situation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Drechsler, J.: Improved variance estimation for fully synthetic datasets. In: Proceedings of the Joint UNECE/EUROSTAT Work Session on Statistical Data Confidentiality (2011)
Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control: Theory and Implementation. LNS, vol. 201. Springer, New York (2011). https://doi.org/10.1007/978-1-4614-0326-5
Drechsler, J., Reiter, J.P.: Disclosure risk and data utility for partially synthetic data: an empirical study using the German IAB establishment survey. J. Off. Stat. 25, 589–603 (2009)
Drechsler, J., Reiter, J.P.: Sampling with synthesis: a new approach for releasing public use census microdata. J. Am. Stat. Assoc. 105(492), 1347–1357 (2010)
Drechsler, J., Reiter, J.P.: Combining synthetic data with subsampling to create public use microdata files for large scale surveys. Surv. Methodol. 38, 73–79 (2012)
Kinney, S.K., Reiter, J.P., Reznek, A.P., Miranda, J., Jarmin, R.S., Abowd, J.M.: Towards unrestricted public use business microdata: the synthetic longitudinal business database. Int. Stat. Rev. 79, 362–384 (2011)
Little, R.J.A.: Statistical analysis of masked data. J. Off. Stat. 9, 407–426 (1993)
Raab, G.M., Nowok, B., Dibben, C.: Practical data synthesis for large samples. J. Priv. Confid. 7(3), 4 (2017)
Raghunathan, T.E., Reiter, J.P., Rubin, D.B.: Multiple imputation for statistical disclosure limitation. J. Off. Stat. 19, 1–16 (2003)
Reiter, J.P.: Satisfying disclosure restrictions with synthetic data sets. J. Off. Stat. 18, 531–544 (2002)
Reiter, J.P.: Inference for partially synthetic, public use microdata sets. Surv. Methodol. 29, 181–189 (2003)
Reiter, J.P., Drechsler, J.: Releasing multiply-imputed, synthetic data generated in two stages to protect confidentiality. Stat. Sin. 20, 405–421 (2010)
Reiter, J.P., Kinney, S.K.: Inferentially valid, partially synthetic data: generating from posterior predictive distributions not necessary. J. Off. Stat. 28(4), 583–590 (2012)
Rubin, D.B.: Discussion: statistical disclosure limitation. J. Off. Stat. 9, 462–468 (1993)
Si, Y., Reiter, J.P.: A comparison of posterior simulation and inference by combining rules for multiple imputation. J. Stat. Theory Pract. 5(2), 335–347 (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Drechsler, J. (2018). Some Clarifications Regarding Fully Synthetic Data. In: Domingo-Ferrer, J., Montes, F. (eds) Privacy in Statistical Databases. PSD 2018. Lecture Notes in Computer Science(), vol 11126. Springer, Cham. https://doi.org/10.1007/978-3-319-99771-1_8
Download citation
DOI: https://doi.org/10.1007/978-3-319-99771-1_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99770-4
Online ISBN: 978-3-319-99771-1
eBook Packages: Computer ScienceComputer Science (R0)