Abstract
Unrestricted availability of the datasets is important for the researchers to evaluate their strategies to solve the research problems. While publicly releasing the datasets, it is equally important to protect the privacy of the respective data owners. Synthetic datasets that preserve the utility while protecting the privacy of the data owners stands as a midway.
There are two ways to synthetically generate the data. Firstly, one can generate a fully synthetic dataset by subsampling it from a synthetically generated population. This technique is known as fully synthetic dataset generation. Secondly, one can generate a partially synthetic dataset by synthesizing the values of sensitive attributes. This technique is known as partially synthetic dataset generation. The datasets generated by these two techniques vary in their utilities as well as in their risks of disclosure.
We perform a comparative study of these techniques with the use of different dataset synthesisers such as linear regression, decision tree, random forest and neural network. We evaluate the effectiveness of these techniques towards the amounts of utility that they preserve and the risks of disclosure that they suffer. We find decision tree to be an efficient and a competitively effective dataset synthesiser.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Caiola, G., Reiter, J.P.: Random forests for generating partially synthetic, categorical data. Trans. Data Priv. 3(1), 27–42 (2010)
Dandekar, A., Zen, R.A., Bressan, S.: A comparative study of synthetic dataset generation techniques. Technical report TRA6/18, National University of Singapore, June 2018. https://dl.comp.nus.edu.sg/handle/1900.100/7050
Drechsler, J.: Using support vector machines for generating synthetic datasets. In: Domingo-Ferrer, J., Magkos, E. (eds.) PSD 2010. LNCS, vol. 6344, pp. 148–161. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15838-4_14
Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control: Theory and Implementation, vol. 201. Springer, Heidelberg (2011). https://doi.org/10.1007/978-1-4614-0326-5
Drechsler, J., Bender, S., Rässler, S.: Comparing fully and partially synthetic datasets for statistical disclosure control in the german iab establishment panel. Trans. Data Priv. 1(3), 105–130 (2008)
Drechsler, J., Reiter, J.P.: Accounting for intruder uncertainty due to sampling when estimating identification disclosure risks in partially synthetic data. In: Domingo-Ferrer, J., Saygın, Y. (eds.) PSD 2008. LNCS, vol. 5262, pp. 227–238. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-87471-3_19
Drechsler, J., Reiter, J.P.: An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. Comput. Stat. Data Anal. 55(12), 3232–3243 (2011)
Karr, A.F., Kohnen, C.N., Oganian, A., Reiter, J.P., Sanil, A.P.: A framework for evaluating the utility of data altered to protect confidentiality. Am. Stat. 60(3), 224–232 (2006)
Little, R.J.: Statistical analysis of masked data. J. Off. Stat. 9(2), 407 (1993)
Murphy, K.P.: Machine Learning: A Probabilistic Perspective. MIT Press, Cambridge (2012)
Narayanan, A., Shmatikov, V.: Robust de-anonymization of large sparse datasets. In: 2008 IEEE Symposium on Security and Privacy. SP 2008, pp. 111–125. IEEE (2008)
Nowok, B., Raab, G., Dibben, C.: synthpop: bespoke creation of synthetic data in R. J. Stat. Softw. 74(11), 1–26 (2016)
Raghunathan, T.E., Reiter, J.P., Rubin, D.B.: Multiple imputation for statistical disclosure limitation. J. Off. Stat. 19(1), 1 (2003)
Reiter, J.P.: Inference for partially synthetic, public use microdata sets. Surv. Methodol. 29(2), 181–188 (2003)
Reiter, J.P.: Estimating risks of identification disclosure in microdata. J. Am. Stat. Assoc. 100(472), 1103–1112 (2005)
Reiter, J.P.: Using cart to generate partially synthetic public use microdata. J. Off. Stat. 21(3), 441 (2005)
Rubin, D.B.: Basic ideas of multiple imputation for nonresponse. Surv. Methodol. 12(1), 37–47 (1986)
Rubin, D.B.: Discussion statistical disclosure limitation. J. Off. Stat. 9(2), 461 (1993)
Ruggles, S., Genadek, K., Goeken, R., Grover, J., Sobek, M.: Integrated public use microdata series: Version 6.0 [dataset] (2015). https://doi.org/10.18128/D010.V6.0
Sweeney, L.: Computational disclosure control for medical microdata: the datafly system. In: Record Linkage Techniques 1997: Proceedings of an International Workshop and Exposition, pp. 442–453 (1997)
Woo, M.J., Reiter, J.P., Oganian, A., Karr, A.F.: Global measures of data utility for microdata masked for disclosure limitation. J. Priv. Confid. 1(1), 7 (2009)
Acknowledgement
This research is supported by the National Research Foundation, Prime Ministers Office, Singapore, under its Corporate Laboratory@University Scheme, National University of Singapore, and Singapore Telecommunications Ltd.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Dandekar, A., Zen, R.A.M., Bressan, S. (2018). A Comparative Study of Synthetic Dataset Generation Techniques. In: Hartmann, S., Ma, H., Hameurlain, A., Pernul, G., Wagner, R. (eds) Database and Expert Systems Applications. DEXA 2018. Lecture Notes in Computer Science(), vol 11030. Springer, Cham. https://doi.org/10.1007/978-3-319-98812-2_35
Download citation
DOI: https://doi.org/10.1007/978-3-319-98812-2_35
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-98811-5
Online ISBN: 978-3-319-98812-2
eBook Packages: Computer ScienceComputer Science (R0)