Skip to main content

A Comparative Study of Synthetic Dataset Generation Techniques

  • Conference paper
  • First Online:
Database and Expert Systems Applications (DEXA 2018)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11030))

Included in the following conference series:

Abstract

Unrestricted availability of the datasets is important for the researchers to evaluate their strategies to solve the research problems. While publicly releasing the datasets, it is equally important to protect the privacy of the respective data owners. Synthetic datasets that preserve the utility while protecting the privacy of the data owners stands as a midway.

There are two ways to synthetically generate the data. Firstly, one can generate a fully synthetic dataset by subsampling it from a synthetically generated population. This technique is known as fully synthetic dataset generation. Secondly, one can generate a partially synthetic dataset by synthesizing the values of sensitive attributes. This technique is known as partially synthetic dataset generation. The datasets generated by these two techniques vary in their utilities as well as in their risks of disclosure.

We perform a comparative study of these techniques with the use of different dataset synthesisers such as linear regression, decision tree, random forest and neural network. We evaluate the effectiveness of these techniques towards the amounts of utility that they preserve and the risks of disclosure that they suffer. We find decision tree to be an efficient and a competitively effective dataset synthesiser.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Caiola, G., Reiter, J.P.: Random forests for generating partially synthetic, categorical data. Trans. Data Priv. 3(1), 27–42 (2010)

    MathSciNet  Google Scholar 

  2. Dandekar, A., Zen, R.A., Bressan, S.: A comparative study of synthetic dataset generation techniques. Technical report TRA6/18, National University of Singapore, June 2018. https://dl.comp.nus.edu.sg/handle/1900.100/7050

  3. Drechsler, J.: Using support vector machines for generating synthetic datasets. In: Domingo-Ferrer, J., Magkos, E. (eds.) PSD 2010. LNCS, vol. 6344, pp. 148–161. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15838-4_14

    Chapter  Google Scholar 

  4. Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control: Theory and Implementation, vol. 201. Springer, Heidelberg (2011). https://doi.org/10.1007/978-1-4614-0326-5

    Book  MATH  Google Scholar 

  5. Drechsler, J., Bender, S., Rässler, S.: Comparing fully and partially synthetic datasets for statistical disclosure control in the german iab establishment panel. Trans. Data Priv. 1(3), 105–130 (2008)

    MathSciNet  Google Scholar 

  6. Drechsler, J., Reiter, J.P.: Accounting for intruder uncertainty due to sampling when estimating identification disclosure risks in partially synthetic data. In: Domingo-Ferrer, J., Saygın, Y. (eds.) PSD 2008. LNCS, vol. 5262, pp. 227–238. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-87471-3_19

    Chapter  Google Scholar 

  7. Drechsler, J., Reiter, J.P.: An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. Comput. Stat. Data Anal. 55(12), 3232–3243 (2011)

    Article  MathSciNet  Google Scholar 

  8. Karr, A.F., Kohnen, C.N., Oganian, A., Reiter, J.P., Sanil, A.P.: A framework for evaluating the utility of data altered to protect confidentiality. Am. Stat. 60(3), 224–232 (2006)

    Article  MathSciNet  Google Scholar 

  9. Little, R.J.: Statistical analysis of masked data. J. Off. Stat. 9(2), 407 (1993)

    Google Scholar 

  10. Murphy, K.P.: Machine Learning: A Probabilistic Perspective. MIT Press, Cambridge (2012)

    MATH  Google Scholar 

  11. Narayanan, A., Shmatikov, V.: Robust de-anonymization of large sparse datasets. In: 2008 IEEE Symposium on Security and Privacy. SP 2008, pp. 111–125. IEEE (2008)

    Google Scholar 

  12. Nowok, B., Raab, G., Dibben, C.: synthpop: bespoke creation of synthetic data in R. J. Stat. Softw. 74(11), 1–26 (2016)

    Article  Google Scholar 

  13. Raghunathan, T.E., Reiter, J.P., Rubin, D.B.: Multiple imputation for statistical disclosure limitation. J. Off. Stat. 19(1), 1 (2003)

    Google Scholar 

  14. Reiter, J.P.: Inference for partially synthetic, public use microdata sets. Surv. Methodol. 29(2), 181–188 (2003)

    Google Scholar 

  15. Reiter, J.P.: Estimating risks of identification disclosure in microdata. J. Am. Stat. Assoc. 100(472), 1103–1112 (2005)

    Article  MathSciNet  Google Scholar 

  16. Reiter, J.P.: Using cart to generate partially synthetic public use microdata. J. Off. Stat. 21(3), 441 (2005)

    Google Scholar 

  17. Rubin, D.B.: Basic ideas of multiple imputation for nonresponse. Surv. Methodol. 12(1), 37–47 (1986)

    MathSciNet  Google Scholar 

  18. Rubin, D.B.: Discussion statistical disclosure limitation. J. Off. Stat. 9(2), 461 (1993)

    Google Scholar 

  19. Ruggles, S., Genadek, K., Goeken, R., Grover, J., Sobek, M.: Integrated public use microdata series: Version 6.0 [dataset] (2015). https://doi.org/10.18128/D010.V6.0

  20. Sweeney, L.: Computational disclosure control for medical microdata: the datafly system. In: Record Linkage Techniques 1997: Proceedings of an International Workshop and Exposition, pp. 442–453 (1997)

    Google Scholar 

  21. Woo, M.J., Reiter, J.P., Oganian, A., Karr, A.F.: Global measures of data utility for microdata masked for disclosure limitation. J. Priv. Confid. 1(1), 7 (2009)

    Google Scholar 

Download references

Acknowledgement

This research is supported by the National Research Foundation, Prime Ministers Office, Singapore, under its Corporate Laboratory@University Scheme, National University of Singapore, and Singapore Telecommunications Ltd.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ashish Dandekar .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Dandekar, A., Zen, R.A.M., Bressan, S. (2018). A Comparative Study of Synthetic Dataset Generation Techniques. In: Hartmann, S., Ma, H., Hameurlain, A., Pernul, G., Wagner, R. (eds) Database and Expert Systems Applications. DEXA 2018. Lecture Notes in Computer Science(), vol 11030. Springer, Cham. https://doi.org/10.1007/978-3-319-98812-2_35

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-98812-2_35

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-98811-5

  • Online ISBN: 978-3-319-98812-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics