Skip to main content

Using Support Vector Machines for Generating Synthetic Datasets

  • Conference paper
Book cover Privacy in Statistical Databases (PSD 2010)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6344))

Included in the following conference series:

Abstract

Generating synthetic datasets is an innovative approach for data dissemination. Values at risk of disclosure or even the entire dataset are replaced with multiple draws from statistical models. The quality of the released data strongly depends on the ability of these models to capture important relationships found in the original data. Defining useful models for complex survey data can be difficult and cumbersome. One possible approach to reduce the modeling burden for data disseminating agencies is to rely on machine learning tools to reveal important relationships in the data.

This paper contains an initial investigation to evaluate whether support vector machines could be utilized to develop synthetic datasets. The application is limited to categorical data but extensions for continuous data should be straight forward. I briefly describe the concept of support vector machines and necessary adjustments for synthetic data generation. I evaluate the performance of the suggested algorithm using a real dataset, the IAB Establishment Panel. The results indicate that some data utility improvements might be achievable using support vector machines. However, these improvements come at the price of an increased disclosure risk compared to standard parametric modeling and more research is needed to find ways for reducing the risk. Some ideas for achieving this goal are provided in the discussion at the end of the paper.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bartlett, B., Jordan, M.I., McAuliffe, J.D.: Comment on: Moguerza, J.M. and Muñoz, A.: Support Vector Machines with Applications. Statistical Science (21), 341–345 (2006)

    Article  MathSciNet  Google Scholar 

  2. Berk, R.: Statistical Learning from a Regression Perspective. Springer, New York (2008)

    MATH  Google Scholar 

  3. Boser, B.E., Guyon, I., Vapnik, V.: A training algorithm for optimal marign classifiers. In: Proceedings of the Fifth ACM Workshop on Computation Learning Theory (COLT), pp. 144–152. ACM Press, New York (1992)

    Chapter  Google Scholar 

  4. Caiola, G., Reiter, J.P.: Random Forests for Generating Partially Synthetic, Categorical Data. Transactions on Data Privacy 3, 27–42 (2010)

    Google Scholar 

  5. Cristianini, N., Shawe-Taylor, J.: Support Vector Machines and other kernel-based learning methods. Cambridge University Press, Cambridge (2000)

    Google Scholar 

  6. Drechsler, J.: Synthetic Datasets for the German IAB Establishment Panel. Working paper for the Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality (2009)

    Google Scholar 

  7. Drechsler, J.: Multiple imputation of missing values in the wave 2007 of the IAB Establishment Panel. IAB Discussion Paper (6) (2010)

    Google Scholar 

  8. Drechsler, J., Bender, S., Rässler, S.: Comparing fully and partially synthetic data sets for statistical disclosure control in the German IAB Establishment Panel. Transactions on Data Privacy 1, 105–130 (2008)

    Google Scholar 

  9. Drechsler, J., Dundler, A., Bender, S., Rässler, S., Zwick, T.: A new approach for disclosure control in the IAB Establishment Panel–Multiple imputation for a better data access. Advances in Statistical Analysis 92, 439–458 (2008)

    Article  Google Scholar 

  10. Drechsler, J., Reiter, J.P.: Accounting for intruder uncertainty due to sampling when estimating identification disclosure risks in partially synthetic data. In: Domingo-Ferrer, J., Saygin, Y. (eds.) Privacy in Statistical Databases, pp. 227–238. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  11. Drechsler, J., Reiter, J.P.: Disclosure risk and data utility for partially synthetic data: An empirical study using the German IAB Establishment Survey. Journal of Official Statistics 25, 589–603 (2009)

    Google Scholar 

  12. Fienberg, S.E.: A radical proposal for the provision of micro-data samples and the preservation of confidentiality. Tech. rep., Department of Statistics, Carnegie-Mellon University (1994)

    Google Scholar 

  13. Fischer, G., Janik, F., Müller, D., Schmucker, A.: The IAB Establishment Panel – from sample to survey to projection. Tech. rep., FDZ- Methodenreport No. 1 (2008)

    Google Scholar 

  14. Gomatam, S., Karr, A.F., Reiter, J.P., Sanil, A.P.: Data dissemination and disclosure limitation in a world without microdata: A risk-utility framework for remote access servers. Statistical Science 20, 163–177 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  15. Graham, P., Penny, R.: Multiply imputed synthetic data files. Tech. rep., University of Otago (2005), http://www.uoc.otago.ac.nz/departments/pubhealth/pgrahpub.htm

  16. Graham, P., Young, J., Penny, R.: Multiply imputed synthetic data: Evaluation of hierarchical bayesian imputation models. Journal of Official Statistics 25, 407–426 (2009)

    Google Scholar 

  17. Hsu, C.-W., Chang, C.-C., Lin, C.-J.: A Practical Guide to Support Vector Classification. Technical report, Department of Computer Science, National Taiwan University (2010)

    Google Scholar 

  18. Karr, A.F., Kohnen, C.N., Oganian, A., Reiter, J.P., Sanil, A.P.: A framework for evaluating the utility of data altered to protect confidentiality. The American Statistician 60, 224–232 (2006)

    Article  MathSciNet  Google Scholar 

  19. Kölling, A.: The IAB-Establishment Panel. Journal of Applied Social Science Studies 120, 291–300 (2000)

    Google Scholar 

  20. Lin, H.-T., Lin, C.-J., Weng, R.C.: A note on Platt’s probabilistic outputs for support vector machines. Technical report, Department of Computer Science, National Taiwan University (2003)

    Google Scholar 

  21. Little, R.J.A.: Statistical analysis of masked data. Journal of Official Statistics 9, 407–426 (1993)

    Google Scholar 

  22. Meng, X.-L.: Multiple-imputation inferences with uncongenial sources of input (disc: P558-573). Statistical Science 9, 538–558 (1994)

    Google Scholar 

  23. Moguerza, J.M., Muñoz, A.: Support Vector Machines with Applications (with discussion). Statistical Science (21), 322–362 (2006)

    Article  MathSciNet  Google Scholar 

  24. Platt, J.: Probabilities for SV machines. In: Smola, A., Bartlett, P., Schölkopf, B., Schuurmans, D. (eds.) Advances in Large Margin Classifiers, pp. 61–74. MIT Press, Cambridge (2000)

    Google Scholar 

  25. Reiter, J.P.: Inference for partially synthetic, public use microdata sets. Survey Methodology 29, 181–189 (2003)

    Google Scholar 

  26. Reiter, J.P.: Releasing multiply-imputed, synthetic public use microdata: An illustration and empirical study. Journal of the Royal Statistical Society, Series A 168, 185–205 (2005)

    MATH  MathSciNet  Google Scholar 

  27. Reiter, J.P.: Using CART to generate partially synthetic, public use microdata. Journal of Official Statistics 21, 441–462 (2005)

    Google Scholar 

  28. Rubin, D.B.: Discussion: Statistical disclosure limitation. Journal of Official Statistics 9, 462–468 (1993)

    Google Scholar 

  29. Wahba, G.: Multivariate function and operator estimation, based on smoothing splines and reproducing kernels. In: Casdagli, M., Eubank, S. (eds.) Proc. of Nonlinear Modeling and Forcasting, SFI Studies in the Science of Complexity, vol. XII, pp. 95–112. Addison-Wesley, Reading (1992)

    Google Scholar 

  30. Wahba, G.: Support vector machines, reproducing kernel hilpert spaces and the erndomized GACV. In: Schölkopf, B., Burges, C.J.C., Smola, A. (eds.) Advances in Kernel Methods – Support Vector Learning, pp. 69–88. MIT Press, Cambridge (1999)

    Google Scholar 

  31. Wu, T.-F., Lin, C.-J., Weng, R.C.: Probability estimates for multi-class classification by pairwise coupling. Journal of Machine Learning Research 5, 975–1005 (2004)

    MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Drechsler, J. (2010). Using Support Vector Machines for Generating Synthetic Datasets. In: Domingo-Ferrer, J., Magkos, E. (eds) Privacy in Statistical Databases. PSD 2010. Lecture Notes in Computer Science, vol 6344. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15838-4_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-15838-4_14

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-15837-7

  • Online ISBN: 978-3-642-15838-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics