Skip to main content

Software Development for SDC in R

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4302))

Abstract

The production of scientific-use files from economic microdata is a major problem. Many common methods change the data in a way which leaves the univariate distribution of each of the variables almost unchanged towards the distribution of the variables of the original data, the multivariate structure of the data, however, is often ruined.

Which method are suitable strongly depends on the underlying data. A program system with which one can apply different methods and evaluate and compare results from different algorithms in a flexible way is needed. The use of methods for protecting microdata as an exploratory data analysis tool requires a powerful program system, able to present the results in a number of easy to grasp graphics. For this purpose some of the most populare procedures for anonymising micro data are applied in a flexible R-package. The R system supports flexible data import/export facilities and advanced developement tools for the development of such a software for disclosure control.

Additionally to existing algorithms in other software (MDAV algorithm for microaggregation, ...) some new algorithms for anonymising microdata are implemented, e.g. a fast algorithm for microaggregation with a projection pursuit approach. This algorithm outperforms existing other algorithms for most of real data.

For all this algorithms/methods print, summary and plot methods and methods for validation are implemented.

In the field of economics suppression of cells in marginal tables is likely to be the most popular method to protect tables for statistical agencies. The use of linear programming for cell suppression seems to be the best way of protecting tables and hierarchical tables.

Some R-packages for various fields of disclosure control are being developed at the moment. It is easy to learn the applications of disclosure control even with little previous knowledge because of its integrated online-help with examples ready to be executed.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Anwar, N.: Micro-aggregation - the small aggregates method. In: Internal report, Eurostat, Luxembourg (1993)

    Google Scholar 

  2. Berkelaar, M., Dirks, J., Eikland, K., Notebaert, P.: lpsolve ide v5.5 (2006)

    Google Scholar 

  3. Borchsenius, L.: New developements in the danish system for access to micro data. In: Monographs of official statistics, Work session on statistical data confidentiality, Eurostat, Luxembourg (2005)

    Google Scholar 

  4. Box, G.E.P., Cox, D.R.: An analysis of transformations. Journal of the Royal Statistical Society, 211–252 (1964)

    Google Scholar 

  5. Chambers, J.M.: Programming with Data. Springer, New York (1998)

    MATH  Google Scholar 

  6. Croux, C., Ruiz-Gazen, A.: High breakdown estimators for principal components: the projection-pursuit approach revisited. Journal of Multivariate Analysis 95, 206–226 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  7. Dalenius, T., Reiss, S.P.: Data-swapping: A technique for disclosure control. In: Proceedings of the Section on Survey Research Methods, vol. 6, pp. 73–85. American Statistical Association (1982)

    Google Scholar 

  8. Defays, D., Anwar, M.N.: Masking microdata using micro-aggregation. Journal of Official Statistics 14(4), 449–461 (1998)

    Google Scholar 

  9. Defays, D., Nanopoulos, P.: Panels of enterprises and confidentiality: the small aggregates method. In: Proceedings of the 1992 Symposium on Design and Analysis of Longitudinal Surveys, pp. 195–204. Statistics Canada, Ottawa (1993)

    Google Scholar 

  10. Domingo-Ferrer, J., Mateo-Sanz, J.M.: Practical data-oriented microaggregation for statistical disclosure control. IEEE Trans. on Knowledge and Data Engineering 14(1), 189–201 (2002)

    Article  Google Scholar 

  11. Efron, R.G., Tibshirani, R.G.: An Introduction to the Bootstrap. Chapman and Hall, New York (1993)

    MATH  Google Scholar 

  12. Elliot, M., Hundepool, A., Nordholt, E.S., Tambay, J.-L., Wende, T.: Glossary on statistical disclosure control (2005)

    Google Scholar 

  13. Filmoser, P.: A multivariate outlier detection method. In: Aivazian, S., Filzmoser, P., Kharin, Y. (eds.) Proceedings of the Seventh International Conference on Computer Data Analysis and Modeling, vol. 1, pp. 18–22. Belarusian State University, Minsk (2004)

    Google Scholar 

  14. Filzmoser, P.: Robust principal component and factor analysis in the geostatistical treatment of environmental data. Environmetrics 10, 363–375 (1999)

    Article  Google Scholar 

  15. Fraley, C., Raftery, A.E.: How many clusters? which clustering method? answers via model-based cluster analysis. The Computer Journal 41(8), 578–588 (1998)

    Article  MATH  Google Scholar 

  16. Gabriel, K.R.: The biplot graphic display of matrices with application to principal component analysis. Biometrika 58(3), 453–467 (1971)

    Article  MATH  MathSciNet  Google Scholar 

  17. Griffin, R., Navarro, A., Flores-Baez, L.: Disclosure avoidance for the 1990 census. In: Proceedings of the Section on Survey Research Methods, pp. 516–521. American Statistical Association (1989)

    Google Scholar 

  18. Huber, P.J.: Projection pursuit. Ann. Statist. 13, 435–525 (1985)

    Article  MATH  MathSciNet  Google Scholar 

  19. Hulliger, B.: Simple and robust estimators for sampling. In: Proceedings of the Survey Research Methods Section, pp. 54–63. American Statistical Association (1999)

    Google Scholar 

  20. Hundepool, A., de Wolf, P.-P.: Onsite@home: Remote access at statistics netherlands. In: Monographs of official statistics, Work session on statistical data confidentiality, Eurostat, Luxembourg (2005)

    Google Scholar 

  21. Hundepool, A., Ramaswamy, R., de Wolf, P.-P., Franconi, L., Giessing, S., Repsilber, D., Salazar, J.J., Castro, C., Merola, G., Lowthian, P. (2003)

    Google Scholar 

  22. Hundepool, A., Van de Wetering, A., Ramaswamy, R., Franconi, L., Capobianchi, A., De Wolf, P.-P., Domingo-Ferrer, J., Torra, V., Brand, R., Giessing, S.: μ-argus version 3.2 software and users manual (2005)

    Google Scholar 

  23. Iman, R.L., Conover, W.J.: A distribution-free approach to inducing rank correlation among input variables. Communications in Statistics B11, 311–334 (1982)

    Google Scholar 

  24. Kim, J.J.: A method for limiting disclosure in microdata based on random noise and transformation. In: Proceedings of the Section on Survey Research Methods, pp. 303–308. American Statistical Association (1986)

    Google Scholar 

  25. Kim, J.J., Winkler, W.E.: Masking microdata files. In: Proceedings of the Section on Survey Research Methods, pp. 114–119. American Statistical Association (1995)

    Google Scholar 

  26. Leisch, F.: Sweave: Dynamic generation of statistical reports using literate data analysis. In: Härdle, W., Rönz, B. (eds.) Compstat 2002 — Proceedings in Computational Statistics, pp. 575–580. Physica Verlag, Heidelberg (2002)

    Google Scholar 

  27. Leisch, F.: Sweave, part I: Mixing R and LaTeX. R News 2(3), 28–31 (2002)

    Google Scholar 

  28. Leisch, F., Rossini, A.J.: Reproducible statistical research. Chance 16(2), 46–50 (2003)

    Google Scholar 

  29. Li, G., Chen, Z.: Projection-pursuit approach to robust dispersion matrices and principal components: primary theory and monte carlo. J. Amer. Statist. Ass. 80, 759–766 (1985)

    Article  MATH  Google Scholar 

  30. Maronna, R.A.: Robust m-estimators of multivariate location and scatter. The Annals of Statistics 4(1), 51–67 (1976)

    Article  MATH  MathSciNet  Google Scholar 

  31. Maronna, R.A., Zamar, R.H.: Robust multivariate estimates for highdimensional datasets. Technometrics 44, 307–317 (2002)

    Article  MathSciNet  Google Scholar 

  32. Mateo-Sanz, J.M., Sebe, F., Domingo-Ferrer, J.: Outlier Protection in Continuous Microdata Masking. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 201–215. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  33. Pearson, K.: On lines and planes of closest fit to systems of points in space. Philosophical Magazine 6(2), 559–572 (1901)

    Google Scholar 

  34. Piker, K.: Geheimhaltung - allgemeiner programmablauf (1995)

    Google Scholar 

  35. R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0 (2006)

    Google Scholar 

  36. Repsilber, R.D.: Preservation of confidentiality in aggregated data. In: The Second International Seminar on Statistical Confidentiality. Luxembourg (1994)

    Google Scholar 

  37. Rousseeuw, P.: Multivariate estimation with high breakdown point. In: Mathematical Statistics and Applications, pp. 283–297. Akademiai Kiado, Budapest (1985)

    Google Scholar 

  38. Schmid, M.: The effect of single-axis sorting on the estimation of a linear regression (2006)

    Google Scholar 

  39. Steel, P., Reznek, A.: Issues in designing a confidential preserving model server. In: Monographs of official statistics, Work session on statistical data confidentiality, Eurostat, Luxembourg (2005)

    Google Scholar 

  40. Stein, M.L.: Large sample properties of simulations using latin hypercube sampling. Technometrics 29, 143–151 (1987)

    Article  MATH  MathSciNet  Google Scholar 

  41. Ting, D., Fienberg, S., Trottini, M.: Romm methodology for microdata release. In: Monographs of official statistics, Work session on statistical data confidentiality, Eurostat, Luxembourg (2005)

    Google Scholar 

  42. Wyss, G.D., Jorgensen, K.H.: Sandia’s latin hypercube sampling software. Technical report sand98-0210, Sandia National Laboratories, Albuquerque, NM (1998)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Templ, M. (2006). Software Development for SDC in R. In: Domingo-Ferrer, J., Franconi, L. (eds) Privacy in Statistical Databases. PSD 2006. Lecture Notes in Computer Science, vol 4302. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11930242_29

Download citation

  • DOI: https://doi.org/10.1007/11930242_29

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-49330-3

  • Online ISBN: 978-3-540-49332-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics