Abstract
The production of scientific-use files from economic microdata is a major problem. Many common methods change the data in a way which leaves the univariate distribution of each of the variables almost unchanged towards the distribution of the variables of the original data, the multivariate structure of the data, however, is often ruined.
Which method are suitable strongly depends on the underlying data. A program system with which one can apply different methods and evaluate and compare results from different algorithms in a flexible way is needed. The use of methods for protecting microdata as an exploratory data analysis tool requires a powerful program system, able to present the results in a number of easy to grasp graphics. For this purpose some of the most populare procedures for anonymising micro data are applied in a flexible R-package. The R system supports flexible data import/export facilities and advanced developement tools for the development of such a software for disclosure control.
Additionally to existing algorithms in other software (MDAV algorithm for microaggregation, ...) some new algorithms for anonymising microdata are implemented, e.g. a fast algorithm for microaggregation with a projection pursuit approach. This algorithm outperforms existing other algorithms for most of real data.
For all this algorithms/methods print, summary and plot methods and methods for validation are implemented.
In the field of economics suppression of cells in marginal tables is likely to be the most popular method to protect tables for statistical agencies. The use of linear programming for cell suppression seems to be the best way of protecting tables and hierarchical tables.
Some R-packages for various fields of disclosure control are being developed at the moment. It is easy to learn the applications of disclosure control even with little previous knowledge because of its integrated online-help with examples ready to be executed.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Anwar, N.: Micro-aggregation - the small aggregates method. In: Internal report, Eurostat, Luxembourg (1993)
Berkelaar, M., Dirks, J., Eikland, K., Notebaert, P.: lpsolve ide v5.5 (2006)
Borchsenius, L.: New developements in the danish system for access to micro data. In: Monographs of official statistics, Work session on statistical data confidentiality, Eurostat, Luxembourg (2005)
Box, G.E.P., Cox, D.R.: An analysis of transformations. Journal of the Royal Statistical Society, 211–252 (1964)
Chambers, J.M.: Programming with Data. Springer, New York (1998)
Croux, C., Ruiz-Gazen, A.: High breakdown estimators for principal components: the projection-pursuit approach revisited. Journal of Multivariate Analysis 95, 206–226 (2005)
Dalenius, T., Reiss, S.P.: Data-swapping: A technique for disclosure control. In: Proceedings of the Section on Survey Research Methods, vol. 6, pp. 73–85. American Statistical Association (1982)
Defays, D., Anwar, M.N.: Masking microdata using micro-aggregation. Journal of Official Statistics 14(4), 449–461 (1998)
Defays, D., Nanopoulos, P.: Panels of enterprises and confidentiality: the small aggregates method. In: Proceedings of the 1992 Symposium on Design and Analysis of Longitudinal Surveys, pp. 195–204. Statistics Canada, Ottawa (1993)
Domingo-Ferrer, J., Mateo-Sanz, J.M.: Practical data-oriented microaggregation for statistical disclosure control. IEEE Trans. on Knowledge and Data Engineering 14(1), 189–201 (2002)
Efron, R.G., Tibshirani, R.G.: An Introduction to the Bootstrap. Chapman and Hall, New York (1993)
Elliot, M., Hundepool, A., Nordholt, E.S., Tambay, J.-L., Wende, T.: Glossary on statistical disclosure control (2005)
Filmoser, P.: A multivariate outlier detection method. In: Aivazian, S., Filzmoser, P., Kharin, Y. (eds.) Proceedings of the Seventh International Conference on Computer Data Analysis and Modeling, vol. 1, pp. 18–22. Belarusian State University, Minsk (2004)
Filzmoser, P.: Robust principal component and factor analysis in the geostatistical treatment of environmental data. Environmetrics 10, 363–375 (1999)
Fraley, C., Raftery, A.E.: How many clusters? which clustering method? answers via model-based cluster analysis. The Computer Journal 41(8), 578–588 (1998)
Gabriel, K.R.: The biplot graphic display of matrices with application to principal component analysis. Biometrika 58(3), 453–467 (1971)
Griffin, R., Navarro, A., Flores-Baez, L.: Disclosure avoidance for the 1990 census. In: Proceedings of the Section on Survey Research Methods, pp. 516–521. American Statistical Association (1989)
Huber, P.J.: Projection pursuit. Ann. Statist. 13, 435–525 (1985)
Hulliger, B.: Simple and robust estimators for sampling. In: Proceedings of the Survey Research Methods Section, pp. 54–63. American Statistical Association (1999)
Hundepool, A., de Wolf, P.-P.: Onsite@home: Remote access at statistics netherlands. In: Monographs of official statistics, Work session on statistical data confidentiality, Eurostat, Luxembourg (2005)
Hundepool, A., Ramaswamy, R., de Wolf, P.-P., Franconi, L., Giessing, S., Repsilber, D., Salazar, J.J., Castro, C., Merola, G., Lowthian, P. (2003)
Hundepool, A., Van de Wetering, A., Ramaswamy, R., Franconi, L., Capobianchi, A., De Wolf, P.-P., Domingo-Ferrer, J., Torra, V., Brand, R., Giessing, S.: μ-argus version 3.2 software and users manual (2005)
Iman, R.L., Conover, W.J.: A distribution-free approach to inducing rank correlation among input variables. Communications in Statistics B11, 311–334 (1982)
Kim, J.J.: A method for limiting disclosure in microdata based on random noise and transformation. In: Proceedings of the Section on Survey Research Methods, pp. 303–308. American Statistical Association (1986)
Kim, J.J., Winkler, W.E.: Masking microdata files. In: Proceedings of the Section on Survey Research Methods, pp. 114–119. American Statistical Association (1995)
Leisch, F.: Sweave: Dynamic generation of statistical reports using literate data analysis. In: Härdle, W., Rönz, B. (eds.) Compstat 2002 — Proceedings in Computational Statistics, pp. 575–580. Physica Verlag, Heidelberg (2002)
Leisch, F.: Sweave, part I: Mixing R and LaTeX. R News 2(3), 28–31 (2002)
Leisch, F., Rossini, A.J.: Reproducible statistical research. Chance 16(2), 46–50 (2003)
Li, G., Chen, Z.: Projection-pursuit approach to robust dispersion matrices and principal components: primary theory and monte carlo. J. Amer. Statist. Ass. 80, 759–766 (1985)
Maronna, R.A.: Robust m-estimators of multivariate location and scatter. The Annals of Statistics 4(1), 51–67 (1976)
Maronna, R.A., Zamar, R.H.: Robust multivariate estimates for highdimensional datasets. Technometrics 44, 307–317 (2002)
Mateo-Sanz, J.M., Sebe, F., Domingo-Ferrer, J.: Outlier Protection in Continuous Microdata Masking. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 201–215. Springer, Heidelberg (2004)
Pearson, K.: On lines and planes of closest fit to systems of points in space. Philosophical Magazine 6(2), 559–572 (1901)
Piker, K.: Geheimhaltung - allgemeiner programmablauf (1995)
R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0 (2006)
Repsilber, R.D.: Preservation of confidentiality in aggregated data. In: The Second International Seminar on Statistical Confidentiality. Luxembourg (1994)
Rousseeuw, P.: Multivariate estimation with high breakdown point. In: Mathematical Statistics and Applications, pp. 283–297. Akademiai Kiado, Budapest (1985)
Schmid, M.: The effect of single-axis sorting on the estimation of a linear regression (2006)
Steel, P., Reznek, A.: Issues in designing a confidential preserving model server. In: Monographs of official statistics, Work session on statistical data confidentiality, Eurostat, Luxembourg (2005)
Stein, M.L.: Large sample properties of simulations using latin hypercube sampling. Technometrics 29, 143–151 (1987)
Ting, D., Fienberg, S., Trottini, M.: Romm methodology for microdata release. In: Monographs of official statistics, Work session on statistical data confidentiality, Eurostat, Luxembourg (2005)
Wyss, G.D., Jorgensen, K.H.: Sandia’s latin hypercube sampling software. Technical report sand98-0210, Sandia National Laboratories, Albuquerque, NM (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Templ, M. (2006). Software Development for SDC in R. In: Domingo-Ferrer, J., Franconi, L. (eds) Privacy in Statistical Databases. PSD 2006. Lecture Notes in Computer Science, vol 4302. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11930242_29
Download citation
DOI: https://doi.org/10.1007/11930242_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-49330-3
Online ISBN: 978-3-540-49332-7
eBook Packages: Computer ScienceComputer Science (R0)